1
|
Zhang Y, Yang Z, Yang Y, Lin H, Wang J. Location-enhanced syntactic knowledge for biomedical relation extraction. J Biomed Inform 2024; 156:104676. [PMID: 38876451 DOI: 10.1016/j.jbi.2024.104676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 06/08/2024] [Accepted: 06/10/2024] [Indexed: 06/16/2024]
Abstract
Biomedical relation extraction has long been considered a challenging task due to the specialization and complexity of biomedical texts. Syntactic knowledge has been widely employed in existing research to enhance relation extraction, providing guidance for the semantic understanding and text representation of models. However, the utilization of syntactic knowledge in most studies is not exhaustive, and there is often a lack of fine-grained noise reduction, leading to confusion in relation classification. In this paper, we propose an attention generator that comprehensively considers both syntactic dependency type information and syntactic position information to distinguish the importance of different dependency connections. Additionally, we integrate positional information, dependency type information, and word representations together to introduce location-enhanced syntactic knowledge for guiding our biomedical relation extraction. Experimental results on three widely used English benchmark datasets in the biomedical domain consistently outperform a range of baseline models, demonstrating that our approach not only makes full use of syntactic knowledge but also effectively reduces the impact of noisy words.
Collapse
Affiliation(s)
- Yan Zhang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Yumeng Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
2
|
Hu Y, Chen Y, Qin Y, Huang R. Learning entity-oriented representation for biomedical relation extraction. J Biomed Inform 2023; 147:104527. [PMID: 37852347 DOI: 10.1016/j.jbi.2023.104527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 10/15/2023] [Indexed: 10/20/2023]
Abstract
Biomedical Relation Extraction (BioRE) aims to automatically extract semantic relations for given entity pairs and is of great significance in biomedical research. Current popular methods often utilize pretrained language models to extract semantic features from individual input instances, which frequently suffer from overlapping semantics. Overlapping semantics refers to the situation in which a sentence contains multiple entity pairs that share the same context, leading to highly similar information between these entity pairs. In this study, we propose a model for learning Entity-oriented Representation (EoR) that aims to improve the performance of the model by enhancing the discriminability between entity pairs. It contains three modules: sentence representation, entity-oriented representation, and output. The first module learns the global semantic information of the input instance; the second module focuses on extracting the semantic information of the sentence from the target entities; and the third module enhances distinguishability among entity pairs and classifies the relation type. We evaluated our approach on four BioRE tasks with eight datasets, and the experiments showed that our EoR achieved state-of-the-art performance for PPI, DDI, CPI, and DPI tasks. Further analysis demonstrated the benefits of entity-oriented semantic information in handling multiple entity pairs in the BioRE task.
Collapse
Affiliation(s)
- Ying Hu
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yanping Chen
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yongbin Qin
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Ruizhang Huang
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| |
Collapse
|
3
|
Nezamuldeen L, Jafri MS. Protein-Protein Interaction Network Extraction Using Text Mining Methods Adds Insight into Autism Spectrum Disorder. BIOLOGY 2023; 12:1344. [PMID: 37887054 PMCID: PMC10604135 DOI: 10.3390/biology12101344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 10/02/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023]
Abstract
Text mining methods are being developed to assimilate the volume of biomedical textual materials that are continually expanding. Understanding protein-protein interaction (PPI) deficits would assist in explaining the genesis of diseases. In this study, we designed an automated system to extract PPIs from the biomedical literature that uses a deep learning sentence classification model, a pretrained word embedding, and a BiLSTM recurrent neural network with additional layers, a conditional random field (CRF) named entity recognition (NER) model, and shortest-dependency path (SDP) model using the SpaCy library in Python. The automated system ensures that it targets sentences that contain PPIs and not just these proteins mentioned in the framework of disease discovery or other context. Our first model achieved 13% greater precision on the Aimed/BioInfr benchmark corpus than the previous state-of-the-art BiLSTM neural network models. The NER model presented in this study achieved 98% precision on the Aimed/BioInfr corpus over previous models. In order to facilitate the production of an accurate representation of the PPI network, the processes were developed to systematically map the protein interactions in the texts. Overall, evaluating our system through the use of 6027 abstracts pertaining to seven proteins associated with Autism Spectrum Disorder completed the manually curated PPI network for these proteins. When it comes to complicated diseases, these networks would assist in understanding how PPI deficits contribute to disease development while also emphasizing the influence of interactions on protein function and biological processes.
Collapse
Affiliation(s)
- Leena Nezamuldeen
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
- King Fahd Medical Research Centre, King Abdulaziz University, Jeddah 21589, Saudi Arabia;
| | - Mohsin Saleet Jafri
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
- Center for Biomedical Engineering and Technology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| |
Collapse
|
4
|
Liu X, Tan J, Fan J, Tan K, Hu J, Dong S. A Syntax-enhanced model based on category keywords for biomedical relation extraction. J Biomed Inform 2022; 132:104135. [PMID: 35842217 DOI: 10.1016/j.jbi.2022.104135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 05/10/2022] [Accepted: 07/05/2022] [Indexed: 10/17/2022]
Abstract
Certain categories in multi-category biomedical relationship extraction have linguistic similarities to some extent. Keywords related to categories and syntax structures of samples between these categories have some notable features, which are very useful in biomedical relation extraction. The pre-trained model has been widely used and has achieved great success in biomedical relationship extraction, but it is still incapable of mining this kind of information accurately. To solve the problem, we present a syntax-enhanced model based on category keywords. First, we prune syntactic dependency trees in terms of category keywords obtained by the chi-square test. It reduces noisy information caused by current syntactic parsing tools and retains useful information related to categories. Next, to encode category-related syntactic dependency trees, a syntactic transformer is presented, which enhances the ability of the pre-trained model to capture syntax structures and to distinguish multiple categories. We evaluate our method on three biomedical datasets. Compared with state-of-the-art models, our method performs better on these datasets. We conduct further analysis to verify the effectiveness of our method.
Collapse
Affiliation(s)
- Xiaofeng Liu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Zhongshan Institute of Modern Industrial Technology, South China University of Technology, Zhongshan, China
| | - Jiajie Tan
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Jianye Fan
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Kaiwen Tan
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| | - Jinlong Hu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Zhongshan Institute of Modern Industrial Technology, South China University of Technology, Zhongshan, China
| | - Shoubin Dong
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Zhongshan Institute of Modern Industrial Technology, South China University of Technology, Zhongshan, China.
| |
Collapse
|
5
|
Le TD, Nguyen PD, Korkin D, Thieu T. PHILM2Web: A high-throughput database of macromolecular host–pathogen interactions on the Web. Database (Oxford) 2022; 2022:6625823. [PMID: 35776535 PMCID: PMC9248916 DOI: 10.1093/database/baac042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 04/27/2022] [Accepted: 05/31/2022] [Indexed: 12/02/2022]
Abstract
During infection, the pathogen’s entry into the host organism, breaching the host immune defense, spread and multiplication are frequently mediated by multiple interactions between the host and pathogen proteins. Systematic studying of host–pathogen interactions (HPIs) is a challenging task for both experimental and computational approaches and is critically dependent on the previously obtained knowledge about these interactions found in the biomedical literature. While several HPI databases exist that manually filter HPI protein–protein interactions from the generic databases and curated experimental interactomic studies, no comprehensive database on HPIs obtained from the biomedical literature is currently available. Here, we introduce a high-throughput literature-mining platform for extracting HPI data that includes the most comprehensive to date collection of HPIs obtained from the PubMed abstracts. Our HPI data portal, PHILM2Web (Pathogen–Host Interactions by Literature Mining on the Web), integrates an automatically generated database of interactions extracted by PHILM, our high-precision HPI literature-mining algorithm. Currently, the database contains 23 581 generic HPIs between 157 host and 403 pathogen organisms from 11 609 abstracts. The interactions were obtained from processing 608 972 PubMed abstracts, each containing mentions of at least one host and one pathogen organisms. In response to the coronavirus disease 2019 (COVID-19) pandemic, we also utilized PHILM to process 25 796 PubMed abstracts obtained by the same query as the COVID-19 Open Research Dataset. This COVID-19 processing batch resulted in 257 HPIs between 19 host and 31 pathogen organisms from 167 abstracts. The access to the entire HPI dataset is available via a searchable PHILM2Web interface; scientists can also download the entire database in bulk for offline processing. Database URL: http://philm2web.live
Collapse
Affiliation(s)
- Tuan-Dung Le
- Department of Computer Science, Oklahoma State University , Stillwater, OK, USA
| | - Phuong D Nguyen
- Department of Biochemistry and Molecular Biology, Oklahoma State University , Stillwater, OK, USA
| | - Dmitry Korkin
- Department of Computer Science and Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute , Worcester, MA, USA
| | - Thanh Thieu
- Machine Learning Department, Moffitt Cancer Center and Research Institute , Tampa, FL, USA
| |
Collapse
|
6
|
Gu W, Yang X, Yang M, Han K, Pan W, Zhu Z. MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction. BIOINFORMATICS ADVANCES 2022; 2:vbac035. [PMID: 36699388 PMCID: PMC9710573 DOI: 10.1093/bioadv/vbac035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/04/2022] [Accepted: 05/09/2022] [Indexed: 01/28/2023]
Abstract
Motivation Natural language processing (NLP) tasks aim to convert unstructured text data (e.g. articles or dialogues) to structured information. In recent years, we have witnessed fundamental advances of NLP technique, which has been widely used in many applications such as financial text mining, news recommendation and machine translation. However, its application in the biomedical space remains challenging due to a lack of labeled data, ambiguities and inconsistencies of biological terminology. In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation. In addition, the fast speed of machine reader helps quickly orient research and development. Results To address the aforementioned needs, we developed automatic training data labeling, rule-based biological terminology cleaning and a more accurate NLP model for binary associative and multi-relation prediction into the MarkerGenie program. We demonstrated the effectiveness of the proposed methods in identifying relations between biomedical entities on various benchmark datasets and case studies. Availability and implementation MarkerGenie is available at https://www.genegeniedx.com/markergenie/. Data for model training and evaluation, term lists of biomedical entities, details of the case studies and all trained models are provided at https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Wenhao Gu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China,GeneGenieDx Corp, San Jose, CA 95134, USA
| | - Xiao Yang
- GeneGenieDx Corp, San Jose, CA 95134, USA
| | - Minhao Yang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Kun Han
- GeneGenieDx Corp, San Jose, CA 95134, USA
| | | | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China,To whom correspondence should be addressed.
| |
Collapse
|
7
|
Yadav S, Ramesh S, Saha S, Ekbal A. Relation Extraction From Biomedical and Clinical Text: Unified Multitask Learning Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1105-1116. [PMID: 32853152 DOI: 10.1109/tcbb.2020.3020016] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
MOTIVATION To minimize the accelerating amount of time invested on the biomedical literature search, numerous approaches for automated knowledge extraction have been proposed. Relation extraction is one such task where semantic relations between the entities are identified from the free text. In the biomedical domain, extraction of regulatory pathways, metabolic processes, adverse drug reaction or disease models necessitates knowledge from the individual relations, for example, physical or regulatory interactions between genes, proteins, drugs, chemical, disease or phenotype. RESULTS In this paper, we study the relation extraction task from three major biomedical and clinical tasks, namely drug-drug interaction, protein-protein interaction, and medical concept relation extraction. Towards this, we model the relation extraction problem in a multi-task learning (MTL)framework, and introduce for the first time the concept of structured self-attentive network complemented with the adversarial learning approach for the prediction of relationships from the biomedical and clinical text. The fundamental notion of MTL is to simultaneously learn multiple problems together by utilizing the concepts of the shared representation. Additionally, we also generate the highly efficient single task model which exploits the shortest dependency path embedding learned over the attentive gated recurrent unit to compare our proposed MTL models. The framework we propose significantly improves over all the baselines (deep learning techniques)and single-task models for predicting the relationships, without compromising on the performance of all the tasks.
Collapse
|
8
|
Liu X, Tan K, Dong S. Multi-granularity sequential neural network for document-level biomedical relation extraction. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
9
|
Pang N, Qian L, Lyu W, Yang JD. Using pretraining and text mining methods to automatically extract the chemical scientific data. DATA TECHNOLOGIES AND APPLICATIONS 2021. [DOI: 10.1108/dta-11-2020-0284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeIn computational chemistry, the chemical bond energy (pKa) is essential, but most pKa-related data are submerged in scientific papers, with only a few data that have been extracted by domain experts manually. The loss of scientific data does not contribute to in-depth and innovative scientific data analysis. To address this problem, this study aims to utilize natural language processing methods to extract pKa-related scientific data in chemical papers.Design/methodology/approachBased on the previous Bert-CRF model combined with dictionaries and rules to resolve the problem of a large number of unknown words of professional vocabulary, in this paper, the authors proposed an end-to-end Bert-CRF model with inputting constructed domain wordpiece tokens using text mining methods. The authors use standard high-frequency string extraction techniques to construct domain wordpiece tokens for specific domains. And in the subsequent deep learning work, domain features are added to the input.FindingsThe experiments show that the end-to-end Bert-CRF model could have a relatively good result and can be easily transferred to other domains because it reduces the requirements for experts by using automatic high-frequency wordpiece tokens extraction techniques to construct the domain wordpiece tokenization rules and then input domain features to the Bert model.Originality/valueBy decomposing lots of unknown words with domain feature-based wordpiece tokens, the authors manage to resolve the problem of a large amount of professional vocabulary and achieve a relatively ideal extraction result compared to the baseline model. The end-to-end model explores low-cost migration for entity and relation extraction in professional fields, reducing the requirements for experts.
Collapse
|
10
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
11
|
Yadav S, Lokala U, Daniulaityte R, Thirunarayan K, Lamy F, Sheth A. "When they say weed causes depression, but it's your fav antidepressant": Knowledge-aware attention framework for relationship extraction. PLoS One 2021; 16:e0248299. [PMID: 33764983 PMCID: PMC7993863 DOI: 10.1371/journal.pone.0248299] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 02/23/2021] [Indexed: 11/19/2022] Open
Abstract
With the increasing legalization of medical and recreational use of cannabis, more research is needed to understand the association between depression and consumer behavior related to cannabis consumption. Big social media data has potential to provide deeper insights about these associations to public health analysts. In this interdisciplinary study, we demonstrate the value of incorporating domain-specific knowledge in the learning process to identify the relationships between cannabis use and depression. We develop an end-to-end knowledge infused deep learning framework (Gated-K-BERT) that leverages the pre-trained BERT language representation model and domain-specific declarative knowledge source (Drug Abuse Ontology) to jointly extract entities and their relationship using gated fusion sharing mechanism. Our model is further tailored to provide more focus to the entities mention in the sentence through entity-position aware attention layer, where ontology is used to locate the target entities position. Experimental results show that inclusion of the knowledge-aware attentive representation in association with BERT can extract the cannabis-depression relationship with better coverage in comparison to the state-of-the-art relation extractor.
Collapse
Affiliation(s)
- Shweta Yadav
- Wright State University, Dayton, Ohio, United States of America
- * E-mail:
| | - Usha Lokala
- University of South Carolina, Columbia, South Carolina, United States of America
| | | | | | | | - Amit Sheth
- University of South Carolina, Columbia, South Carolina, United States of America
| |
Collapse
|
12
|
Meng F, Liang Z, Zhao K, Luo C. Drug design targeting active posttranslational modification protein isoforms. Med Res Rev 2020; 41:1701-1750. [PMID: 33355944 DOI: 10.1002/med.21774] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 11/29/2020] [Accepted: 12/03/2020] [Indexed: 12/11/2022]
Abstract
Modern drug design aims to discover novel lead compounds with attractable chemical profiles to enable further exploration of the intersection of chemical space and biological space. Identification of small molecules with good ligand efficiency, high activity, and selectivity is crucial toward developing effective and safe drugs. However, the intersection is one of the most challenging tasks in the pharmaceutical industry, as chemical space is almost infinity and continuous, whereas the biological space is very limited and discrete. This bottleneck potentially limits the discovery of molecules with desirable properties for lead optimization. Herein, we present a new direction leveraging posttranslational modification (PTM) protein isoforms target space to inspire drug design termed as "Post-translational Modification Inspired Drug Design (PTMI-DD)." PTMI-DD aims to extend the intersections of chemical space and biological space. We further rationalized and highlighted the importance of PTM protein isoforms and their roles in various diseases and biological functions. We then laid out a few directions to elaborate the PTMI-DD in drug design including discovering covalent binding inhibitors mimicking PTMs, targeting PTM protein isoforms with distinctive binding sites from that of wild-type counterpart, targeting protein-protein interactions involving PTMs, and hijacking protein degeneration by ubiquitination for PTM protein isoforms. These directions will lead to a significant expansion of the biological space and/or increase the tractability of compounds, primarily due to precisely targeting PTM protein isoforms or complexes which are highly relevant to biological functions. Importantly, this new avenue will further enrich the personalized treatment opportunity through precision medicine targeting PTM isoforms.
Collapse
Affiliation(s)
- Fanwang Meng
- Drug Discovery and Design Center, the Center for Chemical Biology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China.,Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario, Canada
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou, China
| | - Kehao Zhao
- School of Pharmacy, Key Laboratory of Molecular Pharmacology and Drug Evaluation (Yantai University), Ministry of Education, Collaborative Innovation Center of Advanced Drug Delivery System and Biotech Drugs in Universities of Shandong, Yantai University, Yantai, China
| | - Cheng Luo
- Drug Discovery and Design Center, the Center for Chemical Biology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| |
Collapse
|
13
|
Mitra S, Saha S, Hasanuzzaman M. A Multi-View Deep Neural Network Model for Chemical-Disease Relation Extraction From Imbalanced Datasets. IEEE J Biomed Health Inform 2020; 24:3315-3325. [DOI: 10.1109/jbhi.2020.2983365] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
14
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
15
|
Xu B, Liu Y, Yu S, Wang L, Dong J, Lin H, Yang Z, Wang J, Xia F. A network embedding model for pathogenic genes prediction by multi-path random walking on heterogeneous network. BMC Med Genomics 2019; 12:188. [PMID: 31865919 PMCID: PMC6927107 DOI: 10.1186/s12920-019-0627-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Prediction of pathogenic genes is crucial for disease prevention, diagnosis, and treatment. But traditional genetic localization methods are often technique-difficulty and time-consuming. With the development of computer science, computational biology has gradually become one of the main methods for finding candidate pathogenic genes. METHODS We propose a pathogenic genes prediction method based on network embedding which is called Multipath2vec. Firstly, we construct an heterogeneous network which is called GP-network. It is constructed based on three kinds of relationships between genes and phenotypes, including correlations between phenotypes, interactions between genes and known gene-phenotype pairs. Then in order to embedding the network better, we design the multi-path to guide random walk in GP-network. The multi-path includes multiple paths between genes and phenotypes which can capture complex structural information of heterogeneous network. Finally, we use the learned vector representation of each phenotype and protein to calculate the similarities and rank according to the similarities between candidate genes and the target phenotype. RESULTS We implemented Multipath2vec and four baseline approaches (i.e., CATAPULT, PRINCE, Deepwalk and Metapath2vec) on many-genes gene-phenotype data, single-gene gene-phenotype data and whole gene-phenotype data. Experimental results show that Multipath2vec outperformed the state-of-the-art baselines in pathogenic genes prediction task. CONCLUSIONS We propose Multipath2vec that can be utilized to predict pathogenic genes and experimental results show the higher accuracy of pathogenic genes prediction.
Collapse
Affiliation(s)
- Bo Xu
- School of Software, Dalian University of Technology, Dalian, 116000 China
- Key Laboratory for Ubiquitous Network and Service Software of Liaoning, Dalian, 116000 China
| | - Yu Liu
- School of Software, Dalian University of Technology, Dalian, 116000 China
| | - Shuo Yu
- School of Software, Dalian University of Technology, Dalian, 116000 China
| | - Lei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Jie Dong
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116000 China
| | - Feng Xia
- School of Software, Dalian University of Technology, Dalian, 116000 China
- Key Laboratory for Ubiquitous Network and Service Software of Liaoning, Dalian, 116000 China
| |
Collapse
|
16
|
Neural network-based approaches for biomedical relation classification: A review. J Biomed Inform 2019; 99:103294. [DOI: 10.1016/j.jbi.2019.103294] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 06/02/2019] [Accepted: 09/21/2019] [Indexed: 12/14/2022]
|
17
|
Zhao S, Li L, Lu H, Zhou A, Qian S. Associative attention networks for temporal relation extraction from electronic health records. J Biomed Inform 2019; 99:103309. [PMID: 31627021 DOI: 10.1016/j.jbi.2019.103309] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 10/10/2019] [Accepted: 10/11/2019] [Indexed: 01/17/2023]
Abstract
Temporal relations are crucial in constructing a timeline over the course of clinical care, which can help medical practitioners and researchers track the progression of diseases, treatments and adverse reactions over time. Due to the rapid adoption of Electronic Health Records (EHRs) and high cost of manual curation, using Natural Language Processing (NLP) to extract temporal relations automatically has become a promising approach. Typically temporal relation extraction is formulated as a classification problem for the instances of entity pairs, which relies on the information hidden in context. However, EHRs contain an overwhelming amount of entities and a large number of entity pairs gathering in the same context, making it difficult to distinguish instances and identify relevant contextual information for a specific entity pair. All these pose significant challenges towards temporal relation extraction while existing methods rarely pay attention to. In this work, we propose the associative attention networks to address these issues. Each instance is first carved into three segments according to the entity pair to obtain the differentiated representation initially. Then we devise the associative attention mechanism for a further distinction by emphasizing the relevant information, and meanwhile, for the reconstruction of association among segments as the final representation of the whole instance. In addition, position weights are utilized to enhance the performance. We validate the merit of our method on the widely used THYME corpus and achieve an average F1-score of 64.3% over three runs, which outperforms the state-of-the-art by 1.5%.
Collapse
Affiliation(s)
- Shiyi Zhao
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Lishuang Li
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China.
| | - Hongbin Lu
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Anqiao Zhou
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| | - Shuang Qian
- School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China
| |
Collapse
|
18
|
Caufield JH, Ping P. New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data. Emerg Top Life Sci 2019; 3:357-369. [PMID: 33523203 DOI: 10.1042/etls20190003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 07/11/2019] [Accepted: 07/16/2019] [Indexed: 12/14/2022]
Abstract
Protein-protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein-protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.
Collapse
Affiliation(s)
- J Harry Caufield
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| | - Peipei Ping
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Medicine/Cardiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Bioinformatics, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Scalable Analytics Institute (ScAi), University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| |
Collapse
|
19
|
|
20
|
Li M, He Q, Ma J, He F, Zhu Y, Chang C, Chen T. PPICurator: A Tool for Extracting Comprehensive Protein-Protein Interaction Information. Proteomics 2019; 19:e1800291. [PMID: 30521143 DOI: 10.1002/pmic.201800291] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 11/12/2018] [Indexed: 11/07/2022]
Abstract
Protein-protein interaction extraction through biological literature curation is widely employed for proteome analysis. There is a strong need for a tool that can assist researchers in extracting comprehensive PPI information through literature curation, which is critical in research on protein, for example, construction of protein interaction network, identification of protein signaling pathway, and discovery of meaningful protein interaction. However, most of current tools can only extract PPI relations. None of them are capable of extracting other important PPI information, such as interaction directions, effects, and functional annotations. To address these issues, this paper proposes PPICurator, a novel tool for extracting comprehensive PPI information with a variety of logic and syntax features based on a new support vector machine classifier. PPICurator provides a friendly web-based user interface. It is a platform that automates the extraction of comprehensive PPI information through literature, including PPI relations, as well as their confidential scores, interaction directions, effects, and functional annotations. Thus, PPICurator is more comprehensive than state-of-the-art tools. Moreover, it outperforms state-of-the-art tools in the accuracy of PPI relation extraction measured by F-score and recall on the widely used open datasets. PPICurator is available at https://ppicurator.hupo.org.cn.
Collapse
Affiliation(s)
- Mansheng Li
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Qiang He
- School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne, Victoria, 3122, Australia
| | - Jie Ma
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Fuchu He
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Yunping Zhu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Cheng Chang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Tao Chen
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| |
Collapse
|
21
|
Wang X, Zhai Y, Lin Y, Wang F. Mining layered technological information in scientific papers: A semi-supervised method. J Inf Sci 2018. [DOI: 10.1177/0165551518816941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Tech mining is the application of text mining tools to science and technology information resources. The ever-increasing volume of scientific outputs is a boom to technological innovation, but it also complicates efforts to obtain useful and concise information for problem solving. This challenge extends to tech mining, where the development of techniques compatible with big data is an urgent issue. This article introduces a semi-supervised method for extracting layered technological information from scientific papers in order to extend the reach of tech mining. Our method starts with several pre-set seed patterns used to extract candidate phrases by matching the dependency tree of each sentence. Then, after a series of judgements, phrases are divided into two categories: ‘main technique’ and ‘tech-component’. (A technique, for the purposes of this study, is a method or tool used in the article being analysed.) In order to generate new patterns for subsequent iterations, a weighted pattern learning method is also adopted. Finally, multiple iterations of the method are applied to extract technological information from each paper. A dataset from the field of optical switcher is used to verify the method’s effectiveness. Our findings are that (1) by two loops of extraction process in each iteration, our method realises the layered technological information extraction, which contains the ‘part–whole’ relationships between main techniques and tech-components; (2) the recall rate for main techniques is superior to the baseline after iterating 23 rounds; (3) when layering is disregarded, in the aspect of the precision and the volume of techniques, the new method is higher than that for the baseline; and (4) adjusting another two parameters can optimise the efficiency – however, the effect is neither pronounced nor straightforward.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Department of Information Resource Management, Business School, Nankai University, P.R. China; CETC Big Data Research Institute Co., Ltd., P.R. China
| | - Yujia Zhai
- Department of Information Resource Management, School of Management, Tianjin Normal University, P.R. China
| | - Yuanhai Lin
- Institute of Information Photonics Technology and College of Applied Sciences, Beijing University of Technology, P.R. China
| | - Fang Wang
- Department of Information Resource Management, Business School, Nankai University, P.R. China
| |
Collapse
|