51
|
Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020; 16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open
Abstract
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Shankai Yan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|
52
|
Abstract
Abstract
Biodiversity research studies the variability and diversity of organisms, including variability within and between species with particular focus on the functional diversity of traits and their relationship to environment. Managing biodiversity data implies dealing with its heterogeneous nature using semantics and tailored ontologies. These are themselves differently conceived, and combining them in semantically enabled applications necessitates an effective alignment between their concepts. This paper describes the ontology matching of biodiversity- and ecology-related ontologies. We illustrate diverse challenges introduced by this kind of ontologies to ontology matching in general. Real use cases requiring pairwise alignments between environment and trait ontologies are introduced. We describe our experience creating a new track at the Ontology Alignment Evaluation Initiative designed for this specific domain and report on the results obtained by state-of-the-art participating systems. The biodiversity and ecology use case turns out to be a strong one for ontology matching, introducing new interesting challenges. Even if most of the matching systems perform relatively well in the proposed matching tasks, there is still room for improvement. We highlight possible directions in that matter and elaborate on our plan to further progress with the track.
Collapse
|
53
|
GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings. BMC Genomics 2019; 20:918. [PMID: 31874639 PMCID: PMC8424702 DOI: 10.1186/s12864-019-6272-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 11/12/2019] [Indexed: 12/13/2022] Open
Abstract
Background Semantic similarity between Gene Ontology (GO) terms is a fundamental measure for many bioinformatics applications, such as determining functional similarity between genes or proteins. Most previous research exploited information content to estimate the semantic similarity between GO terms; recently some research exploited word embeddings to learn vector representations for GO terms from a large-scale corpus. In this paper, we proposed a novel method, named GO2Vec, that exploits graph embeddings to learn vector representations for GO terms from GO graph. GO2Vec combines the information from both GO graph and GO annotations, and its learned vectors can be applied to a variety of bioinformatics applications, such as calculating functional similarity between proteins and predicting protein-protein interactions. Results We conducted two kinds of experiments to evaluate the quality of GO2Vec: (1) functional similarity between proteins on the Collaborative Evaluation of GO-based Semantic Similarity Measures (CESSM) dataset and (2) prediction of protein-protein interactions on the Yeast and Human datasets from the STRING database. Experimental results demonstrate the effectiveness of GO2Vec over the information content-based measures and the word embedding-based measures. Conclusion Our experimental results demonstrate the effectiveness of using graph embeddings to learn vector representations from undirected GO and GOA graphs. Our results also demonstrate that GO annotations provide useful information for computing the similarity between GO terms and between proteins.
Collapse
|
54
|
Althubaiti S, Karwath A, Dallol A, Noor A, Alkhayyat SS, Alwassia R, Mineta K, Gojobori T, Beggs AD, Schofield PN, Gkoutos GV, Hoehndorf R. Ontology-based prediction of cancer driver genes. Sci Rep 2019; 9:17405. [PMID: 31757986 PMCID: PMC6874647 DOI: 10.1038/s41598-019-53454-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Accepted: 10/29/2019] [Indexed: 01/05/2023] Open
Abstract
Identifying and distinguishing cancer driver genes among thousands of candidate mutations remains a major challenge. Accurate identification of driver genes and driver mutations is critical for advancing cancer research and personalizing treatment based on accurate stratification of patients. Due to inter-tumor genetic heterogeneity many driver mutations within a gene occur at low frequencies, which make it challenging to distinguish them from non-driver mutations. We have developed a novel method for identifying cancer driver genes. Our approach utilizes multiple complementary types of information, specifically cellular phenotypes, cellular locations, functions, and whole body physiological phenotypes as features. We demonstrate that our method can accurately identify known cancer driver genes and distinguish between their role in different types of cancer. In addition to confirming known driver genes, we identify several novel candidate driver genes. We demonstrate the utility of our method by validating its predictions in nasopharyngeal cancer and colorectal cancer using whole exome and whole genome sequencing.
Collapse
Affiliation(s)
- Sara Althubaiti
- Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Andreas Karwath
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, United Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, B15 2TT, Birmingham, United Kingdom
| | - Ashraf Dallol
- Centre of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Adeeb Noor
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, 80221, Saudi Arabia
| | | | - Rolina Alwassia
- Radiation Oncology Unit, King Abdulaziz University Hospital, Jeddah, Saudi Arabia
| | - Katsuhiko Mineta
- Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
- Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Andrew D Beggs
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, United Kingdom
| | - Paul N Schofield
- Department of Physiology, Development & Neuroscience, University of Cambridge, Downing Street, CB2 3EG, Cambridge, United Kingdom
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, United Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, B15 2TT, Birmingham, United Kingdom
- NIHR Experimental Cancer Medicine Centre, B15 2TT, Birmingham, UK
- NIHR Surgical Reconstruction and Microbiology Research Centre, B15 2TT, Birmingham, UK
- MRC Health Data Research UK (HDR UK) Midlands, Birmingham, United Kingdom
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
55
|
Yao Y, Du X, Diao Y, Zhu H. An integration of deep learning with feature embedding for protein–protein interaction prediction. PeerJ 2019; 7:e7126. [PMID: 31245182 PMCID: PMC6585896 DOI: 10.7717/peerj.7126] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 05/14/2019] [Indexed: 12/22/2022] Open
Abstract
Protein–protein interactions are closely relevant to protein function and drug discovery. Hence, accurately identifying protein–protein interactions will help us to understand the underlying molecular mechanisms and significantly facilitate the drug discovery. However, the majority of existing computational methods for protein–protein interactions prediction are focused on the feature extraction and combination of features and there have been limited gains from the state-of-the-art models. In this work, a new residue representation method named Res2vec is designed for protein sequence representation. Residue representations obtained by Res2vec describe more precisely residue-residue interactions from raw sequence and supply more effective inputs for the downstream deep learning model. Combining effective feature embedding with powerful deep learning techniques, our method provides a general computational pipeline to infer protein–protein interactions, even when protein structure knowledge is entirely unknown. The proposed method DeepFE-PPI is evaluated on the S. Cerevisiae and human datasets. The experimental results show that DeepFE-PPI achieves 94.78% (accuracy), 92.99% (recall), 96.45% (precision), 89.62% (Matthew’s correlation coefficient, MCC) and 98.71% (accuracy), 98.54% (recall), 98.77% (precision), 97.43% (MCC), respectively. In addition, we also evaluate the performance of DeepFE-PPI on five independent species datasets and all the results are superior to the existing methods. The comparisons show that DeepFE-PPI is capable of predicting protein–protein interactions by a novel residue representation method and a deep learning classification framework in an acceptable level of accuracy. The codes along with instructions to reproduce this work are available from https://github.com/xal2019/DeepFE-PPI.
Collapse
|
56
|
Kafkas Ş, Abdelhakim M, Hashish Y, Kulmanov M, Abdellatif M, Schofield PN, Hoehndorf R. PathoPhenoDB, linking human pathogens to their phenotypes in support of infectious disease research. Sci Data 2019; 6:79. [PMID: 31160594 PMCID: PMC6546783 DOI: 10.1038/s41597-019-0090-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Accepted: 05/07/2019] [Indexed: 12/11/2022] Open
Abstract
Understanding the relationship between the pathophysiology of infectious disease, the biology of the causative agent and the development of therapeutic and diagnostic approaches is dependent on the synthesis of a wide range of types of information. Provision of a comprehensive and integrated disease phenotype knowledgebase has the potential to provide novel and orthogonal sources of information for the understanding of infectious agent pathogenesis, and support for research on disease mechanisms. We have developed PathoPhenoDB, a database containing pathogen-to-phenotype associations. PathoPhenoDB relies on manual curation of pathogen-disease relations, on ontology-based text mining as well as manual curation to associate host disease phenotypes with infectious agents. Using Semantic Web technologies, PathoPhenoDB also links to knowledge about drug resistance mechanisms and drugs used in the treatment of infectious diseases. PathoPhenoDB is accessible at http://patho.phenomebrowser.net/ , and the data are freely available through a public SPARQL endpoint.
Collapse
Affiliation(s)
- Şenay Kafkas
- Computer, Electrical and Mathematical Sciences & Engineering with Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Marwa Abdelhakim
- Computer, Electrical and Mathematical Sciences & Engineering with Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Yasmeen Hashish
- Computer, Electrical and Mathematical Sciences & Engineering with Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Maxat Kulmanov
- Computer, Electrical and Mathematical Sciences & Engineering with Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Marwa Abdellatif
- Computer, Electrical and Mathematical Sciences & Engineering with Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
| | - Paul N Schofield
- Department of Physiology, Development & Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3EG, United Kingdom
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering with Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
57
|
Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019; 166:4-21. [PMID: 31022451 DOI: 10.1016/j.ymeth.2019.04.008] [Citation(s) in RCA: 132] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 03/23/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
Collapse
|
58
|
Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings. THE SEMANTIC WEB 2019. [DOI: 10.1007/978-3-030-21348-0_10] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|