1
|
Bandyopadhyay SS, Halder AK, Saha S, Chatterjee P, Nasipuri M, Basu S. Assessment of GO-Based Protein Interaction Affinities in the Large-Scale Human–Coronavirus Family Interactome. Vaccines (Basel) 2023; 11:vaccines11030549. [PMID: 36992133 DOI: 10.3390/vaccines11030549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 02/19/2023] [Accepted: 02/23/2023] [Indexed: 03/03/2023] Open
Abstract
SARS-CoV-2 is a novel coronavirus that replicates itself via interacting with the host proteins. As a result, identifying virus and host protein-protein interactions could help researchers better understand the virus disease transmission behavior and identify possible COVID-19 drugs. The International Committee on Virus Taxonomy has determined that nCoV is genetically 89% compared to the SARS-CoV epidemic in 2003. This paper focuses on assessing the host–pathogen protein interaction affinity of the coronavirus family, having 44 different variants. In light of these considerations, a GO-semantic scoring function is provided based on Gene Ontology (GO) graphs for determining the binding affinity of any two proteins at the organism level. Based on the availability of the GO annotation of the proteins, 11 viral variants, viz., SARS-CoV-2, SARS, MERS, Bat coronavirus HKU3, Bat coronavirus Rp3/2004, Bat coronavirus HKU5, Murine coronavirus, Bovine coronavirus, Rat coronavirus, Bat coronavirus HKU4, Bat coronavirus 133/2005, are considered from 44 viral variants. The fuzzy scoring function of the entire host–pathogen network has been processed with ~180 million potential interactions generated from 19,281 host proteins and around 242 viral proteins. ~4.5 million potential level one host–pathogen interactions are computed based on the estimated interaction affinity threshold. The resulting host–pathogen interactome is also validated with state-of-the-art experimental networks. The study has also been extended further toward the drug-repurposing study by analyzing the FDA-listed COVID drugs.
Collapse
Affiliation(s)
- Soumyendu Sekhar Bandyopadhyay
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
- Department of Computer Science and Engineering, School of Engineering and Technology, Adamas University, Kolkata 700126, India
| | - Anup Kumar Halder
- Faculty of Mathematics and Information Sciences, Warsaw University of Technology, 00-662 Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering (Artificial Intelligence and Machine Learning), Techno Main Salt Lake, Sector V, Kolkata 700091, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata 700152, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| |
Collapse
|
2
|
Pesaranghader A, Matwin S, Sokolova M, Grenier JC, Beiko RG, Hussin J. OUP accepted manuscript. Bioinformatics 2022; 38:3051-3061. [PMID: 35536192 PMCID: PMC9154256 DOI: 10.1093/bioinformatics/btac304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 02/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation There is a plethora of measures to evaluate functional similarity (FS) of genes based on their co-expression, protein–protein interactions and sequence similarity. These measures are typically derived from hand-engineered and application-specific metrics to quantify the degree of shared information between two genes using their Gene Ontology (GO) annotations. Results We introduce deepSimDEF, a deep learning method to automatically learn FS estimation of gene pairs given a set of genes and their GO annotations. deepSimDEF’s key novelty is its ability to learn low-dimensional embedding vector representations of GO terms and gene products and then calculate FS using these learned vectors. We show that deepSimDEF can predict the FS of new genes using their annotations: it outperformed all other FS measures by >5–10% on yeast and human reference datasets on protein–protein interactions, gene co-expression and sequence homology tasks. Thus, deepSimDEF offers a powerful and adaptable deep neural architecture that can benefit a wide range of problems in genomics and proteomics, and its architecture is flexible enough to support its extension to any organism. Availability and implementation Source code and data are available at https://github.com/ahmadpgh/deepSimDEF Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
| | - Marina Sokolova
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
- Faculty of Medicine and Faculty of Engineering, University of Ottawa, Ottawa K1H 8M5, Canada
| | | | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
- Institute for Big Data Analytics, Dalhousie University, Halifax B3H 4R2, Canada
| | | |
Collapse
|
3
|
Kim J, Kim D, Sohn KA. HiG2Vec: hierarchical representations of Gene Ontology and genes in the Poincaré ball. Bioinformatics 2021; 37:2971-2980. [PMID: 33760022 DOI: 10.1093/bioinformatics/btab193] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Revised: 03/14/2021] [Accepted: 03/23/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Knowledge manipulation of Gene Ontology (GO) and Gene Ontology Annotation (GOA) can be done primarily by using vector representation of GO terms and genes. Previous studies have represented GO terms and genes or gene products in Euclidean space to measure their semantic similarity using an embedding method such as the Word2Vec-based method to represent entities as numeric vectors. However, this method has the limitation that embedding large graph-structured data in the Euclidean space cannot prevent a loss of information of latent hierarchies, thus precluding the semantics of GO and GOA from being captured optimally. On the other hand, hyperbolic spaces such as the Poincaré balls are more suitable for modeling hierarchies, as they have a geometric property in which the distance increases exponentially as it nears the boundary because of negative curvature. RESULTS In this article, we propose hierarchical representations of GO and genes (HiG2Vec) by applying Poincaré embedding specialized in the representation of hierarchy through a two-step procedure: GO embedding and gene embedding. Through experiments, we show that our model represents the hierarchical structure better than other approaches and predicts the interaction of genes or gene products similar to or better than previous studies. The results indicate that HiG2Vec is superior to other methods in capturing the GO and gene semantics and in data utilization as well. It can be robustly applied to manipulate various biological knowledge. AVAILABILITYAND IMPLEMENTATION https://github.com/JaesikKim/HiG2Vec. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jaesik Kim
- Department of Computer Engineering, Ajou University, Suwon 16499, South Korea.,Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Dokyoon Kim
- Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kyung-Ah Sohn
- Department of Computer Engineering, Ajou University, Suwon 16499, South Korea.,Department of Artificial Intelligence, Ajou University, Suwon 16499, South Korea
| |
Collapse
|
4
|
Ikram N, Qadir MA, Afzal MT. SimExact – An Efficient Method to Compute Function Similarity Between Proteins Using Gene Ontology. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017092842] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The rapidly growing protein and annotation databases necessitate the development
of efficient tools to process this valuable information. Biologists frequently need to
find proteins similar to a given protein, for which BLAST tools are commonly used. With the development
of biomedical ontologies, e.g. Gene Ontology, methods were designed to measure
function (semantic) similarity between two proteins. These methods work well on protein pairs,
but are not suitable for protein query processing.
Objective:
Our aim is to facilitate searching of similar proteins in an acceptable time.
Methods:
A novel method SimExact for high speed searching of functionally similar proteins has
been proposed.
Results:
The experiments of this study show that SimExact gives correct results required for protein
searching. A fully functional prototype of an online tool (www.datafurnish.com/protsem.php)
has been provided that generates a ranked list of the proteins similar to a query protein, with a response
time of less than 20 seconds in our setup. SimExact was used to search for protein pairs
having high disparity between function similarity and sequence similarity.
Conclusion:
SimExact makes such searches practical, which would not be possible in a reasonable
time otherwise.
Collapse
Affiliation(s)
- Najmul Ikram
- COMSATS University Islamabad, Wah Campus, Islamabad, Pakistan
| | - Muhammad Abdul Qadir
- Center for Distributed and Semantic Computing, Capital University of Science and Technology, Islamabad, Pakistan
| | - Muhammad Tanvir Afzal
- Center for Distributed and Semantic Computing, Capital University of Science and Technology, Islamabad, Pakistan
| |
Collapse
|
5
|
Wang X, Zhu X, Ye M, Wang Y, Li CD, Xiong Y, Wei DQ. STS-NLSP: A Network-Based Label Space Partition Method for Predicting the Specificity of Membrane Transporter Substrates Using a Hybrid Feature of Structural and Semantic Similarity. Front Bioeng Biotechnol 2019; 7:306. [PMID: 31781551 PMCID: PMC6851049 DOI: 10.3389/fbioe.2019.00306] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 10/17/2019] [Indexed: 12/11/2022] Open
Abstract
Membrane transport proteins play crucial roles in the pharmacokinetics of substrate drugs, the drug resistance in cancer and are vital to the process of drug discovery, development and anti-cancer therapeutics. However, experimental methods to profile a substrate drug against a panel of transporters to determine its specificity are labor intensive and time consuming. In this article, we aim to develop an in silico multi-label classification approach to predict whether a substrate can specifically recognize one of the 13 categories of drug transporters ranging from ATP-binding cassette to solute carrier families using both structural fingerprints and chemical ontologies information of substrates. The data-driven network-based label space partition (NLSP) method was utilized to construct the model based on a hybrid of similarity-based feature by the integration of 2D fingerprint and semantic similarity. This method builds predictors for each label cluster (possibly intersecting) detected by community detection algorithms and takes union of label sets for a compound as final prediction. NLSP lies into the ensembles of multi-label classifier category in multi-label learning field. We utilized Cramér's V statistics to quantify the label correlations and depicted them via a heatmap. The jackknife tests and iterative stratification based cross-validation method were adopted on a benchmark dataset to evaluate the prediction performance of the proposed models both in multi-label and label-wise manner. Compared with other powerful multi-label methods, ML-kNN, MTSVM, and RAkELd, our multi-label classification model of NLPS-RF (random forest-based NLSP) has proven to be a feasible and effective model, and performed satisfactorily in the predictive task of transporter-substrate specificity. The idea behind NLSP method is intriguing and the power of NLSP remains to be explored for the multi-label learning problems in bioinformatics. The benchmark dataset, intermediate results and python code which can fully reproduce our experiments and results are available at https://github.com/dqwei-lab/STS.
Collapse
Affiliation(s)
- Xiangeng Wang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, China
| | - Mingzhi Ye
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yanjing Wang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Cheng-Dong Li
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Yi Xiong
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Joint Laboratory of International Cooperation in Metabolic and Developmental Sciences, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
6
|
Halder AK, Dutta P, Kundu M, Basu S, Nasipuri M. Review of computational methods for virus-host protein interaction prediction: a case study on novel Ebola-human interactions. Brief Funct Genomics 2019; 17:381-391. [PMID: 29028879 PMCID: PMC7109800 DOI: 10.1093/bfgp/elx026] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Identification of potential virus–host interactions is useful and vital to control the highly infectious virus-caused diseases. This may contribute toward development of new drugs to treat the viral infections. Recently, database records of clinically and experimentally validated interactions between a small set of human proteins and Ebola virus (EBOV) have been published. Using the information of the known human interaction partners of EBOV, our main objective is to identify a set of proteins that may interact with EBOV proteins. Here, we first review the state-of-the-art, computational methods used for prediction of novel virus–host interactions for infectious diseases followed by a case study on EBOV–human interactions. The assessment result shows that the predicted human host proteins are highly similar with known human interaction partners of EBOV in the context of structure and semantics and are responsible for similar biochemical activities, pathways and host–pathogen relationships.
Collapse
Affiliation(s)
- Anup Kumar Halder
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Pritha Dutta
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Mahantapas Kundu
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, India
| |
Collapse
|
7
|
Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:6750296. [PMID: 30809545 PMCID: PMC6369486 DOI: 10.1155/2019/6750296] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Accepted: 01/13/2019] [Indexed: 11/30/2022]
Abstract
In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
Collapse
|
8
|
Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology. PLoS One 2018; 13:e0208626. [PMID: 30532199 PMCID: PMC6287949 DOI: 10.1371/journal.pone.0208626] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Accepted: 11/20/2018] [Indexed: 12/14/2022] Open
Abstract
Identifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data. In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD). We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.
Collapse
|
9
|
GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms. Sci Rep 2018; 8:15107. [PMID: 30305653 PMCID: PMC6180005 DOI: 10.1038/s41598-018-33219-y] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 09/24/2018] [Indexed: 01/29/2023] Open
Abstract
Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik’s and Wang’s methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/.
Collapse
|
10
|
Liu W, Liu J, Rajapakse JC. Gene Ontology Enrichment Improves Performances of Functional Similarity of Genes. Sci Rep 2018; 8:12100. [PMID: 30108262 PMCID: PMC6092333 DOI: 10.1038/s41598-018-30455-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2017] [Accepted: 07/25/2018] [Indexed: 12/23/2022] Open
Abstract
There exists a plethora of measures to evaluate functional similarity (FS) between genes, which is a widely used in many bioinformatics applications including detecting molecular pathways, identifying co-expressed genes, predicting protein-protein interactions, and prioritization of disease genes. Measures of FS between genes are mostly derived from Information Contents (IC) of Gene Ontology (GO) terms annotating the genes. However, existing measures evaluating IC of terms based either on the representations of terms in the annotating corpus or on the knowledge embedded in the GO hierarchy do not consider the enrichment of GO terms by the querying pair of genes. The enrichment of a GO term by a pair of gene is dependent on whether the term is annotated by one gene (i.e., partial annotation) or by both genes (i.e. complete annotation) in the pair. In this paper, we propose a method that incorporate enrichment of GO terms by a gene pair in computing their FS and show that GO enrichment improves the performances of 46 existing FS measures in the prediction of sequence homologies, gene expression correlations, protein-protein interactions, and disease associated genes.
Collapse
Affiliation(s)
- Wenting Liu
- Human Genetics, Genome Institute of Singapore, Singapore, Singapore.
| | - Jianjun Liu
- Human Genetics, Genome Institute of Singapore, Singapore, Singapore.
| | - Jagath C Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore.
| |
Collapse
|
11
|
Ikram N, Qadir MA, Afzal MT. Investigating Correlation between Protein Sequence Similarity and Semantic Similarity Using Gene Ontology Annotations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:905-912. [PMID: 28436885 DOI: 10.1109/tcbb.2017.2695542] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Sequence similarity is a commonly used measure to compare proteins. With the increasing use of ontologies, semantic (function) similarity is getting importance. The correlation between these measures has been applied in the evaluation of new semantic similarity methods, and in protein function prediction. In this research, we investigate the relationship between the two similarity methods. The results suggest absence of a strong correlation between sequence and semantic similarities. There is a large number of proteins with low sequence similarity and high semantic similarity. We observe that Pearson's correlation coefficient is not sufficient to explain the nature of this relationship. Interestingly, the term semantic similarity values above 0 and below 1 do not seem to play a role in improving the correlation. That is, the correlation coefficient depends only on the number of common GO terms in proteins under comparison, and the semantic similarity measurement method does not influence it. Semantic similarity and sequence similarity have a distinct behavior. These findings are of significant effect for future works on protein comparison, and will help understand the semantic similarity between proteins in a better way.
Collapse
|
12
|
Dutta P, Basu S, Kundu M. Assessment of Semantic Similarity between Proteins Using Information Content and Topological Properties of the Gene Ontology Graph. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:839-849. [PMID: 28371781 DOI: 10.1109/tcbb.2017.2689762] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The semantic similarity between two interacting proteins can be estimated by combining the similarity scores of the GO terms associated with the proteins. Greater number of similar GO annotations between two proteins indicates greater interaction affinity. Existing semantic similarity measures make use of the GO graph structure, the information content of GO terms, or a combination of both. In this paper, we present a hybrid approach which utilizes both the topological features of the GO graph and information contents of the GO terms. More specifically, we 1) consider a fuzzy clustering of the GO graph based on the level of association of the GO terms, 2) estimate the GO term memberships to each cluster center based on the respective shortest path lengths, and 3) assign weightage to GO term pairs on the basis of their dissimilarity with respect to the cluster centers. We test the performance of our semantic similarity measure against seven other previously published similarity measures using benchmark protein-protein interaction datasets of Homo sapiens and Saccharomyces cerevisiae based on sequence similarity, Pfam similarity, area under ROC curve, and measure.
Collapse
|
13
|
Chu Y, Wang Z, Wang R, Zhang N, Li J, Hu Y, Teng M, Wang Y. WDNfinder: A method for minimum driver node set detection and analysis in directed and weighted biological network. J Bioinform Comput Biol 2017; 15:1750021. [DOI: 10.1142/s0219720017500214] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Structural controllability is the generalization of traditional controllability for dynamical systems. During the last decade, interesting biological discoveries have been inferred by applied structural controllability analysis to biological networks. However, false positive/negative information (i.e. nodes and edges) widely exists in biological networks that documented in public data sources, which can hinder accurate analysis of structural controllability. In this study, we propose WDNfinder, a comprehensive analysis package that provides structural controllability with consideration of node connection strength in biological networks. When applied to the human cancer signaling network and p53-mediate DNA damage response network, WDNfinder shows high accuracy on essential nodes prediction in these networks. Compared to existing methods, WDNfinder can significantly narrow down the set of minimum driver node set (MDS) under the restriction of domain knowledge. When using p53-mediate DNA damage response network as illustration, we find more meaningful MDSs by WDNfinder. The source code is implemented in python and publicly available together with relevant data on GitHub: https://github.com/dustincys/WDNfinder .
Collapse
Affiliation(s)
- Yanshuo Chu
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Zhenxing Wang
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Ningyi Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Jie Li
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Yang Hu
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Mingxiang Teng
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, P. R. China
| |
Collapse
|
14
|
Chen Q, Wan Y, Zhang X, Lei Y, Zobel J, Verspoor K. Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2017. [DOI: 10.1145/3131611] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis.
Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.
Collapse
Affiliation(s)
| | - Yu Wan
- University of Melbourne, Victoria, Australia
| | | | - Yang Lei
- University of Melbourne, Australia
| | | | | |
Collapse
|
15
|
A Novel Measure for Semantic Similarity Computation of Gene Ontology Terms Using Weighted Aggregation of Information Contents. HEPATITIS MONTHLY 2017. [DOI: 10.5812/zjrms.12041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
|
16
|
Dutta P, Halder AK, Basu S, Kundu M. A survey on Ebola genome and current trends in computational research on the Ebola virus. Brief Funct Genomics 2017; 17:374-380. [DOI: 10.1093/bfgp/elx020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
|
17
|
Bible PW, Sun HW, Morasso MI, Loganantharaj R, Wei L. The effects of shared information on semantic calculations in the gene ontology. Comput Struct Biotechnol J 2017; 15:195-211. [PMID: 28217262 PMCID: PMC5299144 DOI: 10.1016/j.csbj.2017.01.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2016] [Revised: 01/25/2017] [Accepted: 01/25/2017] [Indexed: 01/01/2023] Open
Abstract
The structured vocabulary that describes gene function, the gene ontology (GO), serves as a powerful tool in biological research. One application of GO in computational biology calculates semantic similarity between two concepts to make inferences about the functional similarity of genes. A class of term similarity algorithms explicitly calculates the shared information (SI) between concepts then substitutes this calculation into traditional term similarity measures such as Resnik, Lin, and Jiang-Conrath. Alternative SI approaches, when combined with ontology choice and term similarity type, lead to many gene-to-gene similarity measures. No thorough investigation has been made into the behavior, complexity, and performance of semantic methods derived from distinct SI approaches. We apply bootstrapping to compare the generalized performance of 57 gene-to-gene semantic measures across six benchmarks. Considering the number of measures, we additionally evaluate whether these methods can be leveraged through ensemble machine learning to improve prediction performance. Results showed that the choice of ontology type most strongly influenced performance across all evaluations. Combining measures into an ensemble classifier reduces cross-validation error beyond any individual measure for protein interaction prediction. This improvement resulted from information gained through the combination of ontology types as ensemble methods within each GO type offered no improvement. These results demonstrate that multiple SI measures can be leveraged for machine learning tasks such as automated gene function prediction by incorporating methods from across the ontologies. To facilitate future research in this area, we developed the GO Graph Tool Kit (GGTK), an open source C++ library with Python interface (github.com/paulbible/ggtk).
Collapse
Affiliation(s)
- Paul W Bible
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Hong-Wei Sun
- Biodata Mining and Discovery Section, Office of Science and Technology, Intramural Research Program, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, Maryland
| | - Maria I Morasso
- Laboratory of Skin Biology, Intramural Research Program, National Institute of Arthritis and Musculoskeletal and Skin Diseases, Bethesda, Maryland
| | - Rasiah Loganantharaj
- Laboratory of Bioinformatics, Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, Louisiana
| | - Lai Wei
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| |
Collapse
|
18
|
Shui Y, Cho YR. Alignment of PPI Networks Using Semantic Similarity for Conserved Protein Complex Prediction. IEEE Trans Nanobioscience 2017; 15:380-389. [PMID: 28113907 DOI: 10.1109/tnb.2016.2555802] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Network alignment is a computational technique to identify topological similarity of graph data by mapping link patterns. In bioinformatics, network alignment algorithms have been applied to protein-protein interaction (PPI) networks to discover evolutionarily conserved substructures at the system level. In particular, local network alignment of PPI networks searches for conserved functional components between species and predicts unknown protein complexes and signaling pathways. In this article, we present a novel approach of local network alignment by semantic mapping. While most previous methods find protein matches between species by sequence homology, our approach uses semantic similarity. Given Gene Ontology (GO) and its annotation data, we estimate functional closeness between two proteins by measuring their semantic similarity. We adopted a new semantic similarity measure, simVICD, which has the best performance for PPI validation and functional match. We tested alignment between the PPI networks of well-studied yeast protein complexes and the genome-wide PPI network of human in order to predict human protein complexes. The experimental results demonstrate that our approach has higher accuracy in protein complex prediction than graph clustering algorithms, and higher efficiency than previous network alignment algorithms.
Collapse
|
19
|
Abstract
Gene Ontology-based semantic similarity (SS) allows the comparison of GO terms or entities annotated with GO terms, by leveraging on the ontology structure and properties and on annotation corpora. In the last decade the number and diversity of SS measures based on GO has grown considerably, and their application ranges from functional coherence evaluation, protein interaction prediction, and disease gene prioritization.Understanding how SS measures work, what issues can affect their performance and how they compare to each other in different evaluation settings is crucial to gain a comprehensive view of this area and choose the most appropriate approaches for a given application.In this chapter, we provide a guide to understanding and selecting SS measures for biomedical researchers. We present a straightforward categorization of SS measures and describe the main strategies they employ. We discuss the intrinsic and external issues that affect their performance, and how these can be addressed. We summarize comparative assessment studies, highlighting the top measures in different settings, and compare different implementation strategies and their use. Finally, we discuss some of the extant challenges and opportunities, namely the increased semantic complexity of GO and the need for fast and efficient computation, pointing the way towards the future generation of SS measures.
Collapse
Affiliation(s)
- Catia Pesquita
- LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Edifício C6, Piso 3, Campo Grande, 1749-016, Lisbon, Portugal.
| |
Collapse
|
20
|
Ehsani R, Drabløs F. TopoICSim: a new semantic similarity measure based on gene ontology. BMC Bioinformatics 2016; 17:296. [PMID: 27473391 PMCID: PMC4966780 DOI: 10.1186/s12859-016-1160-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Accepted: 07/21/2016] [Indexed: 01/14/2023] Open
Abstract
Background The Gene Ontology (GO) is a dynamic, controlled vocabulary that describes the cellular function of genes and proteins according to tree major categories: biological process, molecular function and cellular component. It has become widely used in many bioinformatics applications for annotating genes and measuring their semantic similarity, rather than their sequence similarity. Generally speaking, semantic similarity measures involve the GO tree topology, information content of GO terms, or a combination of both. Results Here we present a new semantic similarity measure called TopoICSim (Topological Information Content Similarity) which uses information on the specific paths between GO terms based on the topology of the GO tree, and the distribution of information content along these paths. The TopoICSim algorithm was evaluated on two human benchmark datasets based on KEGG pathways and Pfam domains grouped as clans, using GO terms from either the biological process or molecular function. The performance of the TopoICSim measure compared favorably to five existing methods. Furthermore, the TopoICSim similarity was also tested on gene/protein sets defined by correlated gene expression, using three human datasets, and showed improved performance compared to two previously published similarity measures. Finally we used an online benchmarking resource which evaluates any similarity measure against a set of 11 similarity measures in three tests, using gene/protein sets based on sequence similarity, Pfam domains, and enzyme classifications. The results for TopoICSim showed improved performance relative to most of the measures included in the benchmarking, and in particular a very robust performance throughout the different tests. Conclusions The TopoICSim similarity measure provides a competitive method with robust performance for quantification of semantic similarity between genes and proteins based on GO annotations. An R script for TopoICSim is available at http://bigr.medisin.ntnu.no/tools/TopoICSim.R.
Collapse
Affiliation(s)
- Rezvan Ehsani
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway.,Department of Mathematics, University of Zabol, Zabol, Iran
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, NO-7491, Trondheim, Norway.
| |
Collapse
|
21
|
Zhang SB, Lai JH. Exploring information from the topology beneath the Gene Ontology terms to improve semantic similarity measures. Gene 2016; 586:148-57. [PMID: 27080954 DOI: 10.1016/j.gene.2016.04.024] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Revised: 03/28/2016] [Accepted: 04/08/2016] [Indexed: 11/19/2022]
Abstract
Measuring the similarity between pairs of biological entities is important in molecular biology. The introduction of Gene Ontology (GO) provides us with a promising approach to quantifying the semantic similarity between two genes or gene products. This kind of similarity measure is closely associated with the GO terms annotated to biological entities under consideration and the structure of the GO graph. However, previous works in this field mainly focused on the upper part of the graph, and seldom concerned about the lower part. In this study, we aim to explore information from the lower part of the GO graph for better semantic similarity. We proposed a framework to quantify the similarity measure beneath a term pair, which takes into account both the information two ancestral terms share and the probability that they co-occur with their common descendants. The effectiveness of our approach was evaluated against seven typical measurements on public platform CESSM, protein-protein interaction and gene expression datasets. Experimental results consistently show that the similarity derived from the lower part contributes to better semantic similarity measure. The promising features of our approach are the following: (1) it provides a mirror model to characterize the information two ancestral terms share with respect to their common descendant; (2) it quantifies the probability that two terms co-occur with their common descendant in an efficient way; and (3) our framework can effectively capture the similarity measure beneath two terms, which can serve as an add-on to improve traditional semantic similarity measure between two GO terms. The algorithm was implemented in Matlab and is freely available from http://ejl.org.cn/bio/GOBeneath/.
Collapse
Affiliation(s)
- Shu-Bo Zhang
- Department of Computer Science, Guangzhou Maritime Institute, Room 803 Building 88, Dashabei Road, Huangpu District, Guangzhou 510275, PR China.
| | - Jian-Huang Lai
- School of Information Science and Technology, Sun Yat-sen University, Room 105 Building 110 East District, 135 Xingangxi Road, Guangzhou 510275, PR China.
| |
Collapse
|
22
|
Pesaranghader A, Matwin S, Sokolova M, Beiko RG. simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics 2015; 32:1380-7. [PMID: 26708333 DOI: 10.1093/bioinformatics/btv755] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Accepted: 12/21/2015] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Measures of protein functional similarity are essential tools for function prediction, evaluation of protein-protein interactions (PPIs) and other applications. Several existing methods perform comparisons between proteins based on the semantic similarity of their GO terms; however, these measures are highly sensitive to modifications in the topological structure of GO, tend to be focused on specific analytical tasks and concentrate on the GO terms themselves rather than considering their textual definitions. RESULTS We introduce simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions, which is based on the Gloss Vector measure commonly used in natural language processing. The simDEF approach builds optimized definition vectors for all relevant GO terms, and expresses the similarity of a pair of proteins as the cosine of the angle between their definition vectors. Relative to existing similarity measures, when validated on a yeast reference database, simDEF improves correlation with sequence homology by up to 50%, shows a correlation improvement >4% with gene expression in the biological process hierarchy of GO and increases PPI predictability by > 2.5% in F1 score for molecular function hierarchy. AVAILABILITY AND IMPLEMENTATION Datasets, results and source code are available at http://kiwi.cs.dal.ca/Software/simDEF CONTACT: ahmad.pgh@dal.ca or beiko@cs.dal.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ahmad Pesaranghader
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada, Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada
| | - Stan Matwin
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada, Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada, Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland and
| | - Marina Sokolova
- Institute for Big Data Analytics, Halifax, NS B3H 4R2, Canada, Faculty of Medicine and Faculty of Engineering, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| | - Robert G Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada
| |
Collapse
|
23
|
Harispe S, Ranwez S, Janaqi S, Montmain J. Semantic Similarity from Natural Language and Ontology Analysis. ACTA ACUST UNITED AC 2015. [DOI: 10.2200/s00639ed1v01y201504hlt027] [Citation(s) in RCA: 113] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
24
|
Zhang SB, Lai JH. Semantic similarity measurement between gene ontology terms based on exclusively inherited shared information. Gene 2015; 558:108-17. [DOI: 10.1016/j.gene.2014.12.062] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 12/15/2014] [Accepted: 12/24/2014] [Indexed: 11/25/2022]
|
25
|
Palma G, Vidal ME, Haag E, Raschid L, Thor A. Determining similarity of scientific entities in annotation datasets. Database (Oxford) 2015; 2015:bau123. [PMID: 25725057 PMCID: PMC4343076 DOI: 10.1093/database/bau123] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Revised: 12/02/2014] [Accepted: 12/03/2014] [Indexed: 11/22/2022]
Abstract
Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/
Collapse
Affiliation(s)
- Guillermo Palma
- Departamento de Computación Universidad Simón Bolívar, Caracas, Venezuela, Department of Biology, University of Maryland, College Park, MD, 20742 USA Smith School of Business, Institute of Advanced Computer Studies, and Department of Computer Science. College Park, MD, 20742 USA and University of Applied Sciences for Telecommunications, Leipzig, Germany 04277
| | - Maria-Esther Vidal
- Departamento de Computación Universidad Simón Bolívar, Caracas, Venezuela, Department of Biology, University of Maryland, College Park, MD, 20742 USA Smith School of Business, Institute of Advanced Computer Studies, and Department of Computer Science. College Park, MD, 20742 USA and University of Applied Sciences for Telecommunications, Leipzig, Germany 04277
| | - Eric Haag
- Departamento de Computación Universidad Simón Bolívar, Caracas, Venezuela, Department of Biology, University of Maryland, College Park, MD, 20742 USA Smith School of Business, Institute of Advanced Computer Studies, and Department of Computer Science. College Park, MD, 20742 USA and University of Applied Sciences for Telecommunications, Leipzig, Germany 04277
| | - Louiqa Raschid
- Departamento de Computación Universidad Simón Bolívar, Caracas, Venezuela, Department of Biology, University of Maryland, College Park, MD, 20742 USA Smith School of Business, Institute of Advanced Computer Studies, and Department of Computer Science. College Park, MD, 20742 USA and University of Applied Sciences for Telecommunications, Leipzig, Germany 04277
| | - Andreas Thor
- Departamento de Computación Universidad Simón Bolívar, Caracas, Venezuela, Department of Biology, University of Maryland, College Park, MD, 20742 USA Smith School of Business, Institute of Advanced Computer Studies, and Department of Computer Science. College Park, MD, 20742 USA and University of Applied Sciences for Telecommunications, Leipzig, Germany 04277
| |
Collapse
|
26
|
Na D, Son H, Gsponer J. Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity. BMC Genomics 2014; 15:1091. [PMID: 25495442 PMCID: PMC4298957 DOI: 10.1186/1471-2164-15-1091] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Accepted: 12/04/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Communalities between large sets of genes obtained from high-throughput experiments are often identified by searching for enrichments of genes with the same Gene Ontology (GO) annotations. The GO analysis tools used for these enrichment analyses assume that GO terms are independent and the semantic distances between all parent-child terms are identical, which is not true in a biological sense. In addition these tools output lists of often redundant or too specific GO terms, which are difficult to interpret in the context of the biological question investigated by the user. Therefore, there is a demand for a robust and reliable method for gene categorization and enrichment analysis. RESULTS We have developed Categorizer, a tool that classifies genes into user-defined groups (categories) and calculates p-values for the enrichment of the categories. Categorizer identifies the biologically best-fit category for each gene by taking advantage of a specialized semantic similarity measure for GO terms. We demonstrate that Categorizer provides improved categorization and enrichment results of genetic modifiers of Huntington's disease compared to a classical GO Slim-based approach or categorizations using other semantic similarity measures. CONCLUSION Categorizer enables more accurate categorizations of genes than currently available methods. This new tool will help experimental and computational biologists analyzing genomic and proteomic data according to their specific needs in a more reliable manner.
Collapse
Affiliation(s)
| | | | - Jörg Gsponer
- Department of Biochemistry and Molecular Biology, Centre for High-throughput Biology, University of British Columbia, 2125 East Mall, Vancouver, BC V6T 1Z4, Canada.
| |
Collapse
|
27
|
Harispe S, Sánchez D, Ranwez S, Janaqi S, Montmain J. A framework for unifying ontology-based semantic similarity measures: a study in the biomedical domain. J Biomed Inform 2013; 48:38-53. [PMID: 24269894 DOI: 10.1016/j.jbi.2013.11.006] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Revised: 11/06/2013] [Accepted: 11/09/2013] [Indexed: 10/26/2022]
Abstract
Ontologies are widely adopted in the biomedical domain to characterize various resources (e.g. diseases, drugs, scientific publications) with non-ambiguous meanings. By exploiting the structured knowledge that ontologies provide, a plethora of ad hoc and domain-specific semantic similarity measures have been defined over the last years. Nevertheless, some critical questions remain: which measure should be defined/chosen for a concrete application? Are some of the, a priori different, measures indeed equivalent? In order to bring some light to these questions, we perform an in-depth analysis of existing ontology-based measures to identify the core elements of semantic similarity assessment. As a result, this paper presents a unifying framework that aims to improve the understanding of semantic measures, to highlight their equivalences and to propose bridges between their theoretical bases. By demonstrating that groups of measures are just particular instantiations of parameterized functions, we unify a large number of state-of-the-art semantic similarity measures through common expressions. The application of the proposed framework and its practical usefulness is underlined by an empirical analysis of hundreds of semantic measures in a biomedical context.
Collapse
Affiliation(s)
- Sébastien Harispe
- LGI2P/EMA Research Centre, Site de Nîmes, Parc scientifique G. Besse, 30035 Nîmes cedex 1, France.
| | - David Sánchez
- Departament d'Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, Av. Països Catalans, 26, 43007 Tarragona, Spain
| | - Sylvie Ranwez
- LGI2P/EMA Research Centre, Site de Nîmes, Parc scientifique G. Besse, 30035 Nîmes cedex 1, France
| | - Stefan Janaqi
- LGI2P/EMA Research Centre, Site de Nîmes, Parc scientifique G. Besse, 30035 Nîmes cedex 1, France
| | - Jacky Montmain
- LGI2P/EMA Research Centre, Site de Nîmes, Parc scientifique G. Besse, 30035 Nîmes cedex 1, France
| |
Collapse
|
28
|
Cho YR, Mina M, Lu Y, Kwon N, Guzzi PH. M-Finder: Uncovering functionally associated proteins from interactome data integrated with GO annotations. Proteome Sci 2013; 11:S3. [PMID: 24565382 PMCID: PMC3909039 DOI: 10.1186/1477-5956-11-s1-s3] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) play a key role in understanding the mechanisms of cellular processes. The availability of interactome data has catalyzed the development of computational approaches to elucidate functional behaviors of proteins on a system level. Gene Ontology (GO) and its annotations are a significant resource for functional characterization of proteins. Because of wide coverage, GO data have often been adopted as a benchmark for protein function prediction on the genomic scale. RESULTS We propose a computational approach, called M-Finder, for functional association pattern mining. This method employs semantic analytics to integrate the genome-wide PPIs with GO data. We also introduce an interactive web application tool that visualizes a functional association network linked to a protein specified by a user. The proposed approach comprises two major components. First, the PPIs that have been generated by high-throughput methods are weighted in terms of their functional consistency using GO and its annotations. We assess two advanced semantic similarity metrics which quantify the functional association level of each interacting protein pair. We demonstrate that these measures outperform the other existing methods by evaluating their agreement to other biological features, such as sequence similarity, the presence of common Pfam domains, and core PPIs. Second, the information flow-based algorithm is employed to discover a set of proteins functionally associated with the protein in a query and their links efficiently. This algorithm reconstructs a functional association network of the query protein. The output network size can be flexibly determined by parameters. CONCLUSIONS M-Finder provides a useful framework to investigate functional association patterns with any protein. This software will also allow users to perform further systematic analysis of a set of proteins for any specific function. It is available online at http://bionet.ecs.baylor.edu/mfinder.
Collapse
|
29
|
COUTO FRANCISCOM, PINTO HSOFIA. THE NEXT GENERATION OF SIMILARITY MEASURES THAT FULLY EXPLORE THE SEMANTICS IN BIOMEDICAL ONTOLOGIES. J Bioinform Comput Biol 2013; 11:1371001. [DOI: 10.1142/s0219720013710017] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
There is a prominent trend to augment and improve the formality of biomedical ontologies. For example, this is shown by the current effort on adding description logic axioms, such as disjointness. One of the key ontology applications that can take advantage of this effort is the conceptual (functional) similarity measurement. The presence of description logic axioms in biomedical ontologies make the current structural or extensional approaches weaker and further away from providing sound semantics-based similarity measures. Although beneficial in small ontologies, the exploration of description logic axioms by semantics-based similarity measures is computational expensive. This limitation is critical for biomedical ontologies that normally contain thousands of concepts. Thus in the process of gaining their rightful place, biomedical functional similarity measures have to take the journey of finding how this rich and powerful knowledge can be fully explored while keeping feasible computational costs. This manuscript aims at promoting and guiding the development of compelling tools that deliver what the biomedical community will require in a near future: a next-generation of biomedical similarity measures that efficiently and fully explore the semantics present in biomedical ontologies.
Collapse
Affiliation(s)
- FRANCISCO M. COUTO
- Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - H. SOFIA PINTO
- INESC-ID, Departamento de Engenharia Informática, Instituto Superior Técnico, Lisboa 1000-029, Portugal
| |
Collapse
|
30
|
Ferreira JD, Hastings J, Couto FM. Exploiting disjointness axioms to improve semantic similarity measures. Bioinformatics 2013; 29:2781-7. [PMID: 24002110 DOI: 10.1093/bioinformatics/btt491] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Representing domain knowledge in biology has traditionally been accomplished by creating simple hierarchies of classes with textual annotations. Recently, expressive ontology languages, such as Web Ontology Language, have become more widely adopted, supporting axioms that express logical relationships other than class-subclass, e.g. disjointness. This is improving the coverage and validity of the knowledge contained in biological ontologies. However, current semantic tools still need to adapt to this more expressive information. In this article, we propose a method to integrate disjointness axioms, which are being incorporated in real-world ontologies, such as the Gene Ontology and the chemical entities of biological interest ontology, into semantic similarity, the measure that estimates the closeness in meaning between classes. RESULTS We present a modification of the measure of shared information content, which extends the base measure to allow the incorporation of disjointness information. To evaluate our approach, we applied it to several randomly selected datasets extracted from the chemical entities of biological interest ontology. In 93.8% of these datasets, our measure performed better than the base measure of shared information content. This supports the idea that semantic similarity is more accurate if it extends beyond the hierarchy of classes of the ontology. CONTACT joao.ferreira@lasige.di.fc.ul.pt. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- João D Ferreira
- Department of Informatics, Faculdade de Ciências da Universidade de Lisboa, 1749-016 Lisboa, Portugal, Cheminformatics and Metabolism, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK, Swiss Center for Affective Sciences, University of Geneva, 7, rue des Battoirs, 1205 Geneva, Switzerland and Evolutionary Bioinformatics Group, Swiss Institute of Bioinformatics, Biophore - CH-1015 Lausanne, Switzerland
| | | | | |
Collapse
|
31
|
Teng Z, Guo M, Liu X, Dai Q, Wang C, Xuan P. Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 2013; 29:1424-32. [PMID: 23572412 DOI: 10.1093/bioinformatics/btt160] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
32
|
Lopes P, Oliveira JL. COEUS: "semantic web in a box" for biomedical applications. J Biomed Semantics 2012; 3:11. [PMID: 23244467 PMCID: PMC3554586 DOI: 10.1186/2041-1480-3-11] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2012] [Accepted: 11/05/2012] [Indexed: 11/30/2022] Open
Abstract
Background As the “omics” revolution unfolds, the growth in data quantity and diversity is bringing about the need for pioneering bioinformatics software, capable of significantly improving the research workflow. To cope with these computer science demands, biomedical software engineers are adopting emerging semantic web technologies that better suit the life sciences domain. The latter’s complex relationships are easily mapped into semantic web graphs, enabling a superior understanding of collected knowledge. Despite increased awareness of semantic web technologies in bioinformatics, their use is still limited. Results COEUS is a new semantic web framework, aiming at a streamlined application development cycle and following a “semantic web in a box” approach. The framework provides a single package including advanced data integration and triplification tools, base ontologies, a web-oriented engine and a flexible exploration API. Resources can be integrated from heterogeneous sources, including CSV and XML files or SQL and SPARQL query results, and mapped directly to one or more ontologies. Advanced interoperability features include REST services, a SPARQL endpoint and LinkedData publication. These enable the creation of multiple applications for web, desktop or mobile environments, and empower a new knowledge federation layer. Conclusions The platform, targeted at biomedical application developers, provides a complete skeleton ready for rapid application deployment, enhancing the creation of new semantic information systems. COEUS is available as open source at http://bioinformatics.ua.pt/coeus/.
Collapse
Affiliation(s)
- Pedro Lopes
- DETI/IEETA, Universidade de Aveiro, Campus Universitário de Santiago, Aveiro, 3810 - 193, Portugal.
| | | |
Collapse
|
33
|
Yang H, Nepusz T, Paccanaro A. Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. ACTA ACUST UNITED AC 2012; 28:1383-9. [PMID: 22522134 DOI: 10.1093/bioinformatics/bts129] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Several measures have been recently proposed for quantifying the functional similarity between gene products according to well-structured controlled vocabularies where biological terms are organized in a tree or in a directed acyclic graph (DAG) structure. However, existing semantic similarity measures ignore two important facts. First, when calculating the similarity between two terms, they disregard the descendants of these terms. While this makes no difference when the ontology is a tree, we shall show that it has important consequences when the ontology is a DAG-this is the case, for example, with the Gene Ontology (GO). Second, existing similarity measures do not model the inherent uncertainty which comes from the fact that our current knowledge of the gene annotation and of the ontology structure is incomplete. Here, we propose a novel approach based on downward random walks that can be used to improve any of the existing similarity measures to exhibit these two properties. The approach is computationally efficient-random walks do not need to be simulated as we provide formulas to calculate their stationary distributions. RESULTS To show that our approach can potentially improve any semantic similarity measure, we test it on six different semantic similarity measures: three commonly used measures by Resnik (1999), Lin (1998), and Jiang and Conrath (1997); and three recently proposed measures: simUI, simGIC by Pesquita et al. (2008); GraSM by Couto et al. (2007); and Couto and Silva (2011). We applied these improved measures to the GO annotations of the yeast Saccharomyces cerevisiae, and tested how they correlate with sequence similarity, mRNA co-expression and protein-protein interaction data. Our results consistently show that the use of downward random walks leads to more reliable similarity measures.
Collapse
Affiliation(s)
- Haixuan Yang
- Department of Computer Science and Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham, TW20 0EX, UK
| | | | | |
Collapse
|
34
|
Falda M, Toppo S, Pescarolo A, Lavezzo E, Di Camillo B, Facchinetti A, Cilia E, Velasco R, Fontana P. Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms. BMC Bioinformatics 2012; 13 Suppl 4:S14. [PMID: 22536960 PMCID: PMC3314586 DOI: 10.1186/1471-2105-13-s4-s14] [Citation(s) in RCA: 108] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predicting protein function has become increasingly demanding in the era of next generation sequencing technology. The task to assign a curator-reviewed function to every single sequence is impracticable. Bioinformatics tools, easy to use and able to provide automatic and reliable annotations at a genomic scale, are necessary and urgent. In this scenario, the Gene Ontology has provided the means to standardize the annotation classification with a structured vocabulary which can be easily exploited by computational methods. RESULTS Argot2 is a web-based function prediction tool able to annotate nucleic or protein sequences from small datasets up to entire genomes. It accepts as input a list of sequences in FASTA format, which are processed using BLAST and HMMER searches vs UniProKB and Pfam databases respectively; these sequences are then annotated with GO terms retrieved from the UniProtKB-GOA database and the terms are weighted using the e-values from BLAST and HMMER. The weighted GO terms are processed according to both their semantic similarity relations described by the Gene Ontology and their associated score. The algorithm is based on the original idea developed in a previous tool called Argot. The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes. CONCLUSIONS The revised algorithm has been already employed and successfully tested during in-house genome projects of grape and apple, and has proven to have a high precision and recall in all our benchmark conditions. It has also been successfully compared with Blast2GO, one of the methods most commonly employed for sequence annotation. The server is freely accessible at http://www.medcomp.medicina.unipd.it/Argot2.
Collapse
Affiliation(s)
- Marco Falda
- Department of Molecular Medicine, University of Padova, via U. Bassi 58/B, 35121, Padova, Italy.
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Lv S, Li Y, Wang Q, Ning S, Huang T, Wang P, Sun J, Zheng Y, Liu W, Ai J, Li X. A novel method to quantify gene set functional association based on gene ontology. J R Soc Interface 2011; 9:1063-72. [PMID: 21998111 DOI: 10.1098/rsif.2011.0551] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Numerous gene sets have been used as molecular signatures for exploring the genetic basis of complex disorders. These gene sets are distinct but related to each other in many cases; therefore, efforts have been made to compare gene sets for studies such as those evaluating the reproducibility of different experiments. Comparison in terms of biological function has been demonstrated to be helpful to biologists. We improved the measurement of semantic similarity to quantify the functional association between gene sets in the context of gene ontology and developed a web toolkit named Gene Set Functional Similarity (GSFS; http://bioinfo.hrbmu.edu.cn/GSFS). Validation based on protein complexes for which the functional associations are known demonstrated that the GSFS scores tend to be correlated with sequence similarity scores and that complexes with high GSFS scores tend to be involved in the same functional catalogue. Compared with the pairwise method and the annotation method, the GSFS shows better discrimination and more accurately reflects the known functional catalogues shared between complexes. Case studies comparing differentially expressed genes of prostate tumour samples from different microarray platforms and identifying coronary heart disease susceptibility pathways revealed that the method could contribute to future studies exploring the molecular basis of complex disorders.
Collapse
Affiliation(s)
- Sali Lv
- College of Bioinformatics Science and Technology and Bio-pharmaceutical Key Laboratory of Heilongjiang Province, Harbin Medical University, Harbin, People's Republic of China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|