1
|
Gu Z. simona: a comprehensive R package for semantic similarity analysis on bio-ontologies. BMC Genomics 2024; 25:869. [PMID: 39285315 PMCID: PMC11406866 DOI: 10.1186/s12864-024-10759-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 09/02/2024] [Indexed: 09/19/2024] Open
Abstract
BACKGROUND Bio-ontologies are keys in structuring complex biological information for effective data integration and knowledge representation. Semantic similarity analysis on bio-ontologies quantitatively assesses the degree of similarity between biological concepts based on the semantics encoded in ontologies. It plays an important role in structured and meaningful interpretations and integration of complex data from multiple biological domains. RESULTS We present simona, a novel R package for semantic similarity analysis on general bio-ontologies. Simona implements infrastructures for ontology analysis by offering efficient data structures, fast ontology traversal methods, and elegant visualizations. Moreover, it provides a robust toolbox supporting over 70 methods for semantic similarity analysis. With simona, we conducted a benchmark against current semantic similarity methods. The results demonstrate methods are clustered based on their mathematical methodologies, thus guiding researchers in the selection of appropriate methods. Additionally, we explored annotation-based versus topology-based methods, revealing that semantic similarities solely based on ontology topology can efficiently reveal semantic similarity structures, facilitating analysis on less-studied organisms and other ontologies. CONCLUSIONS Simona offers a versatile interface and efficient implementation for processing, visualization, and semantic similarity analysis on bio-ontologies. We believe that simona will serve as a robust tool for uncovering relationships and enhancing the interoperability of biological knowledge systems.
Collapse
Affiliation(s)
- Zuguang Gu
- Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT), Im Neuenheimer Feld 280, Heidelberg, 69120, Germany.
| |
Collapse
|
2
|
Zhao Y, Wang J, Guo M, Zhang X, Yu G. Cross-Species Protein Function Prediction with Asynchronous-Random Walk. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1439-1450. [PMID: 31562099 DOI: 10.1109/tcbb.2019.2943342] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein function prediction is a fundamental task in the post-genomic era. Available functional annotations of proteins are incomplete and the annotations of two homologous species are complementary to each other. However, how to effectively leverage mutually complementary annotations of different species to further boost the prediction performance is still not well studied. In this paper, we propose a cross-species protein function prediction approach by performing Asynchronous Random Walk on a heterogeneous network (AsyRW). AsyRW first constructs a heterogeneous network to integrate multiple functional association networks derived from different biological data, established homology-relationships between proteins from different species, known annotations of proteins and Gene Ontology (GO). To account for the intrinsic structures of intra- and inter-species of proteins and that of GO, AsyRW quantifies the individual walk lengths of each network node using the gravity-like theory, and then performs asynchronous-random walk with the individual length to predict associations between proteins and GO terms. Experiments on annotations archived in different years show that individual walk length and asynchronous-random walk can effectively leverage the complementary annotations of different species, AsyRW has a significantly improved performance to other related and competitive methods. The codes of AsyRW are available at: http://mlda.swu.edu.cn/codes.php?name=AsyRW.
Collapse
|
3
|
Moro G, Masseroli M. Gene function finding through cross-organism ensemble learning. BioData Min 2021; 14:14. [PMID: 33579334 PMCID: PMC7879670 DOI: 10.1186/s13040-021-00239-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 01/10/2021] [Indexed: 11/12/2022] Open
Abstract
Background Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available.
Collapse
Affiliation(s)
- Gianluca Moro
- DISI - University of Bologna, Via dell'Università, Cesena (FC), Italy.
| | - Marco Masseroli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, Milan, 20133, Italy
| |
Collapse
|
4
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
5
|
Makrodimitris S, van Ham RCHJ, Reinders MJT. Improving protein function prediction using protein sequence and GO-term similarities. Bioinformatics 2020; 35:1116-1124. [PMID: 30169569 PMCID: PMC6449755 DOI: 10.1093/bioinformatics/bty751] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2017] [Revised: 07/04/2018] [Accepted: 08/28/2018] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Most automatic functional annotation methods assign Gene Ontology (GO) terms to proteins based on annotations of highly similar proteins. We advocate that proteins that are less similar are still informative. Also, despite their simplicity and structure, GO terms seem to be hard for computers to learn, in particular the Biological Process ontology, which has the most terms (>29 000). We propose to use Label-Space Dimensionality Reduction (LSDR) techniques to exploit the redundancy of GO terms and transform them into a more compact latent representation that is easier to predict. RESULTS We compare proteins using a sequence similarity profile (SSP) to a set of annotated training proteins. We introduce two new LSDR methods, one based on the structure of the GO, and one based on semantic similarity of terms. We show that these LSDR methods, as well as three existing ones, improve the Critical Assessment of Functional Annotation performance of several function prediction algorithms. Cross-validation experiments on Arabidopsis thaliana proteins pinpoint the superiority of our GO-aware LSDR over generic LSDR. Our experiments on A.thaliana proteins show that the SSP representation in combination with a kNN classifier outperforms state-of-the-art and baseline methods in terms of cross-validated F-measure. AVAILABILITY AND IMPLEMENTATION Source code for the experiments is available at https://github.com/stamakro/SSP-LSDR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Department of Intelligent Systems, Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.,Department of Bioinformatics, Keygene N.V., Wageningen, The Netherlands
| | - Roeland C H J van Ham
- Department of Intelligent Systems, Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.,Department of Bioinformatics, Keygene N.V., Wageningen, The Netherlands
| | - Marcel J T Reinders
- Department of Intelligent Systems, Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
| |
Collapse
|
6
|
Cardoso C, Sousa RT, Köhler S, Pesquita C. A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain. Database (Oxford) 2020; 2020:baaa078. [PMID: 33181823 PMCID: PMC7661097 DOI: 10.1093/database/baaa078] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 08/13/2020] [Accepted: 08/24/2020] [Indexed: 01/12/2023]
Abstract
The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein-protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein-protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.
Collapse
Affiliation(s)
- Carlota Cardoso
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | - Rita T Sousa
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | | | - Catia Pesquita
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| |
Collapse
|
7
|
Yu G, Wang K, Fu G, Guo M, Wang J. NMFGO: Gene Function Prediction via Nonnegative Matrix Factorization with Gene Ontology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:238-249. [PMID: 30059316 DOI: 10.1109/tcbb.2018.2861379] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Gene Ontology (GO) is a controlled vocabulary of terms that describe molecule function, biological roles, and cellular locations of gene products (i.e., proteins and RNAs), it hierarchically organizes more than 43,000 GO terms via the direct acyclic graph. A gene is generally annotated with several of these GO terms. Therefore, accurately predicting the association between genes and massive terms is a difficult challenge. To combat with this challenge, we propose an matrix factorization based approach called NMFGO. NMFGO stores the available GO annotations of genes in a gene-term association matrix and adopts an ontological structure based taxonomic similarity measure to capture the GO hierarchy. Next, it factorizes the association matrix into two low-rank matrices via nonnegative matrix factorization regularized with the GO hierarchy. After that, it employs a semantic similarity based k nearest neighbor classifier in the low-rank matrices approximated subspace to predict gene functions. Empirical study on three model species (S. cerevisiae, H. sapiens, and A. thaliana) shows that NMFGO is robust to the input parameters and achieves significantly better prediction performance than GIC, TO, dRW- kNN, and NtN, which were re-implemented based on the instructions of the original papers. The supplementary file and demo codes of NMFGO are available at http://mlda.swu.edu.cn/codes.php?name=NMFGO.
Collapse
|
8
|
Genome-Wide Investigation of the NAC Gene Family and Its Potential Association with the Secondary Cell Wall in Moso Bamboo. Biomolecules 2019; 9:biom9100609. [PMID: 31615151 PMCID: PMC6843218 DOI: 10.3390/biom9100609] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Revised: 10/09/2019] [Accepted: 10/13/2019] [Indexed: 01/20/2023] Open
Abstract
NAC (NAM, ATAF, and CUC) transcription factors (TFs) are implicated in the transcriptional regulation of diverse processes and have been characterized in a number of plant species. However, NAC TFs are still not well understood in bamboo, especially their potential association with the secondary cell wall (SCW). Here, 94 PeNACs were identified and characterized in moso bamboo (Phyllostachys edulis). Based on their gene structures and conserved motifs, the PeNACs were divided into 11 groups according to their homologs in Arabidopsis. PeNACs were expressed variously in different tissues of moso bamboo, suggesting their functional diversity. Fifteen PeNACs associated with the SCW were selected for co-expression analysis and validation. It was predicted that 396 genes were co-expressed with the 15 PeNACs, in which 16 and 55 genes were involved in the lignin catabolic process and cellulose biosynthetic process respectively. As the degree of lignification in the growing bamboo shoots increased, all 15 PeNACs were upregulated with a trend of rising first and then decreasing except PeNAC37, which increased continuously. These results indicated that these PeNACs might play important roles in SCW biosynthesis and lignification in bamboo shoots. Seven of 15 PeNACs had been found positively co-expressed with seven PeMYBs, and they had similar expression patterns with those of the PeMYBs in bamboo shoots. The targeted sites of miR164 were found in 16 PeNACs, of which three PeNACs associated with SCW were validated to have an opposite expression trend to that of miR164 in growing bamboo shoots. In addition, three PeNACs were selected and verified to have self-activation activities. These results provide comprehensive information of the NAC gene family in moso bamboo, which will be helpful for further functional studies of PeNACs to reveal the molecular regulatory mechanisms of bamboo wood property.
Collapse
|
9
|
Wang L, Law J, Kale SD, Murali TM, Pandey G. Large-scale protein function prediction using heterogeneous ensembles. F1000Res 2018; 7. [PMID: 30450194 PMCID: PMC6221071 DOI: 10.12688/f1000research.16415.1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/26/2018] [Indexed: 12/24/2022] Open
Abstract
Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).
Collapse
Affiliation(s)
- Linhua Wang
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Jeffrey Law
- Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Shiv D Kale
- Biocomplexity Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - T M Murali
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| |
Collapse
|
10
|
Zhao Y, Fu G, Wang J, Guo M, Yu G. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing. Genomics 2018; 111:334-342. [PMID: 29477548 DOI: 10.1016/j.ygeno.2018.02.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/02/2018] [Accepted: 02/16/2018] [Indexed: 12/27/2022]
Abstract
Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China; Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, China.
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China.
| |
Collapse
|
11
|
Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics 2017; 18:573. [PMID: 29297309 PMCID: PMC5751813 DOI: 10.1186/s12859-017-1959-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background The Gene Ontology (GO) is a community-based bioinformatics resource that employs ontologies to represent biological knowledge and describes information about gene and gene product function. GO includes three independent categories: molecular function, biological process and cellular component. For better biological reasoning, identifying the biological relationships between terms in different categories are important. However, the existing measurements to calculate similarity between terms in different categories are either developed by using the GO data only or only take part of combined gene co-function network information. Results We propose an iterative ranking-based method called CroGO2 to measure the cross-categories GO term similarities by incorporating level information of GO terms with both direct and indirect interactions in the gene co-function network. Conclusions The evaluation test shows that CroGO2 performs better than the existing methods. A genome-specific term association network for yeast is also generated by connecting terms with the high confidence score. The linkages in the term association network could be supported by the literature. Given a gene set, the related terms identified by using the association network have overlap with the related terms identified by GO enrichment analysis.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Honggang Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|
12
|
HashGO: hashing gene ontology for protein function prediction. Comput Biol Chem 2017; 71:264-273. [DOI: 10.1016/j.compbiolchem.2017.09.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 09/25/2017] [Indexed: 10/18/2022]
|