1
|
Gu Z. simona: a comprehensive R package for semantic similarity analysis on bio-ontologies. BMC Genomics 2024; 25:869. [PMID: 39285315 PMCID: PMC11406866 DOI: 10.1186/s12864-024-10759-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 09/02/2024] [Indexed: 09/19/2024] Open
Abstract
BACKGROUND Bio-ontologies are keys in structuring complex biological information for effective data integration and knowledge representation. Semantic similarity analysis on bio-ontologies quantitatively assesses the degree of similarity between biological concepts based on the semantics encoded in ontologies. It plays an important role in structured and meaningful interpretations and integration of complex data from multiple biological domains. RESULTS We present simona, a novel R package for semantic similarity analysis on general bio-ontologies. Simona implements infrastructures for ontology analysis by offering efficient data structures, fast ontology traversal methods, and elegant visualizations. Moreover, it provides a robust toolbox supporting over 70 methods for semantic similarity analysis. With simona, we conducted a benchmark against current semantic similarity methods. The results demonstrate methods are clustered based on their mathematical methodologies, thus guiding researchers in the selection of appropriate methods. Additionally, we explored annotation-based versus topology-based methods, revealing that semantic similarities solely based on ontology topology can efficiently reveal semantic similarity structures, facilitating analysis on less-studied organisms and other ontologies. CONCLUSIONS Simona offers a versatile interface and efficient implementation for processing, visualization, and semantic similarity analysis on bio-ontologies. We believe that simona will serve as a robust tool for uncovering relationships and enhancing the interoperability of biological knowledge systems.
Collapse
Affiliation(s)
- Zuguang Gu
- Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT), Im Neuenheimer Feld 280, Heidelberg, 69120, Germany.
| |
Collapse
|
2
|
Koutsandreas T, Felden B, Chevet E, Chatziioannou A. Protein homeostasis imprinting across evolution. NAR Genom Bioinform 2024; 6:lqae014. [PMID: 38486886 PMCID: PMC10939379 DOI: 10.1093/nargab/lqae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 10/07/2023] [Accepted: 01/24/2024] [Indexed: 03/17/2024] Open
Abstract
Protein homeostasis (a.k.a. proteostasis) is associated with the primary functions of life, and therefore with evolution. However, it is unclear how cellular proteostasis machines have evolved to adjust protein biogenesis needs to environmental constraints. Herein, we describe a novel computational approach, based on semantic network analysis, to evaluate proteostasis plasticity during evolution. We show that the molecular components of the proteostasis network (PN) are reliable metrics to deconvolute the life forms into Archaea, Bacteria and Eukarya and to assess the evolution rates among species. Semantic graphs were used as new criteria to evaluate PN complexity in 93 Eukarya, 250 Bacteria and 62 Archaea, thus representing a novel strategy for taxonomic classification, which provided information about species divergence. Kingdom-specific PN components were identified, suggesting that PN complexity may correlate with evolution. We found that the gains that occurred throughout PN evolution revealed a dichotomy within both the PN conserved modules and within kingdom-specific modules. Additionally, many of these components contribute to the evolutionary imprinting of other conserved mechanisms. Finally, the current study suggests a new way to exploit the genomic annotation of biomedical ontologies, deriving new knowledge from the semantic comparison of different biological systems.
Collapse
Affiliation(s)
- Thodoris Koutsandreas
- Center of Systems Biology, Biomedical Research Foundation of the Academy of Athens, Athens, Greece
- e-NIOS Applications PC, Kallithea-Athens, Greece
| | - Brice Felden
- University of Rennes, INSERM U1230, Rennes, France
| | - Eric Chevet
- INSERM U1242, University of Rennes, Rennes, France
- Centre de Lutte Contre le Cancer Eugène Marquis, Rennes, France
| | - Aristotelis Chatziioannou
- Center of Systems Biology, Biomedical Research Foundation of the Academy of Athens, Athens, Greece
- e-NIOS Applications PC, Kallithea-Athens, Greece
| |
Collapse
|
3
|
Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics 2022; 23:23. [PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra's algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| |
Collapse
|
4
|
Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. Semantic similarity and machine learning with ontologies. Brief Bioinform 2021; 22:bbaa199. [PMID: 33049044 PMCID: PMC8293838 DOI: 10.1093/bib/bbaa199] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/03/2020] [Accepted: 08/04/2020] [Indexed: 12/13/2022] Open
Abstract
Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
Collapse
Affiliation(s)
| | | | - Xin Gao
- Computational Bioscience Research Center and lead of the Structural and Functional Bioinformatics Group at King Abdullah University of Science and Technology
| | | |
Collapse
|
5
|
Choudhury A, Aron S, Botigué LR, Sengupta D, Botha G, Bensellak T, Wells G, Kumuthini J, Shriner D, Fakim YJ, Ghoorah AW, Dareng E, Odia T, Falola O, Adebiyi E, Hazelhurst S, Mazandu G, Nyangiri OA, Mbiyavanga M, Benkahla A, Kassim SK, Mulder N, Adebamowo SN, Chimusa ER, Muzny D, Metcalf G, Gibbs RA, Rotimi C, Ramsay M, Adeyemo AA, Lombard Z, Hanchard NA. High-depth African genomes inform human migration and health. Nature 2020; 586:741-748. [PMID: 33116287 PMCID: PMC7759466 DOI: 10.1038/s41586-020-2859-7] [Citation(s) in RCA: 168] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 08/07/2020] [Indexed: 01/05/2023]
Abstract
The African continent is regarded as the cradle of modern humans and African genomes contain more genetic variation than those from any other continent, yet only a fraction of the genetic diversity among African individuals has been surveyed1. Here we performed whole-genome sequencing analyses of 426 individuals-comprising 50 ethnolinguistic groups, including previously unsampled populations-to explore the breadth of genomic diversity across Africa. We uncovered more than 3 million previously undescribed variants, most of which were found among individuals from newly sampled ethnolinguistic groups, as well as 62 previously unreported loci that are under strong selection, which were predominantly found in genes that are involved in viral immunity, DNA repair and metabolism. We observed complex patterns of ancestral admixture and putative-damaging and novel variation, both within and between populations, alongside evidence that Zambia was a likely intermediate site along the routes of expansion of Bantu-speaking populations. Pathogenic variants in genes that are currently characterized as medically relevant were uncommon-but in other genes, variants denoted as 'likely pathogenic' in the ClinVar database were commonly observed. Collectively, these findings refine our current understanding of continental migration, identify gene flow and the response to human disease as strong drivers of genome-level population variation, and underscore the scientific imperative for a broader characterization of the genomic diversity of African individuals to understand human ancestry and improve health.
Collapse
Affiliation(s)
- Ananyo Choudhury
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Shaun Aron
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Laura R Botigué
- Center for Research in Agricultural Genomics (CRAG), Plant and Animal Genomics Program, CSIC-IRTA-UAB-UB, Barcelona, Spain
| | - Dhriti Sengupta
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Gerrit Botha
- Computational Biology Division and H3ABioNet, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa
| | - Taoufik Bensellak
- System and Data Engineering Team, Abdelmalek Essaadi University, ENSA, Tangier, Morocco
| | - Gordon Wells
- Centre for Proteomic and Genomic Research (CPGR), Cape Town, South Africa.,South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa.,Africa Health Research Institute, Durban, South Africa
| | - Judit Kumuthini
- Centre for Proteomic and Genomic Research (CPGR), Cape Town, South Africa.,South African National Bioinformatics Institute, University of the Western Cape, Bellville, South Africa
| | - Daniel Shriner
- Center for Research on Genomics and Global Health, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Yasmina J Fakim
- Department of Agriculture and Food Science, Faculty of Agriculture, University of Mauritius, Reduit, Mauritius.,Department of Digital Technologies,Faculty of Information, Communication & Digital Technologies, University of Mauritius, Reduit, Mauritius
| | - Anisah W Ghoorah
- Department of Digital Technologies,Faculty of Information, Communication & Digital Technologies, University of Mauritius, Reduit, Mauritius
| | - Eileen Dareng
- Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.,Institute of Human Virology Nigeria, Abuja, Nigeria
| | - Trust Odia
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Nigeria
| | - Oluwadamilare Falola
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Nigeria
| | - Ezekiel Adebiyi
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Nigeria.,Department of Computer and Information Sciences, Covenant University, Ota, Nigeria
| | - Scott Hazelhurst
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, South Africa
| | - Gaston Mazandu
- Computational Biology Division and H3ABioNet, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa
| | - Oscar A Nyangiri
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Mamana Mbiyavanga
- Computational Biology Division and H3ABioNet, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa
| | - Alia Benkahla
- Laboratory of Bioinformatics, Biomathematics and Biostatistics (BIMS), Institute Pasteur of Tunis, Tunis, Tunisia
| | - Samar K Kassim
- Medical Biochemistry and Molecular Biology Department, Faculty of Medicine, Ain Shams University, Abbaseya, Cairo, Egypt
| | - Nicola Mulder
- Computational Biology Division and H3ABioNet, Department of Integrative Biomedical Sciences, IDM, University of Cape Town, Cape Town, South Africa
| | - Sally N Adebamowo
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, University of Maryland Baltimore, Baltimore, MD, USA.,University of Maryland Greenebaum Comprehensive Cancer Center, University of Maryland School of Medicine, University of Maryland Baltimore, Baltimore, MD, USA
| | - Emile R Chimusa
- Division of Human Genetics, Department of Pathology, Faculty of Health Sciences, Institute for Infectious, Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa
| | - Donna Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Ginger Metcalf
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | | | - Charles Rotimi
- Center for Research on Genomics and Global Health, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Michèle Ramsay
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,Division of Human Genetics, National Health Laboratory Service, and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | | | - Adebowale A Adeyemo
- Center for Research on Genomics and Global Health, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Zané Lombard
- Division of Human Genetics, National Health Laboratory Service, and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.
| | - Neil A Hanchard
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.
| |
Collapse
|
6
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
7
|
Acharya S, Saha S, Pradhan P. Multi-Factored Gene-Gene Proximity Measures Exploiting Biological Knowledge Extracted from Gene Ontology: Application in Gene Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:207-219. [PMID: 29994130 DOI: 10.1109/tcbb.2018.2849362] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
To describe the cellular functions of proteins and genes, a potential dynamic vocabulary is Gene Ontology (GO), which comprises of three sub-ontologies namely, Biological-process, Cellular-component, and Molecular-function. It has several applications in the field of bioinformatics like annotating/measuring gene-gene or protein-protein semantic similarity, identifying genes/proteins by their GO annotations for disease gene and target discovery, etc. To determine semantic similarity between genes, several semantic measures have been proposed in literature, which involve information content of GO-terms, GO tree structure, or the combination of both. But, most of the existing semantic similarity measures do not consider different topological and information theoretic aspects of GO-terms collectively. Inspired by this fact, in this article, we have first proposed three novel semantic similarity/distance measures for genes covering different aspects of GO-tree. These are further implanted in the frameworks of well-known multi-objective and single-objective based clustering algorithms to determine functionally similar genes. For comparative analysis, 10 popular existing GO based semantic similarity/distance measures and tools are also considered. Experimental results on Mouse genome, Yeast, and Human genome datasets evidently demonstrate the supremacy of multi-objective clustering algorithms in association with proposed multi-factored similarity/distance measures. Clustering outcomes are further validated by conducting some biological/statistical significance tests. Supplementary information is available at https://www.iitp.ac.in/sriparna/journals.html.
Collapse
|
8
|
Mazandu GK, Chimusa ER, Rutherford K, Zekeng EG, Gebremariam ZZ, Onifade MY, Mulder NJ. Large-scale data-driven integrative framework for extracting essential targets and processes from disease-associated gene data sets. Brief Bioinform 2019; 19:1141-1152. [PMID: 28520909 DOI: 10.1093/bib/bbx052] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Indexed: 12/20/2022] Open
Abstract
Populations worldwide currently face several public health challenges, including growing prevalence of infections and the emergence of new pathogenic organisms. The cost and risk associated with drug development make the development of new drugs for several diseases, especially orphan or rare diseases, unappealing to the pharmaceutical industry. Proof of drug safety and efficacy is required before market approval, and rigorous testing makes the drug development process slow, expensive and frequently result in failure. This failure is often because of the use of irrelevant targets identified in the early steps of the drug discovery process, suggesting that target identification and validation are cornerstones for the success of drug discovery and development. Here, we present a large-scale data-driven integrative computational framework to extract essential targets and processes from an existing disease-associated data set and enhance target selection by leveraging drug-target-disease association at the systems level. We applied this framework to tuberculosis and Ebola virus diseases combining heterogeneous data from multiple sources, including protein-protein functional interaction, functional annotation and pharmaceutical data sets. Results obtained demonstrate the effectiveness of the pipeline, leading to the extraction of essential drug targets and to the rational use of existing approved drugs. This provides an opportunity to move toward optimal target-based strategies for screening available drugs and for drug discovery. There is potential for this model to bridge the gap in the production of orphan disease therapies, offering a systematic approach to predict new uses for existing drugs, thereby harnessing their full therapeutic potential.
Collapse
Affiliation(s)
- Gaston K Mazandu
- Institute of Infectious Disease and Molecular Medicine at UCT and a Researcher at AIMS
| | | | | | | | - Zoe Z Gebremariam
- Institute of Infection and Global Health, University of Liverpool, UK
| | - Maryam Y Onifade
- African Institute for Mathematical Sciences jointly with University of Cape Coast, Ghana
| | - Nicola J Mulder
- Department of Integrative Biomedical Sciences and the Head of the Computational Biology Division, UCT
| |
Collapse
|
9
|
Acharya S, Saha S, Pradhan P. Novel symmetry-based gene-gene dissimilarity measures utilizing Gene Ontology: Application in gene clustering. Gene 2018; 679:341-351. [PMID: 30184472 DOI: 10.1016/j.gene.2018.08.062] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 08/21/2018] [Accepted: 08/21/2018] [Indexed: 11/25/2022]
Abstract
In recent years DNA microarray technology, leading to the generation of high-volume biological data, has gained significant attention. To analyze this high volume gene-expression data, one such powerful tool is Clustering. For any clustering algorithm, its efficiency majorly depends upon the underlying similarity/dissimilarity measure. During the analysis of such data often there is a need to further explore the similarity of genes not only with respect to their expression values but also with respect to their functional annotations, which can be obtained from Gene Ontology (GO) databases. In the existing literature, several novel clustering and bi-clustering approaches were proposed to identify co-regulated genes from gene-expression datasets. Identifying co-regulated genes from gene expression data misses some important biological information about functionalities of genes, which is necessary to identify semantically related genes. In this paper, we have proposed sixteen different semantic gene-gene dissimilarity measures utilizing biological information of genes retrieved from a global biological database namely Gene Ontology (GO). Four proximity measures, viz. Euclidean, Cosine, point symmetry and line symmetry are utilized along with four different representations of gene-GO-term annotation vectors to develop total sixteen gene-gene dissimilarity measures. In order to illustrate the profitability of developed dissimilarity measures, some multi-objective as well as single-objective clustering algorithms are applied utilizing proposed measures to identify functionally similar genes from Mouse genome and Yeast datasets. Furthermore, we have compared the performance of our proposed sixteen dissimilarity measures with three existing state-of-the-art semantic similarity and distance measures.
Collapse
Affiliation(s)
- Sudipta Acharya
- Department of Computer Science and Engineering, IIT Patna, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, IIT Patna, India
| | - Prasanna Pradhan
- Department of Computer Applications, Sikkim Manipal Institute of Technology, India
| |
Collapse
|
10
|
Qu R, Fang Y, Bai W, Jiang Y. Computing semantic similarity based on novel models of semantic representation using Wikipedia. Inf Process Manag 2018. [DOI: 10.1016/j.ipm.2018.07.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
11
|
GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms. Sci Rep 2018; 8:15107. [PMID: 30305653 PMCID: PMC6180005 DOI: 10.1038/s41598-018-33219-y] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 09/24/2018] [Indexed: 01/29/2023] Open
Abstract
Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik’s and Wang’s methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/.
Collapse
|
12
|
Gao W, L.G. Guirao J, Basavanagoud B, Wu J. Partial multi-dividing ontology learning algorithm. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.049] [Citation(s) in RCA: 148] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
13
|
Liu W, Liu J, Rajapakse JC. Gene Ontology Enrichment Improves Performances of Functional Similarity of Genes. Sci Rep 2018; 8:12100. [PMID: 30108262 PMCID: PMC6092333 DOI: 10.1038/s41598-018-30455-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2017] [Accepted: 07/25/2018] [Indexed: 12/23/2022] Open
Abstract
There exists a plethora of measures to evaluate functional similarity (FS) between genes, which is a widely used in many bioinformatics applications including detecting molecular pathways, identifying co-expressed genes, predicting protein-protein interactions, and prioritization of disease genes. Measures of FS between genes are mostly derived from Information Contents (IC) of Gene Ontology (GO) terms annotating the genes. However, existing measures evaluating IC of terms based either on the representations of terms in the annotating corpus or on the knowledge embedded in the GO hierarchy do not consider the enrichment of GO terms by the querying pair of genes. The enrichment of a GO term by a pair of gene is dependent on whether the term is annotated by one gene (i.e., partial annotation) or by both genes (i.e. complete annotation) in the pair. In this paper, we propose a method that incorporate enrichment of GO terms by a gene pair in computing their FS and show that GO enrichment improves the performances of 46 existing FS measures in the prediction of sequence homologies, gene expression correlations, protein-protein interactions, and disease associated genes.
Collapse
Affiliation(s)
- Wenting Liu
- Human Genetics, Genome Institute of Singapore, Singapore, Singapore.
| | - Jianjun Liu
- Human Genetics, Genome Institute of Singapore, Singapore, Singapore.
| | - Jagath C Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore.
| |
Collapse
|
14
|
Wang XL, Hou L, Zhao CG, Tang Y, Zhang B, Zhao JY, Wu YB. Screening of genes involved in epithelial-mesenchymal transition and differential expression of complement-related genes induced by PAX2 in renal tubules. Nephrology (Carlton) 2018; 24:263-271. [PMID: 29280536 PMCID: PMC6585862 DOI: 10.1111/nep.13216] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/15/2017] [Indexed: 01/09/2023]
Abstract
Aim The aim of the present study was to screen and verify downstream genes involved in the epithelial mesenchymal transition (EMT) induced by paired box 2 (PAX2) in NRK‐52E cells. Methods NRK‐52E cells were transfected with lentivirus carrying PAX2 gene or no‐load virus respectively. Total RNA was isolated 72 h after transfection from PAX2‐overexpressing cells and control cells. Isolated RNA was then hybridized with the Rat OneArray Plus expression profile chip. The chips were examined by Agilent 0.1 XDR to screen for differentially expressed genes, which were further analyzed to investigate complement‐related genes as genes of interest. Results In NRK‐52E cells, PAX2 overexpression promoted EMT followed by upregulation of 298 genes and downregulation of 293 genes. KEGG analysis indicated the differential expression of genes related to cytokines and their receptors, extracellular matrix (ECM), MAPKs, local adhesion, cancer, the complement cascade, and coagulation. Gene oncology analysis screened out genes related to molecular functions (e.g., hydrolase activity, phospholipase activity, components of the ECM) and biological processes (e.g., cell development, signal transduction, phylogeny), and cell components (e.g., cytoplasm, cell membrane, and ECM). Analysis of the complement system revealed upregulation of C3 and downregulation of CD55 and complement regulator factor H (CFH). Conclusion PAX2 overexpression upregulates EMT in vitro and may regulate C3, CD55, and CFH. This molecular analysis examines the effect of overexpressing paired box 2 (PAX2) in a tubule epithelial cell line. Results establish a link between pax2 and both epithelial‐mesenchymal transition (EMT) and the complement pathway.
Collapse
Affiliation(s)
- Xiu-Li Wang
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Ling Hou
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Cheng-Guang Zhao
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Ying Tang
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Bo Zhang
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Jing-Ying Zhao
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| | - Yu-Bin Wu
- Department of Pediatric Nephrology, Shengjing Hospital of China Medical University, Shenyang, China
| |
Collapse
|
15
|
Mazandu GK, Chimusa ER, Mulder NJ. Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Brief Bioinform 2017; 18:886-901. [PMID: 27473066 DOI: 10.1093/bib/bbw067] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Indexed: 01/02/2023] Open
Abstract
Gene Ontology (GO) semantic similarity tools enable retrieval of semantic similarity scores, which incorporate biological knowledge embedded in the GO structure for comparing or classifying different proteins or list of proteins based on their GO annotations. This facilitates a better understanding of biological phenomena underlying the corresponding experiment and enables the identification of processes pertinent to different biological conditions. Currently, about 14 tools are available, which may play an important role in improving protein analyses at the functional level using different GO semantic similarity measures. Here we survey these tools to provide a comprehensive view of the challenges and advances made in this area to avoid redundant effort in developing features that already exist, or implementing ideas already proven to be obsolete in the context of GO. This helps researchers, tool developers, as well as end users, understand the underlying semantic similarity measures implemented through knowledge of pertinent features of, and issues related to, a particular tool. This should empower users to make appropriate choices for their biological applications and ensure effective knowledge discovery based on GO annotations.
Collapse
|
16
|
Yocgo RE, Geza E, Chimusa ER, Mazandu GK. A post-gene silencing bioinformatics protocol for plant-defence gene validation and underlying process identification: case study of the Arabidopsis thaliana NPR1. BMC PLANT BIOLOGY 2017; 17:218. [PMID: 29169324 PMCID: PMC5701366 DOI: 10.1186/s12870-017-1151-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Accepted: 11/07/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND Advances in forward and reverse genetic techniques have enabled the discovery and identification of several plant defence genes based on quantifiable disease phenotypes in mutant populations. Existing models for testing the effect of gene inactivation or genes causing these phenotypes do not take into account eventual uncertainty of these datasets and potential noise inherent in the biological experiment used, which may mask downstream analysis and limit the use of these datasets. Moreover, elucidating biological mechanisms driving the induced disease resistance and influencing these observable disease phenotypes has never been systematically tackled, eliciting the need for an efficient model to characterize completely the gene target under consideration. RESULTS We developed a post-gene silencing bioinformatics (post-GSB) protocol which accounts for potential biases related to the disease phenotype datasets in assessing the contribution of the gene target to the plant defence response. The post-GSB protocol uses Gene Ontology semantic similarity and pathway dataset to generate enriched process regulatory network based on the functional degeneracy of the plant proteome to help understand the induced plant defence response. We applied this protocol to investigate the effect of the NPR1 gene silencing to changes in Arabidopsis thaliana plants following Pseudomonas syringae pathovar tomato strain DC3000 infection. Results indicated that the presence of a functionally active NPR1 reduced the plant's susceptibility to the infection, with about 99% of variability in Pseudomonas spore growth between npr1 mutant and wild-type samples. Moreover, the post-GSB protocol has revealed the coordinate action of target-associated genes and pathways through an enriched process regulatory network, summarizing the potential target-based induced disease resistance mechanism. CONCLUSIONS This protocol can improve the characterization of the gene target and, potentially, elucidate induced defence response by more effectively utilizing available phenotype information and plant proteome functional knowledge.
Collapse
Affiliation(s)
- Rosita E. Yocgo
- African Institute for Mathematical Sciences (AIMS), AIMS South Africa and AIMS Ghana, Cape Town, South Africa
- Biomathematics Division, Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa
| | - Ephifania Geza
- African Institute for Mathematical Sciences (AIMS), AIMS South Africa and AIMS Ghana, Cape Town, South Africa
- Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, Anzio Road, Observatory, Cape Town, 7925 South Africa
| | - Emile R. Chimusa
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, Anzio Road, Observatory, Cape Town, 7925 South Africa
| | - Gaston K. Mazandu
- African Institute for Mathematical Sciences (AIMS), AIMS South Africa and AIMS Ghana, Cape Town, South Africa
- Biomathematics Division, Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa
- Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, Anzio Road, Observatory, Cape Town, 7925 South Africa
| |
Collapse
|
17
|
Chen Q, Wan Y, Zhang X, Lei Y, Zobel J, Verspoor K. Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2017. [DOI: 10.1145/3131611] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis.
Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.
Collapse
Affiliation(s)
| | - Yu Wan
- University of Melbourne, Victoria, Australia
| | | | - Yang Lei
- University of Melbourne, Australia
| | | | | |
Collapse
|
18
|
Exploring Approaches for Detecting Protein Functional Similarity within an Orthology-based Framework. Sci Rep 2017; 7:381. [PMID: 28336965 PMCID: PMC5428484 DOI: 10.1038/s41598-017-00465-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Accepted: 02/28/2017] [Indexed: 11/21/2022] Open
Abstract
Protein functional similarity based on gene ontology (GO) annotations serves as a powerful tool when comparing proteins on a functional level in applications such as protein-protein interaction prediction, gene prioritization, and disease gene discovery. Functional similarity (FS) is usually quantified by combining the GO hierarchy with an annotation corpus that links genes and gene products to GO terms. One large group of algorithms involves calculation of GO term semantic similarity (SS) between all the terms annotating the two proteins, followed by a second step, described as “mixing strategy”, which involves combining the SS values to yield the final FS value. Due to the variability of protein annotation caused e.g. by annotation bias, this value cannot be reliably compared on an absolute scale. We therefore introduce a similarity z-score that takes into account the FS background distribution of each protein. For a selection of popular SS measures and mixing strategies we demonstrate moderate accuracy improvement when using z-scores in a benchmark that aims to separate orthologous cases from random gene pairs and discuss in this context the impact of annotation corpus choice. The approach has been implemented in Frela, a fast high-throughput public web server for protein FS calculation and interpretation.
Collapse
|
19
|
Gao W, Qudair Baig A, Ali H, Sajjad W, Reza Farahani M. Margin based ontology sparse vector learning algorithm and applied in biology science. Saudi J Biol Sci 2016; 24:132-138. [PMID: 28053583 PMCID: PMC5199015 DOI: 10.1016/j.sjbs.2016.09.001] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Revised: 09/01/2016] [Accepted: 09/01/2016] [Indexed: 01/02/2023] Open
Abstract
In biology field, the ontology application relates to a large amount of genetic information and chemical information of molecular structure, which makes knowledge of ontology concepts convey much information. Therefore, in mathematical notation, the dimension of vector which corresponds to the ontology concept is often very large, and thus improves the higher requirements of ontology algorithm. Under this background, we consider the designing of ontology sparse vector algorithm and application in biology. In this paper, using knowledge of marginal likelihood and marginal distribution, the optimized strategy of marginal based ontology sparse vector learning algorithm is presented. Finally, the new algorithm is applied to gene ontology and plant ontology to verify its efficiency.
Collapse
Affiliation(s)
- Wei Gao
- School of Information Science and Technology, Yunnan Normal University, Kunming 650500, China
| | - Abdul Qudair Baig
- Department of Mathematics, COMSATS Institute of Information Technology, Attock, Pakistan
| | - Haidar Ali
- Department of Mathematics, COMSATS Institute of Information Technology, Attock, Pakistan
| | - Wasim Sajjad
- Department of Mathematics, COMSATS Institute of Information Technology, Attock, Pakistan
| | - Mohammad Reza Farahani
- Department of Applied Mathematics, Iran University of Science and Technology, Narmak, 16844 Tehran, Iran
| |
Collapse
|
20
|
Pakhomov SVS, Finley G, McEwan R, Wang Y, Melton GB. Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics 2016; 32:3635-3644. [PMID: 27531100 DOI: 10.1093/bioinformatics/btw529] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2016] [Revised: 05/03/2016] [Accepted: 08/09/2016] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated. RESULTS We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho = 0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho = 0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications. AVAILABILITY AND IMPLEMENTATION The software and reference standards used in this study to evaluate semantic similarity and relatedness measures are publicly available as detailed in the article. CONTACT pakh0002@umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Serguei V S Pakhomov
- College of Pharmacy, University of Minnesota, Minneapolis, MN 55455, USA.,Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Greg Finley
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Reed McEwan
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Yan Wang
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Genevieve B Melton
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|