1
|
Chen J, Goudey B, Geard N, Verspoor K. Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation. Bioinformatics 2024; 40:i390-i400. [PMID: 38940182 PMCID: PMC11256942 DOI: 10.1093/bioinformatics/btae246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. RESULTS We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies. This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge. AVAILABILITY AND IMPLEMENTATION https://github.com/jiyuc/de-inconsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
- Data61, The Commonwealth Scientific and Industrial Research Organisation, Marsfield 2122, NSW, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3000, Australia
| |
Collapse
|
2
|
Chen J, Goudey B, Zobel J, Geard N, Verspoor K. Exploring automatic inconsistency detection for literature-based gene ontology annotation. Bioinformatics 2022; 38:i273-i281. [PMID: 35758780 PMCID: PMC9235499 DOI: 10.1093/bioinformatics/btac230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/08/2022] [Indexed: 11/12/2022] Open
Abstract
Motivation Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at https://github.com/jiyuc/AutoGOAConsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Justin Zobel
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.,School of Computer Technologies, RMIT University, Melbourne, VIC 3000, Australia
| |
Collapse
|
3
|
Ayllon-Benitez A, Bourqui R, Thébault P, Mougin F. GSAn: an alternative to enrichment analysis for annotating gene sets. NAR Genom Bioinform 2020; 2:lqaa017. [PMID: 33575577 PMCID: PMC7671311 DOI: 10.1093/nargab/lqaa017] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Revised: 02/12/2020] [Accepted: 02/25/2020] [Indexed: 12/27/2022] Open
Abstract
The revolution in new sequencing technologies is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data that are grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and may suffer from focusing on the most studied genes that represent a limited coverage of annotated genes within a gene set. Semantic similarity measures have shown great results within the pairwise gene comparison by making advantage of the underlying structure of the Gene Ontology. We developed GSAn, a novel gene set annotation method that uses semantic similarity measures to synthesize a priori Gene Ontology annotation terms. The originality of our approach is to identify the best compromise between the number of retained annotation terms that has to be drastically reduced and the number of related genes that has to be as large as possible. Moreover, GSAn offers interactive visualization facilities dedicated to the multi-scale analysis of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.
Collapse
Affiliation(s)
- Aaron Ayllon-Benitez
- University of Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux 33000, France
- University of Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux 33400, France
| | - Romain Bourqui
- University of Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux 33400, France
| | - Patricia Thébault
- University of Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux 33400, France
| | - Fleur Mougin
- University of Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux 33000, France
| |
Collapse
|
4
|
Wang D, Li J, Liu R, Wang Y. Optimizing gene set annotations combining GO structure and gene expression data. BMC SYSTEMS BIOLOGY 2018; 12:133. [PMID: 30598093 PMCID: PMC6311910 DOI: 10.1186/s12918-018-0659-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Background With the rapid accumulation of genomic data, it has become a challenge issue to annotate and interpret these data. As a representative, Gene set enrichment analysis has been widely used to interpret large molecular datasets generated by biological experiments. The result of gene set enrichment analysis heavily relies on the quality and integrity of gene set annotations. Although several methods were developed to annotate gene sets, there is still a lack of high quality annotation methods. Here, we propose a novel method to improve the annotation accuracy through combining the GO structure and gene expression data. Results We propose a novel approach for optimizing gene set annotations to get more accurate annotation results. The proposed method filters the inconsistent annotations using GO structure information and probabilistic gene set clusters calculated by a range of cluster sizes over multiple bootstrap resampled datasets. The proposed method is employed to analyze p53 cell lines, colon cancer and breast cancer gene expression data. The experimental results show that the proposed method can filter a number of annotations unrelated to experimental data and increase gene set enrichment power and decrease the inconsistent of annotations. Conclusions A novel gene set annotation optimization approach is proposed to improve the quality of gene annotations. Experimental results indicate that the proposed method effectively improves gene set annotation quality based on the GO structure and gene expression data.
Collapse
Affiliation(s)
- Dong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, West Da-Zhi Street, Harbin, China
| | - Jie Li
- School of Computer Science and Technology, Harbin Institute of Technology, West Da-Zhi Street, Harbin, China.
| | - Rui Liu
- School of Computer Science and Technology, Harbin Institute of Technology, West Da-Zhi Street, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, West Da-Zhi Street, Harbin, China
| |
Collapse
|
5
|
Ayllón-Benítez A, Mougin F, Allali J, Thiébaut R, Thébault P. A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets. PLoS One 2018; 13:e0208037. [PMID: 30481204 PMCID: PMC6258551 DOI: 10.1371/journal.pone.0208037] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 11/09/2018] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations. RESULTS We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms' characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used.
Collapse
Affiliation(s)
- Aarón Ayllón-Benítez
- Univ. Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux, France
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
- * E-mail: (AA); (PT)
| | - Fleur Mougin
- Univ. Bordeaux, Inserm UMR 1219, Bordeaux Population Health Research Center, team ERIAS, Bordeaux, France
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
| | - Julien Allali
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
| | - Rodolphe Thiébaut
- Univ. Bordeaux, Inserm UMR 1219, INRIA SISTM, Bordeaux, France
- CHU de Bordeaux, Pole de sante publique, Service d’information medicale, Bordeaux, France
- Vaccine Research Institute, Creteil, France
| | - Patricia Thébault
- Univ. Bordeaux, CNRS UMR 5800, LaBRI, Bordeaux, France
- * E-mail: (AA); (PT)
| |
Collapse
|
6
|
Mitterboeck TF, Liu S, Adamowicz SJ, Fu J, Zhang R, Song W, Meusemann K, Zhou X. Positive and relaxed selection associated with flight evolution and loss in insect transcriptomes. Gigascience 2018; 6:1-14. [PMID: 29020740 PMCID: PMC5632299 DOI: 10.1093/gigascience/gix073] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 08/01/2017] [Indexed: 12/31/2022] Open
Abstract
The evolution of powered flight is a major innovation that has facilitated the success of insects. Previously, studies of birds, bats, and insects have detected molecular signatures of differing selection regimes in energy-related genes associated with flight evolution and/or loss. Here, using DNA sequences from more than 1000 nuclear and mitochondrial protein-coding genes obtained from insect transcriptomes, we conduct a broader exploration of which gene categories display positive and relaxed selection at the origin of flight as well as with multiple independent losses of flight. We detected a number of categories of nuclear genes more often under positive selection in the lineage leading to the winged insects (Pterygota), related to catabolic processes such as proteases, as well as splicing-related genes. Flight loss was associated with relaxed selection signatures in splicing genes, mirroring the results for flight evolution. Similar to previous studies of flight loss in various animal taxa, we observed consistently higher nonsynonymous-to-synonymous substitution ratios in mitochondrial genes of flightless lineages, indicative of relaxed selection in energy-related genes. While oxidative phosphorylation genes were not detected as being under selection with the origin of flight specifically, they were most often detected as being under positive selection in holometabolous (complete metamorphosis) insects as compared with other insect lineages. This study supports some convergence in gene-specific selection pressures associated with flight ability, and the exploratory analysis provided some new insights into gene categories potentially associated with the gain and loss of flight in insects.
Collapse
Affiliation(s)
- T Fatima Mitterboeck
- Department of Integrative Biology, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1 Canada.,Biodiversity Institute of Ontario, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1 Canada
| | - Shanlin Liu
- BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen, Guangdong Province, 518083 China.,Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350 Copenhagen, Denmark
| | - Sarah J Adamowicz
- Department of Integrative Biology, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1 Canada.,Biodiversity Institute of Ontario, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1 Canada
| | - Jinzhong Fu
- Department of Integrative Biology, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1 Canada
| | - Rui Zhang
- BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen, Guangdong Province, 518083 China
| | - Wenhui Song
- BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen, Guangdong Province, 518083 China
| | - Karen Meusemann
- University of Freiburg, Department for Biology I (Zoology), Evolutionary Biology and Ecology, Hauptstr. 1, D-79104 Freiburg, Germany.,Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig, Adenauerallee 160, 53113 Bonn, Germany.,Australian National Insect Collection CSIRO, Natl Collections & Marine Infrastructure, Clunies Ross Street, ACTON, 2601 ACT, Canberra, Australia
| | - Xin Zhou
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, China Agricultural University, 2 West Yuanmingyuan Rd., Haidian District, Beijing 100193, China.,College of Plant Protection, China Agricultural University, 2 West Yuanmingyuan Rd., Haidian District, Beijing 100193, China
| |
Collapse
|
7
|
Liang X, Zhu L, Huang DS. Optimization of Gene Set Annotations Using Robust Trace-Norm Multitask Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1016-1021. [PMID: 28391202 DOI: 10.1109/tcbb.2017.2690427] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Gene set enrichment (GSE) is a useful tool for analyzing and interpreting large molecular datasets generated by modern biomedical science. The accuracy and reproducibility of GSE analysis are heavily affected by the quality and integrity of gene sets annotations. In this paper, we propose a novel method, robust trace-norm multitask learning, to solve the optimization problem of gene set annotations. Inspired by the binary nature of annotations, we convert the optimization of gene set annotations into a weakly supervised classification problem and use discriminative logistic regression to fit these datasets. Then, the output of logistic regression can be used to measure the probability of the existence of annotations. In addition, the optimization of each row of the annotation matrix can be treated as an independent weakly classification task, and we use the multitask learning approach with trace-norm regularization to optimize all rows of annotation matrix simultaneously. Finally, the experiments on simulated and real data demonstrate the effectiveness and good performance of the proposed method.
Collapse
|
8
|
Laodim T, Elzo MA, Koonawootrittriron S, Suwanasopee T, Jattawa D. Identification of SNP markers associated with milk and fat yields in multibreed dairy cattle using two genetic group structures. Livest Sci 2017. [DOI: 10.1016/j.livsci.2017.10.015] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
9
|
Yu G, Lu C, Wang J. NoGOA: predicting noisy GO annotations using evidences and sparse representation. BMC Bioinformatics 2017; 18:350. [PMID: 28732468 PMCID: PMC5521088 DOI: 10.1186/s12859-017-1764-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Accepted: 07/14/2017] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. RESULTS We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. CONCLUSIONS The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Chongqing, China.
| | - Chang Lu
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| |
Collapse
|
10
|
Agapito G, Milano M, Guzzi PH, Cannataro M. Extracting Cross-Ontology Weighted Association Rules from Gene Ontology Annotations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:197-208. [PMID: 27045823 DOI: 10.1109/tcbb.2015.2462348] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene Ontology (GO) is a structured repository of concepts (GO Terms) that are associated to one or more gene products through a process referred to as annotation. The analysis of annotated data is an important opportunity for bioinformatics. There are different approaches of analysis, among those, the use of association rules (AR) which provides useful knowledge, discovering biologically relevant associations between terms of GO, not previously known. In a previous work, we introduced GO-WAR (Gene Ontology-based Weighted Association Rules), a methodology for extracting weighted association rules from ontology-based annotated datasets. We here adapt the GO-WAR algorithm to mine cross-ontology association rules, i.e., rules that involve GO terms present in the three sub-ontologies of GO. We conduct a deep performance evaluation of GO-WAR by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.
Collapse
|
11
|
Benites F, Sapozhnikova E. Hierarchical interestingness measures for association rules with generalization on both antecedent and consequent sides. Pattern Recognit Lett 2015. [DOI: 10.1016/j.patrec.2015.07.027] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
12
|
Agapito G, Cannataro M, Guzzi PH, Milano M. Using GO-WAR for mining cross-ontology weighted association rules. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015; 120:113-122. [PMID: 25921876 DOI: 10.1016/j.cmpb.2015.03.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/08/2014] [Revised: 03/16/2015] [Accepted: 03/23/2015] [Indexed: 06/04/2023]
Abstract
The Gene Ontology (GO) is a structured repository of concepts (GO terms) that are associated to one or more gene products. The process of association is referred to as annotation. The relevance and the specificity of both GO terms and annotations are evaluated by a measure defined as information content (IC). The analysis of annotated data is thus an important challenge for bioinformatics. There exist different approaches of analysis. From those, the use of association rules (AR) may provide useful knowledge, and it has been used in some applications, e.g. improving the quality of annotations. Nevertheless classical association rules algorithms do not take into account the source of annotation nor the importance yielding to the generation of candidate rules with low IC. This paper presents GO-WAR (Gene Ontology-based Weighted Association Rules) a methodology for extracting weighted association rules. GO-WAR can extract association rules with a high level of IC without loss of support and confidence from a dataset of annotated data. A case study on using of GO-WAR on publicly available GO annotation datasets is used to demonstrate that our method outperforms current state of the art approaches.
Collapse
Affiliation(s)
- Giuseppe Agapito
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Italy
| | - Mario Cannataro
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Italy
| | - Pietro Hiram Guzzi
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Italy.
| | - Marianna Milano
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Italy
| |
Collapse
|
13
|
Callahan A, Cifuentes JJ, Dumontier M. An evidence-based approach to identify aging-related genes in Caenorhabditis elegans. BMC Bioinformatics 2015; 16:40. [PMID: 25888240 PMCID: PMC4339751 DOI: 10.1186/s12859-015-0469-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2014] [Accepted: 01/15/2015] [Indexed: 12/21/2022] Open
Abstract
Background Extensive studies have been carried out on Caenorhabditis elegans as a model organism to elucidate mechanisms of aging and the effects of perturbing known aging-related genes on lifespan and behavior. This research has generated large amounts of experimental data that is increasingly difficult to integrate and analyze with existing databases and domain knowledge. To address this challenge, we demonstrate a scalable and effective approach for automatic evidence gathering and evaluation that leverages existing experimental data and literature-curated facts to identify genes involved in aging and lifespan regulation in C. elegans. Results We developed a semantic knowledge base for aging by integrating data about C. elegans genes from WormBase with data about 2005 human and model organism genes from GenAge and 149 genes from GenDR, and with the Bio2RDF network of linked data for the life sciences. Using HyQue (a Semantic Web tool for hypothesis-based querying and evaluation) to interrogate this knowledge base, we examined 48,231 C. elegans genes for their role in modulating lifespan and aging. HyQue identified 24 novel but well-supported candidate aging-related genes for further experimental validation. Conclusions We use semantic technologies to discover candidate aging genes whose effects on lifespan are not yet well understood. Our customized HyQue system, the aging research knowledge base it operates over, and HyQue evaluations of all C. elegans genes are freely available at http://hyque.semanticscience.org. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0469-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alison Callahan
- Stanford Center for Biomedical Informatics Research, School of Medicine, Stanford University, Stanford California, AC, USA.
| | - Juan José Cifuentes
- Molecular Bioinformatics Laboratory, Millennium Institute on Immunology and Immunotherapy, 49 Santiago, CP, 8330025, Portugal. .,Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile.
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, School of Medicine, Stanford University, Stanford California, AC, USA.
| |
Collapse
|
14
|
Veres DV, Gyurkó DM, Thaler B, Szalay KZ, Fazekas D, Korcsmáros T, Csermely P. ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis. Nucleic Acids Res 2014; 43:D485-93. [PMID: 25348397 PMCID: PMC4383876 DOI: 10.1093/nar/gku1007] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Here we present ComPPI, a cellular compartment-specific database of proteins and their interactions enabling an extensive, compartmentalized protein–protein interaction network analysis (URL: http://ComPPI.LinkGroup.hu). ComPPI enables the user to filter biologically unlikely interactions, where the two interacting proteins have no common subcellular localizations and to predict novel properties, such as compartment-specific biological functions. ComPPI is an integrated database covering four species (S. cerevisiae, C. elegans, D. melanogaster and H. sapiens). The compilation of nine protein–protein interaction and eight subcellular localization data sets had four curation steps including a manually built, comprehensive hierarchical structure of >1600 subcellular localizations. ComPPI provides confidence scores for protein subcellular localizations and protein–protein interactions. ComPPI has user-friendly search options for individual proteins giving their subcellular localization, their interactions and the likelihood of their interactions considering the subcellular localization of their interacting partners. Download options of search results, whole-proteomes, organelle-specific interactomes and subcellular localization data are available on its website. Due to its novel features, ComPPI is useful for the analysis of experimental results in biochemistry and molecular biology, as well as for proteome-wide studies in bioinformatics and network science helping cellular biology, medicine and drug design.
Collapse
Affiliation(s)
- Daniel V Veres
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Dávid M Gyurkó
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Benedek Thaler
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, Budapest, Hungary
| | - Kristóf Z Szalay
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| | - Dávid Fazekas
- Department of Genetics, Eötvös Loránd University, Budapest, Hungary
| | - Tamás Korcsmáros
- Department of Genetics, Eötvös Loránd University, Budapest, Hungary TGAC, The Genome Analysis Centre, Norwich, UK Gut Health and Food Safety Programme, Institute of Food Research, Norwich, UK
| | - Peter Csermely
- Department of Medical Chemistry, Semmelweis University, Budapest, Hungary
| |
Collapse
|
15
|
Frost HR, Moore JH. Optimization of gene set annotations via entropy minimization over variable clusters (EMVC). Bioinformatics 2014; 30:1698-706. [PMID: 24574114 PMCID: PMC4058919 DOI: 10.1093/bioinformatics/btu110] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Motivation: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets. Results: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results. Availability and implementation:http://cran.r-project.org/web/packages/EMVC/index.html. Contact:jason.h.moore@dartmouth.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- H Robert Frost
- Departments of Genetics and Community and Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
| | - Jason H Moore
- Departments of Genetics and Community and Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
| |
Collapse
|
16
|
Abstract
The constantly increasing volume and complexity of available biological data requires new methods for their management and analysis. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining cross and intra-ontology pairwise generalized association rules. Its advantage is sensitivity to rare associations, for these are important for biologists. We propose a new class of interestingness measures designed for hierarchically organized rules. These measures allow one to select the most important rules and to take into account rare cases. They favor rules with an actual interestingness value that exceeds the expected value. The latter is calculated taking into account the parent rule. We demonstrate this approach by applying it to the analysis of data from Gene Ontology and GPCR databases. Our objective is to discover interesting relations between two different ontologies or parts of a single ontology. The association rules that are thus discovered can provide the user with new knowledge about underlying biological processes or help improve annotation consistency. The obtained results show that produced rules represent meaningful and quite reliable associations.
Collapse
|
17
|
Measuring the evolution of ontology complexity: the gene ontology case study. PLoS One 2013; 8:e75993. [PMID: 24146805 PMCID: PMC3795689 DOI: 10.1371/journal.pone.0075993] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 08/20/2013] [Indexed: 01/09/2023] Open
Abstract
Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure. The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred. The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.
Collapse
|
18
|
Manda P, McCarthy F, Bridges SM. Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships. J Biomed Inform 2013; 46:849-56. [PMID: 23850840 DOI: 10.1016/j.jbi.2013.06.012] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2012] [Revised: 06/07/2013] [Accepted: 06/26/2013] [Indexed: 02/04/2023]
Abstract
The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships.
Collapse
Affiliation(s)
- Prashanti Manda
- Department of Computer Science and Engineering, Mississippi State University, MS, USA.
| | | | | |
Collapse
|
19
|
Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol 2013; 9:e1003063. [PMID: 23737737 PMCID: PMC3667760 DOI: 10.1371/journal.pcbi.1003063] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Accepted: 04/02/2013] [Indexed: 11/19/2022] Open
Abstract
The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the “few articles - many proteins” phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments. Experiments and observations are the vehicles used by science to understand the world around us. In the field of molecular biology, we are increasingly relying on high-throughput, genome-wide experiments to provide answers about the function of biological macromolecules. However, any experimental assay is essentially limited in the type of information it can discover. Here, we show that our increasing reliance on high-throughput experiments biases our understanding of protein function. While the primary source of information is experiments, the functions of many proteins are computationally annotated by sequence-based similarity, either directly or indirectly, to proteins whose function is experimentally determined. Therefore, any biases in experimental annotations can get amplified and entrenched in the majority of protein databases. We show here that high-throughput studies are biased towards certain aspects of protein function, and that they provide less information than low-throughput studies. While there is no clear solution to the phenomenon of bias from high-throughput experiments, recognizing its existence and its impact can help take steps to mitigate its effect.
Collapse
Affiliation(s)
- Alexandra M. Schnoes
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, United States of America
| | - David C. Ream
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Alexander W. Thorman
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Patricia C. Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
- Department of Computer Science and Software Engineering, Miami University, Oxford, Ohio, United States of America
- * E-mail:
| |
Collapse
|
20
|
Abstract
EcoGene (http://ecogene.org) is a database and website devoted to continuously improving the structural and functional annotation of Escherichia coli K-12, one of the most well understood model organisms, represented by the MG1655(Seq) genome sequence and annotations. Major improvements to EcoGene in the past decade include (i) graphic presentations of genome map features; (ii) ability to design Boolean queries and Venn diagrams from EcoArray, EcoTopics or user-provided GeneSets; (iii) the genome-wide clone and deletion primer design tool, PrimerPairs; (iv) sequence searches using a customized EcoBLAST; (v) a Cross Reference table of synonymous gene and protein identifiers; (vi) proteome-wide indexing with GO terms; (vii) EcoTools access to >2000 complete bacterial genomes in EcoGene-RefSeq; (viii) establishment of a MySql relational database; and (ix) use of web content management systems. The biomedical literature is surveyed daily to provide citation and gene function updates. As of September 2012, the review of 37 397 abstracts and articles led to creation of 98 425 PubMed-Gene links and 5415 PubMed-Topic links. Annotation updates to Genbank U00096 are transmitted from EcoGene to NCBI. Experimental verifications include confirmation of a CTG start codon, pseudogene restoration and quality assurance of the Keio strain collection.
Collapse
Affiliation(s)
- Jindan Zhou
- Department of Biochemistry and Molecular Biology, The Miller School of Medicine, University of Miami, Miami, FL 33143, USA
| | | |
Collapse
|