1
|
Romero M, Nakano FK, Finke J, Rocha C, Vens C. Leveraging class hierarchy for detecting missing annotations on hierarchical multi-label classification. Comput Biol Med 2023; 152:106423. [PMID: 36529023 DOI: 10.1016/j.compbiomed.2022.106423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 11/09/2022] [Accepted: 12/11/2022] [Indexed: 12/15/2022]
Abstract
With the development of new sequencing technologies, availability of genomic data has grown exponentially. Over the past decade, numerous studies have used genomic data to identify associations between genes and biological functions. While these studies have shown success in annotating genes with functions, they often assume that genes are completely annotated and fail to take into account that datasets are sparse and noisy. This work proposes a method to detect missing annotations in the context of hierarchical multi-label classification. More precisely, our method exploits the relations of functions, represented as a hierarchy, by computing probabilities based on the paths of functions in the hierarchy. By performing several experiments on a variety of rice (Oriza sativa Japonica), we showcase that the proposed method accurately detects missing annotations and yields superior results when compared to state-of-art methods from the literature.
Collapse
Affiliation(s)
- Miguel Romero
- Department of Electronics and Computer Science, Pontificia Universidad Javeriana, Calle 18 N 118-250, Cali, 760031, Colombia.
| | - Felipe Kenji Nakano
- Department of Public Health and Primary Care, KU Leuven Campus KULAK, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium; Itec, imec research group at KU Leuven, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium.
| | - Jorge Finke
- Department of Electronics and Computer Science, Pontificia Universidad Javeriana, Calle 18 N 118-250, Cali, 760031, Colombia.
| | - Camilo Rocha
- Department of Electronics and Computer Science, Pontificia Universidad Javeriana, Calle 18 N 118-250, Cali, 760031, Colombia.
| | - Celine Vens
- Department of Public Health and Primary Care, KU Leuven Campus KULAK, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium; Itec, imec research group at KU Leuven, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium.
| |
Collapse
|
2
|
Yunes JM, Babbitt PC. Effusion: prediction of protein function from sequence similarity networks. Bioinformatics 2019; 35:442-451. [PMID: 30084920 PMCID: PMC6361244 DOI: 10.1093/bioinformatics/bty672] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2018] [Revised: 07/24/2018] [Accepted: 07/30/2018] [Indexed: 12/26/2022] Open
Abstract
Motivation Critical evaluation of methods for protein function prediction shows that data integration improves the performance of methods that predict protein function, but a basic BLAST-based method is still a top contender. We sought to engineer a method that modernizes the classical approach while avoiding pitfalls common to state-of-the-art methods. Results We present a method for predicting protein function, Effusion, which uses a sequence similarity network to add context for homology transfer, a probabilistic model to account for the uncertainty in labels and function propagation, and the structure of the Gene Ontology (GO) to best utilize sparse input labels and make consistent output predictions. Effusion's model makes it practical to integrate rare experimental data and abundant primary sequence and sequence similarity. We demonstrate Effusion's performance using a critical evaluation method and provide an in-depth analysis. We also dissect the design decisions we used to address challenges for predicting protein function. Finally, we propose directions in which the framework of the method can be modified for additional predictive power. Availability and implementation The source code for an implementation of Effusion is freely available at https://github.com/babbittlab/effusion. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jeffrey M Yunes
- UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, CA, USA
| |
Collapse
|
3
|
Transitive closure of subsumption and causal relations in a large ontology of radiological diagnosis. J Biomed Inform 2016; 61:27-33. [PMID: 27005590 DOI: 10.1016/j.jbi.2016.03.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2015] [Revised: 03/12/2016] [Accepted: 03/18/2016] [Indexed: 01/12/2023]
Abstract
The Radiology Gamuts Ontology (RGO)-an ontology of diseases, interventions, and imaging findings-was developed to aid in decision support, education, and translational research in diagnostic radiology. The ontology defines a subsumption (is_a) relation between more general and more specific terms, and a causal relation (may_cause) to express the relationship between disorders and their possible imaging manifestations. RGO incorporated 19,745 terms with their synonyms and abbreviations, 1768 subsumption relations, and 55,558 causal relations. Transitive closure was computed iteratively; it yielded 2154 relations over subsumption and 1,594,896 relations over causality. Five causal cycles were discovered, all with path length of no more than 5. The graph-theoretic metrics of in-degree and out-degree were explored; the most useful metric to prioritize modification of the ontology was found to be the product of the in-degree of transitive closure over subsumption and the out-degree of transitive closure over causality. Two general types of error were identified: (1) causal assertions that used overly general terms because they implicitly assumed an organ-specific context and (2) subsumption relations where a site-specific disorder was asserted to be a subclass of the general disorder. Transitive closure helped identify incorrect assertions, prioritized and guided ontology revision, and aided resources that applied the ontology's knowledge.
Collapse
|
4
|
Wang H, Huang H, Ding C. Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency. J Comput Biol 2015; 22:546-62. [PMID: 25922963 DOI: 10.1089/cmb.2014.0172] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Conventional computational approaches for protein function prediction usually predict one function at a time, fundamentally. As a result, the protein functions are treated as separate target classes. However, biological processes are highly correlated in reality, which makes multiple functions assigned to a protein not independent. Therefore, it would be beneficial to make use of function category correlations when predicting protein functions. In this article, we propose a novel Maximization of Data-Knowledge Consistency (MDKC) approach to exploit function category correlations for protein function prediction. Our approach banks on the assumption that two proteins are likely to have large overlap in their annotated functions if they are highly similar according to certain experimental data. We first establish a new pairwise protein similarity using protein annotations from knowledge perspective. Then by maximizing the consistency between the established knowledge similarity upon annotations and the data similarity upon biological experiments, putative functions are assigned to unannotated proteins. Most importantly, function category correlations are gracefully incorporated into our learning objective through the knowledge similarity. Comprehensive experimental evaluations on the Saccharomyces cerevisiae species have demonstrated promising results that validate the performance of our methods.
Collapse
Affiliation(s)
- Hua Wang
- 1Department of Electrical Engineering and Computer Science, Colorado School of Mines, Golden, Colorado
| | - Heng Huang
- 2Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas
| | - Chris Ding
- 2Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas
| |
Collapse
|
5
|
Yu G, Zhu H, Domeniconi C. Predicting protein functions using incomplete hierarchical labels. BMC Bioinformatics 2015; 16:1. [PMID: 25591917 PMCID: PMC4384381 DOI: 10.1186/s12859-014-0430-y] [Citation(s) in RCA: 83] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 12/11/2014] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction. RESULTS In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels. CONCLUSION The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.
Collapse
Affiliation(s)
- Guoxian Yu
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China.
- College of Computer and Information Sciences, Southwest University, Chongqing, China.
| | - Hailong Zhu
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China.
| | | |
Collapse
|
6
|
Valentini G. Hierarchical ensemble methods for protein function prediction. ISRN BIOINFORMATICS 2014; 2014:901419. [PMID: 25937954 PMCID: PMC4393075 DOI: 10.1155/2014/901419] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 02/25/2014] [Indexed: 12/11/2022]
Abstract
Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware "flat" prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a "consensus" ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
| |
Collapse
|
7
|
Yu D, Kim M, Xiao G, Hwang TH. Review of biological network data and its applications. Genomics Inform 2013; 11:200-10. [PMID: 24465231 PMCID: PMC3897847 DOI: 10.5808/gi.2013.11.4.200] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Revised: 11/20/2013] [Accepted: 11/21/2013] [Indexed: 12/16/2022] Open
Abstract
Studying biological networks, such as protein-protein interactions, is key to understanding complex biological activities. Various types of large-scale biological datasets have been collected and analyzed with high-throughput technologies, including DNA microarray, next-generation sequencing, and the two-hybrid screening system, for this purpose. In this review, we focus on network-based approaches that help in understanding biological systems and identifying biological functions. Accordingly, this paper covers two major topics in network biology: reconstruction of gene regulatory networks and network-based applications, including protein function prediction, disease gene prioritization, and network-based genome-wide association study.
Collapse
Affiliation(s)
- Donghyeon Yu
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Minsoo Kim
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Guanghua Xiao
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Tae Hyun Hwang
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| |
Collapse
|
8
|
Stojanova D, Ceci M, Malerba D, Dzeroski S. Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction. BMC Bioinformatics 2013; 14:285. [PMID: 24070402 PMCID: PMC3850549 DOI: 10.1186/1471-2105-14-285] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 09/18/2013] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers. RESULTS This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function. CONCLUSIONS Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.
Collapse
Affiliation(s)
- Daniela Stojanova
- Department of Knowledge Technologies, JoŽef Stefan Institute, Jamova cesta 39, Ljubljana, Slovenia.
| | | | | | | |
Collapse
|
9
|
Hu P, Jiang H, Emili A. Incorporating Correlations among Gene Ontology Terms into Predicting Protein Functions. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
The authors describe a new strategy that has better prediction performance than previous methods, which gives additional insights about the importance of the dependence between functional terms when inferring protein function.
Collapse
Affiliation(s)
- Pingzhao Hu
- York University, Canada & University of Toronto, Canada
| | | | | |
Collapse
|
10
|
Kourmpetis YAI, van Dijk ADJ, ter Braak CJF. Gene Ontology consistent protein function prediction: the FALCON algorithm applied to six eukaryotic genomes. Algorithms Mol Biol 2013; 8:10. [PMID: 23531338 PMCID: PMC3691668 DOI: 10.1186/1748-7188-8-10] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2011] [Accepted: 03/04/2013] [Indexed: 11/10/2022] Open
Abstract
: Gene Ontology (GO) is a hierarchical vocabulary for the description of biological functions and locations, often employed by computational methods for protein function prediction. Due to the structure of GO, function predictions can be self- contradictory. For example, a protein may be predicted to belong to a detailed functional class, but not in a broader class that, due to the vocabulary structure, includes the predicted one.We present a novel discrete optimization algorithm called Functional Annotation with Labeling CONsistency (FALCON) that resolves such contradictions. The GO is modeled as a discrete Bayesian Network. For any given input of GO term membership probabilities, the algorithm returns the most probable GO term assignments that are in accordance with the Gene Ontology structure. The optimization is done using the Differential Evolution algorithm. Performance is evaluated on simulated and also real data from Arabidopsis thaliana showing improvement compared to related approaches. We finally applied the FALCON algorithm to obtain genome-wide function predictions for six eukaryotic species based on data provided by the CAFA (Critical Assessment of Function Annotation) project.
Collapse
Affiliation(s)
- Yiannis AI Kourmpetis
- Biometris, Wageningen University and Research Centre, 6700AC Wageningen, The Netherlands
- Current address: Functional Genomics, Nestlé Institute of Health Sciences, Campus EPFL, Quartier de l’Innovation, 1015 Lausanne, Switzerland
| | - Aalt DJ van Dijk
- Biometris, Wageningen University and Research Centre, 6700AC Wageningen, The Netherlands
- Applied Bioinformatics, Plant Research International, Wageningen University and Research Centre, 6700AC Wageningen, The Netherlands
| | - Cajo JF ter Braak
- Biometris, Wageningen University and Research Centre, 6700AC Wageningen, The Netherlands
| |
Collapse
|
11
|
A Latent Eigenprobit Model with Link Uncertainty for Prediction of Protein–Protein Interactions. STATISTICS IN BIOSCIENCES 2012. [DOI: 10.1007/s12561-011-9049-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
12
|
RE MATTEO, VALENTINI GIORGIO. Ensemble Methods. ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY 2012. [DOI: 10.1201/b11822-34] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
13
|
Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach Learn 2011. [DOI: 10.1007/s10994-011-5271-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
14
|
Mazandu GK, Mulder NJ. Using the underlying biological organization of the Mycobacterium tuberculosis functional network for protein function prediction. INFECTION GENETICS AND EVOLUTION 2011; 12:922-32. [PMID: 22085822 DOI: 10.1016/j.meegid.2011.10.027] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Revised: 10/25/2011] [Accepted: 10/28/2011] [Indexed: 10/15/2022]
Abstract
Despite ever-increasing amounts of sequence and functional genomics data, there is still a deficiency of functional annotation for many newly sequenced proteins. For Mycobacterium tuberculosis (MTB), more than half of its genome is still uncharacterized, which hampers the search for new drug targets within the bacterial pathogen and limits our understanding of its pathogenicity. As for many other genomes, the annotations of proteins in the MTB proteome were generally inferred from sequence homology, which is effective but its applicability has limitations. We have carried out large-scale biological data integration to produce an MTB protein functional interaction network. Protein functional relationships were extracted from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database, and additional functional interactions from microarray, sequence and protein signature data. The confidence level of protein relationships in the additional functional interaction data was evaluated using a dynamic data-driven scoring system. This functional network has been used to predict functions of uncharacterized proteins using Gene Ontology (GO) terms, and the semantic similarity between these terms measured using a state-of-the-art GO similarity metric. To achieve better trade-off between improvement of quality, genomic coverage and scalability, this prediction is done by observing the key principles driving the biological organization of the functional network. This study yields a new functionally characterized MTB strain CDC1551 proteome, consisting of 3804 and 3698 proteins out of 4195 with annotations in terms of the biological process and molecular function ontologies, respectively. These data can contribute to research into the Development of effective anti-tubercular drugs with novel biological mechanisms of action.
Collapse
Affiliation(s)
- Gaston K Mazandu
- Computational Biology Group, Department of Clinical Laboratory Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, 7925 Observatory, Cape Town, South Africa
| | | |
Collapse
|
15
|
Valentini G. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:832-847. [PMID: 20479498 DOI: 10.1109/tcbb.2010.38] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Gene function prediction is a complex computational problem, characterized by several items: the number of functional classes is large, and a gene may belong to multiple classes; functional classes are structured according to a hierarchy; classes are usually unbalanced, with more negative than positive examples; class labels can be uncertain and the annotations largely incomplete; to improve the predictions, multiple sources of data need to be properly integrated. In this contribution, we focus on the first three items, and, in particular, on the development of a new method for the hierarchical genome-wide and ontology-wide gene function prediction. The proposed algorithm is inspired by the “true path rule” (TPR) that governs both the Gene Ontology and FunCat taxonomies. According to this rule, the proposed TPR ensemble method is characterized by a two-way asymmetric flow of information that traverses the graph-structured ensemble: positive predictions for a node influence in a recursive way its ancestors, while negative predictions influence its offsprings. Cross-validated results with the model organism S. Crevisiae, using seven different sources of biomolecular data, and a theoretical analysis of the the TPR algorithm show the effectiveness and the drawbacks of the proposed approach.
Collapse
Affiliation(s)
- Giorgio Valentini
- Dipartimento di Scienze dell'Informazione,Università degli Studi di Milano, Via Comelico 39, Milano, Italy.
| |
Collapse
|
16
|
Jiang X, Gold D, Kolaczyk ED. Network-based auto-probit modeling for protein function prediction. Biometrics 2010; 67:958-66. [PMID: 21133881 DOI: 10.1111/j.1541-0420.2010.01519.x] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Predicting the functional roles of proteins based on various genome-wide data, such as protein-protein association networks, has become a canonical problem in computational biology. Approaching this task as a binary classification problem, we develop a network-based extension of the spatial auto-probit model. In particular, we develop a hierarchical Bayesian probit-based framework for modeling binary network-indexed processes, with a latent multivariate conditional autoregressive Gaussian process. The latter allows for the easy incorporation of protein-protein association network topologies-either binary or weighted-in modeling protein functional similarity. We use this framework to predict protein functions, for functions defined as terms in the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functionality. Furthermore, we show how a natural extension of this framework can be used to model and correct for the high percentage of false negative labels in training data derived from GO, a serious shortcoming endemic to biological databases of this type. Our method performance is evaluated and compared with standard algorithms on weighted yeast protein-protein association networks, extracted from a recently developed integrative database called Search Tool for the Retrieval of INteracting Genes/proteins (STRING). Results show that our basic method is competitive with these other methods, and that the extended method-incorporating the uncertainty in negative labels among the training data-can yield nontrivial improvements in predictive accuracy.
Collapse
Affiliation(s)
- Xiaoyu Jiang
- Boehringer Ingelheim Pharmaceuticals, Inc., 900 Ridgebury Road, Ridgefield, Connecticut 06877, USA
| | | | | |
Collapse
|
17
|
Roy Choudhury D, Small C, Wang Y, Mueller PR, Rebel VI, Griswold MD, McCarrey JR. Microarray-based analysis of cell-cycle gene expression during spermatogenesis in the mouse. Biol Reprod 2010; 83:663-75. [PMID: 20631398 DOI: 10.1095/biolreprod.110.084889] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Mammalian spermatogenesis is a continuum of cellular differentiation in a lineage that features three principal stages: 1) a mitotically active stage in spermatogonia, 2) a meiotic stage in spermatocytes, and 3) a postreplicative stage in spermatids. We used a microarray-based approach to identify changes in expression of cell-cycle genes that distinguish 1) mitotic type A spermatogonia from meiotic pachytene spermatocytes and 2) pachytene spermatocytes from postreplicative round spermatids. We detected expression of 550 genes related to cell-cycle function in one or more of these cell types. Although a majority of these genes were expressed during all three stages of spermatogenesis, we observed dramatic changes in levels of individual transcripts between mitotic spermatogonia and meiotic spermatocytes and between meiotic spermatocytes and postreplicative spermatids. Our results suggest that distinct cell-cycle gene regulatory networks or subnetworks are associated with each phase of the cell cycle in each spermatogenic cell type. In addition, we observed expression of different members of certain cell-cycle gene families in each of the three spermatogenic cell types investigated. Finally, we report expression of 221 cell-cycle genes that have not previously been annotated as part of the cell cycle network expressed during spermatogenesis, including eight novel genes that appear to be testis-specific.
Collapse
|
18
|
A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature. PLoS Comput Biol 2010; 6:e1000837. [PMID: 20617200 PMCID: PMC2895635 DOI: 10.1371/journal.pcbi.1000837] [Citation(s) in RCA: 77] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2010] [Accepted: 05/27/2010] [Indexed: 02/07/2023] Open
Abstract
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
Collapse
|
19
|
Xiong B, Wu J, Burk DL, Xue M, Jiang H, Shen J. BSSF: a fingerprint based ultrafast binding site similarity search and function analysis server. BMC Bioinformatics 2010; 11:47. [PMID: 20100327 PMCID: PMC3098077 DOI: 10.1186/1471-2105-11-47] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2009] [Accepted: 01/25/2010] [Indexed: 11/17/2022] Open
Abstract
Background Genome sequencing and post-genomics projects such as structural genomics are extending the frontier of the study of sequence-structure-function relationship of genes and their products. Although many sequence/structure-based methods have been devised with the aim of deciphering this delicate relationship, there still remain large gaps in this fundamental problem, which continuously drives researchers to develop novel methods to extract relevant information from sequences and structures and to infer the functions of newly identified genes by genomics technology. Results Here we present an ultrafast method, named BSSF(Binding Site Similarity & Function), which enables researchers to conduct similarity searches in a comprehensive three-dimensional binding site database extracted from PDB structures. This method utilizes a fingerprint representation of the binding site and a validated statistical Z-score function scheme to judge the similarity between the query and database items, even if their similarities are only constrained in a sub-pocket. This fingerprint based similarity measurement was also validated on a known binding site dataset by comparing with geometric hashing, which is a standard 3D similarity method. The comparison clearly demonstrated the utility of this ultrafast method. After conducting the database searching, the hit list is further analyzed to provide basic statistical information about the occurrences of Gene Ontology terms and Enzyme Commission numbers, which may benefit researchers by helping them to design further experiments to study the query proteins. Conclusions This ultrafast web-based system will not only help researchers interested in drug design and structural genomics to identify similar binding sites, but also assist them by providing further analysis of hit list from database searching.
Collapse
Affiliation(s)
- Bing Xiong
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Zhangjiang Hi-Tech Park, Pudong, Shanghai, 201203, PR China.
| | | | | | | | | | | |
Collapse
|
20
|
Re M, Valentini G. An Experimental Comparison of Hierarchical Bayes and True Path Rule Ensembles for Protein Function Prediction. MULTIPLE CLASSIFIER SYSTEMS 2010. [DOI: 10.1007/978-3-642-12127-2_30] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
21
|
Christie KR, Hong EL, Cherry JM. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns. Trends Microbiol 2009; 17:286-94. [PMID: 19577472 PMCID: PMC3057094 DOI: 10.1016/j.tim.2009.04.005] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Revised: 04/20/2009] [Accepted: 04/24/2009] [Indexed: 11/27/2022]
Abstract
The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions.
Collapse
Affiliation(s)
- Karen R Christie
- Department of Genetics, Stanford University Medical School, Stanford, CA 94305-5120, USA
| | | | | |
Collapse
|