51
|
Yan H, Venkatesan K, Beaver JE, Klitgord N, Yildirim MA, Hao T, Hill DE, Cusick ME, Perrimon N, Roth FP, Vidal M. A genome-wide gene function prediction resource for Drosophila melanogaster. PLoS One 2010; 5:e12139. [PMID: 20711346 PMCID: PMC2920829 DOI: 10.1371/journal.pone.0012139] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2010] [Accepted: 07/14/2010] [Indexed: 11/19/2022] Open
Abstract
Predicting gene functions by integrating large-scale biological data remains a challenge for systems biology. Here we present a resource for Drosophila melanogaster gene function predictions. We trained function-specific classifiers to optimize the influence of different biological datasets for each functional category. Our model predicted GO terms and KEGG pathway memberships for Drosophila melanogaster genes with high accuracy, as affirmed by cross-validation, supporting literature evidence, and large-scale RNAi screens. The resulting resource of prioritized associations between Drosophila genes and their potential functions offers a guide for experimental investigations.
Collapse
Affiliation(s)
- Han Yan
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Kavitha Venkatesan
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - John E. Beaver
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Niels Klitgord
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Muhammed A. Yildirim
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
- Applied Physics Program, Division of Engineering and Applied Sciences, Graduate School of Arts and Sciences, Harvard University, Cambridge, Massachusetts, United States of America
| | - Tong Hao
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - David E. Hill
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Michael E. Cusick
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Norbert Perrimon
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
- Howard Hughes Medical Institute, Boston, Massachusetts, United States of America
| | - Frederick P. Roth
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail: (FPR); (MV)
| | - Marc Vidal
- Department of Cancer Biology, Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail: (FPR); (MV)
| |
Collapse
|
52
|
Sokolov A, Ben-Hur A. Hierarchical classification of gene ontology terms using the GOstruct method. J Bioinform Comput Biol 2010; 8:357-76. [PMID: 20401950 DOI: 10.1142/s0219720010004744] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2009] [Revised: 11/08/2009] [Accepted: 11/08/2009] [Indexed: 11/18/2022]
Abstract
Protein function prediction is an active area of research in bioinformatics. Yet, the transfer of annotation on the basis of sequence or structural similarity remains widely used as an annotation method. Most of today's machine learning approaches reduce the problem to a collection of binary classification problems: whether a protein performs a particular function, sometimes with a post-processing step to combine the binary outputs. We propose a method that directly predicts a full functional annotation of a protein by modeling the structure of the Gene Ontology hierarchy in the framework of kernel methods for structured-output spaces. Our empirical results show improved performance over a BLAST nearest-neighbor method, and over algorithms that employ a collection of binary classifiers as measured on the Mousefunc benchmark dataset.
Collapse
Affiliation(s)
- Artem Sokolov
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA.
| | | |
Collapse
|
53
|
Beaver JE, Tasan M, Gibbons FD, Tian W, Hughes TR, Roth FP. FuncBase: a resource for quantitative gene function annotation. Bioinformatics 2010; 26:1806-7. [PMID: 20495000 PMCID: PMC2894510 DOI: 10.1093/bioinformatics/btq265] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2010] [Revised: 04/17/2010] [Accepted: 05/16/2010] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Computational gene function prediction can serve to focus experimental resources on high-priority experimental tasks. FuncBase is a web resource for viewing quantitative machine learning-based gene function annotations. Quantitative annotations of genes, including fungal and mammalian genes, with Gene Ontology terms are accompanied by a community feedback system. Evidence underlying function annotations is shown. For example, a custom Cytoscape viewer shows functional linkage graphs relevant to the gene or function of interest. FuncBase provides links to external resources, and may be accessed directly or via links from species-specific databases. AVAILABILITY FuncBase as well as all underlying data and annotations are freely available via http://func.med.harvard.edu/
Collapse
Affiliation(s)
- John E Beaver
- Department of Biological Chemistry & Molecular Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | | | | | | | | | | |
Collapse
|
54
|
Lee I, Lehner B, Vavouri T, Shin J, Fraser AG, Marcotte EM. Predicting genetic modifier loci using functional gene networks. Genome Res 2010; 20:1143-53. [PMID: 20538624 DOI: 10.1101/gr.102749.109] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Most phenotypes are genetically complex, with contributions from mutations in many different genes. Mutations in more than one gene can combine synergistically to cause phenotypic change, and systematic studies in model organisms show that these genetic interactions are pervasive. However, in human association studies such nonadditive genetic interactions are very difficult to identify because of a lack of statistical power--simply put, the number of potential interactions is too vast. One approach to resolve this is to predict candidate modifier interactions between loci, and then to specifically test these for associations with the phenotype. Here, we describe a general method for predicting genetic interactions based on the use of integrated functional gene networks. We show that in both Saccharomyces cerevisiae and Caenorhabditis elegans a single high-coverage, high-quality functional network can successfully predict genetic modifiers for the majority of genes. For C. elegans we also describe the construction of a new, improved, and expanded functional network, WormNet 2. Using this network we demonstrate how it is possible to rapidly expand the number of modifier loci known for a gene, predicting and validating new genetic interactions for each of three signal transduction genes. We propose that this approach, termed network-guided modifier screening, provides a general strategy for predicting genetic interactions. This work thus suggests that a high-quality integrated human gene network will provide a powerful resource for modifier locus discovery in many different diseases.
Collapse
Affiliation(s)
- Insuk Lee
- Department of Biotechnology, College of Life science and Biotechnology, Yonsei University, Seodaemun-ku, Seoul 120-749, South Korea.
| | | | | | | | | | | |
Collapse
|
55
|
Hill DP, Berardini TZ, Howe DG, Van Auken KM. Representing ontogeny through ontology: a developmental biologist's guide to the gene ontology. Mol Reprod Dev 2010; 77:314-29. [PMID: 19921742 DOI: 10.1002/mrd.21130] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Developmental biology, like many other areas of biology, has undergone a dramatic shift in the perspective from which developmental processes are viewed. Instead of focusing on the actions of a handful of genes or functional RNAs, we now consider the interactions of large functional gene networks and study how these complex systems orchestrate the unfolding of an organism, from gametes to adult. Developmental biologists are beginning to realize that understanding ontogeny on this scale requires the utilization of computational methods to capture, store and represent the knowledge we have about the underlying processes. Here we review the use of the Gene Ontology (GO) to study developmental biology. We describe the organization and structure of the GO and illustrate some of the ways we use it to capture the current understanding of many common developmental processes. We also discuss ways in which gene product annotations using the GO have been used to ask and answer developmental questions in a variety of model developmental systems. We provide suggestions as to how the GO might be used in more powerful ways to address questions about development. Our goal is to provide developmental biologists with enough background about the GO that they can begin to think about how they might use the ontology efficiently and in the most powerful ways possible.
Collapse
|
56
|
Lippert C, Ghahramani Z, Borgwardt KM. Gene function prediction from synthetic lethality networks via ranking on demand. Bioinformatics 2010; 26:912-8. [DOI: 10.1093/bioinformatics/btq053] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
|
57
|
Genomics Portals: integrative web-platform for mining genomics data. BMC Genomics 2010; 11:27. [PMID: 20070909 PMCID: PMC2824719 DOI: 10.1186/1471-2164-11-27] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2009] [Accepted: 01/13/2010] [Indexed: 12/21/2022] Open
Abstract
Background A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Results Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. Conclusion The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.
Collapse
|
58
|
Klie S, Nikoloski Z, Selbig J. Biological cluster evaluation for gene function prediction. J Comput Biol 2010; 21:428-45. [PMID: 20059365 DOI: 10.1089/cmb.2009.0129] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set.
Collapse
Affiliation(s)
- Sebastian Klie
- 1 Max-Planck Institute for Molecular Plant Physiology , Potsdam, Brandenburg, Germany
| | | | | |
Collapse
|
59
|
Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics 2010; 11:2. [PMID: 20044933 PMCID: PMC2824675 DOI: 10.1186/1471-2105-11-2] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2009] [Accepted: 01/02/2010] [Indexed: 12/04/2022] Open
Abstract
Background S. cerevisiae, A. thaliana and M. musculus are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.
Collapse
|
60
|
Abstract
A major challenge in current biology is to understand the genetic basis of variation for quantitative traits. We review the principles of quantitative trait locus mapping and summarize insights about the genetic architecture of quantitative traits that have been obtained over the past decades. We are currently in the midst of a genomic revolution, which enables us to incorporate genetic variation in transcript abundance and other intermediate molecular phenotypes into a quantitative trait locus mapping framework. This systems genetics approach enables us to understand the biology inside the 'black box' that lies between genotype and phenotype in terms of causal networks of interacting genes.
Collapse
|
61
|
Christie KR, Hong EL, Cherry JM. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns. Trends Microbiol 2009; 17:286-94. [PMID: 19577472 PMCID: PMC3057094 DOI: 10.1016/j.tim.2009.04.005] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Revised: 04/20/2009] [Accepted: 04/24/2009] [Indexed: 11/27/2022]
Abstract
The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions.
Collapse
Affiliation(s)
- Karen R Christie
- Department of Genetics, Stanford University Medical School, Stanford, CA 94305-5120, USA
| | | | | |
Collapse
|
62
|
Rogers MF, Ben-Hur A. The use of gene ontology evidence codes in preventing classifier assessment bias. ACTA ACUST UNITED AC 2009; 25:1173-7. [PMID: 19254922 DOI: 10.1093/bioinformatics/btp122] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The biological community's reliance on computational annotations of protein function makes correct assessment of function prediction methods an issue of great importance. The fact that a large fraction of the annotations in current biological databases are based on computational methods can lead to bias in estimating the accuracy of function prediction methods. This can happen since predicting an annotation that was derived computationally in the first place is likely easier than predicting annotations that were derived experimentally, leading to over-optimistic classifier performance estimates. RESULTS We illustrate this phenomenon in a set of controlled experiments using a nearest neighbor classifier that uses PSI-BLAST similarity scores. Our results demonstrate that the source of Gene Ontology (GO) annotations used to assess a protein function predictor can have a highly significant influence on classifier accuracy: the average accuracy over four species and over GO terms in the biological process namespace increased from 0.72 to 0.87 when the classifier was given access to annotations that are assigned evidence codes that indicate a possible computational source, instead of experimentally determined annotations. Slightly smaller increases were observed in the other namespaces. In these comparisons the total number of annotations and their distribution across GO terms were kept the same. CONCLUSION In conclusion, taking into account GO evidence codes is required for reporting accuracy statistics that do not overestimate a model's performance, and is of particular importance for a fair comparison of classifiers that rely on different information sources. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mark F Rogers
- Computer Science Department, Colorado State University, Ft. Collins, CO, USA.
| | | |
Collapse
|
63
|
Aiyar RS, Gagneur J, Steinmetz LM. Identification of mitochondrial disease genes through integrative analysis of multiple datasets. Methods 2008; 46:248-55. [PMID: 18930150 PMCID: PMC2774125 DOI: 10.1016/j.ymeth.2008.10.002] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2008] [Revised: 10/03/2008] [Accepted: 10/08/2008] [Indexed: 11/24/2022] Open
Abstract
Determining the genetic factors in a disease is crucial to elucidating its molecular basis. This task is challenging due to a lack of information on gene function. The integration of large-scale functional genomics data has proven to be an effective strategy to prioritize candidate disease genes. Mitochondrial disorders are a prevalent and heterogeneous class of diseases that are particularly amenable to this approach. Here we explain the application of integrative approaches to the identification of mitochondrial disease genes. We first examine various datasets that can be used to evaluate the involvement of each gene in mitochondrial function. The data integration methodology is then described, accompanied by examples of common implementations. Finally, we discuss how gene networks are constructed using integrative techniques and applied to candidate gene prioritization. Relevant public data resources are indicated. This report highlights the success and potential of data integration as well as its applicability to the search for mitochondrial disease genes.
Collapse
Affiliation(s)
- Raeka S. Aiyar
- European Molecular Biology Laboratory, Meyerhofstraβe 1, 69117 Heidelberg, Germany
| | - Julien Gagneur
- European Molecular Biology Laboratory, Meyerhofstraβe 1, 69117 Heidelberg, Germany
| | - Lars M. Steinmetz
- European Molecular Biology Laboratory, Meyerhofstraβe 1, 69117 Heidelberg, Germany
| |
Collapse
|
64
|
Taşan M, Tian W, Hill DP, Gibbons FD, Blake JA, Roth FP. An en masse phenotype and function prediction system for Mus musculus. Genome Biol 2008; 9 Suppl 1:S8. [PMID: 18613952 PMCID: PMC2447542 DOI: 10.1186/gb-2008-9-s1-s8] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Individual researchers are struggling to keep up with the accelerating emergence of high-throughput biological data, and to extract information that relates to their specific questions. Integration of accumulated evidence should permit researchers to form fewer - and more accurate - hypotheses for further study through experimentation. RESULTS Here a method previously used to predict Gene Ontology (GO) terms for Saccharomyces cerevisiae (Tian et al.: Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biol 2008, 9(Suppl 1):S7) is applied to predict GO terms and phenotypes for 21,603 Mus musculus genes, using a diverse collection of integrated data sources (including expression, interaction, and sequence-based data). This combined 'guilt-by-profiling' and 'guilt-by-association' approach optimizes the combination of two inference methodologies. Predictions at all levels of confidence are evaluated by examining genes not used in training, and top predictions are examined manually using available literature and knowledge base resources. CONCLUSION We assigned a confidence score to each gene/term combination. The results provided high prediction performance, with nearly every GO term achieving greater than 40% precision at 1% recall. Among the 36 novel predictions for GO terms and 40 for phenotypes that were studied manually, >80% and >40%, respectively, were identified as accurate. We also illustrate that a combination of 'guilt-by-profiling' and 'guilt-by-association' outperforms either approach alone in their application to M. musculus.
Collapse
Affiliation(s)
- Murat Taşan
- Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Longwood Avenue, Boston, Massachusetts 02115, USA
| | | | | | | | | | | |
Collapse
|
65
|
|