101
|
Chu LH, Rivera CG, Popel AS, Bader JS. Constructing the angiome: a global angiogenesis protein interaction network. Physiol Genomics 2012; 44:915-24. [PMID: 22911453 DOI: 10.1152/physiolgenomics.00181.2011] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Angiogenesis is the formation of new blood vessels from pre-existing microvessels. Excessive and insufficient angiogenesis have been associated with many diseases including cancer, age-related macular degeneration, ischemic heart, brain, and skeletal muscle diseases. A comprehensive understanding of angiogenesis regulatory processes is needed to improve treatment of these diseases. To identify proteins related to angiogenesis, we developed a novel integrative framework for diverse sources of high-throughput data. The system, called GeneHits, was used to expand on known angiogenesis pathways to construct the angiome, a protein-protein interaction network for angiogenesis. The network consists of 478 proteins and 1,488 interactions. The network was validated through cross validation and analysis of five gene expression datasets from in vitro angiogenesis assays. We calculated the topological properties of the angiome. We analyzed the functional enrichment of angiogenesis-annotated and associated proteins. We also constructed an extended angiome with 1,233 proteins and 5,726 interactions to derive a more complete map of protein-protein interactions in angiogenesis. Finally, the extended angiome was used to identify growth factor signaling networks that drive angiogenesis and antiangiogenic signaling networks. The results of this analysis can be used to identify genes and proteins in different disease conditions and putative targets for therapeutic interventions as high-ranked candidates for experimental validation.
Collapse
Affiliation(s)
- Liang-Hui Chu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21205, USA.
| | | | | | | |
Collapse
|
102
|
The need for mouse models in osteoporosis genetics research. BONEKEY REPORTS 2012; 1:98. [PMID: 23951485 DOI: 10.1038/bonekey.2012.98] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/05/2011] [Accepted: 04/08/2012] [Indexed: 02/08/2023]
Abstract
Osteoporosis, the progressive loss of bone mass resulting in fragility fractures, affects ∼75 million people in the United States, Europe and Japan. Bone mineral density (BMD) correlates with fracture risk and is widely used in clinical settings to predict fracture. Numerous studies have demonstrated that peak bone mass is highly heritable and consequently a number of genome-wide association studies (GWASs) have been conducted to identify the genes that regulate BMD. Traditional intercross mapping in the mouse has met with limited successes in the field of skeletal biology. With the advent of human GWAS, questions have arisen about the continued need for mouse models in genetics research. However, significant advances have been made in the field of mouse genetics, including new genetics resource populations and loci mapping techniques, which enable gene-level mapping resolution. In this review, we discuss the need for mouse models to help understand the skeletal biology underlying novel human GWAS findings, how loci discovered in the mouse can be used to complement GWAS analysis and highlight the recent advances made in the field of skeletal biology from the use of these new and developing resources. We conclude this paper with a discussion of the need for systems-level approaches in the skeletal biology field, with an emphasis on the need for pathway and network analyses.
Collapse
|
103
|
Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, Troyanskaya OG. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 2012; 40:W484-90. [PMID: 22684505 PMCID: PMC3394282 DOI: 10.1093/nar/gks458] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Integrative multi-species prediction (IMP) is an interactive web server that enables molecular biologists to interpret experimental results and to generate hypotheses in the context of a large cross-organism compendium of functional predictions and networks. The system provides a framework for biologists to analyze their candidate gene sets in the context of functional networks, as they expand or focus these sets by mining functional relationships predicted from integrated high-throughput data. IMP integrates prior knowledge and data collections from multiple organisms in its analyses. Through flexible and interactive visualizations, researchers can compare functional contexts and interpret the behavior of their gene sets across organisms. Additionally, IMP identifies homologs with conserved functional roles for knowledge transfer, allowing for accurate function predictions even for biological processes that have very few experimental annotations in a given organism. IMP currently supports seven organisms (Homo sapiens, Mus musculus, Rattus novegicus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans and Saccharomyces cerevisiae), does not require any registration or installation and is freely available for use at http://imp.princeton.edu.
Collapse
Affiliation(s)
- Aaron K Wong
- Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | | | | | | | | | | |
Collapse
|
104
|
Zhang XF, Dai DQ. A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:740-753. [PMID: 22084148 DOI: 10.1109/tcbb.2011.148] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The functional annotation of proteins is one of the most important tasks in the post-genomic era. Although many computational approaches have been developed in recent years to predict protein function, most of these traditional algorithms do not take interrelationships among functional terms into account, such as different GO terms usually coannotate with some common proteins. In this study, we propose a new functional similarity measure in the form of Jaccard coefficient to quantify these interrelationships and also develop a framework for incorporating GO term similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large size ones when considering functional interrelationships. We also compare our similarity measure with other two widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed measure is more effective. Experiment results also illustrate that our algorithms outperform two previous competing algorithms, which also take functional interrelationships into account, in prediction accuracy. Finally, we show that our method is robust to annotations in the database which are not complete at present. These results give new insights about the importance of functional interrelationships in protein function prediction.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.
| | | |
Collapse
|
105
|
"Guilt by association" is the exception rather than the rule in gene networks. PLoS Comput Biol 2012; 8:e1002444. [PMID: 22479173 PMCID: PMC3315453 DOI: 10.1371/journal.pcbi.1002444] [Citation(s) in RCA: 144] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2011] [Accepted: 02/09/2012] [Indexed: 12/16/2022] Open
Abstract
Gene networks are commonly interpreted as encoding functional information in their connections. An extensively validated principle called guilt by association states that genes which are associated or interacting are more likely to share function. Guilt by association provides the central top-down principle for analyzing gene networks in functional terms or assessing their quality in encoding functional information. In this work, we show that functional information within gene networks is typically concentrated in only a very few interactions whose properties cannot be reliably related to the rest of the network. In effect, the apparent encoding of function within networks has been largely driven by outliers whose behaviour cannot even be generalized to individual genes, let alone to the network at large. While experimentalist-driven analysis of interactions may use prior expert knowledge to focus on the small fraction of critically important data, large-scale computational analyses have typically assumed that high-performance cross-validation in a network is due to a generalizable encoding of function. Because we find that gene function is not systemically encoded in networks, but dependent on specific and critical interactions, we conclude it is necessary to focus on the details of how networks encode function and what information computational analyses use to extract functional meaning. We explore a number of consequences of this and find that network structure itself provides clues as to which connections are critical and that systemic properties, such as scale-free-like behaviour, do not map onto the functional connectivity within networks. The analysis of gene function and gene networks is a major theme of post-genome biomedical research. Historically, many attempts to understand gene function leverage a biological principle known as “guilt by association” (GBA). GBA states that genes with related functions tend to share properties such as genetic or physical interactions. In the past ten years, GBA has been scaled up for application to large gene networks, becoming a favored way to grapple with the complex interdependencies of gene functions in the face of floods of genomics and proteomics data. However, there is a growing realization that scaled-up GBA is not a panacea. In this study, we report a precise identification of the limits of GBA and show that it cannot provide a way to understand gene networks in a way that is simultaneously general and useful. Our findings indicate that the assumptions underlying the high-throughput use of gene networks to interpret function are fundamentally flawed, with wide-ranging implications for the interpretation of genome-wide data.
Collapse
|
106
|
Park J, Costanzo MC, Balakrishnan R, Cherry JM, Hong EL. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas001. [PMID: 22434836 PMCID: PMC3308158 DOI: 10.1093/database/bas001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. Database URL:http://www.yeastgenome.org
Collapse
Affiliation(s)
- Julie Park
- Department of Genetics, Stanford University, Stanford, CA 94305-5120, USA
| | | | | | | | | |
Collapse
|
107
|
Yuan Y, Xu Y, Xu J, Ball RL, Liang H. Predicting the lethal phenotype of the knockout mouse by integrating comprehensive genomic data. ACTA ACUST UNITED AC 2012; 28:1246-52. [PMID: 22419784 DOI: 10.1093/bioinformatics/bts120] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION The phenotypes of knockout mice provide crucial information for understanding the biological functions of mammalian genes. Among various knockout phenotypes, lethality is of great interest because those involved genes play essential roles. With the availability of large-scale genomic data, we aimed to assess how well the integration of various genomic features can predict the lethal phenotype of single-gene knockout mice. RESULTS We first assembled a comprehensive list of 491 candidate genomic features derived from diverse data sources. Using mouse genes with a known phenotype as the training set, we integrated the informative genomic features to predict the knockout lethality through three machine learning methods. Based on cross-validation, our models could achieve a good performance (accuracy = 73% and recall = 63%). Our results serve as a valuable practical resource in the mouse genetics research community, and also accelerate the translation of the knowledge of mouse genes into better strategies for studying human disease.
Collapse
Affiliation(s)
- Yuan Yuan
- Graduate Program in Structural and Computational Biology and Molecular Biophysics, Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Baylor College of Medicine, Houston, TX 77030, USA
| | | | | | | | | |
Collapse
|
108
|
Uncovering the molecular machinery of the human spindle--an integration of wet and dry systems biology. PLoS One 2012; 7:e31813. [PMID: 22427808 PMCID: PMC3302876 DOI: 10.1371/journal.pone.0031813] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2011] [Accepted: 01/18/2012] [Indexed: 11/19/2022] Open
Abstract
The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called "hidden spindle hub", proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
Collapse
|
109
|
Zhu W, Hou J, Chen YPP. Exploiting multi-layered information to iteratively predict protein functions. Math Biosci 2012; 236:108-16. [PMID: 22391459 DOI: 10.1016/j.mbs.2012.02.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2011] [Revised: 02/02/2012] [Accepted: 02/15/2012] [Indexed: 01/21/2023]
Abstract
BACKGROUND Similarity based computational methods are a useful tool for predicting protein functions from protein-protein interaction (PPI) datasets. Although various similarity-based prediction algorithms have been proposed, unsatisfactory prediction results have occurred on many occasions. The purpose of this type of algorithm is to predict functions of an unannotated protein from the functions of those proteins that are similar to the unannotated protein. Therefore, the prediction quality largely depends on how to select a set of proper proteins (i.e., a prediction domain) from which the functions of an unannotated protein are predicted, and how to measure the similarity between proteins. Another issue with existing algorithms is they only believe the function prediction is a one-off procedure, ignoring the fact that interactions amongst proteins are mutual and dynamic in terms of similarity when predicting functions. How to resolve these major issues to increase prediction quality remains a challenge in computational biology. RESULTS In this paper, we propose an innovative approach to predict protein functions of unannotated proteins iteratively from a PPI dataset. The iterative approach takes into account the mutual and dynamic features of protein interactions when predicting functions, and addresses the issues of protein similarity measurement and prediction domain selection by introducing into the prediction algorithm a new semantic protein similarity and a method of selecting the multi-layer prediction domain. The new protein similarity is based on the multi-layered information carried by protein functions. The evaluations conducted on real protein interaction datasets demonstrated that the proposed iterative function prediction method outperformed other similar or non-iterative methods, and provided better prediction results. CONCLUSIONS The new protein similarity derived from multi-layered information of protein functions more reasonably reflects the intrinsic relationships among proteins, and significant improvement to the prediction quality can occur through incorporation of mutual and dynamic features of protein interactions into the prediction algorithm.
Collapse
Affiliation(s)
- Wei Zhu
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia.
| | | | | |
Collapse
|
110
|
A Resource of Quantitative Functional Annotation for Homo sapiens Genes. G3-GENES GENOMES GENETICS 2012; 2:223-33. [PMID: 22384401 PMCID: PMC3284330 DOI: 10.1534/g3.111.000828] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2011] [Accepted: 11/23/2011] [Indexed: 01/31/2023]
Abstract
The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented—alongside existing validated annotations—in a publicly accessible and searchable web interface.
Collapse
|
111
|
Greene CS, Troyanskaya OG. Accurate evaluation and analysis of functional genomics data and methods. Ann N Y Acad Sci 2012; 1260:95-100. [PMID: 22268703 DOI: 10.1111/j.1749-6632.2011.06383.x] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
The development of technology capable of inexpensively performing large-scale measurements of biological systems has generated a wealth of data. Integrative analysis of these data holds the promise of uncovering gene function, regulation, and, in the longer run, understanding complex disease. However, their analysis has proved very challenging, as it is difficult to quickly and effectively assess the relevance and accuracy of these data for individual biological questions. Here, we identify biases that present challenges for the assessment of functional genomics data and methods. We then discuss evaluation methods that, taken together, begin to address these issues. We also argue that the funding of systematic data-driven experiments and of high-quality curation efforts will further improve evaluation metrics so that they more-accurately assess functional genomics data and methods. Such metrics will allow researchers in the field of functional genomics to continue to answer important biological questions in a data-driven manner.
Collapse
Affiliation(s)
- Casey S Greene
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA.
| | | |
Collapse
|
112
|
Jelier R, Semple JI, Garcia-Verdugo R, Lehner B. Predicting phenotypic variation in yeast from individual genome sequences. Nat Genet 2011; 43:1270-4. [PMID: 22081227 DOI: 10.1038/ng.1007] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2011] [Accepted: 10/19/2011] [Indexed: 12/16/2022]
Abstract
A central challenge in genetics is to predict phenotypic variation from individual genome sequences. Here we construct and evaluate phenotypic predictions for 19 strains of Saccharomyces cerevisiae. We use conservation-based methods to predict the impact of protein-coding variation within genes on protein function. We then rank strains using a prediction score that measures the total sum of function-altering changes in different sets of genes reported to influence over 100 phenotypes in genome-wide loss-of-function screens. We evaluate our predictions by comparing them with the observed growth rate and efficiency of 15 strains tested across 20 conditions in quantitative experiments. The median predictive performance, as measured by ROC AUC, was 0.76, and predictions were more accurate when the genes reported to influence a trait were highly connected in a functional gene network.
Collapse
Affiliation(s)
- Rob Jelier
- European Molecular Biology Laboratory, Centre for Genomic Regulation, Systems Biology Research Unit, Barcelona, Spain
| | | | | | | |
Collapse
|
113
|
Genetic dissection of the biotic stress response using a genome-scale gene network for rice. Proc Natl Acad Sci U S A 2011; 108:18548-53. [PMID: 22042862 DOI: 10.1073/pnas.1110384108] [Citation(s) in RCA: 134] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Rice is a staple food for one-half the world's population and a model for other monocotyledonous species. Thus, efficient approaches for identifying key genes controlling simple or complex traits in rice have important biological, agricultural, and economic consequences. Here, we report on the construction of RiceNet, an experimentally tested genome-scale gene network for a monocotyledonous species. Many different datasets, derived from five different organisms including plants, animals, yeast, and humans, were evaluated, and 24 of the most useful were integrated into a statistical framework that allowed for the prediction of functional linkages between pairs of genes. Genes could be linked to traits by using guilt-by-association, predicting gene attributes on the basis of network neighbors. We applied RiceNet to an important agronomic trait, the biotic stress response. Using network guilt-by-association followed by focused protein-protein interaction assays, we identified and validated, in planta, two positive regulators, LOC_Os01g70580 (now Regulator of XA21; ROX1) and LOC_Os02g21510 (ROX2), and one negative regulator, LOC_Os06g12530 (ROX3). These proteins control resistance mediated by rice XA21, a pattern recognition receptor. We also showed that RiceNet can accurately predict gene function in another major monocotyledonous crop species, maize. RiceNet thus enables the identification of genes regulating important crop traits, facilitating engineering of pathways critical to crop productivity.
Collapse
|
114
|
Dozmorov MG, Giles CB, Wren JD. Predicting gene ontology from a global meta-analysis of 1-color microarray experiments. BMC Bioinformatics 2011; 12 Suppl 10:S14. [PMID: 22166114 PMCID: PMC3236836 DOI: 10.1186/1471-2105-12-s10-s14] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Abstract
Collapse
Affiliation(s)
- Mikhail G Dozmorov
- Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation 825 NE 13th Street, Oklahoma City, Oklahoma 73104-5005, USA
| | | | | |
Collapse
|
115
|
Rivera CG, Mellberg S, Claesson-Welsh L, Bader JS, Popel AS. Analysis of VEGF--a regulated gene expression in endothelial cells to identify genes linked to angiogenesis. PLoS One 2011; 6:e24887. [PMID: 21931866 PMCID: PMC3172305 DOI: 10.1371/journal.pone.0024887] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2011] [Accepted: 08/23/2011] [Indexed: 02/06/2023] Open
Abstract
Angiogenesis is important for many physiological processes, diseases, and also regenerative medicine. Therapies that inhibit the vascular endothelial growth factor (VEGF) pathway have been used in the clinic for cancer and macular degeneration. In cancer applications, these treatments suffer from a “tumor escape phenomenon” where alternative pathways are upregulated and angiogenesis continues. The redundancy of angiogenesis regulation indicates the need for additional studies and new drug targets. We aimed to (i) identify novel and missing angiogenesis annotations and (ii) verify their significance to angiogenesis. To achieve these goals, we integrated the human interactome with known angiogenesis-annotated proteins to identify a set of 202 angiogenesis-associated proteins. Across endothelial cell lines, we found that a significant fraction of these proteins had highly perturbed gene expression during angiogenesis. After treatment with VEGF-A, we found increasing expression of HIF-1α, APP, HIV-1 tat interactive protein 2, and MEF2C, while endoglin, liprin β1 and HIF-2α had decreasing expression across three endothelial cell lines. The analysis showed differential regulation of HIF-1α and HIF-2α. The data also provided additional evidence for the role of endothelial cells in Alzheimer's disease.
Collapse
Affiliation(s)
- Corban G Rivera
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
| | | | | | | | | |
Collapse
|
116
|
Drew K, Winters P, Butterfoss GL, Berstis V, Uplinger K, Armstrong J, Riffle M, Schweighofer E, Bovermann B, Goodlett DR, Davis TN, Shasha D, Malmström L, Bonneau R. The Proteome Folding Project: proteome-scale prediction of structure and function. Genome Res 2011; 21:1981-94. [PMID: 21824995 DOI: 10.1101/gr.121475.111] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition, and grid-computing-enabled de novo structure prediction. We predict protein domain boundaries and three-dimensional (3D) structures for protein domains from 94 genomes (including human, Arabidopsis, rice, mouse, fly, yeast, Escherichia coli, and worm). De novo structure predictions were distributed on a grid of more than 1.5 million CPUs worldwide (World Community Grid). We generated significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions.
Collapse
Affiliation(s)
- Kevin Drew
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York 10003, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
117
|
Gillis J, Pavlidis P. The role of indirect connections in gene networks in predicting function. Bioinformatics 2011; 27:1860-6. [PMID: 21551147 PMCID: PMC3117376 DOI: 10.1093/bioinformatics/btr288] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2011] [Revised: 04/12/2011] [Accepted: 05/02/2011] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Gene networks have been used widely in gene function prediction algorithms, many based on complex extensions of the 'guilt by association' principle. We sought to provide a unified explanation for the performance of gene function prediction algorithms in exploiting network structure and thereby simplify future analysis. RESULTS We use co-expression networks to show that most exploited network structure simply reconstructs the original correlation matrices from which the co-expression network was obtained. We show the same principle works in predicting gene function in protein interaction networks and that these methods perform comparably to much more sophisticated gene function prediction algorithms. AVAILABILITY AND IMPLEMENTATION Data and algorithm implementation are fully described and available at http://www.chibi.ubc.ca/extended. Programs are provided in Matlab m-code. CONTACT paul@chibi.ubc.ca
Collapse
Affiliation(s)
- Jesse Gillis
- Centre for High-Throughput Biology and Department of Psychiatry, 177 Michael Smith Laboratories, 2185 East Mall, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | | |
Collapse
|
118
|
Integrated genome-scale prediction of detrimental mutations in transcription networks. PLoS Genet 2011; 7:e1002077. [PMID: 21637788 PMCID: PMC3102745 DOI: 10.1371/journal.pgen.1002077] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Accepted: 03/25/2011] [Indexed: 01/10/2023] Open
Abstract
A central challenge in genetics is to understand when and why mutations alter the phenotype of an organism. The consequences of gene inhibition have been systematically studied and can be predicted reasonably well across a genome. However, many sequence variants important for disease and evolution may alter gene regulation rather than gene function. The consequences of altering a regulatory interaction (or “edge”) rather than a gene (or “node”) in a network have not been as extensively studied. Here we use an integrative analysis and evolutionary conservation to identify features that predict when the loss of a regulatory interaction is detrimental in the extensively mapped transcription network of budding yeast. Properties such as the strength of an interaction, location and context in a promoter, regulator and target gene importance, and the potential for compensation (redundancy) associate to some extent with interaction importance. Combined, however, these features predict quite well whether the loss of a regulatory interaction is detrimental across many promoters and for many different transcription factors. Thus, despite the potential for regulatory diversity, common principles can be used to understand and predict when changes in regulation are most harmful to an organism. The genomes of individuals differ in sequence at thousands of base pairs. Some of these polymorphisms affect the sequence of proteins, but many are likely to alter how genes are regulated. When are changes in gene regulation detrimental to an organism? We have used an integrative analysis of transcription factor binding site conservation in budding yeast to address the extent to which different features predict when potential changes in gene regulation are detrimental. We found that, despite the diversity of transcription factors and regulatory regions in a genome, a few simple properties can be used to predict and understand when changes in regulation are most harmful.
Collapse
|
119
|
A gene-phenotype network for the laboratory mouse and its implications for systematic phenotyping. PLoS One 2011; 6:e19693. [PMID: 21625554 PMCID: PMC3098258 DOI: 10.1371/journal.pone.0019693] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2011] [Accepted: 04/11/2011] [Indexed: 01/22/2023] Open
Abstract
The laboratory mouse is the pre-eminent model organism for the dissection of human disease pathways. With the advent of a comprehensive panel of gene knockouts, projects to characterise the phenotypes of all knockout lines are being initiated. The range of genotype-phenotype associations can be represented using the Mammalian Phenotype ontology. Using publicly available data annotated with this ontology we have constructed gene and phenotype networks representing these associations. These networks show a scale-free, hierarchical and modular character and community structure. They also exhibit enrichment for gene coexpression, protein-protein interactions and Gene Ontology annotation similarity. Close association between gene communities and some high-level ontology terms suggests that systematic phenotyping can provide a direct insight into underlying pathways. However some phenotypes are distributed more diffusely across gene networks, likely reflecting the pleiotropic roles of many genes. Phenotype communities show a many-to-many relationship to human disease communities, but stronger overlap at more granular levels of description. This may suggest that systematic phenotyping projects should aim for high granularity annotations to maximise their relevance to human disease.
Collapse
|
120
|
Valentini G. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:832-847. [PMID: 20479498 DOI: 10.1109/tcbb.2010.38] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Gene function prediction is a complex computational problem, characterized by several items: the number of functional classes is large, and a gene may belong to multiple classes; functional classes are structured according to a hierarchy; classes are usually unbalanced, with more negative than positive examples; class labels can be uncertain and the annotations largely incomplete; to improve the predictions, multiple sources of data need to be properly integrated. In this contribution, we focus on the first three items, and, in particular, on the development of a new method for the hierarchical genome-wide and ontology-wide gene function prediction. The proposed algorithm is inspired by the “true path rule” (TPR) that governs both the Gene Ontology and FunCat taxonomies. According to this rule, the proposed TPR ensemble method is characterized by a two-way asymmetric flow of information that traverses the graph-structured ensemble: positive predictions for a node influence in a recursive way its ancestors, while negative predictions influence its offsprings. Cross-validated results with the model organism S. Crevisiae, using seven different sources of biomolecular data, and a theoretical analysis of the the TPR algorithm show the effectiveness and the drawbacks of the proposed approach.
Collapse
Affiliation(s)
- Giorgio Valentini
- Dipartimento di Scienze dell'Informazione,Università degli Studi di Milano, Via Comelico 39, Milano, Italy.
| |
Collapse
|
121
|
Fortney K, Jurisica I. Integrative computational biology for cancer research. Hum Genet 2011; 130:465-81. [PMID: 21691773 PMCID: PMC3179275 DOI: 10.1007/s00439-011-0983-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2011] [Accepted: 04/02/2011] [Indexed: 12/21/2022]
Abstract
Over the past two decades, high-throughput (HTP) technologies such as microarrays and mass spectrometry have fundamentally changed clinical cancer research. They have revealed novel molecular markers of cancer subtypes, metastasis, and drug sensitivity and resistance. Some have been translated into the clinic as tools for early disease diagnosis, prognosis, and individualized treatment and response monitoring. Despite these successes, many challenges remain: HTP platforms are often noisy and suffer from false positives and false negatives; optimal analysis and successful validation require complex workflows; and great volumes of data are accumulating at a rapid pace. Here we discuss these challenges, and show how integrative computational biology can help diminish them by creating new software tools, analytical methods, and data standards.
Collapse
Affiliation(s)
- Kristen Fortney
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | | |
Collapse
|
122
|
Gillis J, Pavlidis P. The impact of multifunctional genes on "guilt by association" analysis. PLoS One 2011; 6:e17258. [PMID: 21364756 PMCID: PMC3041792 DOI: 10.1371/journal.pone.0017258] [Citation(s) in RCA: 136] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2010] [Accepted: 01/27/2011] [Indexed: 02/02/2023] Open
Abstract
Many previous studies have shown that by using variants of "guilt-by-association", gene function predictions can be made with very high statistical confidence. In these studies, it is assumed that the "associations" in the data (e.g., protein interaction partners) of a gene are necessary in establishing "guilt". In this paper we show that multifunctionality, rather than association, is a primary driver of gene function prediction. We first show that knowledge of the degree of multifunctionality alone can produce astonishingly strong performance when used as a predictor of gene function. We then demonstrate how multifunctionality is encoded in gene interaction data (such as protein interactions and coexpression networks) and how this can feed forward into gene function prediction algorithms. We find that high-quality gene function predictions can be made using data that possesses no information on which gene interacts with which. By examining a wide range of networks from mouse, human and yeast, as well as multiple prediction methods and evaluation metrics, we provide evidence that this problem is pervasive and does not reflect the failings of any particular algorithm or data type. We propose computational controls that can be used to provide more meaningful control when estimating gene function prediction performance. We suggest that this source of bias due to multifunctionality is important to control for, with widespread implications for the interpretation of genomics studies.
Collapse
Affiliation(s)
- Jesse Gillis
- Centre for High-Throughput Biology, Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Paul Pavlidis
- Centre for High-Throughput Biology, Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
123
|
Chikina MD, Troyanskaya OG. Accurate quantification of functional analogy among close homologs. PLoS Comput Biol 2011; 7:e1001074. [PMID: 21304936 PMCID: PMC3033368 DOI: 10.1371/journal.pcbi.1001074] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 01/02/2011] [Indexed: 11/18/2022] Open
Abstract
Correctly evaluating functional similarities among homologous proteins is necessary for accurate transfer of experimental knowledge from one organism to another, and is of particular importance for the development of animal models of human disease. While the fact that sequence similarity implies functional similarity is a fundamental paradigm of molecular biology, sequence comparison does not directly assess the extent to which two proteins participate in the same biological processes, and has limited utility for analyzing families with several parologous members. Nevertheless, we show that it is possible to provide a cross-organism functional similarity measure in an unbiased way through the exclusive use of high-throughput gene-expression data. Our methodology is based on probabilistic cross-species mapping of functionally analogous proteins based on Bayesian integrative analysis of gene expression compendia. We demonstrate that even among closely related genes, our method is able to predict functionally analogous homolog pairs better than relying on sequence comparison alone. We also demonstrate that the landscape of functional similarity is often complex and that definitive “functional orthologs” do not always exist. Even in these cases, our method and the online interface we provide are designed to allow detailed exploration of sources of inferred functional similarity that can be evaluated by the user. Common ancestry is a central tenet of modern biology, as genes from different species often show a high degree of sequence similarity, making it possible to study analogous processes across model organisms. However, many genes belong to large families with several duplicates and the relationship between genes from different species is often not one-to-one, complicating the transfer of experimental knowledge. We present a method that uses a large compendia of high-throughput expression data, that covers many genes that have not been analyzed in any other way, to systematically predict which genes are most likely to participate in the same biological process and thus have analogous function in different organisms. We show that our method agrees well with current experimental knowledge and we use it to investigate several families of genes that demonstrate the complexity of functional analogy.
Collapse
Affiliation(s)
- Maria D. Chikina
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
124
|
Karathia H, Vilaprinyo E, Sorribas A, Alves R. Saccharomyces cerevisiae as a model organism: a comparative study. PLoS One 2011; 6:e16015. [PMID: 21311596 PMCID: PMC3032731 DOI: 10.1371/journal.pone.0016015] [Citation(s) in RCA: 95] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2010] [Accepted: 12/03/2010] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Model organisms are used for research because they provide a framework on which to develop and optimize methods that facilitate and standardize analysis. Such organisms should be representative of the living beings for which they are to serve as proxy. However, in practice, a model organism is often selected ad hoc, and without considering its representativeness, because a systematic and rational method to include this consideration in the selection process is still lacking. METHODOLOGY/PRINCIPAL FINDINGS In this work we propose such a method and apply it in a pilot study of strengths and limitations of Saccharomyces cerevisiae as a model organism. The method relies on the functional classification of proteins into different biological pathways and processes and on full proteome comparisons between the putative model organism and other organisms for which we would like to extrapolate results. Here we compare S. cerevisiae to 704 other organisms from various phyla. For each organism, our results identify the pathways and processes for which S. cerevisiae is predicted to be a good model to extrapolate from. We find that animals in general and Homo sapiens in particular are some of the non-fungal organisms for which S. cerevisiae is likely to be a good model in which to study a significant fraction of common biological processes. We validate our approach by correctly predicting which organisms are phenotypically more distant from S. cerevisiae with respect to several different biological processes. CONCLUSIONS/SIGNIFICANCE The method we propose could be used to choose appropriate substitute model organisms for the study of biological processes in other species that are harder to study. For example, one could identify appropriate models to study either pathologies in humans or specific biological processes in species with a long development time, such as plants.
Collapse
Affiliation(s)
- Hiren Karathia
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida & IRBLleida, Lleida, Spain
| | - Ester Vilaprinyo
- Evaluation and Clinical Epidemiology Department, Hospital del Mar-IMIM, Barcelona, Spain
| | - Albert Sorribas
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida & IRBLleida, Lleida, Spain
| | - Rui Alves
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida & IRBLleida, Lleida, Spain
- * E-mail:
| |
Collapse
|
125
|
Hu L, Huang T, Shi X, Lu WC, Cai YD, Chou KC. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties. PLoS One 2011; 6:e14556. [PMID: 21283518 PMCID: PMC3023709 DOI: 10.1371/journal.pone.0014556] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2010] [Accepted: 12/21/2010] [Indexed: 11/27/2022] Open
Abstract
Background With the huge amount of uncharacterized protein sequences generated in the post-genomic age, it is highly desirable to develop effective computational methods for quickly and accurately predicting their functions. The information thus obtained would be very useful for both basic research and drug development in a timely manner. Methodology/Principal Findings Although many efforts have been made in this regard, most of them were based on either sequence similarity or protein-protein interaction (PPI) information. However, the former often fails to work if a query protein has no or very little sequence similarity to any function-known proteins, while the latter had similar problem if the relevant PPI information is not available. In view of this, a new approach is proposed by hybridizing the PPI information and the biochemical/physicochemical features of protein sequences. The overall first-order success rates by the new predictor for the functions of mouse proteins on training set and test set were 69.1% and 70.2%, respectively, and the success rate covered by the results of the top-4 order from a total of 24 orders was 65.2%. Conclusions/Significance The results indicate that the new approach is quite promising that may open a new avenue or direction for addressing the difficult and complicated problem.
Collapse
Affiliation(s)
- Lele Hu
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Shanghai Center for Bioinformation Technology, Shanghai, China
| | - Xiaohe Shi
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Wen-Cong Lu
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Centre for Computational Systems Biology, Fudan University, Shanghai, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail:
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
126
|
Abstract
A large number of genome-scale networks, including protein-protein and genetic interaction networks, are now available for several organisms. In parallel, many studies have focused on analyzing, characterizing, and modeling these networks. Beyond investigating the topological characteristics such as degree distribution, clustering coefficient, and average shortest-path distance, another area of particular interest is the prediction of nodes (genes) with a given characteristic (labels) - for example prediction of genes that cause a particular phenotype or have a given function. In this chapter, we describe methods and algorithms for predicting node labels from network-based datasets with an emphasis on label propagation algorithms (LPAs) and their relation to local neighborhood methods.
Collapse
Affiliation(s)
- Sara Mostafavi
- Department of Computer Science, Centre for Cellular and Biomolecular Research (CCBR), University of Toronto, Toronto, ON, Canada
| | | | | |
Collapse
|
127
|
Kourmpetis YA, van Dijk AD, van Ham RC, ter Braak CJ. Genome-wide computational function prediction of Arabidopsis proteins by integration of multiple data sources. PLANT PHYSIOLOGY 2011; 155:271-81. [PMID: 21098674 PMCID: PMC3075770 DOI: 10.1104/pp.110.162164] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Although Arabidopsis (Arabidopsis thaliana) is the best studied plant species, the biological role of one-third of its proteins is still unknown. We developed a probabilistic protein function prediction method that integrates information from sequences, protein-protein interactions, and gene expression. The method was applied to proteins from Arabidopsis. Evaluation of prediction performance showed that our method has improved performance compared with single source-based prediction approaches and two existing integration approaches. An innovative feature of our method is that it enables transfer of functional information between proteins that are not directly associated with each other. We provide novel function predictions for 5,807 proteins. Recent experimental studies confirmed several of the predictions. We highlight these in detail for proteins predicted to be involved in flowering and floral organ development.
Collapse
|
128
|
Jiang X, Gold D, Kolaczyk ED. Network-based auto-probit modeling for protein function prediction. Biometrics 2010; 67:958-66. [PMID: 21133881 DOI: 10.1111/j.1541-0420.2010.01519.x] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Predicting the functional roles of proteins based on various genome-wide data, such as protein-protein association networks, has become a canonical problem in computational biology. Approaching this task as a binary classification problem, we develop a network-based extension of the spatial auto-probit model. In particular, we develop a hierarchical Bayesian probit-based framework for modeling binary network-indexed processes, with a latent multivariate conditional autoregressive Gaussian process. The latter allows for the easy incorporation of protein-protein association network topologies-either binary or weighted-in modeling protein functional similarity. We use this framework to predict protein functions, for functions defined as terms in the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functionality. Furthermore, we show how a natural extension of this framework can be used to model and correct for the high percentage of false negative labels in training data derived from GO, a serious shortcoming endemic to biological databases of this type. Our method performance is evaluated and compared with standard algorithms on weighted yeast protein-protein association networks, extracted from a recently developed integrative database called Search Tool for the Retrieval of INteracting Genes/proteins (STRING). Results show that our basic method is competitive with these other methods, and that the extended method-incorporating the uncertainty in negative labels among the training data-can yield nontrivial improvements in predictive accuracy.
Collapse
Affiliation(s)
- Xiaoyu Jiang
- Boehringer Ingelheim Pharmaceuticals, Inc., 900 Ridgebury Road, Ridgefield, Connecticut 06877, USA
| | | | | |
Collapse
|
129
|
Santoni FA, Hartley O, Luban J. Deciphering the code for retroviral integration target site selection. PLoS Comput Biol 2010; 6:e1001008. [PMID: 21124862 PMCID: PMC2991247 DOI: 10.1371/journal.pcbi.1001008] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2010] [Accepted: 10/25/2010] [Indexed: 01/17/2023] Open
Abstract
Upon cell invasion, retroviruses generate a DNA copy of their RNA genome and integrate retroviral cDNA within host chromosomal DNA. Integration occurs throughout the host cell genome, but target site selection is not random. Each subgroup of retrovirus is distinguished from the others by attraction to particular features on chromosomes. Despite extensive efforts to identify host factors that interact with retrovirion components or chromosome features predictive of integration, little is known about how integration sites are selected. We attempted to identify markers predictive of retroviral integration by exploiting Precision-Recall methods for extracting information from highly skewed datasets to derive robust and discriminating measures of association. ChIPSeq datasets for more than 60 factors were compared with 14 retroviral integration datasets. When compared with MLV, PERV or XMRV integration sites, strong association was observed with STAT1, acetylation of H3 and H4 at several positions, and methylation of H2AZ, H3K4, and K9. By combining peaks from ChIPSeq datasets, a supermarker was identified that localized within 2 kB of 75% of MLV proviruses and detected differences in integration preferences among different cell types. The supermarker predicted the likelihood of integration within specific chromosomal regions in a cell-type specific manner, yielding probabilities for integration into proto-oncogene LMO2 identical to experimentally determined values. The supermarker thus identifies chromosomal features highly favored for retroviral integration, provides clues to the mechanism by which retrovirus integration sites are selected, and offers a tool for predicting cell-type specific proto-oncogene activation by retroviruses. When HIV-1, murine leukemia virus (MLV), or other retroviruses infect a cell, the virus generates a DNA copy of the viral RNA genome and ligates the cDNA within host chromosomal DNA. This integration reaction occurs at sites throughout the host cell genome, but little is known about how integration sites are selected. We attempted to identify markers predictive of retroviral integration by comparing the genome-wide binding sites for more than 60 factors with 14 retroviral integration datasets. We borrowed Precision-Recall methods from the Information Retrieval field for extracting information from highly skewed datasets such as these. For MLV and other gammaretroviruses, strong association was observed with STAT1, acetylation of H3 and H4 at several positions, and methylation of H2AZ, H3K4, and K9. We generated a supermarker by combining high scoring markers. The supermarker localized within 2 kB of 75% of MLV proviruses and predicted the likelihood of integration within specific chromosomal regions in a cell-type specific manner. This study identified chromosomal features highly favored for retroviral integration. It also provides clues to the mechanism by which retrovirus integration sites are selected, and offers a tool for predicting cell-type specific proto-oncogene activation by retroviruses.
Collapse
Affiliation(s)
- Federico Andrea Santoni
- Department of Microbiology and Molecular Medicine, University of Geneva, Geneva, Switzerland
- Swiss Institute of Bioinformatics, University of Geneva, Geneva, Switzerland
- Center for Advanced Studies, Research, and Development in Sardinia, Pula, Italy
| | - Oliver Hartley
- Department of Structural Biology and Bioinformatics, University of Geneva, Geneva, Switzerland
| | - Jeremy Luban
- Department of Microbiology and Molecular Medicine, University of Geneva, Geneva, Switzerland
- * E-mail:
| |
Collapse
|
130
|
Nian Chua H. Prediction of Protein Function. Genomics 2010. [DOI: 10.1002/9780470711675.ch9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
131
|
Freudenberg JM, Sivaganesan S, Phatak M, Shinde K, Medvedovic M. Generalized random set framework for functional enrichment analysis using primary genomics datasets. ACTA ACUST UNITED AC 2010; 27:70-7. [PMID: 20971985 DOI: 10.1093/bioinformatics/btq593] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
MOTIVATION Functional enrichment analysis using primary genomics datasets is an emerging approach to complement established methods for functional enrichment based on predefined lists of functionally related genes. Currently used methods depend on creating lists of 'significant' and 'non-significant' genes based on ad hoc significance cutoffs. This can lead to loss of statistical power and can introduce biases affecting the interpretation of experimental results. RESULTS We developed and validated a new statistical framework, generalized random set (GRS) analysis, for comparing the genomic signatures in two datasets without the need for gene categorization. In our tests, GRS produced correct measures of statistical significance, and it showed dramatic improvement in the statistical power over other methods currently used in this setting. We also developed a procedure for identifying genes driving the concordance of the genomics profiles and demonstrated a dramatic improvement in functional coherence of genes identified in such analysis. AVAILABILITY GRS can be downloaded as part of the R package CLEAN from http://ClusterAnalysis.org/. An online implementation is available at http://GenomicsPortals.org/.
Collapse
Affiliation(s)
- Johannes M Freudenberg
- Department of Environmental Health, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| | | | | | | | | |
Collapse
|
132
|
Wang PI, Marcotte EM. It's the machine that matters: Predicting gene function and phenotype from protein networks. J Proteomics 2010; 73:2277-89. [PMID: 20637909 PMCID: PMC2953423 DOI: 10.1016/j.jprot.2010.07.005] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2010] [Revised: 06/22/2010] [Accepted: 07/07/2010] [Indexed: 12/17/2022]
Abstract
Increasing knowledge about the organization of proteins into complexes, systems, and pathways has led to a flowering of theoretical approaches for exploiting this knowledge in order to better learn the functions of proteins and their roles underlying phenotypic traits and diseases. Much of this body of theory has been developed and tested in model organisms, relying on their relative simplicity and genetic and biochemical tractability to accelerate the research. In this review, we discuss several of the major approaches for computationally integrating proteomics and genomics observations into integrated protein networks, then applying guilt-by-association in these networks in order to identify genes underlying traits. Recent trends in this field include a rising appreciation of the modular network organization of proteins underlying traits or mutational phenotypes, and how to exploit such protein modularity using computational approaches related to the internet search algorithm PageRank. Many protein network-based predictions have recently been experimentally confirmed in yeast, worms, plants, and mice, and several successful approaches in model organisms have been directly translated to analyze human disease, with notable recent applications to glioma and breast cancer prognosis.
Collapse
Affiliation(s)
- Peggy I Wang
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX 78712-1064, USA.
| | | |
Collapse
|
133
|
Montojo J, Zuberi K, Rodriguez H, Kazi F, Wright G, Donaldson SL, Morris Q, Bader GD. GeneMANIA Cytoscape plugin: fast gene function predictions on the desktop. Bioinformatics 2010; 26:2927-8. [PMID: 20926419 PMCID: PMC2971582 DOI: 10.1093/bioinformatics/btq562] [Citation(s) in RCA: 457] [Impact Index Per Article: 32.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Summary: The GeneMANIA Cytoscape plugin brings fast gene function prediction capabilities to the desktop. GeneMANIA identifies the most related genes to a query gene set using a guilt-by-association approach. The plugin uses over 800 networks from six organisms and each related gene is traceable to the source network used to make the prediction. Users may add their own interaction networks and expression profile data to complement or override the default data. Availability and Implementation: The GeneMANIA Cytoscape plugin is implemented in Java and is freely available at http://www.genemania.org/plugin/. Contact:gary.bader@utoronto.ca; quaid.morris@utoronto.ca
Collapse
Affiliation(s)
- J Montojo
- Banting and Best Department of Medical Research, The Donnelly Centre, University of Toronto, 160 College Street, Toronto, ON, M5S 3E1, Canada
| | | | | | | | | | | | | | | |
Collapse
|
134
|
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A, Mostafavi S, Montojo J, Shao Q, Wright G, Bader GD, Morris Q. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 2010; 38:W214-20. [PMID: 20576703 PMCID: PMC2896186 DOI: 10.1093/nar/gkq537] [Citation(s) in RCA: 2922] [Impact Index Per Article: 208.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
GeneMANIA (http://www.genemania.org) is a flexible, user-friendly web interface for generating hypotheses about gene function, analyzing gene lists and prioritizing genes for functional assays. Given a query list, GeneMANIA extends the list with functionally similar genes that it identifies using available genomics and proteomics data. GeneMANIA also reports weights that indicate the predictive value of each selected data set for the query. Six organisms are currently supported (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Homo sapiens and Saccharomyces cerevisiae) and hundreds of data sets have been collected from GEO, BioGRID, Pathway Commons and I2D, as well as organism-specific functional genomics data sets. Users can select arbitrary subsets of the data sets associated with an organism to perform their analyses and can upload their own data sets to analyze. The GeneMANIA algorithm performs as well or better than other gene function prediction methods on yeast and mouse benchmarks. The high accuracy of the GeneMANIA prediction algorithm, an intuitive user interface and large database make GeneMANIA a useful tool for any biologist.
Collapse
Affiliation(s)
- David Warde-Farley
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
135
|
Tedder PMR, Bradford JR, Needham CJ, McConkey GA, Bulpitt AJ, Westhead DR. Gene function prediction using semantic similarity clustering and enrichment analysis in the malaria parasite Plasmodium falciparum. ACTA ACUST UNITED AC 2010; 26:2431-7. [PMID: 20693320 DOI: 10.1093/bioinformatics/btq450] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
MOTIVATION Functional genomics data provides a rich source of information that can be used in the annotation of the thousands of genes of unknown function found in most sequenced genomes. However, previous gene function prediction programs are mostly produced for relatively well-annotated organisms that often have a large amount of functional genomics data. Here, we present a novel method for predicting gene function that uses clustering of genes by semantic similarity, a naïve Bayes classifier and 'enrichment analysis' to predict gene function for a genome that is less well annotated but does has a severe effect on human health, that of the malaria parasite Plasmodium falciparum. RESULTS Predictions for the molecular function, biological process and cellular component of P.falciparum genes were created from eight different datasets with a combined prediction also being produced. The high-confidence predictions produced by the combined prediction were compared to those produced by a simple K-nearest neighbour classifier approach and were shown to improve accuracy and coverage. Finally, two case studies are described, which investigate two biological processes in more detail, that of translation initiation and invasion of the host cell. AVAILABILITY Predictions produced are available at http://www.bioinformatics.leeds.ac.uk/∼bio5pmrt/PAGODA.
Collapse
Affiliation(s)
- Philip M R Tedder
- Institute of Molecular and Cellular Biology, University of Leeds, Leeds, UK
| | | | | | | | | | | |
Collapse
|
136
|
Sokolov A, Ben-Hur A. Hierarchical classification of gene ontology terms using the GOstruct method. J Bioinform Comput Biol 2010; 8:357-76. [PMID: 20401950 DOI: 10.1142/s0219720010004744] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2009] [Revised: 11/08/2009] [Accepted: 11/08/2009] [Indexed: 11/18/2022]
Abstract
Protein function prediction is an active area of research in bioinformatics. Yet, the transfer of annotation on the basis of sequence or structural similarity remains widely used as an annotation method. Most of today's machine learning approaches reduce the problem to a collection of binary classification problems: whether a protein performs a particular function, sometimes with a post-processing step to combine the binary outputs. We propose a method that directly predicts a full functional annotation of a protein by modeling the structure of the Gene Ontology hierarchy in the framework of kernel methods for structured-output spaces. Our empirical results show improved performance over a BLAST nearest-neighbor method, and over algorithms that employ a collection of binary classifiers as measured on the Mousefunc benchmark dataset.
Collapse
Affiliation(s)
- Artem Sokolov
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA.
| | | |
Collapse
|
137
|
Beaver JE, Tasan M, Gibbons FD, Tian W, Hughes TR, Roth FP. FuncBase: a resource for quantitative gene function annotation. Bioinformatics 2010; 26:1806-7. [PMID: 20495000 PMCID: PMC2894510 DOI: 10.1093/bioinformatics/btq265] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2010] [Revised: 04/17/2010] [Accepted: 05/16/2010] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Computational gene function prediction can serve to focus experimental resources on high-priority experimental tasks. FuncBase is a web resource for viewing quantitative machine learning-based gene function annotations. Quantitative annotations of genes, including fungal and mammalian genes, with Gene Ontology terms are accompanied by a community feedback system. Evidence underlying function annotations is shown. For example, a custom Cytoscape viewer shows functional linkage graphs relevant to the gene or function of interest. FuncBase provides links to external resources, and may be accessed directly or via links from species-specific databases. AVAILABILITY FuncBase as well as all underlying data and annotations are freely available via http://func.med.harvard.edu/
Collapse
Affiliation(s)
- John E Beaver
- Department of Biological Chemistry & Molecular Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | | | | | | | | | | |
Collapse
|
138
|
Hu J, Wan J, Hackler L, Zack DJ, Qian J. Computational analysis of tissue-specific gene networks: application to murine retinal functional studies. ACTA ACUST UNITED AC 2010; 26:2289-97. [PMID: 20616386 DOI: 10.1093/bioinformatics/btq408] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
MOTIVATION The vertebrate retina is a complex neuronal tissue, and its development, normal functioning and response to injury and disease is subject to a variety of genetic factors. To understand better the regulatory and functional relationships between the genes expressed within the retina, we constructed an interactive gene network of the mouse retina by applying a Bayesian statistics approach to information derived from a variety of gene expression, protein-protein interaction and gene ontology annotation databases. RESULTS The network contains 673 retina-related genes. Most of them are obtained through manual literature-based curation, while the others are the genes preferentially expressed in the retina. These retina-related genes are linked by 3403 potential functional associations in the network. The prediction on the gene functional association using the Bayesian approach outperforms predictions using only one source of information. The network includes five major gene clusters, each enriched in different biological activities. There are several applications to this network. First, we identified approximately 50 hub genes that are predicted to play particularly important roles in the function of the retina. Some of them are not yet well studied. Second, we can predict novel gene functions using 'guilt by association' method. Third, we also predicted novel retinal disease-associated genes based on the network analysis. AVAILABILITY To provide easy access to the retinal network, we constructed an interactive web tool, named MoReNet, which is available at http://bioinfo.wilmer.jhu.edu/morenet/.
Collapse
Affiliation(s)
- Jianfei Hu
- Wilmer Institute, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | | | | | | | | |
Collapse
|
139
|
Lee I, Lehner B, Vavouri T, Shin J, Fraser AG, Marcotte EM. Predicting genetic modifier loci using functional gene networks. Genome Res 2010; 20:1143-53. [PMID: 20538624 DOI: 10.1101/gr.102749.109] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Most phenotypes are genetically complex, with contributions from mutations in many different genes. Mutations in more than one gene can combine synergistically to cause phenotypic change, and systematic studies in model organisms show that these genetic interactions are pervasive. However, in human association studies such nonadditive genetic interactions are very difficult to identify because of a lack of statistical power--simply put, the number of potential interactions is too vast. One approach to resolve this is to predict candidate modifier interactions between loci, and then to specifically test these for associations with the phenotype. Here, we describe a general method for predicting genetic interactions based on the use of integrated functional gene networks. We show that in both Saccharomyces cerevisiae and Caenorhabditis elegans a single high-coverage, high-quality functional network can successfully predict genetic modifiers for the majority of genes. For C. elegans we also describe the construction of a new, improved, and expanded functional network, WormNet 2. Using this network we demonstrate how it is possible to rapidly expand the number of modifier loci known for a gene, predicting and validating new genetic interactions for each of three signal transduction genes. We propose that this approach, termed network-guided modifier screening, provides a general strategy for predicting genetic interactions. This work thus suggests that a high-quality integrated human gene network will provide a powerful resource for modifier locus discovery in many different diseases.
Collapse
Affiliation(s)
- Insuk Lee
- Department of Biotechnology, College of Life science and Biotechnology, Yonsei University, Seodaemun-ku, Seoul 120-749, South Korea.
| | | | | | | | | | | |
Collapse
|
140
|
Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predicting gene function with limited annotation. ACTA ACUST UNITED AC 2010; 26:1759-65. [PMID: 20507895 PMCID: PMC2894508 DOI: 10.1093/bioinformatics/btq262] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Motivation: Many algorithms that integrate multiple functional association networks for predicting gene function construct a composite network as a weighted sum of the individual networks and then use the composite network to predict gene function. The weight assigned to an individual network represents the usefulness of that network in predicting a given gene function. However, because many categories of gene function have a small number of annotations, the process of assigning these network weights is prone to overfitting. Results: Here, we address this problem by proposing a novel approach to combining multiple functional association networks. In particular, we present a method where network weights are simultaneously optimized on sets of related function categories. The method is simpler and faster than existing approaches. Further, we show that it produces composite networks with improved function prediction accuracy using five example species (yeast, mouse, fly, Esherichia coli and human). Availability: Networks and code are available from: http://morrislab.med.utoronto.ca/˜sara/SW Contact:smostafavi@cs.toronto.edu; quaid.morris@utoronto.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara Mostafavi
- Department of Computer Science and Center for Cellular and Biomolecular Research, University of Toronto, Canada.
| | | |
Collapse
|
141
|
Shikano T, Ramadevi J, Shimada Y, Merilä J. Utility of sequenced genomes for microsatellite marker development in non-model organisms: a case study of functionally important genes in nine-spined sticklebacks (Pungitius pungitius). BMC Genomics 2010; 11:334. [PMID: 20507571 PMCID: PMC2891615 DOI: 10.1186/1471-2164-11-334] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2010] [Accepted: 05/27/2010] [Indexed: 12/04/2022] Open
Abstract
Background Identification of genes involved in adaptation and speciation by targeting specific genes of interest has become a plausible strategy also for non-model organisms. We investigated the potential utility of available sequenced fish genomes to develop microsatellite (cf. simple sequence repeat, SSR) markers for functionally important genes in nine-spined sticklebacks (Pungitius pungitius), as well as cross-species transferability of SSR primers from three-spined (Gasterosteus aculeatus) to nine-spined sticklebacks. In addition, we examined the patterns and degree of SSR conservation between these species using their aligned sequences. Results Cross-species amplification success was lower for SSR markers located in or around functionally important genes (27 out of 158) than for those randomly derived from genomic (35 out of 101) and cDNA (35 out of 87) libraries. Polymorphism was observed at a large proportion (65%) of the cross-amplified loci independently of SSR type. To develop SSR markers for functionally important genes in nine-spined sticklebacks, SSR locations were surveyed in or around 67 target genes based on the three-spined stickleback genome and these regions were sequenced with primers designed from conserved sequences in sequenced fish genomes. Out of the 81 SSRs identified in the sequenced regions (44,084 bp), 57 exhibited the same motifs at the same locations as in the three-spined stickleback. Di- and trinucleotide SSRs appeared to be highly conserved whereas mononucleotide SSRs were less so. Species-specific primers were designed to amplify 58 SSRs using the sequences of nine-spined sticklebacks. Conclusions Our results demonstrated that a large proportion of SSRs are conserved in the species that have diverged more than 10 million years ago. Therefore, the three-spined stickleback genome can be used to predict SSR locations in the nine-spined stickleback genome. While cross-species utility of SSR primers is limited due to low amplification success, SSR markers can be developed for target genes and genomic regions using our approach, which should be also applicable to other non-model organisms. The SSR markers developed in this study should be useful for identification of genes responsible for phenotypic variation and adaptive divergence of nine-spined stickleback populations, as well as for constructing comparative gene maps of nine-spined and three-spined sticklebacks.
Collapse
Affiliation(s)
- Takahito Shikano
- Ecological Genetics Research Unit, Department of Biosciences, University of Helsinki, P,O, Box 65, FI-00014, Helsinki, Finland.
| | | | | | | |
Collapse
|
142
|
Affiliation(s)
- Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America.
| | | |
Collapse
|
143
|
Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proc Natl Acad Sci U S A 2010; 107:6823-8. [PMID: 20360561 DOI: 10.1073/pnas.0912043107] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was also demonstrated that the power of our method can increase significantly with the continued growth of public gene expression repositories. Finally, we showed how our disease diagnosis system can be used to characterize complex phenotypes and to construct a disease-drug connectivity map.
Collapse
|
144
|
Guan Y, Dunham M, Caudy A, Troyanskaya O. Systematic planning of genome-scale experiments in poorly studied species. PLoS Comput Biol 2010; 6:e1000698. [PMID: 20221257 PMCID: PMC2832676 DOI: 10.1371/journal.pcbi.1000698] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2009] [Accepted: 01/30/2010] [Indexed: 01/02/2023] Open
Abstract
Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organism Saccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms. Microarray expression experiments allow fast functional profiling of an organism's entire genome and significant efforts are devoted to analyzing the resulting data. Available genome sequences are also increasing quickly. However, it is unexplored how to use available functional genomics data to direct large-scale experiments in newly sequenced but poorly studied species. In this paper, we propose a strategy to systematically plan experimental treatments in the poorly studied species based on their model organism relatives. We consider both the accuracy of the datasets in capturing different biological processes and the redundancy between datasets. Quantifying the above information allows us to recommend a list of experimental treatments. We demonstrate the efficacy of this approach by designing, performing and evaluating S. bayanus microarray experiments using an available S. cerevisiae data repository. We show that this systematic planning process could reduce the labor in doing microarray experiments by 10 fold and achieve similar functional coverage.
Collapse
Affiliation(s)
- Yuanfang Guan
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Maitreya Dunham
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * E-mail: (OT); (AC); (MD)
| | - Amy Caudy
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (OT); (AC); (MD)
| | - Olga Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (OT); (AC); (MD)
| |
Collapse
|
145
|
Kourmpetis YAI, van Dijk ADJ, Bink MCAM, van Ham RCHJ, ter Braak CJF. Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS One 2010; 5:e9293. [PMID: 20195360 PMCID: PMC2827541 DOI: 10.1371/journal.pone.0009293] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2009] [Accepted: 01/15/2010] [Indexed: 01/02/2023] Open
Abstract
Inference of protein functions is one of the most important aims of modern
biology. To fully exploit the large volumes of genomic data typically produced
in modern-day genomic experiments, automated computational methods for protein
function prediction are urgently needed. Established methods use sequence or
structure similarity to infer functions but those types of data do not suffice
to determine the biological context in which proteins act. Current
high-throughput biological experiments produce large amounts of data on the
interactions between proteins. Such data can be used to infer interaction
networks and to predict the biological process that the protein is involved in.
Here, we develop a probabilistic approach for protein function prediction using
network data, such as protein-protein interaction measurements. We take a
Bayesian approach to an existing Markov Random Field method by performing
simultaneous estimation of the model parameters and prediction of protein
functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to
more accurate parameter estimates and consequently to improved prediction
performance compared to the standard Markov Random Fields method. We tested our
method using a high quality S.cereviciae validation network
with 1622 proteins against 90 Gene Ontology terms of different levels of
abstraction. Compared to three other protein function prediction methods, our
approach shows very good prediction performance. Our method can be directly
applied to protein-protein interaction or coexpression networks, but also can be
extended to use multiple data sources. We apply our method to physical protein
interaction data from S. cerevisiae and provide novel
predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we
evaluate the predictions using the available literature.
Collapse
Affiliation(s)
| | - Aalt D. J. van Dijk
- Applied Bioinformatics, Plant Research International, Wageningen, The
Netherlands
| | - Marco C. A. M. Bink
- Biometris, Wageningen University and Research Centre, Wageningen, The
Netherlands
| | - Roeland C. H. J. van Ham
- Applied Bioinformatics, Plant Research International, Wageningen, The
Netherlands
- Laboratory of Bioinformatics, Wageningen University, Wageningen, The
Netherlands
| | - Cajo J. F. ter Braak
- Biometris, Wageningen University and Research Centre, Wageningen, The
Netherlands
- * E-mail:
| |
Collapse
|
146
|
Genomics Portals: integrative web-platform for mining genomics data. BMC Genomics 2010; 11:27. [PMID: 20070909 PMCID: PMC2824719 DOI: 10.1186/1471-2164-11-27] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2009] [Accepted: 01/13/2010] [Indexed: 12/21/2022] Open
Abstract
Background A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Results Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. Conclusion The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.
Collapse
|
147
|
Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics 2010; 11:2. [PMID: 20044933 PMCID: PMC2824675 DOI: 10.1186/1471-2105-11-2] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2009] [Accepted: 01/02/2010] [Indexed: 12/04/2022] Open
Abstract
Background S. cerevisiae, A. thaliana and M. musculus are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability. Results We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use. Conclusions Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.
Collapse
|
148
|
Re M, Valentini G. An Experimental Comparison of Hierarchical Bayes and True Path Rule Ensembles for Protein Function Prediction. MULTIPLE CLASSIFIER SYSTEMS 2010. [DOI: 10.1007/978-3-642-12127-2_30] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
149
|
Ko S, Lee H. Integrative approaches to the prediction of protein functions based on the feature selection. BMC Bioinformatics 2009; 10:455. [PMID: 20043848 PMCID: PMC2813249 DOI: 10.1186/1471-2105-10-455] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2009] [Accepted: 12/31/2009] [Indexed: 01/30/2023] Open
Abstract
Background Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue. Results We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy. Conclusions In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.
Collapse
Affiliation(s)
- Seokha Ko
- Department of Information and Communications, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea.
| | | |
Collapse
|
150
|
Zheng P, Griswold MD, Hassold TJ, Hunt PA, Small CL, Ye P. Predicting meiotic pathways in human fetal oogenesis. Biol Reprod 2009; 82:543-51. [PMID: 19846598 DOI: 10.1095/biolreprod.109.079590] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022] Open
Abstract
Gene function prediction has proven valuable in formulating testable hypotheses. It is particularly useful for exploring biological processes that are experimentally intractable, such as meiotic initiation and progression in the human fetal ovary. In this study, we developed the first functional gene network for the human fetal ovary, HFOnet, by probabilistically integrating multiple genomic features using a naïve Bayesian model. We demonstrated that this network could accurately recapture known functional connections between genes, as well as predict new connections. Our findings suggest that known meiosis-specific genes (i.e., with functions only in meiotic processes in the germ cells) make either no or a few functional connections but are highly clustered with neighbor genes. In contrast, known nonspecific meiotic genes (i.e., with functions in both meiotic and nonmeiotic processes in the germ cells and somatic cells) exhibit numerous connections but low clustering coefficients, indicating their role as central modulators of diverse pathways, including those in meiosis. We also predicted novel genes that may be involved in meiotic initiation and DNA repair. This global functional network provides a much-needed framework for exploring gene functions and pathway components in early human female meiosis that are difficult to tackle by traditional in vivo mammalian genetics.
Collapse
Affiliation(s)
- Ping Zheng
- School of Molecular Biosciences, Center for Reproductive Biology, Washington State University, Pullman, WA 99164, USA
| | | | | | | | | | | |
Collapse
|