1
|
Zhu YH, Zhang C, Liu Y, Omenn GS, Freddolino PL, Yu DJ, Zhang Y. TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1013-1027. [PMID: 35568117 PMCID: PMC10025770 DOI: 10.1016/j.gpb.2022.03.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 03/02/2022] [Accepted: 04/16/2022] [Indexed: 01/13/2023]
Abstract
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Departments of Internal Medicine and Human Genetics, and School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
2
|
Law JN, Akers K, Tasnina N, Santina CMD, Deutsch S, Kshirsagar M, Klein-Seetharaman J, Crovella M, Rajagopalan P, Kasif S, Murali TM. Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2. Gigascience 2021; 10:giab082. [PMID: 34966926 PMCID: PMC8716363 DOI: 10.1093/gigascience/giab082] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 09/21/2021] [Accepted: 11/28/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. RESULTS We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. CONCLUSIONS We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.
Collapse
Affiliation(s)
- Jeffrey N Law
- Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA 24061, USA
| | - Kyle Akers
- Interdisciplinary Ph.D. Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, VA 24061, USA
| | - Nure Tasnina
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | | - Shay Deutsch
- Department of Mathematics, University of California, Los Angeles, CA 90095, USA
| | | | | | - Mark Crovella
- Department of Computer Science, Boston University, Boston, MA 02215, USA
| | | | - Simon Kasif
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
3
|
We need to keep a reproducible trace of facts, predictions, and hypotheses from gene to function in the era of big data. PLoS Biol 2020; 18:e3000999. [PMID: 33253151 PMCID: PMC7728211 DOI: 10.1371/journal.pbio.3000999] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 12/10/2020] [Indexed: 01/18/2023] Open
Abstract
How do we scale biological science to the demand of next generation biology and medicine to keep track of the facts, predictions, and hypotheses? These days, enormous amounts of DNA sequence and other omics data are generated. Since these data contain the blueprint for life, it is imperative that we interpret it accurately. The abundance of DNA is only one part of the challenge. Artificial Intelligence (AI) and network methods routinely build on large screens, single cell technologies, proteomics, and other modalities to infer or predict biological functions and phenotypes associated with proteins, pathways, and organisms. As a first step, how do we systematically trace the provenance of knowledge from experimental ground truth to gene function predictions and annotations? Here, we review the main challenges in tracking the evolution of biological knowledge and propose several specific solutions to provenance and computational tracing of evidence in functional linkage networks.
Collapse
|
4
|
Pournoor E, Mousavian Z, Dalini AN, Masoudi-Nejad A. Identification of Key Components in Colon Adenocarcinoma Using Transcriptome to Interactome Multilayer Framework. Sci Rep 2020; 10:4991. [PMID: 32193399 PMCID: PMC7081269 DOI: 10.1038/s41598-020-59605-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 01/31/2020] [Indexed: 12/21/2022] Open
Abstract
Complexity of cascading interrelations between molecular cell components at different levels from genome to metabolome ordains a massive difficulty in comprehending biological happenings. However, considering these complications in the systematic modelings will result in realistic and reliable outputs. The multilayer networks approach is a relatively innovative concept that could be applied for multiple omics datasets as an integrative methodology to overcome heterogeneity difficulties. Herein, we employed the multilayer framework to rehabilitate colon adenocarcinoma network by observing co-expression correlations, regulatory relations, and physical binding interactions. Hub nodes in this three-layer network were selected using a heterogeneous random walk with random jump procedure. We exploited local composite modules around the hub nodes having high overlay with cancer-specific pathways, and investigated their genes showing a different expressional pattern in the tumor progression. These genes were examined for survival effects on the patient's lifespan, and those with significant impacts were selected as potential candidate biomarkers. Results suggest that identified genes indicate noteworthy importance in the carcinogenesis of the colon.
Collapse
Affiliation(s)
- Ehsan Pournoor
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Zaynab Mousavian
- School of Mathematics, Statistics, and Computer Science, College of Science, University of Tehran, Tehran, Iran
| | - Abbas Nowzari Dalini
- School of Mathematics, Statistics, and Computer Science, College of Science, University of Tehran, Tehran, Iran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| |
Collapse
|
5
|
Havugimana PC, Hu P, Emili A. Protein complexes, big data, machine learning and integrative proteomics: lessons learned over a decade of systematic analysis of protein interaction networks. Expert Rev Proteomics 2017; 14:845-855. [PMID: 28918672 DOI: 10.1080/14789450.2017.1374179] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
OVERVIEW Elucidation of the networks of physical (functional) interactions present in cells and tissues is fundamental for understanding the molecular organization of biological systems, the mechanistic basis of essential and disease-related processes, and for functional annotation of previously uncharacterized proteins (via guilt-by-association or -correlation). After a decade in the field, we felt it timely to document our own experiences in the systematic analysis of protein interaction networks. Areas covered: Researchers worldwide have contributed innovative experimental and computational approaches that have driven the rapidly evolving field of 'functional proteomics'. These include mass spectrometry-based methods to characterize macromolecular complexes on a global-scale and sophisticated data analysis tools - most notably machine learning - that allow for the generation of high-quality protein association maps. Expert commentary: Here, we recount some key lessons learned, with an emphasis on successful workflows, and challenges, arising from our own and other groups' ongoing efforts to generate, interpret and report proteome-scale interaction networks in increasingly diverse biological contexts.
Collapse
Affiliation(s)
- Pierre C Havugimana
- a Donnelly Centre for Cellular and Biomolecular Research , University of Toronto , Toronto , ON , Canada.,b Department of Molecular Genetics , University of Toronto , Toronto , ON , Canada
| | - Pingzhao Hu
- c Department of Biochemistry and Medical Genetics , University of Manitoba , Winnipeg , MB , Canada
| | - Andrew Emili
- a Donnelly Centre for Cellular and Biomolecular Research , University of Toronto , Toronto , ON , Canada.,b Department of Molecular Genetics , University of Toronto , Toronto , ON , Canada
| |
Collapse
|
6
|
Li J, Li X, Zhu B. User opinion classification in social media: A global consistency maximization approach. INFORMATION & MANAGEMENT 2016. [DOI: 10.1016/j.im.2016.06.004] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Panwar B, Menon R, Eksi R, Li HD, Omenn GS, Guan Y. Genome-Wide Functional Annotation of Human Protein-Coding Splice Variants Using Multiple Instance Learning. J Proteome Res 2016; 15:1747-53. [PMID: 27142340 DOI: 10.1021/acs.jproteome.5b00883] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The vast majority of human multiexon genes undergo alternative splicing and produce a variety of splice variant transcripts and proteins, which can perform different functions. These protein-coding splice variants (PCSVs) greatly increase the functional diversity of proteins. Most functional annotation algorithms have been developed at the gene level; the lack of isoform-level gold standards is an important intellectual limitation for currently available machine learning algorithms. The accumulation of a large amount of RNA-seq data in the public domain greatly increases our ability to examine the functional annotation of genes at isoform level. In the present study, we used a multiple instance learning (MIL)-based approach for predicting the function of PCSVs. We used transcript-level expression values and gene-level functional associations from the Gene Ontology database. A support vector machine (SVM)-based 5-fold cross-validation technique was applied. Comparatively, genes with multiple PCSVs performed better than single PCSV genes, and performance also improved when more examples were available to train the models. We demonstrated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. All predictions have been implemented in a web resource called "IsoFunc", which is freely available for the global scientific community through http://guanlab.ccmb.med.umich.edu/isofunc .
Collapse
Affiliation(s)
- Bharat Panwar
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Ridvan Eksi
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, ‡Department of Internal Medicine, §Department of Human Genetics and School of Public Health, and ∥Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan 48109, United States
| |
Collapse
|
8
|
Pushing the annotation of cellular activities to a higher resolution: Predicting functions at the isoform level. Methods 2015; 93:110-8. [PMID: 26238263 DOI: 10.1016/j.ymeth.2015.07.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 07/20/2015] [Accepted: 07/29/2015] [Indexed: 12/23/2022] Open
Abstract
In past decades, the experimental determination of protein functions was expensive and time-consuming, so numerous computational methods were developed to speed up and guide the process. However, most of these methods predict protein functions at the gene level and do not consider the fact that protein isoforms (translated from alternatively spliced transcripts), not genes, are the actual function carriers. Now, high-throughput RNA-seq technology is providing unprecedented opportunities to unravel protein functions at the isoform level. In this article, we review recent progress in the high-resolution functional annotations of protein isoforms, focusing on two methods developed by the authors. Both methods can integrate multiple RNA-seq datasets for comprehensively characterizing functions of protein isoforms.
Collapse
|
9
|
Frasca M, Bassis S, Valentini G. Learning node labels with multi-category Hopfield networks. Neural Comput Appl 2015. [DOI: 10.1007/s00521-015-1965-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
|
10
|
Wang S, Cho H, Zhai C, Berger B, Peng J. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 2015; 31:i357-64. [PMID: 26072504 PMCID: PMC4542782 DOI: 10.1093/bioinformatics/btv260] [Citation(s) in RCA: 74] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
MOTIVATION Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this 'overfitting' issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog. RESULTS We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions. AVAILABILITY AND IMPLEMENTATION https://github.com/wangshenguiuc/clusDCA.
Collapse
Affiliation(s)
- Sheng Wang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA
| | - Hyunghoon Cho
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA
| | - ChengXiang Zhai
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA
| | - Bonnie Berger
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA
| | - Jian Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA and Department of Mathematics, MIT, Cambridge, MA, USA
| |
Collapse
|
11
|
Sedaghat N, Saegusa T, Randolph T, Shojaie A. Comparative study of computational methods for reconstructing genetic networks of cancer-related pathways. Cancer Inform 2014; 13:55-66. [PMID: 25288880 PMCID: PMC4179645 DOI: 10.4137/cin.s13781] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2014] [Revised: 05/08/2014] [Accepted: 05/10/2014] [Indexed: 12/16/2022] Open
Abstract
Network reconstruction is an important yet challenging task in systems biology. While many methods have been recently proposed for reconstructing biological networks from diverse data types, properties of estimated networks and differences between reconstruction methods are not well understood. In this paper, we conduct a comprehensive empirical evaluation of seven existing network reconstruction methods, by comparing the estimated networks with different sparsity levels for both normal and tumor samples. The results suggest substantial heterogeneity in networks reconstructed using different reconstruction methods. Our findings also provide evidence for significant differences between networks of normal and tumor samples, even after accounting for the considerable variability in structures of networks estimated using different reconstruction methods. These differences can offer new insight into changes in mechanisms of genetic interaction associated with cancer initiation and progression.
Collapse
Affiliation(s)
- Nafiseh Sedaghat
- Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran
| | - Takumi Saegusa
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Timothy Randolph
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| |
Collapse
|
12
|
Mulder NJ, Akinola RO, Mazandu GK, Rapanoel H. Using biological networks to improve our understanding of infectious diseases. Comput Struct Biotechnol J 2014; 11:1-10. [PMID: 25379138 PMCID: PMC4212278 DOI: 10.1016/j.csbj.2014.08.006] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Infectious diseases are the leading cause of death, particularly in developing countries. Although many drugs are available for treating the most common infectious diseases, in many cases the mechanism of action of these drugs or even their targets in the pathogen remain unknown. In addition, the key factors or processes in pathogens that facilitate infection and disease progression are often not well understood. Since proteins do not work in isolation, understanding biological systems requires a better understanding of the interconnectivity between proteins in different pathways and processes, which includes both physical and other functional interactions. Such biological networks can be generated within organisms or between organisms sharing a common environment using experimental data and computational predictions. Though different data sources provide different levels of accuracy, confidence in interactions can be measured using interaction scores. Connections between interacting proteins in biological networks can be represented as graphs and edges, and thus studied using existing algorithms and tools from graph theory. There are many different applications of biological networks, and here we discuss three such applications, specifically applied to the infectious disease tuberculosis, with its causative agent Mycobacterium tuberculosis and host, Homo sapiens. The applications include the use of the networks for function prediction, comparison of networks for evolutionary studies, and the generation and use of host–pathogen interaction networks.
Collapse
Affiliation(s)
- Nicola J Mulder
- Computational Biology Group, Department of Clinical Laboratory Sciences, IDM, University of Cape Town Faculty of Health Sciences, Anzio Road, Observatory, Cape Town, South Africa
| | - Richard O Akinola
- Computational Biology Group, Department of Clinical Laboratory Sciences, IDM, University of Cape Town Faculty of Health Sciences, Anzio Road, Observatory, Cape Town, South Africa
| | - Gaston K Mazandu
- Computational Biology Group, Department of Clinical Laboratory Sciences, IDM, University of Cape Town Faculty of Health Sciences, Anzio Road, Observatory, Cape Town, South Africa
| | - Holifidy Rapanoel
- Computational Biology Group, Department of Clinical Laboratory Sciences, IDM, University of Cape Town Faculty of Health Sciences, Anzio Road, Observatory, Cape Town, South Africa
| |
Collapse
|
13
|
Yu D, Kim M, Xiao G, Hwang TH. Review of biological network data and its applications. Genomics Inform 2013; 11:200-10. [PMID: 24465231 PMCID: PMC3897847 DOI: 10.5808/gi.2013.11.4.200] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Revised: 11/20/2013] [Accepted: 11/21/2013] [Indexed: 12/16/2022] Open
Abstract
Studying biological networks, such as protein-protein interactions, is key to understanding complex biological activities. Various types of large-scale biological datasets have been collected and analyzed with high-throughput technologies, including DNA microarray, next-generation sequencing, and the two-hybrid screening system, for this purpose. In this review, we focus on network-based approaches that help in understanding biological systems and identifying biological functions. Accordingly, this paper covers two major topics in network biology: reconstruction of gene regulatory networks and network-based applications, including protein function prediction, disease gene prioritization, and network-based genome-wide association study.
Collapse
Affiliation(s)
- Donghyeon Yu
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Minsoo Kim
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Guanghua Xiao
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Tae Hyun Hwang
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| |
Collapse
|
14
|
Anton BP, Chang YC, Brown P, Choi HP, Faller LL, Guleria J, Hu Z, Klitgord N, Levy-Moonshine A, Maksad A, Mazumdar V, McGettrick M, Osmani L, Pokrzywa R, Rachlin J, Swaminathan R, Allen B, Housman G, Monahan C, Rochussen K, Tao K, Bhagwat AS, Brenner SE, Columbus L, de Crécy-Lagard V, Ferguson D, Fomenkov A, Gadda G, Morgan RD, Osterman AL, Rodionov DA, Rodionova IA, Rudd KE, Söll D, Spain J, Xu SY, Bateman A, Blumenthal RM, Bollinger JM, Chang WS, Ferrer M, Friedberg I, Galperin MY, Gobeill J, Haft D, Hunt J, Karp P, Klimke W, Krebs C, Macelis D, Madupu R, Martin MJ, Miller JH, O'Donovan C, Palsson B, Ruch P, Setterdahl A, Sutton G, Tate J, Yakunin A, Tchigvintsev D, Plata G, Hu J, Greiner R, Horn D, Sjölander K, Salzberg SL, Vitkup D, Letovsky S, Segrè D, DeLisi C, Roberts RJ, Steffen M, Kasif S. The COMBREX project: design, methodology, and initial results. PLoS Biol 2013; 11:e1001638. [PMID: 24013487 PMCID: PMC3754883 DOI: 10.1371/journal.pbio.1001638] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Affiliation(s)
- Brian P. Anton
- New England Biolabs, Ipswich, Massachusetts, United States of America
- * E-mail: (BPA); (SK)
| | - Yi-Chien Chang
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Peter Brown
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Han-Pil Choi
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Lina L. Faller
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Jyotsna Guleria
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Zhenjun Hu
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Niels Klitgord
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Ami Levy-Moonshine
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Almaz Maksad
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Varun Mazumdar
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Mark McGettrick
- Diatom Software LLC, Holliston, Massachusetts, United States of America
| | - Lais Osmani
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Revonda Pokrzywa
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - John Rachlin
- Diatom Software LLC, Holliston, Massachusetts, United States of America
| | - Rajeswari Swaminathan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Benjamin Allen
- Program for Evolutionary Dynamics, Harvard University, Cambridge, Massachusetts, United States of America
- Department of Mathematics, Emmanuel College, Boston, Massachusetts, United States of America
| | - Genevieve Housman
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Caitlin Monahan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Krista Rochussen
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Kevin Tao
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Ashok S. Bhagwat
- Department of Chemistry, Wayne State University, Detroit, Michigan, United States of America
| | - Steven E. Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, California, United States of America
| | - Linda Columbus
- Department of Chemistry, University of Virginia, Charlottesville, Virginia, United States of America
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida, United States of America
| | - Donald Ferguson
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Alexey Fomenkov
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Giovanni Gadda
- Department of Chemistry, Georgia State University, Atlanta, Georgia, United States of America
| | - Richard D. Morgan
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Andrei L. Osterman
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Dmitry A. Rodionov
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Irina A. Rodionova
- Bioinformatics and Systems Biology, Sanford Burnham Medical Research Institute, La Jolla, California, United States of America
| | - Kenneth E. Rudd
- Department of Biochemistry and Molecular Biology, University of Miami, Miami, Florida, United States of America
| | - Dieter Söll
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - James Spain
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Shuang-yong Xu
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Alex Bateman
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Robert M. Blumenthal
- Department of Medical Microbiology and Immunology, and Program in Bioinformatics, University of Toledo, Toledo, Ohio, United States of America
| | - J. Martin Bollinger
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Woo-Suk Chang
- Department of Biology, University of Texas-Arlington, Arlington, Texas, United States of America
| | - Manuel Ferrer
- Spanish National Research Council (CSIC), Institute of Catalysis, Madrid, Spain
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Michael Y. Galperin
- National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Julien Gobeill
- Department of Library and Information Sciences, University of Applied Sciences Western Switzerland, Geneva, Switzerland
- Bibliomics and Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Daniel Haft
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - John Hunt
- Biological Sciences, Columbia University, New York, New York, United States of America
| | - Peter Karp
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, California, United States of America
| | - William Klimke
- National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Carsten Krebs
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Dana Macelis
- New England Biolabs, Ipswich, Massachusetts, United States of America
| | - Ramana Madupu
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - Maria J. Martin
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Jeffrey H. Miller
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles, California, United States of America
| | - Claire O'Donovan
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Bernhard Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Patrick Ruch
- Department of Library and Information Sciences, University of Applied Sciences Western Switzerland, Geneva, Switzerland
- Bibliomics and Text Mining Group, Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Aaron Setterdahl
- Department of Chemistry, Indiana University Southeast, New Albany, Indiana, United States of America
| | - Granger Sutton
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - John Tate
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
| | - Alexander Yakunin
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada
| | - Dmitri Tchigvintsev
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario, Canada
| | - Germán Plata
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
- Integrated Program in Cellular, Molecular, Structural, and Genetic Studies, Columbia University, New York, New York, United States of America
| | - Jie Hu
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Russell Greiner
- Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
| | - David Horn
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Kimmen Sjölander
- Berkeley Phylogenomics Group, University of California, Berkeley, California, United States of America
| | - Steven L. Salzberg
- Departments of Medicine and Biostatistics, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Dennis Vitkup
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Stanley Letovsky
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Daniel Segrè
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Charles DeLisi
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Richard J. Roberts
- New England Biolabs, Ipswich, Massachusetts, United States of America
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
| | - Martin Steffen
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Simon Kasif
- Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
- * E-mail: (BPA); (SK)
| |
Collapse
|
15
|
Frasca M, Bertoni A, Re M, Valentini G. A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw 2013; 43:84-98. [DOI: 10.1016/j.neunet.2013.01.021] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2012] [Revised: 01/28/2013] [Accepted: 01/29/2013] [Indexed: 01/03/2023]
|
16
|
Zhang J, Li L, Peng L, Sun Y, Li J. An efficient weighted graph strategy to identify differentiation associated genes in embryonic stem cells. PLoS One 2013; 8:e62716. [PMID: 23638139 PMCID: PMC3637163 DOI: 10.1371/journal.pone.0062716] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 03/25/2013] [Indexed: 11/18/2022] Open
Abstract
In the past few decades, embryonic stem cells (ESCs) were of great interest as a model system for studying early developmental processes and because of their potential therapeutic applications in regenerative medicine. However, the underlying mechanisms of ESC differentiation remain unclear, which limits our exploration of the therapeutic potential of stem cells. Fortunately, the increasing quantity and diversity of biological datasets can provide us with opportunities to explore the biological secrets. However, taking advantage of diverse biological information to facilitate the advancement of ESC research still remains a challenge. Here, we propose a scalable, efficient and flexible function prediction framework that integrates diverse biological information using a simple weighted strategy, for uncovering the genetic determinants of mouse ESC differentiation. The advantage of this approach is that it can make predictions based on dynamic information fusion, owing to the simple weighted strategy. With this approach, we identified 30 genes that had been reported to be associated with differentiation of stem cells, which we regard to be associated with differentiation or pluripotency in embryonic stem cells. We also predicted 70 genes as candidates for contributing to differentiation, which requires further confirmation. As a whole, our results showed that this strategy could be applied as a useful tool for ESC research.
Collapse
Affiliation(s)
- Jie Zhang
- Department of Prevention, Tongji University School of Medicine, Shanghai, China
- * E-mail: (JZ); (JL)
| | - Li Li
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University School of Medicine, Shanghai, China
| | - Luying Peng
- Key Laboratory of Arrhythmias, Ministry of Education, Tongji University School of Medicine, Shanghai, China
| | - Yingxian Sun
- Department of Cardiology, The First Hospital of China Medical University, Shenyang, China
| | - Jue Li
- Department of Prevention, Tongji University School of Medicine, Shanghai, China
- * E-mail: (JZ); (JL)
| |
Collapse
|
17
|
Hu P, Jiang H, Emili A. Incorporating Correlations among Gene Ontology Terms into Predicting Protein Functions. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
The authors describe a new strategy that has better prediction performance than previous methods, which gives additional insights about the importance of the dependence between functional terms when inferring protein function.
Collapse
Affiliation(s)
- Pingzhao Hu
- York University, Canada & University of Toronto, Canada
| | | | | |
Collapse
|
18
|
Abstract
Functional characterization of genes and their protein products is essential to biological and clinical research. Yet, there is still no reliable way of assigning functional annotations to proteins in a high-throughput manner. In this article, the authors provide an introduction to the task of automated protein function prediction. They discuss about the motivation for automated protein function prediction, the challenges faced in this task, as well as some approaches that are currently available. In particular, they take a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.
Collapse
|
19
|
Chu LH, Rivera CG, Popel AS, Bader JS. Constructing the angiome: a global angiogenesis protein interaction network. Physiol Genomics 2012; 44:915-24. [PMID: 22911453 DOI: 10.1152/physiolgenomics.00181.2011] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Angiogenesis is the formation of new blood vessels from pre-existing microvessels. Excessive and insufficient angiogenesis have been associated with many diseases including cancer, age-related macular degeneration, ischemic heart, brain, and skeletal muscle diseases. A comprehensive understanding of angiogenesis regulatory processes is needed to improve treatment of these diseases. To identify proteins related to angiogenesis, we developed a novel integrative framework for diverse sources of high-throughput data. The system, called GeneHits, was used to expand on known angiogenesis pathways to construct the angiome, a protein-protein interaction network for angiogenesis. The network consists of 478 proteins and 1,488 interactions. The network was validated through cross validation and analysis of five gene expression datasets from in vitro angiogenesis assays. We calculated the topological properties of the angiome. We analyzed the functional enrichment of angiogenesis-annotated and associated proteins. We also constructed an extended angiome with 1,233 proteins and 5,726 interactions to derive a more complete map of protein-protein interactions in angiogenesis. Finally, the extended angiome was used to identify growth factor signaling networks that drive angiogenesis and antiangiogenic signaling networks. The results of this analysis can be used to identify genes and proteins in different disease conditions and putative targets for therapeutic interventions as high-ranked candidates for experimental validation.
Collapse
Affiliation(s)
- Liang-Hui Chu
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21205, USA.
| | | | | | | |
Collapse
|
20
|
A Resource of Quantitative Functional Annotation for Homo sapiens Genes. G3-GENES GENOMES GENETICS 2012; 2:223-33. [PMID: 22384401 PMCID: PMC3284330 DOI: 10.1534/g3.111.000828] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2011] [Accepted: 11/23/2011] [Indexed: 01/31/2023]
Abstract
The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented—alongside existing validated annotations—in a publicly accessible and searchable web interface.
Collapse
|
21
|
WANG JINGYAN, LI YONGPING. SEQUENTIAL LINEAR NEIGHBORHOOD PROPAGATION FOR SEMI-SUPERVISED PROTEIN FUNCTION PREDICTION. J Bioinform Comput Biol 2012; 9:663-79. [DOI: 10.1142/s0219720011005550] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2010] [Revised: 10/14/2010] [Accepted: 03/21/2011] [Indexed: 11/18/2022]
Abstract
Predicting protein function is one of the most challenging problems of the post-genomic era. The development of experimental methods for genome scale analysis of molecular interaction networks has provided new approaches to inferring protein function. In this paper we introduce a new graph-based semi-supervised classification algorithm Sequential Linear Neighborhood Propagation (SLNP), which addresses the problem of the classification of partially labeled protein interaction networks. The proposed SLNP first constructs a sequence of node sets according to their shortest distance to the labeled nodes, and then predicts the function of the unlabel proteins from the set closer to labeled one, using Linear Neighborhood Propagation. Its performance is assessed on the Saccharomyces cerevisiae PPI network data sets, with good results compared with three current state-of-the-art algorithms, especially in settings where only a small fraction of the proteins are labeled.
Collapse
Affiliation(s)
- JINGYAN WANG
- Shanghai Institute of Applied Physics, Chinese Academy of Sciences, 2019 Jialuo Road, Jiading District, Shanghai 201800, P. R. China
| | - YONGPING LI
- Shanghai Institute of Applied Physics, Chinese Academy of Sciences, 2019 Jialuo Road, Jiading District, Shanghai 201800, P. R. China
| |
Collapse
|
22
|
Mazandu GK, Mulder NJ. Using the underlying biological organization of the Mycobacterium tuberculosis functional network for protein function prediction. INFECTION GENETICS AND EVOLUTION 2011; 12:922-32. [PMID: 22085822 DOI: 10.1016/j.meegid.2011.10.027] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Revised: 10/25/2011] [Accepted: 10/28/2011] [Indexed: 10/15/2022]
Abstract
Despite ever-increasing amounts of sequence and functional genomics data, there is still a deficiency of functional annotation for many newly sequenced proteins. For Mycobacterium tuberculosis (MTB), more than half of its genome is still uncharacterized, which hampers the search for new drug targets within the bacterial pathogen and limits our understanding of its pathogenicity. As for many other genomes, the annotations of proteins in the MTB proteome were generally inferred from sequence homology, which is effective but its applicability has limitations. We have carried out large-scale biological data integration to produce an MTB protein functional interaction network. Protein functional relationships were extracted from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database, and additional functional interactions from microarray, sequence and protein signature data. The confidence level of protein relationships in the additional functional interaction data was evaluated using a dynamic data-driven scoring system. This functional network has been used to predict functions of uncharacterized proteins using Gene Ontology (GO) terms, and the semantic similarity between these terms measured using a state-of-the-art GO similarity metric. To achieve better trade-off between improvement of quality, genomic coverage and scalability, this prediction is done by observing the key principles driving the biological organization of the functional network. This study yields a new functionally characterized MTB strain CDC1551 proteome, consisting of 3804 and 3698 proteins out of 4195 with annotations in terms of the biological process and molecular function ontologies, respectively. These data can contribute to research into the Development of effective anti-tubercular drugs with novel biological mechanisms of action.
Collapse
Affiliation(s)
- Gaston K Mazandu
- Computational Biology Group, Department of Clinical Laboratory Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, 7925 Observatory, Cape Town, South Africa
| | | |
Collapse
|
23
|
Murali TM, Dyer MD, Badger D, Tyler BM, Katze MG. Network-based prediction and analysis of HIV dependency factors. PLoS Comput Biol 2011; 7:e1002164. [PMID: 21966263 PMCID: PMC3178628 DOI: 10.1371/journal.pcbi.1002164] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Accepted: 06/30/2011] [Indexed: 01/27/2023] Open
Abstract
HIV Dependency Factors (HDFs) are a class of human proteins that are essential for HIV replication, but are not lethal to the host cell when silenced. Three previous genome-wide RNAi experiments identified HDF sets with little overlap. We combine data from these three studies with a human protein interaction network to predict new HDFs, using an intuitive algorithm called SinkSource and four other algorithms published in the literature. Our algorithm achieves high precision and recall upon cross validation, as do the other methods. A number of HDFs that we predict are known to interact with HIV proteins. They belong to multiple protein complexes and biological processes that are known to be manipulated by HIV. We also demonstrate that many predicted HDF genes show significantly different programs of expression in early response to SIV infection in two non-human primate species that differ in AIDS progression. Our results suggest that many HDFs are yet to be discovered and that they have potential value as prognostic markers to determine pathological outcome and the likelihood of AIDS development. More generally, if multiple genome-wide gene-level studies have been performed at independent labs to study the same biological system or phenomenon, our methodology is applicable to interpret these studies simultaneously in the context of molecular interaction networks and to ask if they reinforce or contradict each other. Medicines to cure infectious diseases usually target proteins in the pathogens. Since pathogens have short life cycles, the targeted proteins can rapidly evolve and make the medicines ineffective, especially in viruses such as HIV. However, since viruses have very small genomes, they must exploit the cellular machinery of the host to propagate. Therefore, disrupting the activity of selected host proteins may impede viruses. Three recent experiments have discovered hundreds of such proteins in human cells that HIV depends upon. Surprisingly, these three sets have very little overlap. In this work, we demonstrate that this discrepancy can be explained by considering physical interactions between the human proteins in these studies. Moreover, we exploit these interactions to predict new dependency factors for HIV. Our predictions show very significant overlaps with human proteins that are known to interact with HIV proteins and with human cellular processes that are known to be subverted by the virus. Most importantly, we show that proteins predicted by us may play a prominent role in affecting HIV-related disease progression in lymph nodes. Therefore, our predictions constitute a powerful resource for experimentalists who desire to discover new human proteins that can control the spread of HIV.
Collapse
Affiliation(s)
- T. M. Murali
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, United States of America
- * E-mail: (TMM) (TM); (MGK) (MK)
| | - Matthew D. Dyer
- Applied Biosystems, Foster City, California, United States of America
| | - David Badger
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, United States of America
| | - Brett M. Tyler
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, United States of America
| | - Michael G. Katze
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
- * E-mail: (TMM) (TM); (MGK) (MK)
| |
Collapse
|
24
|
Rivera CG, Mellberg S, Claesson-Welsh L, Bader JS, Popel AS. Analysis of VEGF--a regulated gene expression in endothelial cells to identify genes linked to angiogenesis. PLoS One 2011; 6:e24887. [PMID: 21931866 PMCID: PMC3172305 DOI: 10.1371/journal.pone.0024887] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2011] [Accepted: 08/23/2011] [Indexed: 02/06/2023] Open
Abstract
Angiogenesis is important for many physiological processes, diseases, and also regenerative medicine. Therapies that inhibit the vascular endothelial growth factor (VEGF) pathway have been used in the clinic for cancer and macular degeneration. In cancer applications, these treatments suffer from a “tumor escape phenomenon” where alternative pathways are upregulated and angiogenesis continues. The redundancy of angiogenesis regulation indicates the need for additional studies and new drug targets. We aimed to (i) identify novel and missing angiogenesis annotations and (ii) verify their significance to angiogenesis. To achieve these goals, we integrated the human interactome with known angiogenesis-annotated proteins to identify a set of 202 angiogenesis-associated proteins. Across endothelial cell lines, we found that a significant fraction of these proteins had highly perturbed gene expression during angiogenesis. After treatment with VEGF-A, we found increasing expression of HIF-1α, APP, HIV-1 tat interactive protein 2, and MEF2C, while endoglin, liprin β1 and HIF-2α had decreasing expression across three endothelial cell lines. The analysis showed differential regulation of HIF-1α and HIF-2α. The data also provided additional evidence for the role of endothelial cells in Alzheimer's disease.
Collapse
Affiliation(s)
- Corban G Rivera
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
| | | | | | | | | |
Collapse
|
25
|
Mazandu GK, Mulder NJ. Scoring protein relationships in functional interaction networks predicted from sequence data. PLoS One 2011; 6:e18607. [PMID: 21526183 PMCID: PMC3079720 DOI: 10.1371/journal.pone.0018607] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2010] [Accepted: 03/07/2011] [Indexed: 11/21/2022] Open
Abstract
UNLABELLED The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins. AVAILABILITY Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.
Collapse
Affiliation(s)
- Gaston K Mazandu
- Computational Biology Group, Department of Clinical Laboratory Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Cape Town, South Africa
| | | |
Collapse
|
26
|
Smoot M, Ono K, Ideker T, Maere S. PiNGO: a Cytoscape plugin to find candidate genes in biological networks. Bioinformatics 2011; 27:1030-1. [PMID: 21278188 DOI: 10.1093/bioinformatics/btr045] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
UNLABELLED PiNGO is a tool to screen biological networks for candidate genes, i.e. genes predicted to be involved in a biological process of interest. The user can narrow the search to genes with particular known functions or exclude genes belonging to particular functional classes. PiNGO provides support for a wide range of organisms and Gene Ontology classification schemes, and it can easily be customized for other organisms and functional classifications. PiNGO is implemented as a plugin for Cytoscape, a popular network visualization platform. AVAILABILITY PiNGO is distributed as an open-source Java package under the GNU General Public License (http://www.gnu.org/), and can be downloaded via the Cytoscape plugin manager. A detailed user guide and tutorial are available on the PiNGO website (http://www.psb.ugent.be/esb/PiNGO.
Collapse
Affiliation(s)
- Michael Smoot
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | | | | | | |
Collapse
|
27
|
Abstract
A large number of genome-scale networks, including protein-protein and genetic interaction networks, are now available for several organisms. In parallel, many studies have focused on analyzing, characterizing, and modeling these networks. Beyond investigating the topological characteristics such as degree distribution, clustering coefficient, and average shortest-path distance, another area of particular interest is the prediction of nodes (genes) with a given characteristic (labels) - for example prediction of genes that cause a particular phenotype or have a given function. In this chapter, we describe methods and algorithms for predicting node labels from network-based datasets with an emphasis on label propagation algorithms (LPAs) and their relation to local neighborhood methods.
Collapse
Affiliation(s)
- Sara Mostafavi
- Department of Computer Science, Centre for Cellular and Biomolecular Research (CCBR), University of Toronto, Toronto, ON, Canada
| | | | | |
Collapse
|
28
|
Heo HS, Oh SJ, Kim JM, Kim HS, Chung HY. TREP_DB: transcriptional regulatory elements pattern database. Biochem Biophys Res Commun 2010; 394:309-316. [PMID: 20206134 DOI: 10.1016/j.bbrc.2010.02.169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2010] [Accepted: 02/26/2010] [Indexed: 05/28/2023]
Abstract
Predicting and assigning functions for putative genes and hypothetical proteins are important goals in the post-genomic era. Many methods have been developed for this challenge, among which the straightforward way is function prediction using sequence homology. Homology-based function prediction applies sequence-alignment tools to find homology relationships between functions of known genes and putative genes, and transfers the most similar functions of known genes to putative genes. This approach fails completely for about 30% of genes, and only 3% have any supporting experimental evidence. According to supporting evidence, genes are known to be regulated by a common transcriptional regulatory element if the expression profiles of the coregulated genes are highly correlated. We propose a new conceptual approach and method for nonhomology-based function-prediction methods for putative genes and hypothetical proteins. We have established patterns, also considered to be combinations, of common transcriptional regulatory elements for functional classes of mouse (Mus musculus) transcripts (the TREP_DB). Using these results, we have also established a function-prediction method for putative genes and hypothetical proteins.
Collapse
Affiliation(s)
- Hyoung-Sam Heo
- Department of Pharmacy, College of Pharmacy and Molecular Inflammation Research Center for Aging Intervention, Pusan National University, Gumjung-gu, Busan 609-735, Republic of Korea
| | | | | | | | | |
Collapse
|
29
|
Bradford JR, Needham CJ, Tedder P, Care MA, Bulpitt AJ, Westhead DR. GO-At: in silico prediction of gene function in Arabidopsis thaliana by combining heterogeneous data. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2010; 61:713-721. [PMID: 19947983 DOI: 10.1111/j.1365-313x.2009.04097.x] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Despite recent advances, accurate gene function prediction remains an elusive goal, with very few methods directly applicable to the plant Arabidopsis thaliana. In this study, we present GO-At (gene ontology prediction in A. thaliana), a method that combines five data types (co-expression, sequence, phylogenetic profile, interaction and gene neighbourhood) to predict gene function in Arabidopsis. Using a simple, yet powerful two-step approach, GO-At first generates a list of genes ranked in descending order of probability of functional association with the query gene. Next, a prediction score is automatically assigned to each function in this list based on the assumption that functions appearing most frequently at the top of the list are most likely to represent the function of the query gene. In this way, the second step provides an effective alternative to simply taking the 'best hit' from the first list, and achieves success rates of up to 79%. GO-At is applicable across all three GO categories: molecular function, biological process and cellular component, and can assign functions at multiple levels of annotation detail. Furthermore, we demonstrate GO-At's ability to predict functions of uncharacterized genes by identifying ten putative golgins/Golgi-associated proteins amongst 8219 genes of previously unknown cellular component and present independent evidence to support our predictions. A web-based implementation of GO-At (http://www.bioinformatics.leeds.ac.uk/goat) is available, providing a unique resource for plant researchers to make predictions for uncharacterized genes and predict novel functions in Arabidopsis.
Collapse
Affiliation(s)
- James R Bradford
- Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK
| | | | | | | | | | | |
Collapse
|
30
|
Hu P, Jiang H, Emili A. Predicting protein functions by relaxation labelling protein interaction network. BMC Bioinformatics 2010; 11 Suppl 1:S64. [PMID: 20122240 PMCID: PMC3009538 DOI: 10.1186/1471-2105-11-s1-s64] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background One of key issues in the post-genomic era is to assign functions to uncharacterized proteins. Since proteins seldom act alone; rather, they must interact with other biomolecular units to execute their functions. Thus, the functions of unknown proteins may be discovered through studying their interactions with proteins having known functions. Although many approaches have been developed for this purpose, one of main limitations in most of these methods is that the dependence among functional terms has not been taken into account. Results We developed a new network-based protein function prediction method which combines the likelihood scores of local classifiers with a relaxation labelling technique. The framework can incorporate the inter-relationship among functional labels into the function prediction procedure and allow us to efficiently discover relevant non-local dependence. We evaluated the performance of the new method with one other representative network-based function prediction method using E. coli protein functional association networks. Conclusion Our results showed that the new method has better prediction performance than the previous method. The better predictive power of our method gives new insights about the importance of the dependence between functional terms in protein functional prediction.
Collapse
Affiliation(s)
- Pingzhao Hu
- Department of Computer Science and Engineering, York University, ON, Toronto, Canada.
| | | | | |
Collapse
|
31
|
|
32
|
Li X, Chen H, Li J, Zhang Z. Gene function prediction with gene interaction networks: a context graph kernel approach. ACTA ACUST UNITED AC 2009; 14:119-28. [PMID: 19789115 DOI: 10.1109/titb.2009.2033116] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Predicting gene functions is a challenge for biologists in the postgenomic era. Interactions among genes and their products compose networks that can be used to infer gene functions. Most previous studies adopt a linkage assumption, i.e., they assume that gene interactions indicate functional similarities between connected genes. In this study, we propose to use a gene's context graph, i.e., the gene interaction network associated with the focal gene, to infer its functions. In a kernel-based machine-learning framework, we design a context graph kernel to capture the information in context graphs. Our experimental study on a testbed of p53-related genes demonstrates the advantage of using indirect gene interactions and shows the empirical superiority of the proposed approach over linkage-assumption-based methods, such as the algorithm to minimize inconsistent connected genes and diffusion kernels.
Collapse
Affiliation(s)
- Xin Li
- Department of Information Systems, City University of Hong Kong, Kowloon Tong, Hong Kong.
| | | | | | | |
Collapse
|
33
|
Hu P, Janga SC, Babu M, Díaz-Mejía JJ, Butland G, Yang W, Pogoutse O, Guo X, Phanse S, Wong P, Chandran S, Christopoulos C, Nazarians-Armavil A, Nasseri NK, Musso G, Ali M, Nazemof N, Eroukova V, Golshani A, Paccanaro A, Greenblatt JF, Moreno-Hagelsieb G, Emili A. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol 2009; 7:e96. [PMID: 19402753 PMCID: PMC2672614 DOI: 10.1371/journal.pbio.1000096] [Citation(s) in RCA: 268] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2008] [Accepted: 03/16/2009] [Indexed: 12/28/2022] Open
Abstract
One-third of the 4,225 protein-coding genes of Escherichia coli K-12 remain functionally unannotated (orphans). Many map to distant clades such as Archaea, suggesting involvement in basic prokaryotic traits, whereas others appear restricted to E. coli, including pathogenic strains. To elucidate the orphans' biological roles, we performed an extensive proteomic survey using affinity-tagged E. coli strains and generated comprehensive genomic context inferences to derive a high-confidence compendium for virtually the entire proteome consisting of 5,993 putative physical interactions and 74,776 putative functional associations, most of which are novel. Clustering of the respective probabilistic networks revealed putative orphan membership in discrete multiprotein complexes and functional modules together with annotated gene products, whereas a machine-learning strategy based on network integration implicated the orphans in specific biological processes. We provide additional experimental evidence supporting orphan participation in protein synthesis, amino acid metabolism, biofilm formation, motility, and assembly of the bacterial cell envelope. This resource provides a "systems-wide" functional blueprint of a model microbe, with insights into the biological and evolutionary significance of previously uncharacterized proteins.
Collapse
Affiliation(s)
- Pingzhao Hu
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Sarath Chandra Janga
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Medical Research Council Laboratory of Molecular Biology, Cambridge, United Kingdom
| | - Mohan Babu
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - J. Javier Díaz-Mejía
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Biology, Wilfrid Laurier University, Waterloo, Ontario, Canada
| | - Gareth Butland
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Wenhong Yang
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Oxana Pogoutse
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Xinghua Guo
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Sadhna Phanse
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Peter Wong
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Shamanta Chandran
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Constantine Christopoulos
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Anaies Nazarians-Armavil
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Negin Karimi Nasseri
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Gabriel Musso
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Mehrab Ali
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Nazila Nazemof
- Department of Biology and Ottawa Institute of Systems Biology, Carleton University, Ottawa, Canada
| | - Veronika Eroukova
- Department of Biology and Ottawa Institute of Systems Biology, Carleton University, Ottawa, Canada
| | - Ashkan Golshani
- Department of Biology and Ottawa Institute of Systems Biology, Carleton University, Ottawa, Canada
| | - Alberto Paccanaro
- Department of Computer Science, Royal Holloway, University of London, Egham, United Kingdom
| | - Jack F Greenblatt
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Gabriel Moreno-Hagelsieb
- Department of Biology, Wilfrid Laurier University, Waterloo, Ontario, Canada
- * To whom correspondence should be addressed. E-mail: (GM-H); (AE)
| | - Andrew Emili
- Banting and Best Department of Medical Research, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- * To whom correspondence should be addressed. E-mail: (GM-H); (AE)
| |
Collapse
|
34
|
Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, Troyanskaya OG. Exploring the human genome with functional maps. Genes Dev 2009; 19:1093-106. [PMID: 19246570 PMCID: PMC2694471 DOI: 10.1101/gr.082214.108] [Citation(s) in RCA: 166] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Accepted: 02/09/2009] [Indexed: 11/24/2022]
Abstract
Human genomic data of many types are readily available, but the complexity and scale of human molecular biology make it difficult to integrate this body of data, understand it from a systems level, and apply it to the study of specific pathways or genetic disorders. An investigator could best explore a particular protein, pathway, or disease if given a functional map summarizing the data and interactions most relevant to his or her area of interest. Using a regularized Bayesian integration system, we provide maps of functional activity and interaction networks in over 200 areas of human cellular biology, each including information from approximately 30,000 genome-scale experiments pertaining to approximately 25,000 human genes. Key to these analyses is the ability to efficiently summarize this large data collection from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders. In addition to providing maps of each of these areas, we also identify biological processes active in each data set. Experimental investigation of five specific genes, AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A, has confirmed novel roles for these proteins in the proper initiation of macroautophagy in amino acid-starved human fibroblasts. Our functional maps can be explored using HEFalMp (Human Experimental/Functional Mapper), a web interface allowing interactive visualization and investigation of this large body of information.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| | - Erin M. Haley
- Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | | | - Vanessa Dumeaux
- Institute of Community Medicine, Tromsø University, Tromsø, Norway
| | - Daniel R. Barrett
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| | - Hilary A. Coller
- Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | - Olga G. Troyanskaya
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|
35
|
Hess DC, Myers CL, Huttenhower C, Hibbs MA, Hayes AP, Paw J, Clore JJ, Mendoza RM, Luis BS, Nislow C, Giaever G, Costanzo M, Troyanskaya OG, Caudy AA. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet 2009; 5:e1000407. [PMID: 19300474 PMCID: PMC2648979 DOI: 10.1371/journal.pgen.1000407] [Citation(s) in RCA: 116] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2008] [Accepted: 02/05/2009] [Indexed: 01/09/2023] Open
Abstract
Mitochondria are central to many cellular processes including respiration, ion homeostasis, and apoptosis. Using computational predictions combined with traditional quantitative experiments, we have identified 100 proteins whose deficiency alters mitochondrial biogenesis and inheritance in Saccharomyces cerevisiae. In addition, we used computational predictions to perform targeted double-mutant analysis detecting another nine genes with synthetic defects in mitochondrial biogenesis. This represents an increase of about 25% over previously known participants. Nearly half of these newly characterized proteins are conserved in mammals, including several orthologs known to be involved in human disease. Mutations in many of these genes demonstrate statistically significant mitochondrial transmission phenotypes more subtle than could be detected by traditional genetic screens or high-throughput techniques, and 47 have not been previously localized to mitochondria. We further characterized a subset of these genes using growth profiling and dual immunofluorescence, which identified genes specifically required for aerobic respiration and an uncharacterized cytoplasmic protein required for normal mitochondrial motility. Our results demonstrate that by leveraging computational analysis to direct quantitative experimental assays, we have characterized mutants with subtle mitochondrial defects whose phenotypes were undetected by high-throughput methods. Mitochondria are the proverbial powerhouses of the cell, running the fundamental biochemical processes that produce energy from nutrients using oxygen. These processes are conserved in all eukaryotes, from humans to model organisms such as baker's yeast. In humans, mitochondrial dysfunction plays a role in a variety of diseases, including diabetes, neuromuscular disorders, and aging. In order to better understand fundamental mitochondrial biology, we studied genes involved in mitochondrial biogenesis in the yeast S. cerevisiae, discovering over 100 proteins with novel roles in this process. These experiments assigned function to 5% of the genes whose function was not known. In order to achieve this rapid rate of discovery, we developed a system incorporating highly quantitative experimental assays and an integrated, iterative process of computational protein function prediction. Beginning from relatively little prior knowledge, we found that computational predictions achieved about 60% accuracy and rapidly guided our laboratory work towards hundreds of promising candidate genes. Thus, in addition to providing a more thorough understanding of mitochondrial biology, this study establishes a framework for successfully integrating computation and experimentation to drive biological discovery. A companion manuscript, published in PLoS Computational Biology (doi:10.1371/journal.pcbi.1000322), discusses observations and conclusions important for the computational community.
Collapse
Affiliation(s)
- David C. Hess
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Chad L. Myers
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Curtis Huttenhower
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Matthew A. Hibbs
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Alicia P. Hayes
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Jadine Paw
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - John J. Clore
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Rosa M. Mendoza
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Bryan San Luis
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Corey Nislow
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Guri Giaever
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Michael Costanzo
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (OGT); (AAC)
| | - Amy A. Caudy
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (OGT); (AAC)
| |
Collapse
|
36
|
Abstract
Recent advances in biotechnology have produced a wealth of genomic data, which capture a variety of complementary cellular features. While these data promise to yield key insights into molecular biology, much of the available information remains underutilized because of the lack of scalable approaches for integrating signals across large, diverse data sets. A proper framework for capturing these numerous snapshots of complementary phenomena under a variety of conditions can provide the holistic view necessary for developing precise systems-level hypotheses. Here we describe bioPIXIE, a system for combining information from diverse genomic data sets to predict biological networks. bioPIXIE utilizes a Bayesian framework for probabilistic integration of several high-throughput genomic data types including gene expression, protein-protein interactions, genetic interactions, protein localization, and sequence data to predict biological networks. The main purpose of the system is to support user-driven exploration through the inferred functional network, which is enabled by a public, web-based interface. We describe the features and supporting methods of this integration and discovery framework and present case examples where bioPIXIE has been used to generate specific, testable hypotheses for Saccharomyces cerevisiae, many of which have been confirmed experimentally.
Collapse
|
37
|
Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci U S A 2008; 105:12763-8. [PMID: 18725631 DOI: 10.1073/pnas.0806627105] [Citation(s) in RCA: 211] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Protein-protein interactions (PPIs) and their networks play a central role in all biological processes. Akin to the complete sequencing of genomes and their comparative analysis, complete descriptions of interactomes and their comparative analysis is fundamental to a deeper understanding of biological processes. A first step in such an analysis is to align two or more PPI networks. Here, we introduce an algorithm, IsoRank, for global alignment of multiple PPI networks. The guiding intuition here is that a protein in one PPI network is a good match for a protein in another network if their respective sequences and neighborhood topologies are a good match. We encode this intuition as an eigenvalue problem in a manner analogous to Google's PageRank method. Using IsoRank, we compute a global alignment of the Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, and Homo sapiens PPI networks. We demonstrate that incorporating PPI data in ortholog prediction results in improvements over existing sequence-only approaches and over predictions from local alignments of the yeast and fly networks. Previous methods have been effective at identifying conserved, localized network patterns across pairs of networks. This work takes the further step of performing a global alignment of multiple PPI networks. It simultaneously uses sequence similarity and network data and, unlike previous approaches, explicitly models the tradeoff inherent in combining them. We expect IsoRank-with its simultaneous handling of node similarity and network similarity-to be applicable across many scientific domains.
Collapse
|
38
|
Jiang X, Nariai N, Steffen M, Kasif S, Kolaczyk ED. Integration of relational and hierarchical network information for protein function prediction. BMC Bioinformatics 2008; 9:350. [PMID: 18721473 PMCID: PMC2535605 DOI: 10.1186/1471-2105-9-350] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Accepted: 08/22/2008] [Indexed: 11/22/2022] Open
Abstract
Background In the current climate of high-throughput computational biology, the inference of a protein's function from related measurements, such as protein-protein interaction relations, has become a canonical task. Most existing technologies pursue this task as a classification problem, on a term-by-term basis, for each term in a database, such as the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functions. However, ontology structures are essentially hierarchies, with certain top to bottom annotation rules which protein function predictions should in principle follow. Currently, the most common approach to imposing these hierarchical constraints on network-based classifiers is through the use of transitive closure to predictions. Results We propose a probabilistic framework to integrate information in relational data, in the form of a protein-protein interaction network, and a hierarchically structured database of terms, in the form of the GO database, for the purpose of protein function prediction. At the heart of our framework is a factorization of local neighborhood information in the protein-protein interaction network across successive ancestral terms in the GO hierarchy. We introduce a classifier within this framework, with computationally efficient implementation, that produces GO-term predictions that naturally obey a hierarchical 'true-path' consistency from root to leaves, without the need for further post-processing. Conclusion A cross-validation study, using data from the yeast Saccharomyces cerevisiae, shows our method offers substantial improvements over both standard 'guilt-by-association' (i.e., Nearest-Neighbor) and more refined Markov random field methods, whether in their original form or when post-processed to artificially impose 'true-path' consistency. Further analysis of the results indicates that these improvements are associated with increased predictive capabilities (i.e., increased positive predictive value), and that this increase is consistent uniformly with GO-term depth. Additional in silico validation on a collection of new annotations recently added to GO confirms the advantages suggested by the cross-validation study. Taken as a whole, our results show that a hierarchical approach to network-based protein function prediction, that exploits the ontological structure of protein annotation databases in a principled manner, can offer substantial advantages over the successive application of 'flat' network-based methods.
Collapse
Affiliation(s)
- Xiaoyu Jiang
- Department of Mathematics and Statistics, Boston University, Boston, MA 02215, USA.
| | | | | | | | | |
Collapse
|
39
|
Aguilar-Ruiz JS, Moore JH, Ritchie MD. Filling the gap between biology and computer science. BioData Min 2008; 1:1. [PMID: 18822148 PMCID: PMC2547862 DOI: 10.1186/1756-0381-1-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2008] [Accepted: 07/17/2008] [Indexed: 01/07/2023] Open
Abstract
This editorial introduces BioData Mining, a new journal which publishes research articles related to advances in computational methods and techniques for the extraction of useful knowledge from heterogeneous biological data. We outline the aims and scope of the journal, introduce the publishing model and describe the open peer review policy, which fosters interaction within the research community.
Collapse
|
40
|
Peña-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kim WK, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Lin GN, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde-Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS, Morris Q, Klein-Seetharaman J, Bar-Joseph Z, Chen T, Sun F, Troyanskaya OG, Marcotte EM, Xu D, Hughes TR, Roth FP. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 2008; 9 Suppl 1:S2. [PMID: 18613946 PMCID: PMC2447536 DOI: 10.1186/gb-2008-9-s1-s2] [Citation(s) in RCA: 197] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated. RESULTS In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%. CONCLUSION We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.
Collapse
Affiliation(s)
- Lourdes Peña-Castillo
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON M5S3E1, Canada
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Evidence-based annotation of the malaria parasite's genome using comparative expression profiling. PLoS One 2008; 3:e1570. [PMID: 18270564 PMCID: PMC2215772 DOI: 10.1371/journal.pone.0001570] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2007] [Accepted: 01/09/2008] [Indexed: 11/19/2022] Open
Abstract
A fundamental problem in systems biology and whole genome sequence analysis is how to infer functions for the many uncharacterized proteins that are identified, whether they are conserved across organisms of different phyla or are phylum-specific. This problem is especially acute in pathogens, such as malaria parasites, where genetic and biochemical investigations are likely to be more difficult. Here we perform comparative expression analysis on Plasmodium parasite life cycle data derived from P. falciparum blood, sporozoite, zygote and ookinete stages, and P. yoelii mosquito oocyst and salivary gland sporozoites, blood and liver stages and show that type II fatty acid biosynthesis genes are upregulated in liver and insect stages relative to asexual blood stages. We also show that some universally uncharacterized genes with orthologs in Plasmodium species, Saccharomyces cerevisiae and humans show coordinated transcription patterns in large collections of human and yeast expression data and that the function of the uncharacterized genes can sometimes be predicted based on the expression patterns across these diverse organisms. We also use a comprehensive and unbiased literature mining method to predict which uncharacterized parasite-specific genes are likely to have roles in processes such as gliding motility, host-cell interactions, sporozoite stage, or rhoptry function. These analyses, together with protein-protein interaction data, provide probabilistic models that predict the function of 926 uncharacterized malaria genes and also suggest that malaria parasites may provide a simple model system for the study of some human processes. These data also provide a foundation for further studies of transcriptional regulation in malaria parasites.
Collapse
|
42
|
Chua HN, Sung WK, Wong L. An efficient strategy for extensive integration of diverse biological data for protein function prediction. Bioinformatics 2007; 23:3364-73. [DOI: 10.1093/bioinformatics/btm520] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|