1
|
Giudice G, Chen H, Koutsandreas T, Petsalaki E. phuEGO: A Network-Based Method to Reconstruct Active Signaling Pathways From Phosphoproteomics Datasets. Mol Cell Proteomics 2024; 23:100771. [PMID: 38642805 PMCID: PMC11134849 DOI: 10.1016/j.mcpro.2024.100771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 04/08/2024] [Accepted: 04/17/2024] [Indexed: 04/22/2024] Open
Abstract
Signaling networks are critical for virtually all cell functions. Our current knowledge of cell signaling has been summarized in signaling pathway databases, which, while useful, are highly biased toward well-studied processes, and do not capture context specific network wiring or pathway cross-talk. Mass spectrometry-based phosphoproteomics data can provide a more unbiased view of active cell signaling processes in a given context, however, it suffers from low signal-to-noise ratio and poor reproducibility across experiments. While progress in methods to extract active signaling signatures from such data has been made, there are still limitations with respect to balancing bias and interpretability. Here we present phuEGO, which combines up-to-three-layer network propagation with ego network decomposition to provide small networks comprising active functional signaling modules. PhuEGO boosts the signal-to-noise ratio from global phosphoproteomics datasets, enriches the resulting networks for functional phosphosites and allows the improved comparison and integration across datasets. We applied phuEGO to five phosphoproteomics data sets from cell lines collected upon infection with SARS CoV2. PhuEGO was better able to identify common active functions across datasets and to point to a subnetwork enriched for known COVID-19 targets. Overall, phuEGO provides a flexible tool to the community for the improved functional interpretation of global phosphoproteomics datasets.
Collapse
Affiliation(s)
- Girolamo Giudice
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridgeshire, United Kingdom
| | - Haoqi Chen
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridgeshire, United Kingdom
| | - Thodoris Koutsandreas
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridgeshire, United Kingdom
| | - Evangelia Petsalaki
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridgeshire, United Kingdom.
| |
Collapse
|
2
|
De Silva K, Demmer RT, Jönsson D, Mousa A, Forbes A, Enticott J. Highly perturbed genes and hub genes associated with type 2 diabetes in different tissues of adult humans: a bioinformatics analytic workflow. Funct Integr Genomics 2022; 22:1003-1029. [PMID: 35788821 PMCID: PMC9255467 DOI: 10.1007/s10142-022-00881-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 06/19/2022] [Accepted: 06/24/2022] [Indexed: 11/28/2022]
Abstract
Type 2 diabetes (T2D) has a complex etiology which is not yet fully elucidated. The identification of gene perturbations and hub genes of T2D may deepen our understanding of its genetic basis. We aimed to identify highly perturbed genes and hub genes associated with T2D via an extensive bioinformatics analytic workflow consisting of five steps: systematic review of Gene Expression Omnibus and associated literature; identification and classification of differentially expressed genes (DEGs); identification of highly perturbed genes via meta-analysis; identification of hub genes via network analysis; and downstream analysis of highly perturbed genes and hub genes. Three meta-analytic strategies, random effects model, vote-counting approach, and p value combining approach, were applied. Hub genes were defined as those nodes having above-average betweenness, closeness, and degree in the network. Downstream analyses included gene ontologies, Kyoto Encyclopedia of Genes and Genomes pathways, metabolomics, COVID-19-related gene sets, and Genotype-Tissue Expression profiles. Analysis of 27 eligible microarrays identified 6284 DEGs (4592 downregulated and 1692 upregulated) in four tissue types. Tissue-specific gene expression was significantly greater than tissue non-specific (shared) gene expression. Analyses revealed 79 highly perturbed genes and 28 hub genes. Downstream analyses identified enrichments of shared genes with certain other diabetes phenotypes; insulin synthesis and action-related pathways and metabolomics; mechanistic associations with apoptosis and immunity-related pathways; COVID-19-related gene sets; and cell types demonstrating over- and under-expression of marker genes of T2D. Our approach provided valuable insights on T2D pathogenesis and pathophysiological manifestations. Broader utility of this pipeline beyond T2D is envisaged.
Collapse
Affiliation(s)
- Kushan De Silva
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia.
| | - Ryan T Demmer
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, USA.,Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Daniel Jönsson
- Department of Periodontology, Faculty of Odontology, Malmö University, 21119, Malmö, Sweden.,Department of Clinical Sciences, Lund University, 21428, Malmö, Sweden
| | - Aya Mousa
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia
| | - Andrew Forbes
- Biostatistics Unit, Division of Research Methodology, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Melbourne, 3004, Australia
| | - Joanne Enticott
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, 3168, Australia
| |
Collapse
|
3
|
Mancuso CA, Bills PS, Krum D, Newsted J, Liu R, Krishnan A. GenePlexus: a web-server for gene discovery using network-based machine learning. Nucleic Acids Res 2022; 50:W358-W366. [PMID: 35580053 PMCID: PMC9252732 DOI: 10.1093/nar/gkac335] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 04/13/2022] [Accepted: 04/30/2022] [Indexed: 11/28/2022] Open
Abstract
Biomedical researchers take advantage of high-throughput, high-coverage technologies to routinely generate sets of genes of interest across a wide range of biological conditions. Although these technologies have directly shed light on the molecular underpinnings of various biological processes and diseases, the list of genes from any individual experiment is often noisy and incomplete. Additionally, interpreting these lists of genes can be challenging in terms of how they are related to each other and to other genes in the genome. In this work, we present GenePlexus (https://www.geneplexus.net/), a web-server that allows a researcher to utilize a powerful, network-based machine learning method to gain insights into their gene set of interest and additional functionally similar genes. Once a user uploads their own set of human genes and chooses between a number of different human network representations, GenePlexus provides predictions of how associated every gene in the network is to the input set. The web-server also provides interpretability through network visualization and comparison to other machine learning models trained on thousands of known process/pathway and disease gene sets. GenePlexus is free and open to all users without the need for registration.
Collapse
Affiliation(s)
- Christopher A Mancuso
- Department Of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Patrick S Bills
- Data Management and Analytics, IT Services, Michigan State University, East Lansing, MI 48824, USA
| | - Douglas Krum
- Data Management and Analytics, IT Services, Michigan State University, East Lansing, MI 48824, USA
| | - Jacob Newsted
- Data Management and Analytics, IT Services, Michigan State University, East Lansing, MI 48824, USA
| | - Renming Liu
- Department Of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department Of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
4
|
Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 2022; 23:169-181. [PMID: 34837041 DOI: 10.1038/s41576-021-00434-9] [Citation(s) in RCA: 66] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/28/2021] [Indexed: 11/08/2022]
Abstract
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
Collapse
|
5
|
Artificial Intelligence and Cardiovascular Genetics. Life (Basel) 2022; 12:life12020279. [PMID: 35207566 PMCID: PMC8875522 DOI: 10.3390/life12020279] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/26/2022] [Accepted: 02/09/2022] [Indexed: 12/13/2022] Open
Abstract
Polygenic diseases, which are genetic disorders caused by the combined action of multiple genes, pose unique and significant challenges for the diagnosis and management of affected patients. A major goal of cardiovascular medicine has been to understand how genetic variation leads to the clinical heterogeneity seen in polygenic cardiovascular diseases (CVDs). Recent advances and emerging technologies in artificial intelligence (AI), coupled with the ever-increasing availability of next generation sequencing (NGS) technologies, now provide researchers with unprecedented possibilities for dynamic and complex biological genomic analyses. Combining these technologies may lead to a deeper understanding of heterogeneous polygenic CVDs, better prognostic guidance, and, ultimately, greater personalized medicine. Advances will likely be achieved through increasingly frequent and robust genomic characterization of patients, as well the integration of genomic data with other clinical data, such as cardiac imaging, coronary angiography, and clinical biomarkers. This review discusses the current opportunities and limitations of genomics; provides a brief overview of AI; and identifies the current applications, limitations, and future directions of AI in genomics.
Collapse
|
6
|
Gunning M, Pavlidis P. "Guilt by association" is not competitive with genetic association for identifying autism risk genes. Sci Rep 2021; 11:15950. [PMID: 34354131 PMCID: PMC8342445 DOI: 10.1038/s41598-021-95321-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 07/16/2021] [Indexed: 12/25/2022] Open
Abstract
Discovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.
Collapse
Affiliation(s)
- Margot Gunning
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Department of Psychiatry, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Graduate Program in Bioinformatics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Paul Pavlidis
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Department of Psychiatry, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
| |
Collapse
|
7
|
Structure-based protein function prediction using graph convolutional networks. Nat Commun 2021; 12:3168. [PMID: 34039967 PMCID: PMC8155034 DOI: 10.1038/s41467-021-23303-9] [Citation(s) in RCA: 217] [Impact Index Per Article: 72.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 04/22/2021] [Indexed: 02/04/2023] Open
Abstract
The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .
Collapse
|
8
|
Ietswaart R, Gyori BM, Bachman JA, Sorger PK, Churchman LS. GeneWalk identifies relevant gene functions for a biological context using network representation learning. Genome Biol 2021; 22:55. [PMID: 33526072 PMCID: PMC7852222 DOI: 10.1186/s13059-021-02264-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Accepted: 01/05/2021] [Indexed: 12/13/2022] Open
Abstract
A bottleneck in high-throughput functional genomics experiments is identifying the most important genes and their relevant functions from a list of gene hits. Gene Ontology (GO) enrichment methods provide insight at the gene set level. Here, we introduce GeneWalk ( github.com/churchmanlab/genewalk ) that identifies individual genes and their relevant functions critical for the experimental setting under examination. After the automatic assembly of an experiment-specific gene regulatory network, GeneWalk uses representation learning to quantify the similarity between vector representations of each gene and its GO annotations, yielding annotation significance scores that reflect the experimental context. By performing gene- and condition-specific functional analysis, GeneWalk converts a list of genes into data-driven hypotheses.
Collapse
Affiliation(s)
- Robert Ietswaart
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
| | - John A Bachman
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
| | - Peter K Sorger
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
| | - L Stirling Churchman
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
9
|
Perlasca P, Frasca M, Ba CT, Gliozzo J, Notaro M, Pennacchioni M, Valentini G, Mesiti M. Multi-resolution visualization and analysis of biomolecular networks through hierarchical community detection and web-based graphical tools. PLoS One 2020; 15:e0244241. [PMID: 33351828 PMCID: PMC7755227 DOI: 10.1371/journal.pone.0244241] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 12/04/2020] [Indexed: 11/19/2022] Open
Abstract
The visual exploration and analysis of biomolecular networks is of paramount importance for identifying hidden and complex interaction patterns among proteins. Although many tools have been proposed for this task, they are mainly focused on the query and visualization of a single protein with its neighborhood. The global exploration of the entire network and the interpretation of its underlying structure still remains difficult, mainly due to the excessively large size of the biomolecular networks. In this paper we propose a novel multi-resolution representation and exploration approach that exploits hierarchical community detection algorithms for the identification of communities occurring in biomolecular networks. The proposed graphical rendering combines two types of nodes (protein and communities) and three types of edges (protein-protein, community-community, protein-community), and displays communities at different resolutions, allowing the user to interactively zoom in and out from different levels of the hierarchy. Links among communities are shown in terms of relationships and functional correlations among the biomolecules they contain. This form of navigation can be also combined by the user with a vertex centric visualization for identifying the communities holding a target biomolecule. Since communities gather limited-size groups of correlated proteins, the visualization and exploration of complex and large networks becomes feasible on off-the-shelf computer machines. The proposed graphical exploration strategies have been implemented and integrated in UNIPred-Web, a web application that we recently introduced for combining the UNIPred algorithm, able to address both integration and protein function prediction in an imbalance-aware fashion, with an easy to use vertex-centric exploration of the integrated network. The tool has been deeply amended from different standpoints, including the prediction core algorithm. Several tests on networks of different size and connectivity have been conducted to show off the vast potential of our methodology; moreover, enrichment analyses have been performed to assess the biological meaningfulness of detected communities. Finally, a CoV-human network has been embedded in the system, and a corresponding case study presented, including the visualization and the prediction of human host proteins that potentially interact with SARS-CoV2 proteins.
Collapse
Affiliation(s)
- Paolo Perlasca
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
| | - Marco Frasca
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
| | - Cheick Tidiane Ba
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
| | - Jessica Gliozzo
- Neuroradiology Unit, IRCCS San Raffaele Hospital, Milan, Italy
| | - Marco Notaro
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
| | - Mario Pennacchioni
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
| | - Giorgio Valentini
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
- CINI National Laboratory in Artificial Intelligence and Intelligent Systems—AIIS, Rome, Italy
| | - Marco Mesiti
- AnacletoLab, Department of Computer Science, University of Milan, Milan, Italy
| |
Collapse
|
10
|
Lee J, Shah M, Ballouz S, Crow M, Gillis J. CoCoCoNet: conserved and comparative co-expression across a diverse set of species. Nucleic Acids Res 2020; 48:W566-W571. [PMID: 32392296 PMCID: PMC7319556 DOI: 10.1093/nar/gkaa348] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 04/21/2020] [Accepted: 04/24/2020] [Indexed: 12/19/2022] Open
Abstract
Co-expression analysis has provided insight into gene function in organisms from Arabidopsis to zebrafish. Comparison across species has the potential to enrich these results, for example by prioritizing among candidate human disease genes based on their network properties or by finding alternative model systems where their co-expression is conserved. Here, we present CoCoCoNet as a tool for identifying conserved gene modules and comparing co-expression networks. CoCoCoNet is a resource for both data and methods, providing gold standard networks and sophisticated tools for on-the-fly comparative analyses across 14 species. We show how CoCoCoNet can be used in two use cases. In the first, we demonstrate deep conservation of a nucleolus gene module across very divergent organisms, and in the second, we show how the heterogeneity of autism mechanisms in humans can be broken down by functional groups and translated to model organisms. CoCoCoNet is free to use and available to all at https://milton.cshl.edu/CoCoCoNet, with data and R scripts available at ftp://milton.cshl.edu/data.
Collapse
Affiliation(s)
- John Lee
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Manthan Shah
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Megan Crow
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| |
Collapse
|
11
|
Sethi S, Vorontsov IE, Kulakovskiy IV, Greenaway S, Williams J, Makeev VJ, Brown SDM, Simon MM, Mallon AM. A holistic view of mouse enhancer architectures reveals analogous pleiotropic effects and correlation with human disease. BMC Genomics 2020; 21:754. [PMID: 33138777 PMCID: PMC7607678 DOI: 10.1186/s12864-020-07109-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 09/29/2020] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Efforts to elucidate the function of enhancers in vivo are underway but their vast numbers alongside differing enhancer architectures make it difficult to determine their impact on gene activity. By systematically annotating multiple mouse tissues with super- and typical-enhancers, we have explored their relationship with gene function and phenotype. RESULTS Though super-enhancers drive high total- and tissue-specific expression of their associated genes, we find that typical-enhancers also contribute heavily to the tissue-specific expression landscape on account of their large numbers in the genome. Unexpectedly, we demonstrate that both enhancer types are preferentially associated with relevant 'tissue-type' phenotypes and exhibit no difference in phenotype effect size or pleiotropy. Modelling regulatory data alongside molecular data, we built a predictive model to infer gene-phenotype associations and use this model to predict potentially novel disease-associated genes. CONCLUSION Overall our findings reveal that differing enhancer architectures have a similar impact on mammalian phenotypes whilst harbouring differing cellular and expression effects. Together, our results systematically characterise enhancers with predicted phenotypic traits endorsing the role for both types of enhancers in human disease and disorders.
Collapse
Affiliation(s)
- Siddharth Sethi
- Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK
| | - Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, 119991, Russia
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, Pushchino, Moscow Region, 142290, Russia
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, 119991, Russia
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, Pushchino, Moscow Region, 142290, Russia
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilova 32, Moscow, 119991, Russia
| | - Simon Greenaway
- Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK
| | - John Williams
- Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, B15 2TH, UK
- Institute of Cancer and Genomic Sciences, University of Birmingham, Birmingham, B15 2TT, UK
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, Moscow, 119991, Russia
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, Pushchino, Moscow Region, 142290, Russia
- Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, 141700, Russia
| | - Steve D M Brown
- Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK
| | - Michelle M Simon
- Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK.
| | - Ann-Marie Mallon
- Mammalian Genetics Unit, MRC Harwell Institute, Oxfordshire, OX11 0RD, UK.
| |
Collapse
|
12
|
Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. Bioinformatics 2020; 36:3457-3465. [PMID: 32129827 PMCID: PMC7267831 DOI: 10.1093/bioinformatics/btaa150] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/01/2019] [Accepted: 02/27/2020] [Indexed: 12/22/2022] Open
Abstract
Background Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. Results In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene’s full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation’s appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. Availability and implementation The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. Contact arjun@msu.edu Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renming Liu
- Department of Computational Mathematics, Science and Engineering
| | | | | | - Kayla A Johnson
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
- To whom correspondence should be addressed.
| |
Collapse
|
13
|
Li SS, Tian JM, Wei TH, Wang HR. Identification of key genes for type 1 diabetes mellitus by network-based guilt by association. ACTA ACUST UNITED AC 2020; 66:778-783. [PMID: 32696859 DOI: 10.1590/1806-9282.66.6.778] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Accepted: 01/19/2020] [Indexed: 12/31/2022]
Abstract
OBJECTIVE This study aimed to propose a co-expression-network (CEN) based gene functional inference by extending the "Guilt by Association" (GBA) principle to predict candidate gene functions for type 1 diabetes mellitus (T1DM). METHODS Firstly, transcriptome data of T1DM were retrieved from the genomics data repository for differentially expressed gene (DEGs) analysis, and a weighted differential CEN was generated. The area under the receiver operating characteristics curve (AUC) was chosen to determine the performance metric for each Gene Ontology (GO) term. Differential expression analysis identified 325 DEGs in T1DM, and co-expression analysis generated a differential CEN of edge weight > 0.8. RESULTS A total of 282 GO annotations with DEGs > 20 remained for functional inference. By calculating the multifunctionality score of genes, gene function inference was performed to identify the optimal gene functions for T1DM based on the optimal ranking gene list. Considering an AUC > 0.7, six optimal gene functions for T1DM were identified, such as regulation of immune system process and receptor activity. CONCLUSIONS CEN-based gene functional inference by extending the GBA principle predicted 6 optimal gene functions for T1DM. The results may be potential paths for therapeutic or preventive treatments of T1DM.
Collapse
Affiliation(s)
- Shan-Shan Li
- Department of Endocrinology, Linyi People's Hospital, Linyi, China
| | - Jia-Mei Tian
- Department of Pediatric Internal Medicine, Linyi People's Hospital, Linyi, China
| | - Tong-Huan Wei
- Department of Internal Medicine, The People's Hospital of Linyi Hi-Tech Industrial Development Zone, Linyi, China
| | - Hao-Ren Wang
- Department of Internal Medicine, Linyi Luozhuang Central Hospital, Linyi, China
| |
Collapse
|
14
|
Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. BIOINFORMATICS (OXFORD, ENGLAND) 2020; 36:3457-3465. [PMID: 32129827 DOI: 10.1101/721423] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/01/2019] [Accepted: 02/27/2020] [Indexed: 05/26/2023]
Abstract
BACKGROUND Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. RESULTS In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene's full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation's appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. AVAILABILITY AND IMPLEMENTATION The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. CONTACT arjun@msu.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renming Liu
- Department of Computational Mathematics, Science and Engineering
| | | | | | - Kayla A Johnson
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
15
|
Koo DCE, Bonneau R. Towards region-specific propagation of protein functions. Bioinformatics 2020; 35:1737-1744. [PMID: 30304483 PMCID: PMC6513163 DOI: 10.1093/bioinformatics/bty834] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 08/23/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features. RESULTS We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms. AVAILABILITY AND IMPLEMENTATION The code and features are freely available at: https://github.com/ek1203/rsfp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Da Chen Emily Koo
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Richard Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.,Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.,Center for Data Science, New York University, New York, NY, USA
| |
Collapse
|
16
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China.,CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
17
|
Blatti C, Emad A, Berry MJ, Gatzke L, Epstein M, Lanier D, Rizal P, Ge J, Liao X, Sobh O, Lambert M, Post CS, Xiao J, Groves P, Epstein AT, Chen X, Srinivasan S, Lehnert E, Kalari KR, Wang L, Weinshilboum RM, Song JS, Jongeneel CV, Han J, Ravaioli U, Sobh N, Bushell CB, Sinha S. Knowledge-guided analysis of "omics" data using the KnowEnG cloud platform. PLoS Biol 2020; 18:e3000583. [PMID: 31971940 PMCID: PMC6977717 DOI: 10.1371/journal.pbio.3000583] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022] Open
Abstract
We present Knowledge Engine for Genomics (KnowEnG), a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis, and expression signature analysis. The system specializes in "knowledge-guided" data mining and machine learning algorithms, in which user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge bases and encoded in a massive "Knowledge Network." KnowEnG adheres to "FAIR" principles (findable, accessible, interoperable, and reuseable): its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution, and are interoperable with other computing platforms. The analysis tools are made available through multiple access modes, including a web portal with specialized visualization modules. We demonstrate the KnowEnG system's potential value in democratization of advanced tools for the modern genomics era through several case studies that use its tools to recreate and expand upon the published analysis of cancer data sets.
Collapse
Affiliation(s)
- Charles Blatti
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Amin Emad
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Department of Electrical and Computer Engineering, McGill University, Montreal, Canada
| | - Matthew J. Berry
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Lisa Gatzke
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Milt Epstein
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Daniel Lanier
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Pramod Rizal
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jing Ge
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xiaoxia Liao
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Omar Sobh
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Mike Lambert
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Corey S. Post
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jinfeng Xiao
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Peter Groves
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Aidan T. Epstein
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xi Chen
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Subhashini Srinivasan
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Erik Lehnert
- Seven Bridges Genomics, Charlestown, Massachusetts, United States of America
| | - Krishna R. Kalari
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Liewei Wang
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Richard M. Weinshilboum
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Jun S. Song
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Department of Physics, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - C. Victor Jongeneel
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jiawei Han
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Umberto Ravaioli
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Nahil Sobh
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Colleen B. Bushell
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Saurabh Sinha
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
18
|
Plyusnin I, Holm L, Törönen P. Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences. PLoS Comput Biol 2019; 15:e1007419. [PMID: 31682632 PMCID: PMC6855565 DOI: 10.1371/journal.pcbi.1007419] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Revised: 11/14/2019] [Accepted: 09/24/2019] [Indexed: 11/18/2022] Open
Abstract
Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (https://bitbucket.org/plyusnin/ads/). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.
Collapse
Affiliation(s)
- Ilya Plyusnin
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
- Research Programme in Organismal and Evolutionary Biology, Faculty of Biosciences, University of Helsinki, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
| |
Collapse
|
19
|
Gligorijevic V, Barot M, Bonneau R. deepNF: deep network fusion for protein function prediction. Bioinformatics 2019; 34:3873-3881. [PMID: 29868758 PMCID: PMC6223364 DOI: 10.1093/bioinformatics/bty440] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Accepted: 05/28/2018] [Indexed: 01/10/2023] Open
Abstract
Motivation The prevalence of high-throughput experimental methods has resulted in an abundance of large-scale molecular and functional interaction networks. The connectivity of these networks provides a rich source of information for inferring functional annotations for genes and proteins. An important challenge has been to develop methods for combining these heterogeneous networks to extract useful protein feature representations for function prediction. Most of the existing approaches for network integration use shallow models that encounter difficulty in capturing complex and highly non-linear network structures. Thus, we propose deepNF, a network fusion method based on Multimodal Deep Autoencoders to extract high-level features of proteins from multiple heterogeneous interaction networks. Results We apply this method to combine STRING networks to construct a common low-dimensional representation containing high-level protein features. We use separate layers for different network types in the early stages of the multimodal autoencoder, later connecting all the layers into a single bottleneck layer from which we extract features to predict protein function. We compare the cross-validation and temporal holdout predictive performance of our method with state-of-the-art methods, including the recently proposed method Mashup. Our results show that our method outperforms previous methods for both human and yeast STRING networks. We also show substantial improvement in the performance of our method in predicting gene ontology terms of varying type and specificity. Availability and implementation deepNF is freely available at: https://github.com/VGligorijevic/deepNF. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vladimir Gligorijevic
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Meet Barot
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Richard Bonneau
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.,Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.,Center for Data Science, New York University, New York, NY, USA
| |
Collapse
|
20
|
Perlasca P, Frasca M, Ba CT, Notaro M, Petrini A, Casiraghi E, Grossi G, Gliozzo J, Valentini G, Mesiti M. UNIPred-Web: a web tool for the integration and visualization of biomolecular networks for protein function prediction. BMC Bioinformatics 2019; 20:422. [PMID: 31412768 PMCID: PMC6694573 DOI: 10.1186/s12859-019-2959-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 06/18/2019] [Indexed: 01/06/2023] Open
Abstract
Background One of the main issues in the automated protein function prediction (AFP) problem is the integration of multiple networked data sources. The UNIPred algorithm was thereby proposed to efficiently integrate —in a function-specific fashion— the protein networks by taking into account the imbalance that characterizes protein annotations, and to subsequently predict novel hypotheses about unannotated proteins. UNIPred is publicly available as R code, which might result of limited usage for non-expert users. Moreover, its application requires efforts in the acquisition and preparation of the networks to be integrated. Finally, the UNIPred source code does not handle the visualization of the resulting consensus network, whereas suitable views of the network topology are necessary to explore and interpret existing protein relationships. Results We address the aforementioned issues by proposing UNIPred-Web, a user-friendly Web tool for the application of the UNIPred algorithm to a variety of biomolecular networks, already supplied by the system, and for the visualization and exploration of protein networks. We support different organisms and different types of networks —e.g., co-expression, shared domains and physical interaction networks. Users are supported in the different phases of the process, ranging from the selection of the networks and the protein function to be predicted, to the navigation of the integrated network. The system also supports the upload of user-defined protein networks. The vertex-centric and the highly interactive approach of UNIPred-Web allow a narrow exploration of specific proteins, and an interactive analysis of large sub-networks with only a few mouse clicks. Conclusions UNIPred-Web offers a practical and intuitive (visual) guidance to biologists interested in gaining insights into protein biomolecular functions. UNIPred-Web provides facilities for the integration of networks, and supplies a framework for the imbalance-aware protein network integration of nine organisms, the prediction of thousands of GO protein functions, and a easy-to-use graphical interface for the visual analysis, navigation and interpretation of the integrated networks and of the functional predictions.
Collapse
Affiliation(s)
- Paolo Perlasca
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Marco Frasca
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Cheick Tidiane Ba
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Marco Notaro
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Alessandro Petrini
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Elena Casiraghi
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Giuliano Grossi
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Jessica Gliozzo
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy.,Fondazione IRCCS Ca' Granda - Ospedale Maggiore Policlinico, Università degli Studi di Milano, Via della Commenda 10, Milano, 20122, Italy
| | - Giorgio Valentini
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy
| | - Marco Mesiti
- Department of Computer Science, Università degli Studi di Milano, Via Celoria 18, Milano, 20133, Italy.
| |
Collapse
|
21
|
Li SS, Zhao XB, Tian JM, Wang HR, Wei TH. Prediction of seed gene function in progressive diabetic neuropathy by a network-based inference method. Exp Ther Med 2019; 17:4176-4182. [PMID: 31007748 PMCID: PMC6468912 DOI: 10.3892/etm.2019.7441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 03/07/2019] [Indexed: 11/07/2022] Open
Abstract
Guilt by association (GBA) algorithm has been widely used to statistically predict gene functions, and network-based approach increases the confidence and veracity of identifying molecular signatures for diseases. This work proposed a network-based GBA method by integrating the GBA algorithm and network, to identify seed gene functions for progressive diabetic neuropathy (PDN). The inference of predicting seed gene functions comprised of three steps: i) Preparing gene lists and sets; ii) constructing a co-expression matrix (CEM) on gene lists by Spearman correlation coefficient (SCC) method and iii) predicting gene functions by GBA algorithm. Ultimately, seed gene functions were selected according to the area under the receiver operating characteristics curve (AUC) index. A total of 79 differentially expressed genes (DEGs) and 40 background gene ontology (GO) terms were regarded as gene lists and sets for the subsequent analyses, respectively. The predicted results obtained from the network-based GBA approach showed that 27.5% of all gene sets had a good classified performance with AUC >0.5. Most significantly, 3 gene sets with AUC >0.6 were denoted as seed gene functions for PDN, including binding, molecular function and regulation of the metabolic process. In summary, we predicted 3 seed gene functions for PDN compared with non-progressors utilizing network-based GBA algorithm. The findings provide insights to reveal pathological and molecular mechanism underlying PDN.
Collapse
Affiliation(s)
- Shan-Shan Li
- Department of Endocrinology, Linyi People's Hospital, Linyi, Shandong 276000, P.R. China
| | - Xin-Bo Zhao
- Department of Endocrinology, Linyi People's Hospital, Linyi, Shandong 276000, P.R. China
| | - Jia-Mei Tian
- Department of Pediatrics, Linyi People's Hospital, Linyi, Shandong 276000, P.R. China
| | - Hao-Ren Wang
- Department of Medicine, Linyi Luozhuang Central Hospital, Linyi, Shandong 276000, P.R. China
| | - Tong-Huan Wei
- Department of Medicine, People's Hospital of Linyi High-Tech Industrial Development Zone, Linyi, Shandong 276000, P.R. China
| |
Collapse
|
22
|
Pan ZG, Zhang XZ, Zhang ZM, Dong YJ. Optimal pathways involved in the treatment of sevoflurane or propofol for patients undergoing coronary artery bypass graft surgery. Exp Ther Med 2019; 17:3637-3643. [PMID: 30988747 PMCID: PMC6447764 DOI: 10.3892/etm.2019.7354] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Accepted: 02/14/2019] [Indexed: 01/02/2023] Open
Abstract
The cardio-protection mechanisms of sevoflurane and propofol still remain unclear in patients undergoing coronary artery bypass grafting (CABG). We designed the present study to identify the optimal pathways through integrating differential co-expressed network (DCN)-based guilt by association (GBA) principle based on the expression data of E-GEOD-4386 downloaded from EMBL-EBI. Differentially expressed genes (DEGs) were firstly identified and then DCN and sub-DCN were established. The seed pathways were predicted through GBA principle using the area under the curve (AUC) for pathway categories, and the pathway terms with AUC >0.9 were defined as the seed pathways. KEGG pathway analysis was applied to the DEGs based on DAVIA to detect significant pathways. The final optimal pathways were identified based on the traditional pathway analysis and network-based pathway inference approach. There were 83 common, 99 sevoflurane-specific and 4 propofol-specific DEGs in the expression profile of artial samples. Finally, 8 and 4 pathway terms having the AUC >0.9 were identified and determined as the seed pathways in the propofol and sevoflurane group, respectively. TNF signaling pathway, NF-κB signaling pathway, as well as NOD-like receptor signaling pathway were the common optimal ones in these two groups. Only the pathway of cytokine-cytokine receptor interaction was unique to sevoflurane, and no pathway was specific to propofol. Our results suggested that sevoflurane and propofol might synergistically possess some cardio-protective properties in patients undergoing CABG.
Collapse
Affiliation(s)
- Zhen-Guo Pan
- Department of Anesthesiology, The Second People's Hospital of Liaocheng, Linqing, Shandong 252600, P.R. China
| | - Xi-Zeng Zhang
- Department of Anesthesiology, The Second People's Hospital of Liaocheng, Linqing, Shandong 252600, P.R. China
| | - Zhi-Mei Zhang
- Department of Anesthesiology, The Second People's Hospital of Liaocheng, Linqing, Shandong 252600, P.R. China
| | - Yun-Jie Dong
- Department of Medical Administration, The Second People's Hospital of Liaocheng, Linqing, Shandong 252600, P.R. China
| |
Collapse
|
23
|
He M, Lin Y, Xu Y. Identification of prognostic biomarkers in colorectal cancer using a long non-coding RNA-mediated competitive endogenous RNA network. Oncol Lett 2019; 17:2687-2694. [PMID: 30854042 PMCID: PMC6365949 DOI: 10.3892/ol.2019.9936] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2018] [Accepted: 12/05/2018] [Indexed: 02/07/2023] Open
Abstract
Colorectal cancer (CRC) is a highly malignant gastrointestinal tumor accompanied by poor prognosis. Long non-coding RNA (lncRNA) plays an important role in the progression and physiology of tumors as it competes with endogenous RNAs, including miRNA and mRNA. In the present study, a multi-step computational method was used to build a CRC-related functional lncRNA-mediated competitive endogenous RNA (ceRNA) network (LMCN). lncRNAs with more degrees and betweenness centrality (BC) were screened out as hub lncRNAs. Then functional enrichment analyses of lncRNAs were carried out from the Gene Ontology (GO) and Reactome pathway databases based on the 'guilt by association' principle. As a result, lncRNAs in the LMCN displayed specific topological characteristics in accordance with the regulatory correlation of coding mRNAs in CRC pathology. HCP5, EPB41L4A-AS1, SNHG12, and LINC00649 were screened out as hub lncRNAs which were more significantly related to the development and prognosis of CRC. The hub lncRNAs in CRC were obviously involved in functions of cell cycle arrest, vacuolar transport, histone modification, and in pathways of GPCR, signaling by Rho GTPases, axon guidance pathways, meaning that they might be potential biomarkers for diagnosis, evaluation and gene-targeted therapy of CRC. Thus, the LMCN construction method could accelerate lncRNA discovery and therapeutic development in CRC.
Collapse
Affiliation(s)
- Minjie He
- Department of Medical Oncology, The First Affiliated Hospital of Kunming Medical University, Kunming, Yunnan 650000, P.R. China
| | - Yan Lin
- Department of Oncology, The Affiliated Traditional Chinese Medical Hospital of Xinjiang Medical University, Urumqi, Xinjiang 830000, P.R. China
| | - Yuzhen Xu
- Department of Gastrointestinal Surgery, Xuzhou Hospital Affiliated to Medical School of Southeast University, Xuzhou, Jiangsu 221009, P.R. China
| |
Collapse
|
24
|
Combined haplotype blocks regression and multi-locus mixed model analysis reveals novel candidate genes associated with milk traits in dairy sheep. Livest Sci 2019. [DOI: 10.1016/j.livsci.2018.11.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
25
|
Pai S, Bader GD. Patient Similarity Networks for Precision Medicine. J Mol Biol 2018; 430:2924-2938. [PMID: 29860027 PMCID: PMC6097926 DOI: 10.1016/j.jmb.2018.05.037] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2018] [Revised: 05/24/2018] [Accepted: 05/29/2018] [Indexed: 02/08/2023]
Abstract
Clinical research and practice in the 21st century is poised to be transformed by analysis of computable electronic medical records and population-level genome-scale patient profiles. Genomic data capture genetic and environmental state, providing information on heterogeneity in disease and treatment outcome, but genomic-based clinical risk scores are limited. Achieving the goal of routine precision medicine that takes advantage of these rich genomics data will require computational methods that support heterogeneous data, have excellent predictive performance, and ideally, provide biologically interpretable results. Traditional machine-learning approaches excel at performance, but often have limited interpretability. Patient similarity networks are an emerging paradigm for precision medicine, in which patients are clustered or classified based on their similarities in various features, including genomic profiles. This strategy is analogous to standard medical diagnosis, has excellent performance, is interpretable, and can preserve patient privacy. We review new methods based on patient similarity networks, including Similarity Network Fusion for patient clustering and netDx for patient classification. While these methods are already useful, much work is required to improve their scalability for contemporary genetic cohorts, optimize parameters, and incorporate a wide range of genomics and clinical data. The coming 5 years will provide an opportunity to assess the utility of network-based algorithms for precision medicine.
Collapse
Affiliation(s)
- Shraddha Pai
- The Donnelly Centre, University of Toronto, Toronto, Canada
| | - Gary D Bader
- The Donnelly Centre, University of Toronto, Toronto, Canada; Department of Molecular Genetics, University of Toronto, Toronto, Canada; Department of Computer Science, University of Toronto, Toronto, Canada; The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Canada.
| |
Collapse
|
26
|
Ye KJ, Dai J, Liu LY, Peng MJ. Network‑based gene function inference method to predict optimal gene functions associated with fetal growth restriction. Mol Med Rep 2018; 18:3003-3010. [PMID: 30015878 DOI: 10.3892/mmr.2018.9232] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 07/21/2017] [Indexed: 11/06/2022] Open
Abstract
The guilt by association (GBA) principle has been widely used to predict gene functions, and a network‑based approach may enhance the confidence and stability of the analysis compared with focusing on individual genes. Fetal growth restriction (FGR), is the second primary cause of perinatal mortality. Therefore, the present study aimed to predict the optimal gene functions for FGR using a network‑based GBA method. The method was comprised of four parts: Identification of differentially‑expressed genes (DEGs) between patients with FGR and normal controls based on gene expression data; construction of a co‑expression network (CEN) dependent on DEGs, using the Spearman correlation coefficient algorithm; collection of gene ontology (GO) data on the basis of a known confirmed database and DEGs; and prediction of optimal gene functions using the GBA algorithm, for which the area under the receiver operating characteristic curve (AUC) was obtained for each GO term. A total of 115 DEGs and 109 GO terms were obtained for subsequent analysis. All DEGs were mapped to the CEN and formed 6,555 edges. The results of GBA algorithm demonstrated that 78 GO terms had a good classification performance with AUC >0.5. In particular, the AUC for 5 of the GO terms was >0.7, and these were defined as optimal gene functions, including defense response, immune system process, response to stress, cellular response to chemical stimulus and positive regulation of biological process. In conclusion, the results of the present study provided insights into the pathological mechanism underlying FGR, and provided potential biomarkers for early detection and targeted treatment of this disease. However, the interactions between the 5 GO terms remain unclear, and further studies are required.
Collapse
Affiliation(s)
- Ke-Jun Ye
- Department of Gynaecology and Obstetrics, Ruian People's Hospital, Ruian, Zhejiang 325200, P.R. China
| | - Jie Dai
- Department of Gynaecology and Obstetrics, Ruian People's Hospital, Ruian, Zhejiang 325200, P.R. China
| | - Ling-Yun Liu
- Department of Clinical Laboratory, Ruian People's Hospital, Ruian, Zhejiang 325200, P.R. China
| | - Meng-Jia Peng
- Department of Gynaecology and Obstetrics, Ruian People's Hospital, Ruian, Zhejiang 325200, P.R. China
| |
Collapse
|
27
|
Wu W, Huang B, Yan Y, Zhong ZQ. Exploration of gene functions for esophageal squamous cell carcinoma using network-based guilt by association principle. ACTA ACUST UNITED AC 2018; 51:e6801. [PMID: 29694510 PMCID: PMC5937724 DOI: 10.1590/1414-431x20186801] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2017] [Accepted: 01/25/2018] [Indexed: 11/21/2022]
Abstract
Gene networks have been broadly used to predict gene functions based on guilt by association (GBA) principle. Thus, in order to better understand the molecular mechanisms of esophageal squamous cell carcinoma (ESCC), our study was designed to use a network-based GBA method to identify the optimal gene functions for ESCC. To identify genomic bio-signatures for ESCC, microarray data of GSE20347 were first downloaded from a public functional genomics data repository of Gene Expression Omnibus database. Then, differentially expressed genes (DEGs) between ESCC patients and controls were identified using the LIMMA method. Afterwards, construction of differential co-expression network (DCN) was performed relying on DEGs, followed by gene ontology (GO) enrichment analysis based on a known confirmed database and DEGs. Eventually, the optimal gene functions were predicted using GBA algorithm based on the area under the curve (AUC) for each GO term. Overall, 43 DEGs and 67 GO terms were gained for subsequent analysis. GBA predictions demonstrated that 13 GO functions with AUC>0.7 had a good classification ability. Significantly, 6 out of 13 GO terms yielded AUC>0.8, which were determined as the optimal gene functions. Interestingly, there were two GO categories with AUC>0.9, which included cell cycle checkpoint (AUC=0.91648), and mitotic sister chromatid segregation (AUC=0.91597). Our findings highlight the clinical implications of cell cycle checkpoint and mitotic sister chromatid segregation in ESCC progression and provide the molecular foundation for developing therapeutic targets.
Collapse
Affiliation(s)
- Wei Wu
- Department of Gastroenterology (40th Ward), Daqing Oilfield General Hospital, Daqing, China
| | - Bo Huang
- Department of Gastroenterology (40th Ward), Daqing Oilfield General Hospital, Daqing, China
| | - Yan Yan
- Department of Ultrasonics, Daqing Oilfield General Hospital, Daqing, China
| | - Zhi-Qiang Zhong
- Department of Gastroenterology (40th Ward), Daqing Oilfield General Hospital, Daqing, China
| |
Collapse
|
28
|
Crow M, Paul A, Ballouz S, Huang ZJ, Gillis J. Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat Commun 2018; 9:884. [PMID: 29491377 PMCID: PMC5830442 DOI: 10.1038/s41467-018-03282-0] [Citation(s) in RCA: 158] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 02/02/2018] [Indexed: 12/19/2022] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) technology provides a new avenue to discover and characterize cell types; however, the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine its replicability. Meta-analysis is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that quantifies the degree to which cell types replicate across datasets, and enables rapid identification of clusters with high similarity. We first measure the replicability of neuronal identity, comparing results across eight technically and biologically diverse datasets to define best practices for more complex assessments. We then apply this to novel interneuron subtypes, finding that 24/45 subtypes have evidence of replication, which enables the identification of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.
Collapse
Affiliation(s)
- Megan Crow
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Anirban Paul
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Sara Ballouz
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Z Josh Huang
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA.
| |
Collapse
|
29
|
Ballouz S, Weber M, Pavlidis P, Gillis J. EGAD: ultra-fast functional analysis of gene networks. Bioinformatics 2018; 33:612-614. [PMID: 27993773 DOI: 10.1093/bioinformatics/btw695] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 11/03/2016] [Indexed: 12/25/2022] Open
Abstract
Summary Evaluating gene networks with respect to known biology is a common task but often a computationally costly one. Many computational experiments are difficult to apply exhaustively in network analysis due to run-times. To permit high-throughput analysis of gene networks, we have implemented a set of very efficient tools to calculate functional properties in networks based on guilt-by-association methods. ( xtending ' uilt-by- ssociation' by egree) allows gene networks to be evaluated with respect to hundreds or thousands of gene sets. The methods predict novel members of gene groups, assess how well a gene network groups known sets of genes, and determines the degree to which generic predictions drive performance. By allowing fast evaluations, whether of random sets or real functional ones, provides the user with an assessment of performance which can easily be used in controlled evaluations across many parameters. Availability and Implementation The software package is freely available at https://github.com/sarbal/EGAD and implemented for use in R and Matlab. The package is also freely available under the LGPL license from the Bioconductor web site ( http://bioconductor.org ). Contact JGillis@cshl.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Melanie Weber
- Department of Mathematics and Computer Science, University of Leipzig, Leipzig, Germany
| | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
30
|
Abstract
Biological networks are powerful resources for the discovery of genes and genetic modules that drive disease. Fundamental to network analysis is the concept that genes underlying the same phenotype tend to interact; this principle can be used to combine and to amplify signals from individual genes. Recently, numerous bioinformatic techniques have been proposed for genetic analysis using networks, based on random walks, information diffusion and electrical resistance. These approaches have been applied successfully to identify disease genes, genetic modules and drug targets. In fact, all these approaches are variations of a unifying mathematical machinery - network propagation - suggesting that it is a powerful data transformation method of broad utility in genetic research.
Collapse
|
31
|
Chen X. Prediction of optimal gene functions for osteosarcoma using network-based- guilt by association method based on gene oncology and microarray profile. J Bone Oncol 2017; 7:18-22. [PMID: 28443230 PMCID: PMC5396855 DOI: 10.1016/j.jbo.2017.04.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Revised: 04/05/2017] [Accepted: 04/06/2017] [Indexed: 01/21/2023] Open
Abstract
In the current study, we planned to predict the optimal gene functions for osteosarcoma (OS) by integrating network-based method with guilt by association (GBA) principle (called as network-based gene function inference approach) based on gene oncology (GO) data and gene expression profile. To begin with, differentially expressed genes (DEGs) were extracted using linear models for microarray data (LIMMA) package. Then, construction of differential co-expression network (DCN) relying on DEGs was implemented, and sub-DCN was identified using Spearman correlation coefficient (SCC). Subsequently, GO annotations for OS were collected according to known confirmed database and DEGs. Ultimately, gene functions were predicted by means of GBA principle based on the area under the curve (AUC) for GO terms, and we determined GO terms with AUC >0.7 as the optimal gene functions for OS. Totally, 123 DEGs and 137 GO terms were obtained for further analysis. A DCN was constructed, which included 123 DEGs and 7503 interactions. A total of 105 GO terms were identified when the threshold was set as AUC >0.5, which had a good classification performance. Among these 105 GO terms, 2 functions had the AUC >0.7 and were determined as the optimal gene functions including angiogenesis (AUC =0.767) and regulation of immune system process (AUC =0.710). These gene functions appear to have potential for early detection and clinical treatment of OS in the future.
Collapse
|
32
|
Kominakis A, Hager-Theodorides AL, Zoidis E, Saridaki A, Antonakos G, Tsiamis G. Combined GWAS and 'guilt by association'-based prioritization analysis identifies functional candidate genes for body size in sheep. Genet Sel Evol 2017; 49:41. [PMID: 28454565 PMCID: PMC5408376 DOI: 10.1186/s12711-017-0316-3] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Accepted: 04/19/2017] [Indexed: 12/30/2022] Open
Abstract
Background Body size in sheep is an important indicator of productivity, growth and health as well as of environmental adaptation. It is a composite quantitative trait that has been studied with high-throughput genomic methods, i.e. genome-wide association studies (GWAS) in various mammalian species. Several genomic markers have been associated with body size traits and genes have been identified as causative candidates in humans, dog and cattle. A limited number of related GWAS have been performed in various sheep breeds and have identified genomic regions and candidate genes that partly account for body size variability. Here, we conducted a GWAS in Frizarta dairy sheep with phenotypic data from 10 body size measurements and genotypic data (from Illumina ovineSNP50 BeadChip) for 459 ewes. Results The 10 body size measurements were subjected to principal component analysis and three independent principal components (PC) were constructed, interpretable as width, height and length dimensions, respectively. The GWAS performed for each PC identified 11 significant SNPs, at the chromosome level, one on each of the chromosomes 3, 8, 9, 10, 11, 12, 19, 20, 23 and two on chromosome 25. Nine out of the 11 SNPs were located on previously identified quantitative trait loci for sheep meat, production or reproduction. One hundred and ninety-seven positional candidate genes within a 1-Mb distance from each significant SNP were found. A guilt-by-association-based (GBA) prioritization analysis (PA) was performed to identify the most plausible functional candidate genes. GBA-based PA identified 39 genes that were significantly associated with gene networks relevant to body size traits. Prioritized genes were identified in the vicinity of all significant SNPs except for those on chromosomes 10 and 12. The top five ranking genes were TP53, BMPR1A, PIK3R5, RPL26 and PRKDC. Conclusions The results of this GWAS provide evidence for 39 causative candidate genes across nine chromosomal regions for body size traits, some of which are novel and some are previously identified candidates from other studies (e.g. TP53, NTN1 and ZNF521). GBA-based PA has proved to be a useful tool to identify genes with increased biological relevance but it is subjected to certain limitations. Electronic supplementary material The online version of this article (doi:10.1186/s12711-017-0316-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Antonios Kominakis
- Department of Animal Science and Aquaculture, Agricultural University of Athens, Iera Odos 75, 11855, Athens, Greece
| | - Ariadne L Hager-Theodorides
- Department of Animal Science and Aquaculture, Agricultural University of Athens, Iera Odos 75, 11855, Athens, Greece.
| | - Evangelos Zoidis
- Department of Animal Science and Aquaculture, Agricultural University of Athens, Iera Odos 75, 11855, Athens, Greece
| | - Aggeliki Saridaki
- Department of Environmental and Natural Resources Management, University of Patras, Seferi 2, 30100, Agrinio, Greece
| | - George Antonakos
- Agricultural and Livestock Union of Western Greece, 13rd Km N.R. Agrinio-Ioannina, 30100, Lepenou, Greece
| | - George Tsiamis
- Department of Environmental and Natural Resources Management, University of Patras, Seferi 2, 30100, Agrinio, Greece
| |
Collapse
|
33
|
Guala D, Sonnhammer ELL. A large-scale benchmark of gene prioritization methods. Sci Rep 2017; 7:46598. [PMID: 28429739 PMCID: PMC5399445 DOI: 10.1038/srep46598] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 03/22/2017] [Indexed: 11/16/2022] Open
Abstract
In order to maximize the use of results from high-throughput experimental studies, e.g. GWAS, for identification and diagnostics of new disease-associated genes, it is important to have properly analyzed and benchmarked gene prioritization tools. While prospective benchmarks are underpowered to provide statistically significant results in their attempt to differentiate the performance of gene prioritization tools, a strategy for retrospective benchmarking has been missing, and new tools usually only provide internal validations. The Gene Ontology(GO) contains genes clustered around annotation terms. This intrinsic property of GO can be utilized in construction of robust benchmarks, objective to the problem domain. We demonstrate how this can be achieved for network-based gene prioritization tools, utilizing the FunCoup network. We use cross-validation and a set of appropriate performance measures to compare state-of-the-art gene prioritization algorithms: three based on network diffusion, NetRank and two implementations of Random Walk with Restart, and MaxLink that utilizes network neighborhood. Our benchmark suite provides a systematic and objective way to compare the multitude of available and future gene prioritization tools, enabling researchers to select the best gene prioritization tool for the task at hand, and helping to guide the development of more accurate methods.
Collapse
Affiliation(s)
- Dimitri Guala
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| | - Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
34
|
Jiang B, Kloster K, Gleich DF, Gribskov M. AptRank: an adaptive PageRank model for protein function prediction on bi-relational graphs. Bioinformatics 2017; 33:1829-1836. [PMID: 28200073 DOI: 10.1093/bioinformatics/btx029] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Accepted: 02/14/2017] [Indexed: 11/15/2022] Open
Affiliation(s)
- Biaobin Jiang
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Kyle Kloster
- Department of Mathematics, Purdue University, West Lafayette, IN, USA
| | - David F Gleich
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Michael Gribskov
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
35
|
Kim E, Lee I. Network-Based Gene Function Prediction in Mouse and Other Model Vertebrates Using MouseNet Server. Methods Mol Biol 2017; 1611:183-198. [PMID: 28451980 DOI: 10.1007/978-1-4939-7015-5_14] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
The mouse, Mus musculus, is a popular model organism for the study of human genes involved in development, immunology, and disease phenotypes. Despite recent revolutions in gene-knockout technologies in mouse, identification of candidate genes for functions of interest can further accelerate the discovery of novel gene functions. The collaborative nature of genetic functions allows for the inference of gene functions based on the principle of guilt-by-association. Genome-scale co-functional networks could therefore provide functional predictions for genes via network analysis. We recently constructed such a network for mouse (MouseNet), which interconnects over 88% of protein-coding genes with 788,080 functional relationships. The companion web server ( www.inetbio.org/mousenet ) enables researchers with no bioinformatics expertise to generate predictions that facilitate discovery of novel gene functions. In this chapter, we present the theoretical framework for MouseNet, as well as step-by-step instructions and technical tips for functional prediction of genes and pathways in mouse and other model vertebrates.
Collapse
Affiliation(s)
- Eiru Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Korea
| | - Insuk Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Korea.
| |
Collapse
|
36
|
Fu G, Wang J, Yang B, Yu G. NegGOA: negative GO annotations selection using ontology structure. Bioinformatics 2016; 32:2996-3004. [PMID: 27318205 DOI: 10.1093/bioinformatics/btw366] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2016] [Accepted: 06/01/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Predicting the biological functions of proteins is one of the key challenges in the post-genomic era. Computational models have demonstrated the utility of applying machine learning methods to predict protein function. Most prediction methods explicitly require a set of negative examples-proteins that are known not carrying out a particular function. However, Gene Ontology (GO) almost always only provides the knowledge that proteins carry out a particular function, and functional annotations of proteins are incomplete. GO structurally organizes more than tens of thousands GO terms and a protein is annotated with several (or dozens) of these terms. For these reasons, the negative examples of a protein can greatly help distinguishing true positive examples of the protein from such a large candidate GO space. RESULTS In this paper, we present a novel approach (called NegGOA) to select negative examples. Specifically, NegGOA takes advantage of the ontology structure, available annotations and potentiality of additional annotations of a protein to choose negative examples of the protein. We compare NegGOA with other negative examples selection algorithms and find that NegGOA produces much fewer false negatives than them. We incorporate the selected negative examples into an efficient function prediction model to predict the functions of proteins in Yeast, Human, Mouse and Fly. NegGOA also demonstrates improved accuracy than these comparing algorithms across various evaluation metrics. In addition, NegGOA is less suffered from incomplete annotations of proteins than these comparing methods. AVAILABILITY AND IMPLEMENTATION The Matlab and R codes are available at https://sites.google.com/site/guoxian85/neggoa CONTACT gxyu@swu.edu.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Bo Yang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
| |
Collapse
|
37
|
Abstract
“Big Data” has surpassed “systems biology” and “omics” as the hottest buzzword in the biological sciences, but is there any substance behind the hype? Certainly, we have learned about various aspects of cell and molecular biology from the many individual high-throughput data sets that have been published in the past 15–20 years. These data, although useful as individual data sets, can provide much more knowledge when interrogated with Big Data approaches, such as applying integrative methods that leverage the heterogeneous data compendia in their entirety. Here we discuss the benefits and challenges of such Big Data approaches in biology and how cell and molecular biologists can best take advantage of them.
Collapse
Affiliation(s)
- Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540
| | - Olga G Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540 Department of Computer Science, Princeton University, Princeton, NJ 08540 Simons Center for Data Analysis, Simons Foundation, New York, NY 10010
| |
Collapse
|
38
|
Guan Y, Martini S, Mariani LH. Genes Caught In Flagranti: Integrating Renal Transcriptional Profiles With Genotypes and Phenotypes. Semin Nephrol 2016. [PMID: 26215861 DOI: 10.1016/j.semnephrol.2015.04.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
In the past decade, population genetics has gained tremendous success in identifying genetic variations that are statistically relevant to renal diseases and kidney function. However, it is challenging to interpret the functional relevance of the genetic variations found by population genetics studies. In this review, we discuss studies that integrate multiple levels of data, especially transcriptome profiles and phenotype data, to assign functional roles of genetic variations involved in kidney function. Furthermore, we introduce state-of-the-art machine learning algorithms, Bayesian networks, support vector machines, and Gaussian process regression, which have been applied successfully to integrating genetic, regulatory, and clinical information to predict clinical outcomes. These methods are likely to be deployed successfully in the nephrology field in the near future.
Collapse
Affiliation(s)
- Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI; Department of Internal Medicine, University of Michigan, Ann Arbor, MI; Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI
| | - Sebastian Martini
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI; Nephrologisches Zentrum, Medizinische Klinik und Poliklinik IV, Klinikum der Universität München, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Laura H Mariani
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI
| |
Collapse
|
39
|
Conrad T, Albrecht AS, de Melo Costa VR, Sauer S, Meierhofer D, Ørom UA. Serial interactome capture of the human cell nucleus. Nat Commun 2016; 7:11212. [PMID: 27040163 PMCID: PMC4822031 DOI: 10.1038/ncomms11212] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 03/02/2016] [Indexed: 11/15/2022] Open
Abstract
Novel RNA-guided cellular functions are paralleled by an increasing number of RNA-binding proteins (RBPs). Here we present 'serial RNA interactome capture' (serIC), a multiple purification procedure of ultraviolet-crosslinked poly(A)-RNA-protein complexes that enables global RBP detection with high specificity. We apply serIC to the nuclei of proliferating K562 cells to obtain the first human nuclear RNA interactome. The domain composition of the 382 identified nuclear RBPs markedly differs from previous IC experiments, including few factors without known RNA-binding domains that are in good agreement with computationally predicted RNA binding. serIC extends the number of DNA-RNA-binding proteins (DRBPs), and reveals a network of RBPs involved in p53 signalling and double-strand break repair. serIC is an effective tool to couple global RBP capture with additional selection or labelling steps for specific detection of highly purified RBPs.
Collapse
Affiliation(s)
- Thomas Conrad
- Max Planck Institute for Molecular Genetics, Otto Warburg Laboratories, 14195 Berlin, Germany
| | - Anne-Susann Albrecht
- Max Planck Institute for Molecular Genetics, Otto Warburg Laboratories, 14195 Berlin, Germany
- Department of Biochemistry, Free University of Berlin, 14195 Berlin, Germany
| | - Veronica Rodrigues de Melo Costa
- Max Planck Institute for Molecular Genetics, Otto Warburg Laboratories, 14195 Berlin, Germany
- Department of Informatics, Free University of Berlin, 14195 Berlin, Germany
| | - Sascha Sauer
- Max Planck Institute for Molecular Genetics, Otto Warburg Laboratories, 14195 Berlin, Germany
- CU Systems Medicine, 97080 Würzburg 14195, Germany
| | - David Meierhofer
- Max Planck Institute for Molecular Genetics, Mass Spectrometry Core Facility, 14195 Berlin, Germany
| | - Ulf Andersson Ørom
- Max Planck Institute for Molecular Genetics, Otto Warburg Laboratories, 14195 Berlin, Germany
| |
Collapse
|
40
|
Blatti C, Sinha S. Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks. Bioinformatics 2016; 32:2167-75. [PMID: 27153592 PMCID: PMC4937193 DOI: 10.1093/bioinformatics/btw151] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Accepted: 03/14/2016] [Indexed: 12/11/2022] Open
Abstract
Motivation: Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or ‘properties’ such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene–gene or gene–property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. Results: We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. Availability and Implementation: DRaWR was implemented as an R package available at veda.cs.illinois.edu/DRaWR. Contact: blatti@illinois.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Charles Blatti
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA Institute of Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| |
Collapse
|
41
|
Hung CM, Yu AY, Lai YT, Shaner PJL. Developing informative microsatellite markers for non-model species using reference mapping against a model species' genome. Sci Rep 2016; 6:23087. [PMID: 26976328 PMCID: PMC4791680 DOI: 10.1038/srep23087] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2015] [Accepted: 03/01/2016] [Indexed: 11/12/2022] Open
Abstract
Microsatellites have a wide range of applications from behavioral biology, evolution, to agriculture-based breeding programs. The recent progress in the next-generation sequencing technologies and the rapidly increasing number of published genomes may greatly enhance the current applications of microsatellites by turning them from anonymous to informative markers. Here we developed an approach to anchor microsatellite markers of any target species in a genome of a related model species, through which the genomic locations of the markers, along with any functional genes potentially linked to them, can be revealed. We mapped the shotgun sequence reads of a non-model rodent species Apodemus semotus against the genome of a model species, Mus musculus, and presented 24 polymorphic microsatellite markers with detailed background information for A. semotus in this study. The developed markers can be used in other rodent species, especially those that are closely related to A. semotus or M. musculus. Compared to the traditional approaches based on DNA cloning, our approach is likely to yield more loci for the same cost. This study is a timely demonstration of how a research team can efficiently generate informative (neutral or function-associated) microsatellite markers for their study species and unique biological questions.
Collapse
Affiliation(s)
- Chih-Ming Hung
- Biodiversity Research Center, Academia Sinica, Taipei, Taiwan
| | - Ai-Yun Yu
- Department of Life Science, National Taiwan Normal University, Taipei, Taiwan
| | - Yu-Ting Lai
- Department of Life Science, National Taiwan Normal University, Taipei, Taiwan
| | - Pei-Jen L Shaner
- Department of Life Science, National Taiwan Normal University, Taipei, Taiwan
| |
Collapse
|
42
|
Yu G, Fu G, Wang J, Zhu H. Predicting Protein Function via Semantic Integration of Multiple Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:220-232. [PMID: 26800544 DOI: 10.1109/tcbb.2015.2459713] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Determining the biological functions of proteins is one of the key challenges in the post-genomic era. The rapidly accumulated large volumes of proteomic and genomic data drives to develop computational models for automatically predicting protein function in large scale. Recent approaches focus on integrating multiple heterogeneous data sources and they often get better results than methods that use single data source alone. In this paper, we investigate how to integrate multiple biological data sources with the biological knowledge, i.e., Gene Ontology (GO), for protein function prediction. We propose a method, called SimNet, to Semantically integrate multiple functional association Networks derived from heterogenous data sources. SimNet firstly utilizes GO annotations of proteins to capture the semantic similarity between proteins and introduces a semantic kernel based on the similarity. Next, SimNet constructs a composite network, obtained as a weighted summation of individual networks, and aligns the network with the kernel to get the weights assigned to individual networks. Then, it applies a network-based classifier on the composite network to predict protein function. Experiment results on heterogenous proteomic data sources of Yeast, Human, Mouse, and Fly show that, SimNet not only achieves better (or comparable) results than other related competitive approaches, but also takes much less time. The Matlab codes of SimNet are available at https://sites.google.com/site/guoxian85/simnet.
Collapse
|
43
|
Abstract
Spurred by advances in processing power, memory, storage, and an unprecedented wealth of data, computers are being asked to tackle increasingly complex learning tasks, often with astonishing success. Computers have now mastered a popular variant of poker, learned the laws of physics from experimental data, and become experts in video games - tasks that would have been deemed impossible not too long ago. In parallel, the number of companies centered on applying complex data analysis to varying industries has exploded, and it is thus unsurprising that some analytic companies are turning attention to problems in health care. The purpose of this review is to explore what problems in medicine might benefit from such learning approaches and use examples from the literature to introduce basic concepts in machine learning. It is important to note that seemingly large enough medical data sets and adequate learning algorithms have been available for many decades, and yet, although there are thousands of papers applying machine learning algorithms to medical data, very few have contributed meaningfully to clinical care. This lack of impact stands in stark contrast to the enormous relevance of machine learning to many other industries. Thus, part of my effort will be to identify what obstacles there may be to changing the practice of medicine through statistical learning approaches, and discuss how these might be overcome.
Collapse
Affiliation(s)
- Rahul C Deo
- From Cardiovascular Research Institute, Department of Medicine and Institute for Human Genetics, University of California, San Francisco, and California Institute for Quantitative Biosciences, San Francisco.
| |
Collapse
|
44
|
Li HD, Omenn GS, Guan Y. A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling. Brief Bioinform 2016; 17:1024-1031. [PMID: 26740460 DOI: 10.1093/bib/bbv109] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 11/03/2015] [Indexed: 01/23/2023] Open
Abstract
The products of multi-exon genes are a mixture of alternatively spliced isoforms, from which the translated proteins can have similar, different or even opposing functions. It is therefore essential to differentiate and annotate functions for individual isoforms. Computational approaches provide an efficient complement to expensive and time-consuming experimental studies. The input data of these methods range from DNA sequence, to RNA selection pressure, to expressed sequence tags, to full-length complementary DNA, to exon array, to RNA-seq expression, to proteomic data. Notably, RNA-seq technology generates quantitative profiling of transcript expression at the genome scale, with an unprecedented amount of expression data available for developing isoform function prediction methods. Integrative analysis of these data at different molecular levels enables a proteogenomic approach to systematically interrogate isoform functions. Here, we briefly review the state-of-the-art methods according to their input data sources, discuss their advantages and limitations and point out potential ways to improve prediction accuracies.
Collapse
|
45
|
Kim E, Hwang S, Kim H, Shim H, Kang B, Yang S, Shim JH, Shin SY, Marcotte EM, Lee I. MouseNet v2: a database of gene networks for studying the laboratory mouse and eight other model vertebrates. Nucleic Acids Res 2016; 44:D848-54. [PMID: 26527726 PMCID: PMC4702832 DOI: 10.1093/nar/gkv1155] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Revised: 10/05/2015] [Accepted: 10/19/2015] [Indexed: 12/26/2022] Open
Abstract
Laboratory mouse, Mus musculus, is one of the most important animal tools in biomedical research. Functional characterization of the mouse genes, hence, has been a long-standing goal in mammalian and human genetics. Although large-scale knockout phenotyping is under progress by international collaborative efforts, a large portion of mouse genome is still poorly characterized for cellular functions and associations with disease phenotypes. A genome-scale functional network of mouse genes, MouseNet, was previously developed in context of MouseFunc competition, which allowed only limited input data for network inferences. Here, we present an improved mouse co-functional network, MouseNet v2 (available at http://www.inetbio.org/mousenet), which covers 17 714 genes (>88% of coding genome) with 788 080 links, along with a companion web server for network-assisted functional hypothesis generation. The network database has been substantially improved by large expansion of genomics data. For example, MouseNet v2 database contains 183 co-expression networks inferred from 8154 public microarray samples. We demonstrated that MouseNet v2 is predictive for mammalian phenotypes as well as human diseases, which suggests its usefulness in discovery of novel disease genes and dissection of disease pathways. Furthermore, MouseNet v2 database provides functional networks for eight other vertebrate models used in various research fields.
Collapse
Affiliation(s)
- Eiru Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Sohyun Hwang
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, TX 78712, USA
| | - Hyojin Kim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Hongseok Shim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Byunghee Kang
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Sunmo Yang
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Jae Ho Shim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Seung Yeon Shin
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Edward M Marcotte
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, TX 78712, USA
| | - Insuk Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| |
Collapse
|
46
|
Verleyen W, Ballouz S, Gillis J. Positive and negative forms of replicability in gene network analysis. Bioinformatics 2015; 32:1065-73. [PMID: 26668004 DOI: 10.1093/bioinformatics/btv734] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 12/09/2015] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Gene networks have become a central tool in the analysis of genomic data but are widely regarded as hard to interpret. This has motivated a great deal of comparative evaluation and research into best practices. We explore the possibility that this may lead to overfitting in the field as a whole. RESULTS We construct a model of 'research communities' sampling from real gene network data and machine learning methods to characterize performance trends. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes 'easy' or uninformative replication to dominate analyses. We find that when sampling across network data and algorithms with similar variability, the relationship between replicability and accuracy is positive (Spearman's correlation, rs ∼0.33) but where no such constraint is imposed, the relationship becomes negative for a given gene function (rs ∼ -0.13). We predict factors driving replicability in some prior analyses of gene networks and show that they are unconnected with the correctness of the original result, instead reflecting replicable biases. Without these biases, the original results also vanish replicably. We show these effects can occur quite far upstream in network data and that there is a strong tendency within protein-protein interaction data for highly replicable interactions to be associated with poor quality control. AVAILABILITY AND IMPLEMENTATION Algorithms, network data and a guide to the code available at: https://github.com/wimverleyen/AggregateGeneFunctionPrediction CONTACT jgillis@cshl.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- W Verleyen
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - S Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - J Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| |
Collapse
|
47
|
Madhukar NS, Elemento O, Pandey G. Prediction of Genetic Interactions Using Machine Learning and Network Properties. Front Bioeng Biotechnol 2015; 3:172. [PMID: 26579514 PMCID: PMC4620407 DOI: 10.3389/fbioe.2015.00172] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 10/12/2015] [Indexed: 12/04/2022] Open
Abstract
A genetic interaction (GI) is a type of interaction where the effect of one gene is modified by the effect of one or several other genes. These interactions are important for delineating functional relationships among genes and their corresponding proteins, as well as elucidating complex biological processes and diseases. An important type of GI - synthetic sickness or synthetic lethality - involves two or more genes, where the loss of either gene alone has little impact on cell viability, but the combined loss of all genes leads to a severe decrease in fitness (sickness) or cell death (lethality). The identification of GIs is an important problem for it can help delineate pathways, protein complexes, and regulatory dependencies. Synthetic lethal interactions have important clinical and biological significance, such as providing therapeutically exploitable weaknesses in tumors. While near systematic high-content screening for GIs is possible in single cell organisms such as yeast, the systematic discovery of GIs is extremely difficult in mammalian cells. Therefore, there is a great need for computational approaches to reliably predict GIs, including synthetic lethal interactions, in these organisms. Here, we review the state-of-the-art approaches, strategies, and rigorous evaluation methods for learning and predicting GIs, both under general (healthy/standard laboratory) conditions and under specific contexts, such as diseases.
Collapse
Affiliation(s)
- Neel S Madhukar
- Department of Physiology and Biophysics, Meyer Cancer Center, Institute for Precision Medicine and Institute for Computational Biomedicine, Weill Cornell Medical College , New York, NY , USA ; Tri-Institutional Training Program in Computational Biology and Medicine , New York, NY , USA
| | - Olivier Elemento
- Department of Physiology and Biophysics, Meyer Cancer Center, Institute for Precision Medicine and Institute for Computational Biomedicine, Weill Cornell Medical College , New York, NY , USA ; Tri-Institutional Training Program in Computational Biology and Medicine , New York, NY , USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Graduate School of Biomedical Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai , New York, NY , USA
| |
Collapse
|
48
|
Frasca M, Bertoni A, Valentini G. UNIPred: Unbalance-Aware Network Integration and Prediction of Protein Functions. J Comput Biol 2015; 22:1057-74. [PMID: 26402488 DOI: 10.1089/cmb.2014.0110] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
The proper integration of multiple sources of data and the unbalance between annotated and unannotated proteins represent two of the main issues of the automated function prediction (AFP) problem. Most of supervised and semisupervised learning algorithms for AFP proposed in literature do not jointly consider these items, with a negative impact on both sensitivity and precision performances, due to the unbalance between annotated and unannotated proteins that characterize the majority of functional classes and to the specific and complementary information content embedded in each available source of data. We propose UNIPred (unbalance-aware network integration and prediction of protein functions), an algorithm that properly combines different biomolecular networks and predicts protein functions using parametric semisupervised neural models. The algorithm explicitly takes into account the unbalance between unannotated and annotated proteins both to construct the integrated network and to predict protein annotations for each functional class. Full-genome and ontology-wide experiments with three eukaryotic model organisms show that the proposed method compares favorably with state-of-the-art learning algorithms for AFP.
Collapse
Affiliation(s)
- Marco Frasca
- DI - Department of Computer Science, University of Milan , Milan, Italy
| | - Alberto Bertoni
- DI - Department of Computer Science, University of Milan , Milan, Italy
| | - Giorgio Valentini
- DI - Department of Computer Science, University of Milan , Milan, Italy
| |
Collapse
|
49
|
Whalen S, Pandey OP, Pandey G. Predicting protein function and other biomedical characteristics with heterogeneous ensembles. Methods 2015; 93:92-102. [PMID: 26342255 DOI: 10.1016/j.ymeth.2015.08.016] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 08/03/2015] [Accepted: 08/23/2015] [Indexed: 12/29/2022] Open
Abstract
Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.
Collapse
Affiliation(s)
- Sean Whalen
- Gladstone Institutes, University of California, San Francisco, CA, USA.
| | - Om Prakash Pandey
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Gaurav Pandey
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Graduate School of Biomedical Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
50
|
Lehtinen S, Lees J, Bähler J, Shawe-Taylor J, Orengo C. Gene Function Prediction from Functional Association Networks Using Kernel Partial Least Squares Regression. PLoS One 2015; 10:e0134668. [PMID: 26288239 PMCID: PMC4545790 DOI: 10.1371/journal.pone.0134668] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 07/13/2015] [Indexed: 11/18/2022] Open
Abstract
With the growing availability of large-scale biological datasets, automated methods of extracting functionally meaningful information from this data are becoming increasingly important. Data relating to functional association between genes or proteins, such as co-expression or functional association, is often represented in terms of gene or protein networks. Several methods of predicting gene function from these networks have been proposed. However, evaluating the relative performance of these algorithms may not be trivial: concerns have been raised over biases in different benchmarking methods and datasets, particularly relating to non-independence of functional association data and test data. In this paper we propose a new network-based gene function prediction algorithm using a commute-time kernel and partial least squares regression (Compass). We compare Compass to GeneMANIA, a leading network-based prediction algorithm, using a number of different benchmarks, and find that Compass outperforms GeneMANIA on these benchmarks. We also explicitly explore problems associated with the non-independence of functional association data and test data. We find that a benchmark based on the Gene Ontology database, which, directly or indirectly, incorporates information from other databases, may considerably overestimate the performance of algorithms exploiting functional association data for prediction.
Collapse
Affiliation(s)
- Sonja Lehtinen
- CoMPLEX, University College London, London, United Kingdom
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jon Lees
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jürg Bähler
- Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | - John Shawe-Taylor
- Department of Computer Science, University College London, London, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
- * E-mail:
| |
Collapse
|