1
|
Zhang C, Freddolino L. A large-scale assessment of sequence database search tools for homology-based protein function prediction. Brief Bioinform 2024; 25:bbae349. [PMID: 39038936 PMCID: PMC11262835 DOI: 10.1093/bib/bbae349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/03/2024] [Accepted: 07/05/2024] [Indexed: 07/24/2024] Open
Abstract
Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. These searches are also a critical component in most state-of-the-art machine learning and deep learning-based protein function predictors. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND-one of the most popular tools for function prediction-under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. Additionally, we developed a new scoring function to derive GO prediction from homologous hits that consistently outperform previously proposed scoring functions. These findings enable the improvement of almost all protein function prediction algorithms with a few easily implementable changes in their sequence homolog-based component. This study emphasizes the critical role of search parameter settings in homology-based function transfer and should have an important contribution to the development of future protein function prediction algorithms.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109, United States
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109, United States
| |
Collapse
|
2
|
Fan K, Zhang Y. Pseudo2GO: A Graph-Based Deep Learning Method for Pseudogene Function Prediction by Borrowing Information From Coding Genes. Front Genet 2020; 11:807. [PMID: 33014009 PMCID: PMC7461887 DOI: 10.3389/fgene.2020.00807] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 07/06/2020] [Indexed: 12/16/2022] Open
Abstract
Pseudogenes are indicating more and more functional potentials recently, though historically were regarded as relics of evolution. Computational methods for predicting pseudogene functions on Gene Ontology is important for directing experimental discovery. However, no pseudogene-specific computational methods have been proposed to directly predict their Gene Ontology (GO) terms. The biggest challenge for pseudogene function prediction is the lack of enough features and functional annotations, making training a predictive model difficult. Considering the close functional similarity between pseudogenes and their parent coding genes that share great amount of DNA sequence, as well as that coding genes have rich annotations, we aim to predict pseudogene functions by borrowing information from coding genes in a graph-based way. Here we propose Pseudo2GO, a graph-based deep learning semi-supervised model for pseudogene function prediction. A sequence similarity graph is first constructed to connect pseudogenes and coding genes. Multiple features are incorporated into the model as the node attributes to enable the graph an attributed graph, including expression profiles, interactions with microRNAs, protein-protein interactions (PPIs), and genetic interactions. Graph convolutional networks are used to propagate node attributes across the graph to make classifications on pseudogenes. Comparing Pseudo2GO with other frameworks adapted from popular protein function prediction methods, we demonstrated that our method has achieved state-of-the-art performance, significantly outperforming other methods in terms of the M-AUPR metric.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
- The Ohio State University Comprehensive Cancer Center, Columbus, OH, United States
| |
Collapse
|
3
|
Fan K, Guan Y, Zhang Y. Graph2GO: a multi-modal attributed network embedding method for inferring protein functions. Gigascience 2020; 9:giaa081. [PMID: 32770210 PMCID: PMC7414417 DOI: 10.1093/gigascience/giaa081] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 04/30/2020] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Identifying protein functions is important for many biological applications. Since experimental functional characterization of proteins is time-consuming and costly, accurate and efficient computational methods for predicting protein functions are in great demand for generating the testable hypotheses guiding large-scale experiments." RESULTS Here, we propose Graph2GO, a multi-modal graph-based representation learning model that can integrate heterogeneous information, including multiple types of interaction networks (sequence similarity network and protein-protein interaction network) and protein features (amino acid sequence, subcellular location, and protein domains) to predict protein functions on gene ontology. Comparing Graph2GO to BLAST, as a baseline model, and to two popular protein function prediction methods (Mashup and deepNF), we demonstrated that our model can achieve state-of-the-art performance. We show the robustness of our model by testing on multiple species. We also provide a web server supporting function query and downstream analysis on-the-fly. CONCLUSIONS Graph2GO is the first model that has utilized attributed network representation learning methods to model both interaction networks and protein features for predicting protein functions, and achieved promising performance. Our model can be easily extended to include more protein features to further improve the performance. Besides, Graph2GO is also applicable to other application scenarios involving biological networks, and the learned latent representations can be used as feature inputs for machine learning tasks in various downstream analyses.
Collapse
Affiliation(s)
- Kunjie Fan
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- The Ohio State University Comprehensive Cancer Center (OSUCCC - James), Columbus, OH 43210, USA
| |
Collapse
|
4
|
Piovesan D, Tosatto SCE. INGA 2.0: improving protein function prediction for the dark proteome. Nucleic Acids Res 2020; 47:W373-W378. [PMID: 31073595 PMCID: PMC6602455 DOI: 10.1093/nar/gkz375] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 04/29/2019] [Accepted: 04/30/2019] [Indexed: 12/21/2022] Open
Abstract
Our current knowledge of complex biological systems is stored in a computable form through the Gene Ontology (GO) which provides a comprehensive description of genes function. Prediction of GO terms from the sequence remains, however, a challenging task, which is particularly critical for novel genomes. Here we present INGA 2.0, a new version of the INGA software for protein function prediction. INGA exploits homology, domain architecture, interaction networks and information from the ‘dark proteome’, like transmembrane and intrinsically disordered regions, to generate a consensus prediction. INGA was ranked in the top ten methods on both CAFA2 and CAFA3 blind tests. The new algorithm can process entire genomes in a few hours or even less when additional input files are provided. The new interface provides a better user experience by integrating filters and widgets to explore the graph structure of the predicted terms. The INGA web server, databases and benchmarking are available from URL: https://inga.bio.unipd.it/.
Collapse
Affiliation(s)
- Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padua, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, Padua, Italy.,CNR Institute of Neuroscience, Padua, Italy
| |
Collapse
|
5
|
Zheng W, Zhang C, Bell EW, Zhang Y. I-TASSER gateway: A protein structure and function prediction server powered by XSEDE. FUTURE GENERATIONS COMPUTER SYSTEMS : FGCS 2019; 99:73-85. [PMID: 31427836 PMCID: PMC6699767 DOI: 10.1016/j.future.2019.04.011] [Citation(s) in RCA: 59] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
There is an increasing gap between the number of known protein sequences and the number of proteins with experimentally characterized structure and function. To alleviate this issue, we have developed the I-TASSER gateway, an online server for automated and reliable protein structure and function prediction. For a given sequence, I-TASSER starts with template recognition from a known structure library, followed by full-length atomic model construction by iterative assembly simulations of the continuous structural fragments excised from the template alignments. Functional insights are then derived from comparative matching of the predicted model with a library of proteins with known function. The I-TASSER pipeline has been recently integrated with the XSEDE Gateway system to accommodate pressing demand from the user community and increasing computing costs. This report summarizes the configuration of the I-TASSER Gateway with the XSEDE-Comet supercomputer cluster, together with an overview of the I-TASSER method and milestones of its development.
Collapse
|
6
|
Profiti G, Martelli PL, Casadio R. The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation. Nucleic Acids Res 2019; 45:W285-W290. [PMID: 28453653 PMCID: PMC5570247 DOI: 10.1093/nar/gkx330] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2017] [Accepted: 04/18/2017] [Indexed: 01/03/2023] Open
Abstract
BAR 3.0 updates our server BAR (Bologna Annotation Resource) for predicting protein structural and functional features from sequence. We increase data volume, query capabilities and information conveyed to the user. The core of BAR 3.0 is a graph-based clustering procedure of UniProtKB sequences, following strict pairwise similarity criteria (sequence identity ≥40% with alignment coverage ≥90%). Each cluster contains the available annotation downloaded from UniProtKB, GO, PFAM and PDB. After statistical validation, GO terms and PFAM domains are cluster-specific and annotate new sequences entering the cluster after satisfying similarity constraints. BAR 3.0 includes 28 869 663 sequences in 1 361 773 clusters, of which 22.2% (22 241 661 sequences) and 47.4% (24 555 055 sequences) have at least one validated GO term and one PFAM domain, respectively. 1.4% of the clusters (36% of all sequences) include PDB structures and the cluster is associated to a hidden Markov model that allows building template-target alignment suitable for structural modeling. Some other 3 399 026 sequences are singletons. BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s. When evaluated on the CAFA2 targets, BAR 3.0 largely outperforms our previous version and scores among state-of-the-art methods. BAR 3.0 is publicly available and accessible at http://bar.biocomp.unibo.it/bar3.
Collapse
Affiliation(s)
- Giuseppe Profiti
- Biocomputing Group, BiGeA/CIG, 'Luigi Galvani' Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics and Biocomplexity, University of Bologna, Bologna 40126, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, BiGeA/CIG, 'Luigi Galvani' Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics and Biocomplexity, University of Bologna, Bologna 40126, Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA/CIG, 'Luigi Galvani' Interdepartmental Center for Integrated Studies of Bioinformatics, Biophysics and Biocomplexity, University of Bologna, Bologna 40126, Italy
| |
Collapse
|
7
|
Vendramin V, Ormanbekova D, Scalabrin S, Scaglione D, Maccaferri M, Martelli P, Salvi S, Jurman I, Casadio R, Cattonaro F, Tuberosa R, Massi A, Morgante M. Genomic tools for durum wheat breeding: de novo assembly of Svevo transcriptome and SNP discovery in elite germplasm. BMC Genomics 2019; 20:278. [PMID: 30971220 PMCID: PMC6456968 DOI: 10.1186/s12864-019-5645-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Accepted: 03/25/2019] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND The tetraploid durum wheat (Triticum turgidum L. ssp. durum Desf. Husnot) is an important crop which provides the raw material for pasta production and a valuable source of genetic diversity for breeding hexaploid wheat (Triticum aestivum L.). Future breeding efforts to enhance yield potential and climate resilience will increasingly rely on genomics-based approaches to identify and select beneficial alleles. A deeper characterisation of the molecular and functional diversity of the durum wheat transcriptome will be instrumental to more effectively harness its genetic diversity. RESULTS We report on the de novo transcriptome assembly of durum wheat cultivar 'Svevo'. The transcriptome of four tissues/organs (shoots and roots at the seedling stage, reproductive organs and developing grains) was assembled de novo, yielding 180,108 contigs, with a N50 length of 1121 bp and mean contig length of 883 bp. Alignment against the transcriptome of nine plant species identified 43% of transcripts with homology to at least one reference transcriptome. The functional annotation was completed by means of a combination of complementary software. The presence of differential expression between the A- and B-homoeolog copies of the durum wheat tetraploid genome was ascertained by phase reconstruction of polymorphic sites based on the T. urartu transcripts and inferring homoeolog-specific sequences. We observed greater expression divergence between A and B homoeologs in grains rather than in leaves and roots. The transcriptomes of 13 durum wheat cultivars spanning the breeding period from 1969 to 2005 were analysed for SNP diversity, leading to 95,358 non-rare, hemi-SNPs shared among two or more cultivars and 33,747 locus-specific (diploid inheritance) SNPs. CONCLUSIONS Our study updates and expands the de novo transcriptome reference assembly available for durum wheat. Out of 180,108 assembled transcripts, 13,636 were specific to the Svevo cultivar as compared to the only other reference transcriptome available for durum, thus contributing to the identification of the tetraploid wheat pan-transcriptome. Additionally, the analysis of 13 historically relevant hallmark varieties produced a SNP dataset that could successfully validate the genotyping in tetraploid wheat and provide a valuable resource for genomics-assisted breeding of both tetraploid and hexaploid wheats.
Collapse
Affiliation(s)
- Vera Vendramin
- IGA Technology Services, via J. Linussio 51, 33100, Udine, Italy.
| | - Danara Ormanbekova
- Department of Agricultural and Food Sciences DISTAL, University of Bologna, Viale G. Fanin 44, 40127, Bologna, Italy
| | - Simone Scalabrin
- IGA Technology Services, via J. Linussio 51, 33100, Udine, Italy
| | - Davide Scaglione
- IGA Technology Services, via J. Linussio 51, 33100, Udine, Italy
| | - Marco Maccaferri
- Department of Agricultural and Food Sciences DISTAL, University of Bologna, Viale G. Fanin 44, 40127, Bologna, Italy
| | - Pierluigi Martelli
- Biocomputing Group, University of Bologna, via San Giacomo 9/2, 40126, Bologna, Italy
| | - Silvio Salvi
- Department of Agricultural and Food Sciences DISTAL, University of Bologna, Viale G. Fanin 44, 40127, Bologna, Italy
| | - Irena Jurman
- Istituto di Genomica Applicata, via J. Linussio 51, 33100, Udine, Italy
| | - Rita Casadio
- Biocomputing Group, University of Bologna, via San Giacomo 9/2, 40126, Bologna, Italy
| | | | - Roberto Tuberosa
- Department of Agricultural and Food Sciences DISTAL, University of Bologna, Viale G. Fanin 44, 40127, Bologna, Italy
| | - Andrea Massi
- Società produttori Sementi Bologna, Via Macero 1, 40050, Argelato, BO, Italy
| | - Michele Morgante
- Istituto di Genomica Applicata, via J. Linussio 51, 33100, Udine, Italy.,Department od Agricultural, Food, Environmental and Animal Research - DI4A, University of Udine, via delle Scienze 206, 33100, Udine, Italy
| |
Collapse
|
8
|
Abstract
Drugs modulate disease states through their actions on targets in the body. Determining these targets aids the focused development of new treatments, and helps to better characterize those already employed. One means of accomplishing this is through the deployment of in silico methodologies, harnessing computational analytical and predictive power to produce educated hypotheses for experimental verification. Here, we provide an overview of the current state of the art, describe some of the well-established methods in detail, and reflect on how they, and emerging technologies promoting the incorporation of complex and heterogeneous data-sets, can be employed to improve our understanding of (poly)pharmacology.
Collapse
Affiliation(s)
- Ryan Byrne
- Department of Chemistry and Applied Biosciences, Swiss Federal Institute of Technology (ETH), Zurich, Switzerland
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, Swiss Federal Institute of Technology (ETH), Zurich, Switzerland.
| |
Collapse
|
9
|
Abstract
Surveys of public sequence resources show that experimentally supported functional information is still completely missing for a considerable fraction of known proteins and is clearly incomplete for an even larger portion. Bioinformatics methods have long made use of very diverse data sources alone or in combination to predict protein function, with the understanding that different data types help elucidate complementary biological roles. This chapter focuses on methods accepting amino acid sequences as input and producing GO term assignments directly as outputs; the relevant biological and computational concepts are presented along with the advantages and limitations of individual approaches.
Collapse
Affiliation(s)
- Domenico Cozzetto
- Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
| | - David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
10
|
Fontanesi L, Di Palma F, Flicek P, Smith AT, Thulin CG, Alves PC. LaGomiCs-Lagomorph Genomics Consortium: An International Collaborative Effort for Sequencing the Genomes of an Entire Mammalian Order. J Hered 2016; 107:295-308. [PMID: 26921276 DOI: 10.1093/jhered/esw010] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2015] [Accepted: 02/02/2016] [Indexed: 01/07/2023] Open
Abstract
The order Lagomorpha comprises about 90 living species, divided in 2 families: the pikas (Family Ochotonidae), and the rabbits, hares, and jackrabbits (Family Leporidae). Lagomorphs are important economically and scientifically as major human food resources, valued game species, pests of agricultural significance, model laboratory animals, and key elements in food webs. A quarter of the lagomorph species are listed as threatened. They are native to all continents except Antarctica, and occur up to 5000 m above sea level, from the equator to the Arctic, spanning a wide range of environmental conditions. The order has notable taxonomic problems presenting significant difficulties for defining a species due to broad phenotypic variation, overlap of morphological characteristics, and relatively recent speciation events. At present, only the genomes of 2 species, the European rabbit (Oryctolagus cuniculus) and American pika (Ochotona princeps) have been sequenced and assembled. Starting from a paucity of genome information, the main scientific aim of the Lagomorph Genomics Consortium (LaGomiCs), born from a cooperative initiative of the European COST Action "A Collaborative European Network on Rabbit Genome Biology-RGB-Net" and the World Lagomorph Society (WLS), is to provide an international framework for the sequencing of the genome of all extant and selected extinct lagomorphs. Sequencing the genomes of an entire order will provide a large amount of information to address biological problems not only related to lagomorphs but also to all mammals. We present current and planned sequencing programs and outline the final objective of LaGomiCs possible through broad international collaboration.
Collapse
Affiliation(s)
- Luca Fontanesi
- From the Division of Animal Sciences, Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy (Fontanesi); Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK (Di Palma); Broad Institute of MIT and Harvard, Cambridge, MA (Di Palma); European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK (Flicek); School of Life Sciences, Arizona State University, Tempe, AZ (Smith); Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden (Thulin); CIBIO, Centro de Investigação em Biodiversidade e Recursos Geneticos, Universidade do Porto, Campus Agrario de Vairao, Vairao, Portugal (Alves); and Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal (Alves).
| | - Federica Di Palma
- From the Division of Animal Sciences, Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy (Fontanesi); Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK (Di Palma); Broad Institute of MIT and Harvard, Cambridge, MA (Di Palma); European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK (Flicek); School of Life Sciences, Arizona State University, Tempe, AZ (Smith); Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden (Thulin); CIBIO, Centro de Investigação em Biodiversidade e Recursos Geneticos, Universidade do Porto, Campus Agrario de Vairao, Vairao, Portugal (Alves); and Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal (Alves)
| | - Paul Flicek
- From the Division of Animal Sciences, Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy (Fontanesi); Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK (Di Palma); Broad Institute of MIT and Harvard, Cambridge, MA (Di Palma); European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK (Flicek); School of Life Sciences, Arizona State University, Tempe, AZ (Smith); Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden (Thulin); CIBIO, Centro de Investigação em Biodiversidade e Recursos Geneticos, Universidade do Porto, Campus Agrario de Vairao, Vairao, Portugal (Alves); and Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal (Alves)
| | - Andrew T Smith
- From the Division of Animal Sciences, Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy (Fontanesi); Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK (Di Palma); Broad Institute of MIT and Harvard, Cambridge, MA (Di Palma); European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK (Flicek); School of Life Sciences, Arizona State University, Tempe, AZ (Smith); Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden (Thulin); CIBIO, Centro de Investigação em Biodiversidade e Recursos Geneticos, Universidade do Porto, Campus Agrario de Vairao, Vairao, Portugal (Alves); and Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal (Alves)
| | - Carl-Gustaf Thulin
- From the Division of Animal Sciences, Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy (Fontanesi); Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK (Di Palma); Broad Institute of MIT and Harvard, Cambridge, MA (Di Palma); European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK (Flicek); School of Life Sciences, Arizona State University, Tempe, AZ (Smith); Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden (Thulin); CIBIO, Centro de Investigação em Biodiversidade e Recursos Geneticos, Universidade do Porto, Campus Agrario de Vairao, Vairao, Portugal (Alves); and Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal (Alves)
| | - Paulo C Alves
- From the Division of Animal Sciences, Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy (Fontanesi); Vertebrate and Health Genomics, The Genome Analysis Centre (TGAC), Norwich, UK (Di Palma); Broad Institute of MIT and Harvard, Cambridge, MA (Di Palma); European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK (Flicek); School of Life Sciences, Arizona State University, Tempe, AZ (Smith); Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden (Thulin); CIBIO, Centro de Investigação em Biodiversidade e Recursos Geneticos, Universidade do Porto, Campus Agrario de Vairao, Vairao, Portugal (Alves); and Departamento de Biologia, Faculdade de Ciências da Universidade do Porto, Porto, Portugal (Alves).
| | | |
Collapse
|
11
|
GoFDR: A sequence alignment based method for predicting protein functions. Methods 2016; 93:3-14. [DOI: 10.1016/j.ymeth.2015.08.009] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 07/27/2015] [Accepted: 08/11/2015] [Indexed: 01/01/2023] Open
|
12
|
Dorden S, Mahadevan P. Functional prediction of hypothetical proteins in human adenoviruses. Bioinformation 2015; 11:466-73. [PMID: 26664031 PMCID: PMC4658645 DOI: 10.6026/97320630011466] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2015] [Accepted: 10/15/2015] [Indexed: 02/06/2023] Open
Abstract
Assigning functional information to hypothetical proteins in virus genomes is crucial for gaining insight into their proteomes. Human adenoviruses are medium sized viruses that cause a range of diseases. Their genomes possess proteins with uncharacterized function known as hypothetical proteins. Using a wide range of protein function prediction servers, functional information was obtained about these hypothetical proteins. A comparison of functional information obtained from these servers revealed that some of them produced functional information, while others provided little functional information about these human adenovirus hypothetical proteins. The PFP, ESG, PSIPRED, 3d2GO, and ProtFun servers produced the most functional information regarding these hypothetical proteins.
Collapse
Affiliation(s)
- Shane Dorden
- Department of Biology, University of Tampa, 401 W. Kennedy Blvd., Box 3F, Tampa, FL, 33606, USA
| | - Padmanabhan Mahadevan
- Department of Biology, University of Tampa, 401 W. Kennedy Blvd., Box 3F, Tampa, FL, 33606, USA
| |
Collapse
|
13
|
Profiti G, Fariselli P, Casadio R. AlignBucket: a tool to speed up 'all-against-all' protein sequence alignments optimizing length constraints. Bioinformatics 2015; 31:3841-3. [PMID: 26231432 DOI: 10.1093/bioinformatics/btv451] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 07/24/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The next-generation sequencing era requires reliable, fast and efficient approaches for the accurate annotation of the ever-increasing number of biological sequences and their variations. Transfer of annotation upon similarity search is a standard approach. The procedure of all-against-all protein comparison is a preliminary step of different available methods that annotate sequences based on information already present in databases. Given the actual volume of sequences, methods are necessary to pre-process data to reduce the time of sequence comparison. RESULTS We present an algorithm that optimizes the partition of a large volume of sequences (the whole database) into sets where sequence length values (in residues) are constrained depending on a bounded minimal and expected alignment coverage. The idea is to optimally group protein sequences according to their length, and then computing the all-against-all sequence alignments among sequences that fall in a selected length range. We describe a mathematically optimal solution and we show that our method leads to a 5-fold speed-up in real world cases. AVAILABILITY AND IMPLEMENTATION The software is available for downloading at http://www.biocomp.unibo.it/∼giuseppe/partitioning.html. CONTACT giuseppe.profiti2@unibo.it. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Giuseppe Profiti
- Department of Computer Science and Engineering, via Mura Anteo Zamboni 7, Bologna, Bologna Biocomputing group, via S. Giacomo 9/2, Bologna and Health Sciences and Technologies ICIR, via Tolara di Sopra 41/E, Ozzano dell'Emilia, Italy
| | - Piero Fariselli
- Department of Computer Science and Engineering, via Mura Anteo Zamboni 7, Bologna, Bologna Biocomputing group, via S. Giacomo 9/2, Bologna and
| | - Rita Casadio
- Bologna Biocomputing group, via S. Giacomo 9/2, Bologna and Health Sciences and Technologies ICIR, via Tolara di Sopra 41/E, Ozzano dell'Emilia, Italy
| |
Collapse
|
14
|
Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SCE. INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 2015; 43:W134-40. [PMID: 26019177 PMCID: PMC4489281 DOI: 10.1093/nar/gkv523] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2015] [Accepted: 05/07/2015] [Indexed: 01/10/2023] Open
Abstract
Identifying protein functions can be useful for numerous applications in biology. The prediction of gene ontology (GO) functional terms from sequence remains however a challenging task, as shown by the recent CAFA experiments. Here we present INGA, a web server developed to predict protein function from a combination of three orthogonal approaches. Sequence similarity and domain architecture searches are combined with protein-protein interaction network data to derive consensus predictions for GO terms using functional enrichment. The INGA server can be queried both programmatically through RESTful services and through a web interface designed for usability. The latter provides output supporting the GO term predictions with the annotating sequences. INGA is validated on the CAFA-1 data set and was recently shown to perform consistently well in the CAFA-2 blind test. The INGA web server is available from URL: http://protein.bio.unipd.it/inga.
Collapse
Affiliation(s)
- Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padua 35121, Italy
| | - Manuel Giollo
- Department of Biomedical Sciences, University of Padua, Padua 35121, Italy Department of Information Engineering, University of Padua, Padua 35121, Italy
| | - Emanuela Leonardi
- Department of Women's and Children's Health, University of Padua, Padua 35128, Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padua 35121, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, Padua 35121, Italy CNR Institute of Neuroscience, Padua 35121, Italy
| |
Collapse
|
15
|
Martin AJM, Walsh I, Domenico TD, Mičetić I, Tosatto SCE. PANADA: protein association network annotation, determination and analysis. PLoS One 2013; 8:e78383. [PMID: 24265686 PMCID: PMC3827049 DOI: 10.1371/journal.pone.0078383] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2013] [Accepted: 09/20/2013] [Indexed: 11/18/2022] Open
Abstract
Increasingly large numbers of proteins require methods for functional annotation. This is typically based on pairwise inference from the homology of either protein sequence or structure. Recently, similarity networks have been presented to leverage both the ability to visualize relationships between proteins and assess the transferability of functional inference. Here we present PANADA, a novel toolkit for the visualization and analysis of protein similarity networks in Cytoscape. Networks can be constructed based on pairwise sequence or structural alignments either on a set of proteins or, alternatively, by database search from a single sequence. The Panada web server, executable for download and examples and extensive help files are available at URL: http://protein.bio.unipd.it/panada/.
Collapse
Affiliation(s)
| | - Ian Walsh
- Department of Biology, University of Padova, Padova, Italy
| | | | - Ivan Mičetić
- Department of Biology, University of Padova, Padova, Italy
| | | |
Collapse
|
16
|
Piovesan D, Profiti G, Martelli PL, Fariselli P, Fontanesi L, Casadio R. SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat065. [PMID: 24065691 PMCID: PMC3781388 DOI: 10.1093/database/bat065] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics. Database URL:http://bar.biocomp.unibo.it/pig/
Collapse
Affiliation(s)
- Damiano Piovesan
- Bologna Biocomputing Group, University of Bologna, via S. Giacomo 9/2, I-40126, Bologna, Italy, Department of Biological, Geological and Environmental Sciences (BIGEA), University of Bologna, via Selmi 3, I-40126, Bologna, Italy, Department of Computer Science and Engineering, University of Bologna, Mura A. Zamboni 7, I-40126, Bologna, Italy, Health Science and Technologies-ICIR, University of Bologna, Via Tolara di Sopra 41/E, I-40064, Ozzano dell'Emilia, Italy and Department of Agro-Food Science and Technology (DISTAL), University of Bologna, Viale Fanin 46, I-40127, Bologna, Italy
| | | | | | | | | | | |
Collapse
|
17
|
Piccoli S, Suku E, Garonzi M, Giorgetti A. Genome-wide Membrane Protein Structure Prediction. Curr Genomics 2013; 14:324-9. [PMID: 24403851 PMCID: PMC3763683 DOI: 10.2174/13892029113149990009] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2013] [Revised: 07/19/2013] [Accepted: 07/22/2013] [Indexed: 01/25/2023] Open
Abstract
Transmembrane proteins allow cells to extensively communicate with the external world in a very accurate and specific way. They form principal nodes in several signaling pathways and attract large interest in therapeutic intervention, as the majority pharmaceutical compounds target membrane proteins. Thus, according to the current genome annotation methods, a detailed structural/functional characterization at the protein level of each of the elements codified in the genome is also required. The extreme difficulty in obtaining high-resolution three-dimensional structures, calls for computational approaches. Here we review to which extent the efforts made in the last few years, combining the structural characterization of membrane proteins with protein bioinformatics techniques, could help describing membrane proteins at a genome-wide scale. In particular we analyze the use of comparative modeling techniques as a way of overcoming the lack of high-resolution three-dimensional structures in the human membrane proteome.
Collapse
Affiliation(s)
- Stefano Piccoli
- Applied Bioinformatics Group, Dept. of Biotechnology, University of Verona, strada Le grazie 15, 37134, Verona,
Italy
| | - Eda Suku
- Applied Bioinformatics Group, Dept. of Biotechnology, University of Verona, strada Le grazie 15, 37134, Verona,
Italy
| | - Marianna Garonzi
- Applied Bioinformatics Group, Dept. of Biotechnology, University of Verona, strada Le grazie 15, 37134, Verona,
Italy
| | - Alejandro Giorgetti
- Applied Bioinformatics Group, Dept. of Biotechnology, University of Verona, strada Le grazie 15, 37134, Verona,
Italy
- German Research School for Simulation Sciences, Juelich, Germany
- Center for Biomedical Computing (CBMC), University of Verona, strada Le grazie 8, 37134, Verona, Italy
| |
Collapse
|
18
|
Piovesan D, Martelli PL, Fariselli P, Profiti G, Zauli A, Rossi I, Casadio R. How to inherit statistically validated annotation within BAR+ protein clusters. BMC Bioinformatics 2013; 14 Suppl 3:S4. [PMID: 23514411 PMCID: PMC3584929 DOI: 10.1186/1471-2105-14-s3-s4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In the genomic era a key issue is protein annotation, namely how to endow protein sequences, upon translation from the corresponding genes, with structural and functional features. Routinely this operation is electronically done by deriving and integrating information from previous knowledge. The reference database for protein sequences is UniProtKB divided into two sections, UniProtKB/TrEMBL which is automatically annotated and not reviewed and UniProtKB/Swiss-Prot which is manually annotated and reviewed. The annotation process is essentially based on sequence similarity search. The question therefore arises as to which extent annotation based on transfer by inheritance is valuable and specifically if it is possible to statistically validate inherited features when little homology exists among the target sequence and its template(s). Results In this paper we address the problem of annotating protein sequences in a statistically validated manner considering as a reference annotation resource UniProtKB. The test case is the set of 48,298 proteins recently released by the Critical Assessment of Function Annotations (CAFA) organization. We show that we can transfer after validation, Gene Ontology (GO) terms of the three main categories and Pfam domains to about 68% and 72% of the sequences, respectively. This is possible after alignment of the CAFA sequences towards BAR+, our annotation resource that allows discriminating among statistically validated and not statistically validated annotation. By comparing with a direct UniProtKB annotation, we find that besides validating annotation of some 78% of the CAFA set, we assign new and statistically validated annotation to 14.8% of the sequences and find new structural templates for about 25% of the chains, half of which share less than 30% sequence identity to the corresponding template/s. Conclusion Inheritance of annotation by transfer generally requires a careful selection of the identity value among the target and the template in order to transfer structural and/or functional features. Here we prove that even distantly remote homologs can be safely endowed with structural templates and GO and/or Pfam terms provided that annotation is done within clusters collecting cluster-related protein sequences and where a statistical validation of the shared structural and functional features is possible.
Collapse
|
19
|
Piovesan D, Profiti G, Martelli PL, Casadio R. The human "magnesome": detecting magnesium binding sites on human proteins. BMC Bioinformatics 2012; 13 Suppl 14:S10. [PMID: 23095498 PMCID: PMC3439678 DOI: 10.1186/1471-2105-13-s14-s10] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Magnesium research is increasing in molecular medicine due to the relevance of this ion in several important biological processes and associated molecular pathogeneses. It is still difficult to predict from the protein covalent structure whether a human chain is or not involved in magnesium binding. This is mainly due to little information on the structural characteristics of magnesium binding sites in proteins and protein complexes. Magnesium binding features, differently from those of other divalent cations such as calcium and zinc, are elusive. Here we address a question that is relevant in protein annotation: how many human proteins can bind Mg2+? Our analysis is performed taking advantage of the recently implemented Bologna Annotation Resource (BAR-PLUS), a non hierarchical clustering method that relies on the pair wise sequence comparison of about 14 millions proteins from over 300.000 species and their grouping into clusters where annotation can safely be inherited after statistical validation. Results After cluster assignment of the latest version of the human proteome, the total number of human proteins for which we can assign putative Mg binding sites is 3,751. Among these proteins, 2,688 inherit annotation directly from human templates and 1,063 inherit annotation from templates of other organisms. Protein structures are highly conserved inside a given cluster. Transfer of structural properties is possible after alignment of a given sequence with the protein structures that characterise a given cluster as obtained with a Hidden Markov Model (HMM) based procedure. Interestingly a set of 370 human sequences inherit Mg2+ binding sites from templates sharing less than 30% sequence identity with the template. Conclusion We describe and deliver the "human magnesome", a set of proteins of the human proteome that inherit putative binding of magnesium ions. With our BAR-hMG, 251 clusters including 1,341 magnesium binding protein structures corresponding to 387 sequences are sufficient to annotate some 13,689 residues in 3,751 human sequences as "magnesium binding". Protein structures act therefore as three dimensional seeds for structural and functional annotation of human sequences. The data base collects specifically all the human proteins that can be annotated according to our procedure as "magnesium binding", the corresponding structures and BAR+ clusters from where they derive the annotation (http://bar.biocomp.unibo.it/mg).
Collapse
Affiliation(s)
- Damiano Piovesan
- Biocomputing Group, Department of Biology, University of Bologna, Bologna, 40126, Italy
| | | | | | | |
Collapse
|