1
|
Ghaly TM, Rajabal V, Penesyan A, Coleman NV, Paulsen IT, Gillings MR, Tetu SG. Functional enrichment of integrons: Facilitators of antimicrobial resistance and niche adaptation. iScience 2023; 26:108301. [PMID: 38026211 PMCID: PMC10661359 DOI: 10.1016/j.isci.2023.108301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 10/11/2023] [Accepted: 10/19/2023] [Indexed: 12/01/2023] Open
Abstract
Integrons are genetic elements, found among diverse bacteria and archaea, that capture and rearrange gene cassettes to rapidly generate genetic diversity and drive adaptation. Despite their broad taxonomic and geographic prevalence, and their role in microbial adaptation, the functions of gene cassettes remain poorly characterized. Here, using a combination of bioinformatic and experimental analyses, we examined the functional diversity of gene cassettes from different environments. We find that cassettes encode diverse antimicrobial resistance (AMR) determinants, including those conferring resistance to antibiotics currently in the developmental pipeline. Further, we find a subset of cassette functions is universally enriched relative to their broader metagenomes. These are largely involved in (a)biotic interactions, including AMR, phage defense, virulence, biodegradation, and stress tolerance. The remainder of functions are sample-specific, suggesting that they confer localised functions relevant to their microenvironment. Together, they comprise functional profiles different from bulk metagenomes, representing niche-adaptive components of the prokaryotic pangenome.
Collapse
Affiliation(s)
- Timothy M. Ghaly
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
| | - Vaheesan Rajabal
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
- ARC Centre of Excellence in Synthetic Biology, Macquarie University, New South Wales 2109, Australia
| | - Anahit Penesyan
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
- ARC Centre of Excellence in Synthetic Biology, Macquarie University, New South Wales 2109, Australia
| | - Nicholas V. Coleman
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
| | - Ian T. Paulsen
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
- ARC Centre of Excellence in Synthetic Biology, Macquarie University, New South Wales 2109, Australia
| | - Michael R. Gillings
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
- ARC Centre of Excellence in Synthetic Biology, Macquarie University, New South Wales 2109, Australia
| | - Sasha G. Tetu
- School of Natural Sciences, Macquarie University, New South Wales 2109, Australia
- ARC Centre of Excellence in Synthetic Biology, Macquarie University, New South Wales 2109, Australia
| |
Collapse
|
2
|
Abstract
The vast, mostly unknown protein universe can be explored by analyzing protein sequences as a string of domains. A broader coverage can be achieved when these domains, the essential blocks in protein evolution, are detected using sequence profiles. Using clustering to collapse redundant profiles into unique function words (UFWs), we find that over the years 2009–2016, the number of UFWs saturates while the number of sequences matched by a combination of two or more UFWs grows exponentially. Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared).
Collapse
|
3
|
Garrido-Martín D, Pazos F. Effect of the sequence data deluge on the performance of methods for detecting protein functional residues. BMC Bioinformatics 2018; 19:67. [PMID: 29482506 PMCID: PMC5827975 DOI: 10.1186/s12859-018-2084-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 02/21/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. RESULTS In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. CONCLUSIONS These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Collapse
Affiliation(s)
- Diego Garrido-Martín
- Present address: Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, c/ Dr. Aiguader, 88, 08003, Barcelona, Spain.,Present address: Universitat Pompeu Fabra (UPF), Plaça de la Mercè, 10-12, 08002, Barcelona, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, Systems Biology Program, National Centre for Biotechnology (CNB-CSIC), c/ Darwin, 3, 28049, Madrid, Spain.
| |
Collapse
|
4
|
Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res 2016; 45:D170-D176. [PMID: 27899574 PMCID: PMC5614098 DOI: 10.1093/nar/gkw1081] [Citation(s) in RCA: 366] [Impact Index Per Article: 45.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Revised: 10/14/2016] [Accepted: 11/01/2016] [Indexed: 11/27/2022] Open
Abstract
We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uniboost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust.mmseqs.com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.
Collapse
Affiliation(s)
- Milot Mirdita
- Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Lars von den Driesch
- Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Clovis Galiez
- Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Johannes Söding
- Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Martin Steinegger
- Quantitative and Computational Biology Group, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany .,Department for Bioinformatics and Computational Biology, Technische Universität München, Munich, Germany.,Department of Chemistry, Seoul National University, Seoul, Korea
| |
Collapse
|
5
|
Hauser M, Steinegger M, Söding J. MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics 2016; 32:1323-30. [PMID: 26743509 DOI: 10.1093/bioinformatics/btw006] [Citation(s) in RCA: 84] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2015] [Accepted: 01/01/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Sequence databases are growing fast, challenging existing analysis pipelines. Reducing the redundancy of sequence databases by similarity clustering improves speed and sensitivity of iterative searches. But existing tools cannot efficiently cluster databases of the size of UniProt to 50% maximum pairwise sequence identity or below. Furthermore, in metagenomics experiments typically large fractions of reads cannot be matched to any known sequence anymore because searching with sensitive but relatively slow tools (e.g. BLAST or HMMER3) through comprehensive databases such as UniProt is becoming too costly. RESULTS MMseqs (Many-against-Many sequence searching) is a software suite for fast and deep clustering and searching of large datasets, such as UniProt, or 6-frame translated metagenomics sequencing reads. MMseqs contains three core modules: a fast and sensitive prefiltering module that sums up the scores of similar k-mers between query and target sequences, an SSE2- and multi-core-parallelized local alignment module, and a clustering module.In our homology detection benchmarks, MMseqs is much more sensitive and 4-30 times faster than UBLAST and RAPsearch, respectively, although it does not reach BLAST sensitivity yet. Using its cascaded clustering workflow, MMseqs can cluster large databases down to ∼30% sequence identity at hundreds of times the speed of BLASTclust and much deeper than CD-HIT and USEARCH. MMseqs can also update a database clustering in linear instead of quadratic time. Its much improved sensitivity-speed trade-off should make MMseqs attractive for a wide range of large-scale sequence analysis tasks. AVAILABILITY AND IMPLEMENTATION MMseqs is open-source software available under GPL at https://github.com/soedinglab/MMseqs CONTACT martin.steinegger@mpibpc.mpg.de, soeding@mpibpc.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maria Hauser
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany
| | - Martin Steinegger
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and TUM, Department of Informatics, Bioinformatics & Computational Biology-I12, Garching 85748, Germany
| | - Johannes Söding
- Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany, Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany and
| |
Collapse
|
6
|
Abstract
Unravelling the genotype–phenotype relationship in humans remains a challenging task in genomics studies. Recent advances in sequencing technologies mean there are now thousands of sequenced human genomes, revealing millions of single nucleotide variants (SNVs). For non-synonymous SNVs present in proteins the difficulties of the problem lie in first identifying those nsSNVs that result in a functional change in the protein among the many non-functional variants and in turn linking this functional change to phenotype. Here we present VarMod (Variant Modeller) a method that utilises both protein sequence and structural features to predict nsSNVs that alter protein function. VarMod develops recent observations that functional nsSNVs are enriched at protein–protein interfaces and protein–ligand binding sites and uses these characteristics to make predictions. In benchmarking on a set of nearly 3000 nsSNVs VarMod performance is comparable to an existing state of the art method. The VarMod web server provides extensive resources to investigate the sequence and structural features associated with the predictions including visualisation of protein models and complexes via an interactive JSmol molecular viewer. VarMod is available for use at http://www.wasslab.org/varmod.
Collapse
Affiliation(s)
- Morena Pappalardo
- Centre for Molecular Processing, School of Biosciences, University of Kent, CT2 7NH, UK
| | - Mark N Wass
- Centre for Molecular Processing, School of Biosciences, University of Kent, CT2 7NH, UK
| |
Collapse
|
7
|
Yates CM, Filippis I, Kelley LA, Sternberg MJE. SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. J Mol Biol 2014; 426:2692-701. [PMID: 24810707 PMCID: PMC4087249 DOI: 10.1016/j.jmb.2014.04.026] [Citation(s) in RCA: 165] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 04/23/2014] [Accepted: 04/28/2014] [Indexed: 11/16/2022]
Abstract
Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html. Bioinformatics approaches are key for identification of disease-causing variants. SAV phenotype prediction can be improved using network information. A method including these features, SuSPect, outperforms tested methods. SuSPect is available to use at www.sbg.bio.ic.ac.uk/suspect.
Collapse
Affiliation(s)
- Christopher M Yates
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK.
| | - Ioannis Filippis
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| | - Lawrence A Kelley
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| | - Michael J E Sternberg
- Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
8
|
Kryshtafovych A, Fidelis K, Moult J. CASP10 results compared to those of previous CASP experiments. Proteins 2013; 82 Suppl 2:164-74. [PMID: 24150928 DOI: 10.1002/prot.24448] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Revised: 10/04/2013] [Accepted: 10/04/2013] [Indexed: 11/11/2022]
Abstract
We compare results of the community efforts in modeling protein structures in the tenth CASP experiment, with those in earlier CASPs particularly in CASP5, a decade ago. There is a substantial improvement in template based model accuracy as reflected in more successful modeling of regions of structure not easily derived from a single experimental structure template, most likely reflecting intensive work within the modeling community in developing methods that make use of multiple templates, as well as the increased number of experimental structures available. Deriving structural information not obvious from a template is the most demanding as well as one of the most useful tasks that modeling can perform. Thus this is gratifying progress. By contrast, overall backbone accuracy of models appears little changed in the last decade. This puzzling result is explained by two factors--increased database size in some ways makes it harder to choose the best available templates, and the increased intrinsic difficulty of CASP targets as experimental work has progressed to larger and more unusual structures. There is no detectable recent improvement in template-free modeling, but again, this may reflect the changing nature of CASP targets.
Collapse
|
9
|
Hauser M, Mayer CE, Söding J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 2013; 14:248. [PMID: 23945046 PMCID: PMC3843501 DOI: 10.1186/1471-2105-14-248] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Accepted: 08/12/2013] [Indexed: 11/13/2022] Open
Abstract
Background Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at
http://toolkit.lmb.uni-muenchen.de/pub/kClust/.
Collapse
Affiliation(s)
- Maria Hauser
- Gene Center and Center for Integrated Protein Science (CIPSM), Ludwig-Maximilians-Universität München, Feodor-Lynen-Str, 25, Munich 81377, Germany.
| | | | | |
Collapse
|
10
|
Abstract
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from
Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in
E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.
Collapse
Affiliation(s)
- Lars-Gustav Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, Norway
| | - David W Ussery
- Centre for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
11
|
Cai H, Kuang R, Gu J, Wang Y. Proteases in malaria parasites - a phylogenomic perspective. Curr Genomics 2012; 12:417-27. [PMID: 22379395 PMCID: PMC3178910 DOI: 10.2174/138920211797248565] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2011] [Revised: 07/17/2011] [Accepted: 07/20/2011] [Indexed: 12/21/2022] Open
Abstract
Malaria continues to be one of the most devastating global health problems due to the high morbidity and mortality it causes in endemic regions. The search for new antimalarial targets is of high priority because of the increasing prevalence of drug resistance in malaria parasites. Malarial proteases constitute a class of promising therapeutic targets as they play important roles in the parasite life cycle and it is possible to design and screen for specific protease inhibitors. In this mini-review, we provide a phylogenomic overview of malarial proteases. An evolutionary perspective on the origin and divergence of these proteases will provide insights into the adaptive mechanisms of parasite growth, development, infection, and pathogenesis.B
Collapse
Affiliation(s)
- Hong Cai
- Department of Biology, University of Texas at San Antonio, San Antonio, TX 78249, USA
| | | | | | | |
Collapse
|
12
|
Defining sequence space and reaction products within the cyanuric acid hydrolase (AtzD)/barbiturase protein family. J Bacteriol 2012; 194:4579-88. [PMID: 22730121 DOI: 10.1128/jb.00791-12] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Cyanuric acid hydrolases (AtzD) and barbiturases are homologous, found almost exclusively in bacteria, and comprise a rare protein family with no discernible linkage to other protein families or an X-ray structural class. There has been confusion in the literature and in genome projects regarding the reaction products, the assignment of individual sequences as either cyanuric acid hydrolases or barbiturases, and spurious connection of this family to another protein family. The present study has addressed those issues. First, the published enzyme reaction products of cyanuric acid hydrolase are incorrectly identified as biuret and carbon dioxide. The current study employed (13)C nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry to show that cyanuric acid hydrolase releases carboxybiuret, which spontaneously decarboxylates to biuret. This is significant because it revealed that homologous cyanuric acid hydrolases and barbiturases catalyze completely analogous reactions. Second, enzymes that had been annotated incorrectly in genome projects have been reassigned here by bioinformatics, gene cloning, and protein characterization studies. Third, the AtzD/barbiturase family has previously been suggested to consist of members of the amidohydrolase superfamily, a large class of metallohydrolases. Bioinformatics and the lack of bound metals both argue against a connection to the amidohydrolase superfamily. Lastly, steady-state kinetic measurements and observations of protein stability suggested that the AtzD/barbiturase family might be an undistinguished protein family that has undergone some resurgence with the recent introduction of industrial s-triazine compounds such as atrazine and melamine into the environment.
Collapse
|
13
|
Wass MN, Barton G, Sternberg MJE. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 2012; 40:W466-70. [PMID: 22641853 PMCID: PMC3394346 DOI: 10.1093/nar/gks489] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein–protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.
Collapse
Affiliation(s)
- Mark N Wass
- Centre for Bioinformatics, Imperial College London, London, SW7 2AZ, UK.
| | | | | |
Collapse
|
14
|
Buenavista MT, Roche DB, McGuffin LJ. Improvement of 3D protein models using multiple templates guided by single-template model quality assessment. ACTA ACUST UNITED AC 2012; 28:1851-7. [PMID: 22592378 DOI: 10.1093/bioinformatics/bts292] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Modelling the 3D structures of proteins can often be enhanced if more than one fold template is used during the modelling process. However, in many cases, this may also result in poorer model quality for a given target or alignment method. There is a need for modelling protocols that can both consistently and significantly improve 3D models and provide an indication of when models might not benefit from the use of multiple target-template alignments. Here, we investigate the use of both global and local model quality prediction scores produced by ModFOLDclust2, to improve the selection of target-template alignments for the construction of multiple-template models. Additionally, we evaluate clustering the resulting population of multi- and single-template models for the improvement of our IntFOLD-TS tertiary structure prediction method. RESULTS We find that using accurate local model quality scores to guide alignment selection is the most consistent way to significantly improve models for each of the sequence to structure alignment methods tested. In addition, using accurate global model quality for re-ranking alignments, prior to selection, further improves the majority of multi-template modelling methods tested. Furthermore, subsequent clustering of the resulting population of multiple-template models significantly improves the quality of selected models compared with the previous version of our tertiary structure prediction method, IntFOLD-TS. AVAILABILITY AND IMPLEMENTATION Source code and binaries can be freely downloaded from http://www.reading.ac.uk/bioinf/downloads/
Collapse
Affiliation(s)
- Maria T Buenavista
- School of Biological Sciences, University of Reading, Whiteknights, Reading RG6 6AS, UK
| | | | | |
Collapse
|
15
|
Faraggi E, Zhang T, Yang Y, Kurgan L, Zhou Y. SPINE X: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J Comput Chem 2012; 33:259-67. [PMID: 22045506 PMCID: PMC3240697 DOI: 10.1002/jcc.21968] [Citation(s) in RCA: 187] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Revised: 09/16/2011] [Accepted: 09/18/2011] [Indexed: 11/11/2022]
Abstract
Accurate prediction of protein secondary structure is essential for accurate sequence alignment, three-dimensional structure modeling, and function prediction. The accuracy of ab initio secondary structure prediction from sequence, however, has only increased from around 77 to 80% over the past decade. Here, we developed a multistep neural-network algorithm by coupling secondary structure prediction with prediction of solvent accessibility and backbone torsion angles in an iterative manner. Our method called SPINE X was applied to a dataset of 2640 proteins (25% sequence identity cutoff) previously built for the first version of SPINE and achieved a 82.0% accuracy based on 10-fold cross validation (Q(3)). Surpassing 81% accuracy by SPINE X is further confirmed by employing an independently built test dataset of 1833 protein chains, a recently built dataset of 1975 proteins and 117 CASP 9 targets (critical assessment of structure prediction techniques) with an accuracy of 81.3%, 82.3% and 81.8%, respectively. The prediction accuracy is further improved to 83.8% for the dataset of 2640 proteins if the DSSP assignment used above is replaced by a more consistent consensus secondary structure assignment method. Comparison to the popular PSIPRED and CASP-winning structure-prediction techniques is made. SPINE X predicts number of helices and sheets correctly for 21.0% of 1833 proteins, compared to 17.6% by PSIPRED. It further shows that SPINE X consistently makes more accurate prediction in helical residues (6%) without over prediction while PSIPRED makes more accurate prediction in coil residues (3-5%) and over predicts them by 7%. SPINE X Server and its training/test datasets are available at http://sparks.informatics.iupui.edu/
Collapse
Affiliation(s)
- Eshel Faraggi
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| | - Tuo Zhang
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| | - Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| | - Lukasz Kurgan
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| |
Collapse
|
16
|
Peng J, Xu J. RaptorX: exploiting structure information for protein alignment by statistical inference. Proteins 2011; 79 Suppl 10:161-71. [PMID: 21987485 DOI: 10.1002/prot.23175] [Citation(s) in RCA: 241] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2011] [Revised: 07/25/2011] [Accepted: 08/19/2011] [Indexed: 12/13/2022]
Abstract
This work presents RaptorX, a statistical method for template-based protein modeling that improves alignment accuracy by exploiting structural information in a single or multiple templates. RaptorX consists of three major components: single-template threading, alignment quality prediction, and multiple-template threading. This work summarizes the methods used by RaptorX and presents its CASP9 result analysis, aiming to identify major bottlenecks with RaptorX and template-based modeling and hopefully directions for further study. Our results show that template structural information helps a lot with both single-template and multiple-template protein threading especially when closely-related templates are unavailable, and there is still large room for improvement in both alignment and template selection. The RaptorX web server is available at http://raptorx.uchicago.edu.
Collapse
Affiliation(s)
- Jian Peng
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Avenue, Chicago, IL 60637, USA
| | | |
Collapse
|
17
|
Godzik A. Metagenomics and the protein universe. Curr Opin Struct Biol 2011; 21:398-403. [PMID: 21497084 DOI: 10.1016/j.sbi.2011.03.010] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Revised: 03/07/2011] [Accepted: 03/24/2011] [Indexed: 02/07/2023]
Abstract
Metagenomics sequencing projects have dramatically increased our knowledge of the protein universe and provided over one-half of currently known protein sequences; they have also introduced a much broader phylogenetic diversity into the protein databases. The full analysis of metagenomic datasets is only beginning, but it has already led to the discovery of thousands of new protein families, likely representing novel functions specific to given environments. At the same time, a deeper analysis of such novel families, including experimental structure determination of some representatives, suggests that most of them represent distant homologs of already characterized protein families, and thus most of the protein diversity present in the new environments are due to functional divergence of the known protein families rather than the emergence of new ones.
Collapse
Affiliation(s)
- Adam Godzik
- Program on Bioinformatics and Systems Biology, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA.
| |
Collapse
|
18
|
Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 2011; 21:404-11. [PMID: 21458982 DOI: 10.1016/j.sbi.2011.03.005] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Revised: 03/01/2011] [Accepted: 03/09/2011] [Indexed: 11/26/2022]
Abstract
Protein sequence comparison methods have grown increasingly sensitive during the last decade and can often identify distantly related proteins sharing a common ancestor some 3 billion years ago. Although cellular function is not conserved so long, molecular functions and structures of protein domains often are. In combination with a domain-centered approach to function and structure prediction, modern remote homology detection methods have a great and largely underexploited potential for elucidating protein functions and evolution. Advances during the last few years include nonlinear scoring functions combining various sequence features, the use of sequence context information, and powerful new software packages. Since progress depends on realistically assessing new and existing methods and published benchmarks are often hard to compare, we propose 10 rules of good-practice benchmarking.
Collapse
Affiliation(s)
- Johannes Söding
- Gene Center and Center for Integrated Protein Science, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, Munich, Germany.
| | | |
Collapse
|