1
|
Kinateder T, Mayer C, Nazet J, Sterner R. Improving enzyme functional annotation by integrating in vitro and in silico approaches: The example of histidinol phosphate phosphatases. Protein Sci 2024; 33:e4899. [PMID: 38284491 PMCID: PMC10804674 DOI: 10.1002/pro.4899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 12/13/2023] [Accepted: 01/01/2024] [Indexed: 01/30/2024]
Abstract
Advances in sequencing technologies have led to a rapid growth of public protein sequence databases, whereby the fraction of proteins with experimentally verified function continuously decreases. This problem is currently addressed by automated functional annotations with computational tools, which however lack the accuracy of experimental approaches and are susceptible to error propagation. Here, we present an approach that combines the efficiency of functional annotation by in silico methods with the rigor of enzyme characterization in vitro. First, a thorough experimental analysis of a representative enzyme of a group of homologues is performed which includes a focused alanine scan of the active site to determine a fingerprint of function-determining residues. In a second step, this fingerprint is used in combination with a sequence similarity network to identify putative isofunctional enzymes among the homologues. Using this approach in a proof-of-principle study, homologues of the histidinol phosphate phosphatase (HolPase) from Pseudomonas aeruginosa, many of which were annotated as phosphoserine phosphatases, were predicted to be HolPases. This functional annotation of the homologues was verified by in vitro testing of several representatives and an analysis of the occurrence of annotated HolPases in the corresponding phylogenetic groups. Moreover, the application of the same approach to the homologues of the HolPase from the archaeon Nitrosopumilus maritimus, which is not related to the HolPase from P. aeruginosa and was newly discovered in the course of this work, led to the annotation of the putative HolPase from various archaeal species.
Collapse
Affiliation(s)
- Thomas Kinateder
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| | - Carina Mayer
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| | - Julian Nazet
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| | - Reinhard Sterner
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| |
Collapse
|
2
|
Necci M, Piovesan D, Clementel D, Dosztányi Z, Tosatto SCE. MobiDB-lite 3.0: fast consensus annotation of intrinsic disorder flavours in proteins. Bioinformatics 2020; 36:5533-5534. [PMID: 33325498 DOI: 10.1093/bioinformatics/btaa1045] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 11/03/2020] [Accepted: 12/07/2020] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION The earlier version of MobiDB-lite is currently used in large-scale proteome annotation platforms to detect intrinsic disorder. However, new theoretical models allow for the classification of intrinsically disordered regions into subtypes from sequence features associated with specific polymeric properties or compositional bias. RESULTS MobiDB-lite 3.0 maintains its previous speed and performance but also provides a finer classification of disorder by identifying regions with characteristics of polyolyampholytes, positive or negative polyelectrolytes, low complexity regions or enriched in cysteine, proline or glycine or polar residues. Sub-regions are abundantly detected in IDRs of the human proteome. The new version of MobiDB-lite represents a new step for the proteome level analysis of protein disorder. AVAILABILITY Both the MobiDB-lite 3.0 source code and a docker container are available from the GitHub repository: https://github.com/BioComputingUP/MobiDB-lite.
Collapse
Affiliation(s)
- Marco Necci
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35121 Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35121 Padova, Italy
| | - Damiano Clementel
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35121 Padova, Italy
| | - Zsuzsanna Dosztányi
- MTA-ELTE Lendület Bioinformatics Research Group, Department of Biochemistry, ELTE Eötvös Loránd University, Pázmány Péter sétány 1/c, Budapest, Hungary
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35121 Padova, Italy
| |
Collapse
|
3
|
Structure-diverse Phylomer libraries as a rich source of bioactive hits from phenotypic and target directed screens against intracellular proteins. Curr Opin Chem Biol 2017; 38:127-133. [DOI: 10.1016/j.cbpa.2017.03.016] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Revised: 03/27/2017] [Accepted: 03/27/2017] [Indexed: 01/15/2023]
|
4
|
Li W, Fontanelli O, Miramontes P. Size distribution of function-based human gene sets and the split-merge model. ROYAL SOCIETY OPEN SCIENCE 2016; 3:160275. [PMID: 27853602 PMCID: PMC5108952 DOI: 10.1098/rsos.160275] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
The sizes of paralogues-gene families produced by ancestral duplication-are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Oscar Fontanelli
- Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Exterior, Ciudad Universitaria, México 04510 DF, México
| | - Pedro Miramontes
- Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México, Circuito Exterior, Ciudad Universitaria, México 04510 DF, México
- Bioinformatics Group and Interdisciplinary Center for Bioinformatics, University of Leipzig, Haertelstrasse 16–18, 04107 Leipzig, Germany
| |
Collapse
|
5
|
The impact of structural genomics: the first quindecennial. ACTA ACUST UNITED AC 2016; 17:1-16. [PMID: 26935210 DOI: 10.1007/s10969-016-9201-5] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2015] [Accepted: 02/17/2016] [Indexed: 12/21/2022]
Abstract
The period 2000-2015 brought the advent of high-throughput approaches to protein structure determination. With the overall funding on the order of $2 billion (in 2010 dollars), the structural genomics (SG) consortia established worldwide have developed pipelines for target selection, protein production, sample preparation, crystallization, and structure determination by X-ray crystallography and NMR. These efforts resulted in the determination of over 13,500 protein structures, mostly from unique protein families, and increased the structural coverage of the expanding protein universe. SG programs contributed over 4400 publications to the scientific literature. The NIH-funded Protein Structure Initiatives alone have produced over 2000 scientific publications, which to date have attracted more than 93,000 citations. Software and database developments that were necessary to handle high-throughput structure determination workflows have led to structures of better quality and improved integrity of the associated data. Organized and accessible data have a positive impact on the reproducibility of scientific experiments. Most of the experimental data generated by the SG centers are freely available to the community and has been utilized by scientists in various fields of research. SG projects have created, improved, streamlined, and validated many protocols for protein production and crystallization, data collection, and functional analysis, significantly benefiting biological and biomedical research.
Collapse
|
6
|
The history of the CATH structural classification of protein domains. Biochimie 2015; 119:209-17. [PMID: 26253692 PMCID: PMC4678953 DOI: 10.1016/j.biochi.2015.08.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Accepted: 08/01/2015] [Indexed: 11/21/2022]
Abstract
This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. We present a historical review of the protein structure database CATH. We review the expansion of the CATH and SCOP resources with sequence data and functional annotations. How functional annotation resources allow insights into functional divergence and evolution within protein families.
Collapse
|
7
|
Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2014; 31:926-32. [PMID: 25398609 PMCID: PMC4375400 DOI: 10.1093/bioinformatics/btu739] [Citation(s) in RCA: 956] [Impact Index Per Article: 95.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. RESULTS Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
Collapse
Affiliation(s)
- Baris E Suzek
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Yuqi Wang
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Hongzhan Huang
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Peter B McGarvey
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Cathy H Wu
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | | |
Collapse
|
8
|
Bray JE. Target selection for structural genomics based on combining fold recognition and crystallisation prediction methods: application to the human proteome. ACTA ACUST UNITED AC 2012; 13:37-46. [PMID: 22354707 DOI: 10.1007/s10969-012-9130-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2011] [Accepted: 02/07/2012] [Indexed: 11/29/2022]
Abstract
The objective of this study is to automatically identify regions of the human proteome that are suitable for 3D structure determination by X-ray crystallography and to annotate them according to their likelihood to produce diffraction quality crystals. The results provide a powerful tool for structural genomics laboratories who wish to select human proteins based on the statistical likelihood of crystallisation success. Combining fold recognition and crystallisation prediction algorithms enables the efficient calculation of the crystallisability of the entire human proteome. This novel study estimates that there are approximately 40,000 crystallisable regions in the human proteome. Currently, only 15% of these regions (approx. 6,000 sequences) have been solved to at least 95% sequence identity. The remaining unsolved regions have been categorised into 5 crystallisation classes and an integral membrane protein (IMP) class, based on established structure prediction, crystallisation prediction and transmembrane (TM) helix prediction algorithms. Approximately 750 unsolved regions (2% of the proteome) have been identified as having a PDB fold representative (template) and an 'optimal' likelihood of crystallisation. At the other end of the spectrum, more than 10,500 non-IMP regions with a PDB template are classified as 'very difficult' to crystallise (26%) and almost 2,500 regions (6%) were predicted to contain at least 3 TM helices. The 3D-SPECS (3D Structural Proteomics Explorer with Crystallisation Scores) website contains crystallisation predictions for the entire human proteome and can be found at http://www.bioinformaticsplus.org/3dspecs.
Collapse
Affiliation(s)
- James E Bray
- Structural Genomics Consortium, University of Oxford, Old Road Campus Research Building, Roosevelt Drive, Oxford, OX3 7DQ, UK.
| |
Collapse
|
9
|
Abstract
Gene evolution has long been thought to be primarily driven by duplication and rearrangement mechanisms. However, every evolutionary lineage harbours orphan genes that lack homologues in other lineages and whose evolutionary origin is only poorly understood. Orphan genes might arise from duplication and rearrangement processes followed by fast divergence; however, de novo evolution out of non-coding genomic regions is emerging as an important additional mechanism. This process appears to provide raw material continuously for the evolution of new gene functions, which can become relevant for lineage-specific adaptations.
Collapse
|
10
|
Herrada A, Eguíluz VM, Hernández-García E, Duarte CM. Scaling properties of protein family phylogenies. BMC Evol Biol 2011; 11:155. [PMID: 21645345 PMCID: PMC3277297 DOI: 10.1186/1471-2148-11-155] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2011] [Accepted: 06/06/2011] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND One of the classical questions in evolutionary biology is how evolutionary processes are coupled at the gene and species level. With this motivation, we compare the topological properties (mainly the depth scaling, as a characterization of balance) of a large set of protein phylogenies with those of a set of species phylogenies. RESULTS The comparative analysis between protein and species phylogenies shows that both sets of phylogenies share a remarkably similar scaling behavior, suggesting the universality of branching rules and of the evolutionary processes that drive biological diversification from gene to species level. In order to explain such generality, we propose a simple model which allows us to estimate the proportion of evolvability/robustness needed to approximate the scaling behavior observed in the phylogenies, highlighting the relevance of the robustness of a biological system (species or protein) in the scaling properties of the phylogenetic trees. CONCLUSIONS The invariance of the scaling properties at levels spanning from genes to species suggests that rules that govern the incapability of a biological system to diversify are equally relevant both at the gene and at the species level.
Collapse
Affiliation(s)
- Alejandro Herrada
- Instituto de Física Interdisciplinar y Sistemas Complejos, IFISC (CSIC-UIB), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain
| | - Víctor M Eguíluz
- Instituto de Física Interdisciplinar y Sistemas Complejos, IFISC (CSIC-UIB), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain
| | - Emilio Hernández-García
- Instituto de Física Interdisciplinar y Sistemas Complejos, IFISC (CSIC-UIB), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain
| | - Carlos M Duarte
- Instituto Mediterráneo de Estudios Avanzados, IMEDEA (CSIC-UIB), C/Miquel Marqués 21, E-07190 Esporles, Spain
- Oceans Institute, University of Western Australia, 35 Stirling Highway, Crawley 6009, Australia
| |
Collapse
|
11
|
Cai XH, Jaroszewski L, Wooley J, Godzik A. Internal organization of large protein families: relationship between the sequence, structure, and function-based clustering. Proteins 2011; 79:2389-402. [PMID: 21671455 DOI: 10.1002/prot.23049] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2010] [Revised: 02/12/2011] [Accepted: 03/13/2011] [Indexed: 12/14/2022]
Abstract
The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects.
Collapse
Affiliation(s)
- Xiao-Hui Cai
- Joint Center for Structural Genomics, Center for Research in Biological Systems, University of California, San Diego, California 92093-0446, USA
| | | | | | | |
Collapse
|
12
|
Abstract
Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species.
Collapse
Affiliation(s)
- Christian Frech
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Nansheng Chen
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
- * E-mail:
| |
Collapse
|
13
|
Cuff A, Redfern OC, Greene L, Sillitoe I, Lewis T, Dibley M, Reid A, Pearl F, Dallman T, Todd A, Garratt R, Thornton J, Orengo C. The CATH hierarchy revisited-structural divergence in domain superfamilies and the continuity of fold space. Structure 2010; 17:1051-62. [PMID: 19679085 PMCID: PMC2741583 DOI: 10.1016/j.str.2009.06.015] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2008] [Revised: 06/24/2009] [Accepted: 06/25/2009] [Indexed: 11/29/2022]
Abstract
This paper explores the structural continuum in CATH and the extent to which superfamilies adopt distinct folds. Although most superfamilies are structurally conserved, in some of the most highly populated superfamilies (4% of all superfamilies) there is considerable structural divergence. While relatives share a similar fold in the evolutionary conserved core, diverse elaborations to this core can result in significant differences in the global structures. Applying similar protocols to examine the extent to which structural overlaps occur between different fold groups, it appears this effect is confined to just a few architectures and is largely due to small, recurring super-secondary motifs (e.g., αβ-motifs, α-hairpins). Although 24% of superfamilies overlap with superfamilies having different folds, only 14% of nonredundant structures in CATH are involved in overlaps. Nevertheless, the existence of these overlaps suggests that, in some regions of structure space, the fold universe should be seen as more continuous.
Collapse
Affiliation(s)
- Alison Cuff
- Institute of Structural and Molecular Biology, University College London, London, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Dessailly BH, Nair R, Jaroszewski L, Fajardo JE, Kouranov A, Lee D, Fiser A, Godzik A, Rost B, Orengo C. PSI-2: structural genomics to cover protein domain family space. Structure 2009; 17:869-81. [PMID: 19523904 DOI: 10.1016/j.str.2009.03.015] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2008] [Revised: 03/18/2009] [Accepted: 03/22/2009] [Indexed: 11/25/2022]
Abstract
One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centers, targets representatives from large, structurally uncharacterized protein domain families, and from structurally uncharacterized subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly overrepresented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first 3 years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.
Collapse
Affiliation(s)
- Benoît H Dessailly
- Department of Structural and Molecular Biology, University College of London, London WC1E6BT, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Addou S, Rentzsch R, Lee D, Orengo CA. Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. J Mol Biol 2008; 387:416-30. [PMID: 19135455 DOI: 10.1016/j.jmb.2008.12.045] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2008] [Revised: 12/12/2008] [Accepted: 12/17/2008] [Indexed: 11/24/2022]
Abstract
Divergence in function of homologous proteins is based on both sequence and structural changes. Overall enzyme function has been reported to diverge earlier (50% sequence identity) than overall structure (35%). We herein study the functional conservation of enzymes and non-enzyme sequences using the protein domain families in CATH-Gene3D. Despite the rapid increase in sequence data since the last comprehensive study by Tian and Skolnick, our findings suggest that generic thresholds of 40% and 60% aligned sequence identity are still sufficient to safely inherit third-level and full Enzyme Commission numbers, respectively. This increases to 50% and 70% on the domain level, unless the multi-domain architecture matches. Assignments from the Kyoto Encyclopedia of Genes and Genomes and the Munich Information Center for Protein Sequences Functional Catalogue seem to be less conserved with sequence, probably due to a more pathway-centric view: 80% domain sequence identity is required for safe function transfer. Comparing domains (more pairwise relationships) and the use of family-specific thresholds (varying evolutionary speeds) yields the highest coverage rates when transferring functions to model proteomes. An average twofold increase in enzyme annotations is seen for 523 proteomes in Gene3D. As simple 'rules of thumb', sequence identity thresholds do not require a bioinformatics background. We will provide and update this information with future releases of CATH-Gene3D.
Collapse
Affiliation(s)
- Sarah Addou
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | | | | |
Collapse
|
16
|
Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007; 8:995-1005. [PMID: 18037900 DOI: 10.1038/nrm2281] [Citation(s) in RCA: 354] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
17
|
Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res 2007; 36:D414-8. [PMID: 18032434 PMCID: PMC2238970 DOI: 10.1093/nar/gkm1019] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Gene3D provides comprehensive structural and functional annotation of most available protein sequences, including the UniProt, RefSeq and Integr8 resources. The main structural annotation is generated through scanning these sequences against the CATH structural domain database profile-HMM library. CATH is a database of manually derived PDB-based structural domains, placed within a hierarchy reflecting topology, homology and conservation and is able to infer more ancient and divergent homology relationships than sequence-based approaches. This data is supplemented with Pfam-A, other non-domain structural predictions (i.e. coiled coils) and experimental data from UniProt. In order to enhance the investigations possible with this data, we have also incorporated a variety of protein annotation resources, including protein-protein interaction data, GO functional assignments, KEGG pathways, FUNCAT functional descriptions and links to microarray expression data. All of this data can be accessed through a newly re-designed website that has a focus on flexibility and clarity, with searches that can be restricted to a single genome or across the entire sequence database. Currently Gene3D contains over 3.5 million domain assignments for nearly 5 million proteins including 527 completed genomes. This is available at: http://gene3d.biochem.ucl.ac.uk/
Collapse
Affiliation(s)
- Corin Yeats
- UCL, Department of Molecular Biology & Biochemistry, Darwin Building, Gower St, London, UK.
| | | | | | | | | | | | | |
Collapse
|
18
|
Abeln S, Teubner C, Deane CM. Using phylogeny to improve genome-wide distant homology recognition. PLoS Comput Biol 2006; 3:e3. [PMID: 17238281 PMCID: PMC1779300 DOI: 10.1371/journal.pcbi.0030003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2006] [Accepted: 11/20/2006] [Indexed: 11/19/2022] Open
Abstract
The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10(-8) for SUPERFAMILY and >10(-4) for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.
Collapse
Affiliation(s)
- Sanne Abeln
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Carlo Teubner
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Charlotte M Deane
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
19
|
Bryson K, Loux V, Bossy R, Nicolas P, Chaillou S, van de Guchte M, Penaud S, Maguin E, Hoebeke M, Bessières P, Gibrat JF. AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 2006; 34:3533-45. [PMID: 16855290 PMCID: PMC1524909 DOI: 10.1093/nar/gkl471] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license.
Collapse
Affiliation(s)
| | | | | | | | - S. Chaillou
- Flore Lactique et Environnement Carné, INRA78352 Jouy-en-Josas Cedex, France
| | | | - S. Penaud
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | - E. Maguin
- Génétique Microbienne, INRA78352 Jouy-en-Josas Cedex, France
| | | | | | - J-F Gibrat
- To whom correspondence should be addressed. Tel: +33 1 34 65 28 97; Fax: +33 1 34 65 29 01; E-mail:
| |
Collapse
|
20
|
Marsden RL, Ranea JAG, Sillero A, Redfern O, Yeats C, Maibaum M, Lee D, Addou S, Reeves GA, Dallman TJ, Orengo CA. Exploiting protein structure data to explore the evolution of protein function and biological complexity. Philos Trans R Soc Lond B Biol Sci 2006; 361:425-40. [PMID: 16524831 PMCID: PMC1609337 DOI: 10.1098/rstb.2005.1801] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry, University College London Gower Street, London WC1E 6BT, UK.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Yeats C, Maibaum M, Marsden R, Dibley M, Lee D, Addou S, Orengo CA. Gene3D: modelling protein structure, function and evolution. Nucleic Acids Res 2006; 34:D281-4. [PMID: 16381865 PMCID: PMC1347420 DOI: 10.1093/nar/gkj057] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
The Gene3D release 4 database and web portal () provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives—including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein–protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers.
Collapse
Affiliation(s)
- Corin Yeats
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK.
| | | | | | | | | | | | | |
Collapse
|
22
|
Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 2006; 34:1066-80. [PMID: 16481312 PMCID: PMC1373602 DOI: 10.1093/nar/gkj494] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK.
| | | | | | | | | |
Collapse
|
23
|
Todd AE, Marsden RL, Thornton JM, Orengo CA. Progress of Structural Genomics Initiatives: An Analysis of Solved Target Structures. J Mol Biol 2005; 348:1235-60. [PMID: 15854658 DOI: 10.1016/j.jmb.2005.03.037] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2004] [Revised: 02/28/2005] [Accepted: 03/15/2005] [Indexed: 11/27/2022]
Abstract
The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.
Collapse
Affiliation(s)
- Annabel E Todd
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK.
| | | | | | | |
Collapse
|
24
|
Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnston C, Sillero A, Thornton J, Orengo C. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 2005; 33:D247-51. [PMID: 15608188 PMCID: PMC539978 DOI: 10.1093/nar/gki024] [Citation(s) in RCA: 211] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath/) currently contains 43,229 domains classified into 1467 superfamilies and 5107 sequence families. Each structural family is expanded with sequence relatives from GenBank and completed genomes, using a variety of efficient sequence search protocols and reliable thresholds. This extended CATH protein family database contains 616,470 domain sequences classified into 23,876 sequence families. This results in the significant expansion of the CATH HMM model library to include models built from the CATH sequence relatives, giving a 10% increase in coverage for detecting remote homologues. An improved Dictionary of Homologous superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) containing specific sequence, structural and functional information for each superfamily in CATH considerably assists manual validation of homologues. Information on sequence relatives in CATH superfamilies, GenBank and completed genomes is presented in the CATH associated DHS and Gene3D resources. Domain partnership information can be obtained from Gene3D (http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/). A new CATH server has been implemented (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) providing automatic classification of newly determined sequences and structures using a suite of rapid sequence and structure comparison methods. The statistical significance of matches is assessed and links are provided to the putative superfamily or fold group to which the query sequence or structure is assigned.
Collapse
Affiliation(s)
- Frances Pearl
- Biochemistry and Molecular Biology Department, University College London, University of London, Gower Street, London WC1E 6BT, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|