1
|
Bou Dagher L, Madern D, Malbos P, Brochier-Armanet C. Persistent homology reveals strong phylogenetic signal in 3D protein structures. PNAS NEXUS 2024; 3:pgae158. [PMID: 38689707 PMCID: PMC11058471 DOI: 10.1093/pnasnexus/pgae158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/01/2024] [Indexed: 05/02/2024]
Abstract
Changes that occur in proteins over time provide a phylogenetic signal that can be used to decipher their evolutionary history and the relationships between organisms. Sequence comparison is the most common way to access this phylogenetic signal, while those based on 3D structure comparisons are still in their infancy. In this study, we propose an effective approach based on Persistent Homology Theory (PH) to extract the phylogenetic information contained in protein structures. PH provides efficient and robust algorithms for extracting and comparing geometric features from noisy datasets at different spatial resolutions. PH has a growing number of applications in the life sciences, including the study of proteins (e.g. classification, folding). However, it has never been used to study the phylogenetic signal they may contain. Here, using 518 protein families, representing 22,940 protein sequences and structures, from 10 major taxonomic groups, we show that distances calculated with PH from protein structures correlate strongly with phylogenetic distances calculated from protein sequences, at both small and large evolutionary scales. We test several methods for calculating PH distances and propose some refinements to improve their relevance for addressing evolutionary questions. This work opens up new perspectives in evolutionary biology by proposing an efficient way to access the phylogenetic signal contained in protein structures, as well as future developments of topological analysis in the life sciences.
Collapse
Affiliation(s)
- Léa Bou Dagher
- Université Claude Bernard Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et BiologieÉvolutive, UMR5558, F-69622 Villeurbanne, France
- Université Claude Bernard Lyon 1, CNRS, Institut Camille Jordan, UMR5208, F-69622 Villeurbanne, France
- Université Libanaise, Laboratoire de Mathématiques, École Doctorale en Science et Technologie, PO BOX 5 Hadath, Liban
| | - Dominique Madern
- University Grenoble Alpes, CEA, CNRS, IBS, 38000 Grenoble, France
| | - Philippe Malbos
- Université Claude Bernard Lyon 1, CNRS, Institut Camille Jordan, UMR5208, F-69622 Villeurbanne, France
| | - Céline Brochier-Armanet
- Université Claude Bernard Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et BiologieÉvolutive, UMR5558, F-69622 Villeurbanne, France
| |
Collapse
|
2
|
Koehler Leman J, Szczerbiak P, Renfrew PD, Gligorijevic V, Berenberg D, Vatanen T, Taylor BC, Chandler C, Janssen S, Pataki A, Carriero N, Fisk I, Xavier RJ, Knight R, Bonneau R, Kosciolek T. Sequence-structure-function relationships in the microbial protein universe. Nat Commun 2023; 14:2351. [PMID: 37100781 PMCID: PMC10133388 DOI: 10.1038/s41467-023-37896-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 04/05/2023] [Indexed: 04/28/2023] Open
Abstract
For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We predict ~200,000 structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid, a large-scale citizen science initiative. The resulting database of structural models is complementary to the AlphaFold database, with regards to domains of life as well as sequence diversity and sequence length. We identify 148 novel folds and describe examples where we map specific functions to structural motifs. We also show that the structural space is continuous and largely saturated, highlighting the need for a shift in focus across all branches of biology, from obtaining structures to putting them into context and from sequence-based to sequence-structure-function based meta-omics analyses.
Collapse
Affiliation(s)
- Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| | - Pawel Szczerbiak
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
| | - P Douglas Renfrew
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Department of Biology, New York University, New York, NY, USA
| | - Vladimir Gligorijevic
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Daniel Berenberg
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
- Center for Data Science, New York University, New York, NY, 10011, USA
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA
| | - Tommi Vatanen
- Broad Institute, Cambridge, MA, USA
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, 00014 University of Helsinki, Helsinki, Finland
| | - Bryn C Taylor
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- In Silico Discovery and External Innovation, Janssen Research and Development, San Diego, CA, 92122, USA
| | - Chris Chandler
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Stefan Janssen
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA
- Algorithmic Bioinformatics, Justus Liebig University Giessen, Giessen, Germany
| | - Andras Pataki
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Nick Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Ian Fisk
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Ramnik J Xavier
- Broad Institute, Cambridge, MA, USA
- Center for Microbiome Informatics and Therapeutics, MIT, Cambridge, MA, 02139, USA
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Bioengineering, University of California, San Diego, USA
| | - Richard Bonneau
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Department of Biology, New York University, New York, NY, USA
- Center for Data Science, New York University, New York, NY, 10011, USA
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
| |
Collapse
|
3
|
Roethel A, Biliński P, Ishikawa T. BioS2Net: Holistic Structural and Sequential Analysis of Biomolecules Using a Deep Neural Network. Int J Mol Sci 2022; 23:ijms23062966. [PMID: 35328384 PMCID: PMC8954277 DOI: 10.3390/ijms23062966] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/05/2022] [Accepted: 03/08/2022] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND For decades, the rate of solving new biomolecular structures has been exceeding that at which their manual classification and feature characterisation can be carried out efficiently. Therefore, a new comprehensive and holistic tool for their examination is needed. METHODS Here we propose the Biological Sequence and Structure Network (BioS2Net), which is a novel deep neural network architecture that extracts both sequential and structural information of biomolecules. Our architecture consists of four main parts: (i) a sequence convolutional extractor, (ii) a 3D structure extractor, (iii) a 3D structure-aware sequence temporal network, as well as (iv) a fusion and classification network. RESULTS We have evaluated our approach using two protein fold classification datasets. BioS2Net achieved a 95.4% mean class accuracy on the eDD dataset and a 76% mean class accuracy on the F184 dataset. The accuracy of BioS2Net obtained on the eDD dataset was comparable to results achieved by previously published methods, confirming that the algorithm described in this article is a top-class solution for protein fold recognition. CONCLUSIONS BioS2Net is a novel tool for the holistic examination of biomolecules of known structure and sequence. It is a reliable tool for protein analysis and their unified representation as feature vectors.
Collapse
Affiliation(s)
- Albert Roethel
- Department of Molecular Biology, Institute of Biochemistry, Faculty of Biology, University of Warsaw, 02-096 Warsaw, Poland;
- College of Inter-Faculty Individual Studies in Mathematics and Natural Sciences, University of Warsaw, 02-097 Warsaw, Poland
| | - Piotr Biliński
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 02-097 Warsaw, Poland;
| | - Takao Ishikawa
- Department of Molecular Biology, Institute of Biochemistry, Faculty of Biology, University of Warsaw, 02-096 Warsaw, Poland;
- Correspondence: ; Tel.: +48-22-5543111
| |
Collapse
|
4
|
Mura C, Veretnik S, Bourne PE. The Urfold: Structural similarity just above the superfold level? Protein Sci 2019; 28:2119-2126. [PMID: 31599042 PMCID: PMC6863707 DOI: 10.1002/pro.3742] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Revised: 09/30/2019] [Accepted: 10/01/2019] [Indexed: 01/16/2023]
Abstract
We suspect that there is a level of granularity of protein structure intermediate between the classical levels of "architecture" and "topology," as reflected in such phenomena as extensive three-dimensional structural similarity above the level of (super)folds. Here, we examine this notion of architectural identity despite topological variability, starting with a concept that we call the "Urfold." We believe that this model could offer a new conceptual approach for protein structural analysis and classification: indeed, the Urfold concept may help reconcile various phenomena that have been frequently recognized or debated for years, such as the precise meaning of "significant" structural overlap and the degree of continuity of fold space. More broadly, the role of structural similarity in sequence↔structure↔function evolution has been studied via many models over the years; by addressing a conceptual gap that we believe exists between the architecture and topology levels of structural classification schemes, the Urfold eventually may help synthesize these models into a generalized, consistent framework. Here, we begin by qualitatively introducing the concept.
Collapse
Affiliation(s)
- Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
| | - Stella Veretnik
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia.,School of Data Science, University of Virginia, Charlottesville, Virginia
| |
Collapse
|
5
|
Herman JL. Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information. Methods Mol Biol 2019; 1851:183-214. [PMID: 30298398 DOI: 10.1007/978-1-4939-8736-8_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
6
|
Dybas JM, Fiser A. Development of a motif-based topology-independent structure comparison method to identify evolutionarily related folds. Proteins 2016; 84:1859-1874. [PMID: 27671894 PMCID: PMC5118133 DOI: 10.1002/prot.25169] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Revised: 08/17/2016] [Accepted: 08/25/2016] [Indexed: 11/09/2022]
Abstract
Structure conservation, functional similarities, and homologous relationships that exist across diverse protein topologies suggest that some regions of the protein fold universe are continuous. However, the current structure classification systems are based on hierarchical organizations, which cannot accommodate structural relationships that span fold definitions. Here, we describe a novel, super-secondary-structure motif-based, topology-independent structure comparison method (SmotifCOMP) that is able to quantitatively identify structural relationships between disparate topologies. The basis of SmotifCOMP is a systematically defined super-secondary-structure motif library whose representative geometries are shown to be saturated in the Protein Data Bank and exhibit a unique distribution within the known folds. SmotifCOMP offers a robust and quantitative technique to compare domains that adopt different topologies since the method does not rely on a global superposition. SmotifCOMP is used to perform an exhaustive comparison of the known folds and the identified relationships are used to produce a nonhierarchical representation of the fold space that reflects the notion of a continuous and connected fold universe. The current work offers insight into previously hypothesized evolutionary relationships between disparate folds and provides a resource for exploring novel ones. Proteins 2016; 84:1859-1874. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Joseph M. Dybas
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue Bronx, NY 10461, USA
- Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue Bronx, NY 10461, USA
- Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue Bronx, NY 10461, USA
| |
Collapse
|
7
|
Edwards H, Deane CM. Structural Bridges through Fold Space. PLoS Comput Biol 2015; 11:e1004466. [PMID: 26372166 PMCID: PMC4570669 DOI: 10.1371/journal.pcbi.1004466] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 07/12/2015] [Indexed: 12/05/2022] Open
Abstract
Several protein structure classification schemes exist that partition the protein universe into structural units called folds. Yet these schemes do not discuss how these units sit relative to each other in a global structure space. In this paper we construct networks that describe such global relationships between folds in the form of structural bridges. We generate these networks using four different structural alignment methods across multiple score thresholds. The networks constructed using the different methods remain a similar distance apart regardless of the probability threshold defining a structural bridge. This suggests that at least some structural bridges are method specific and that any attempt to build a picture of structural space should not be reliant on a single structural superposition method. Despite these differences all representations agree on an organisation of fold space into five principal community structures: all-α, all-β sandwiches, all-β barrels, α/β and α + β. We project estimated fold ages onto the networks and find that not only are the pairings of unconnected folds associated with higher age differences than bridged folds, but this difference increases with the number of networks displaying an edge. We also examine different centrality measures for folds within the networks and how these relate to fold age. While these measures interpret the central core of fold space in varied ways they all identify the disposition of ancestral folds to fall within this core and that of the more recently evolved structures to provide the peripheral landscape. These findings suggest that evolutionary information is encoded along these structural bridges. Finally, we identify four highly central pivotal folds representing dominant topological features which act as key attractors within our landscapes. Folds are considered to be the structural units which make up the protein universe. Structural classification schemes focus on the assignment and organisation of protein domains into folds. However, they do not suggest how different folds might relate to one another in a global way. We introduce the concept of bridges through fold space: significant similarities between these units. We consider four alignment methods and a dynamic approach to placing these bridges. A greater consensus between these methods cannot be achieved by simply increasing the stringency with which edges are assigned. Instead, we emphasise the importance of considering consensus maps and only report results where there is agreement across all networks. It is possible that a study of the bridges may reveal evolutionary relationships. Based on a phylogenetic analysis of structures, we find that bridges consistently fall between folds which evolved at similar times. Moreover, the landscapes all consist of a core of older folds, with younger structures more often seen at the periphery. Finally we identify four pivotal folds in the landscapes. They contain topological motifs which unite disparate regions of fold space.
Collapse
Affiliation(s)
- Hannah Edwards
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Charlotte M. Deane
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- * E-mail:
| |
Collapse
|
8
|
Rackovsky S. Nonlinearities in protein space limit the utility of informatics in protein biophysics. Proteins 2015; 83:1923-8. [PMID: 26315852 DOI: 10.1002/prot.24916] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2015] [Revised: 08/12/2015] [Accepted: 08/20/2015] [Indexed: 11/08/2022]
Abstract
We examine the utility of informatic-based methods in computational protein biophysics. To do so, we use newly developed metric functions to define completely independent sequence and structure spaces for a large database of proteins. By investigating the relationship between these spaces, we demonstrate quantitatively the limits of knowledge-based correlation between the sequences and structures of proteins. It is shown that there are well-defined, nonlinear regions of protein space in which dissimilar structures map onto similar sequences (the conformational switch), and dissimilar sequences map onto similar structures (remote homology). These nonlinearities are shown to be quite common-almost half the proteins in our database fall into one or the other of these two regions. They are not anomalies, but rather intrinsic properties of structural encoding in amino acid sequences. It follows that extreme care must be exercised in using bioinformatic data as a basis for computational structure prediction. The implications of these results for protein evolution are examined.
Collapse
Affiliation(s)
- S Rackovsky
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York, 14853.,Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York, New York, 10029
| |
Collapse
|
9
|
Lhota J, Hauptman R, Hart T, Ng C, Xie L. A new method to improve network topological similarity search: applied to fold recognition. Bioinformatics 2015; 31:2106-14. [PMID: 25717198 DOI: 10.1093/bioinformatics/btv125] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Accepted: 02/21/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Similarity search is the foundation of bioinformatics. It plays a key role in establishing structural, functional and evolutionary relationships between biological sequences. Although the power of the similarity search has increased steadily in recent years, a high percentage of sequences remain uncharacterized in the protein universe. Thus, new similarity search strategies are needed to efficiently and reliably infer the structure and function of new sequences. The existing paradigm for studying protein sequence, structure, function and evolution has been established based on the assumption that the protein universe is discrete and hierarchical. Cumulative evidence suggests that the protein universe is continuous. As a result, conventional sequence homology search methods may be not able to detect novel structural, functional and evolutionary relationships between proteins from weak and noisy sequence signals. To overcome the limitations in existing similarity search methods, we propose a new algorithmic framework-Enrichment of Network Topological Similarity (ENTS)-to improve the performance of large scale similarity searches in bioinformatics. RESULTS We apply ENTS to a challenging unsolved problem: protein fold recognition. Our rigorous benchmark studies demonstrate that ENTS considerably outperforms state-of-the-art methods. As the concept of ENTS can be applied to any similarity metric, it may provide a general framework for similarity search on any set of biological entities, given their representation as a network. AVAILABILITY AND IMPLEMENTATION Source code freely available upon request CONTACT : lxie@iscb.org.
Collapse
Affiliation(s)
- John Lhota
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Ruth Hauptman
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Thomas Hart
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Clara Ng
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| | - Lei Xie
- Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A. Hunter College High School, New York, NY 10128, U.S.A., Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, U.S.A., Department of Biological Sciences, Hunter College, The City University of New York New York, NY 10065, U.S.A. and The Graduate Center, The City University of New York, New York, NY 10016, U.S.A
| |
Collapse
|
10
|
Abstract
To explore protein space from a global perspective, we consider 9,710 SCOP (Structural Classification of Proteins) domains with up to 70% sequence identity and present all similarities among them as networks: In the "domain network," nodes represent domains, and edges connect domains that share "motifs," i.e., significantly sized segments of similar sequence and structure. We explore the dependence of the network on the thresholds that define the evolutionary relatedness of the domains. At excessively strict thresholds the network falls apart completely; for very lax thresholds, there are network paths between virtually all domains. Interestingly, at intermediate thresholds the network constitutes two regions that can be described as "continuous" versus "discrete." The continuous region comprises a large connected component, dominated by domains with alternating alpha and beta elements, and the discrete region includes the rest of the domains in isolated islands, each generally corresponding to a fold. We also construct the "motif network," in which nodes represent recurring motifs, and edges connect motifs that appear in the same domain. This network also features a large and highly connected component of motifs that originate from domains with alternating alpha/beta elements (and some all-alpha domains), and smaller isolated islands. Indeed, the motif network suggests that nature reuses such motifs extensively. The networks suggest evolutionary paths between domains and give hints about protein evolution and the underlying biophysics. They provide natural means of organizing protein space, and could be useful for the development of strategies for protein search and design.
Collapse
|
11
|
Edwards H, Abeln S, Deane CM. Exploring fold space preferences of new-born and ancient protein superfamilies. PLoS Comput Biol 2013; 9:e1003325. [PMID: 24244135 PMCID: PMC3828129 DOI: 10.1371/journal.pcbi.1003325] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 09/23/2013] [Indexed: 11/18/2022] Open
Abstract
The evolution of proteins is one of the fundamental processes that has delivered the diversity and complexity of life we see around ourselves today. While we tend to define protein evolution in terms of sequence level mutations, insertions and deletions, it is hard to translate these processes to a more complete picture incorporating a polypeptide's structure and function. By considering how protein structures change over time we can gain an entirely new appreciation of their long-term evolutionary dynamics. In this work we seek to identify how populations of proteins at different stages of evolution explore their possible structure space. We use an annotation of superfamily age to this space and explore the relationship between these ages and a diverse set of properties pertaining to a superfamily's sequence, structure and function. We note several marked differences between the populations of newly evolved and ancient structures, such as in their length distributions, secondary structure content and tertiary packing arrangements. In particular, many of these differences suggest a less elaborate structure for newly evolved superfamilies when compared with their ancient counterparts. We show that the structural preferences we report are not a residual effect of a more fundamental relationship with function. Furthermore, we demonstrate the robustness of our results, using significant variation in the algorithm used to estimate the ages. We present these age estimates as a useful tool to analyse protein populations. In particularly, we apply this in a comparison of domains containing greek key or jelly roll motifs.
Collapse
Affiliation(s)
- Hannah Edwards
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Sanne Abeln
- Department of Computer Science, Vrije Universiteit, Amsterdam, The Netherlands
| | - Charlotte M. Deane
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- * E-mail:
| |
Collapse
|
12
|
Affiliation(s)
- Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa 31905, Israel;
| | - Leonid Pereyaslavets
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| | | | - Michael Levitt
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| |
Collapse
|
13
|
Goldman AD, Baross JA, Samudrala R. The enzymatic and metabolic capabilities of early life. PLoS One 2012; 7:e39912. [PMID: 22970111 PMCID: PMC3438178 DOI: 10.1371/journal.pone.0039912] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2011] [Accepted: 06/04/2012] [Indexed: 12/24/2022] Open
Abstract
We introduce the concept of metaconsensus and employ it to make high confidence predictions of early enzyme functions and the metabolic properties that they may have produced. Several independent studies have used comparative bioinformatics methods to identify taxonomically broad features of genomic sequence data, protein structure data, and metabolic pathway data in order to predict physiological features that were present in early, ancestral life forms. But all such methods carry with them some level of technical bias. Here, we cross-reference the results of these previous studies to determine enzyme functions predicted to be ancient by multiple methods. We survey modern metabolic pathways to identify those that maintain the highest frequency of metaconsensus enzymes. Using the full set of modern reactions catalyzed by these metaconsensus enzyme functions, we reconstruct a representative metabolic network that may reflect the core metabolism of early life forms. Our results show that ten enzyme functions, four hydrolases, three transferases, one oxidoreductase, one lyase, and one ligase, are determined by metaconsensus to be present at least as late as the last universal common ancestor. Subnetworks within central metabolic processes related to sugar and starch metabolism, amino acid biosynthesis, phospholipid metabolism, and CoA biosynthesis, have high frequencies of these enzyme functions. We demonstrate that a large metabolic network can be generated from this small number of enzyme functions.
Collapse
Affiliation(s)
- Aaron David Goldman
- Department of Ecology and Evolutionary Biology, Princeton, New Jersey, United States of America.
| | | | | |
Collapse
|
14
|
Hensen U, Meyer T, Haas J, Rex R, Vriend G, Grubmüller H. Exploring protein dynamics space: the dynasome as the missing link between protein structure and function. PLoS One 2012; 7:e33931. [PMID: 22606222 PMCID: PMC3350514 DOI: 10.1371/journal.pone.0033931] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2011] [Accepted: 02/20/2012] [Indexed: 12/25/2022] Open
Abstract
Proteins are usually described and classified according to amino acid sequence, structure or function. Here, we develop a minimally biased scheme to compare and classify proteins according to their internal mobility patterns. This approach is based on the notion that proteins not only fold into recurring structural motifs but might also be carrying out only a limited set of recurring mobility motifs. The complete set of these patterns, which we tentatively call the dynasome, spans a multi-dimensional space with axes, the dynasome descriptors, characterizing different aspects of protein dynamics. The unique dynamic fingerprint of each protein is represented as a vector in the dynasome space. The difference between any two vectors, consequently, gives a reliable measure of the difference between the corresponding protein dynamics. We characterize the properties of the dynasome by comparing the dynamics fingerprints obtained from molecular dynamics simulations of 112 proteins but our approach is, in principle, not restricted to any specific source of data of protein dynamics. We conclude that: 1. the dynasome consists of a continuum of proteins, rather than well separated classes. 2. For the majority of proteins we observe strong correlations between structure and dynamics. 3. Proteins with similar function carry out similar dynamics, which suggests a new method to improve protein function annotation based on protein dynamics.
Collapse
Affiliation(s)
- Ulf Hensen
- Theoretische und computergestützte Biophysik, Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany
| | - Tim Meyer
- Theoretische und computergestützte Biophysik, Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany
| | - Jürgen Haas
- Theoretische und computergestützte Biophysik, Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany
| | - René Rex
- Theoretische und computergestützte Biophysik, Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany
| | - Gert Vriend
- CMBI, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
| | - Helmut Grubmüller
- Theoretische und computergestützte Biophysik, Max-Planck-Institut für biophysikalische Chemie, Göttingen, Germany
| |
Collapse
|
15
|
Abstract
Motivation: Structural alignment methods are widely used to generate gold standard alignments for improving multiple sequence alignments and transferring functional annotations, as well as for assigning structural distances between proteins. However, the correctness of the alignments generated by these methods is difficult to assess objectively since little is known about the exact evolutionary history of most proteins. Since homology is an equivalence relation, an upper bound on alignment quality can be found by assessing the consistency of alignments. Measuring the consistency of current methods of structure alignment and determining the causes of inconsistencies can, therefore, provide information on the quality of current methods and suggest possibilities for further improvement. Results: We analyze the self-consistency of seven widely-used structural alignment methods (SAP, TM-align, Fr-TM-align, MAMMOTH, DALI, CE and FATCAT) on a diverse, non-redundant set of 1863 domains from the SCOP database and demonstrate that even for relatively similar proteins the degree of inconsistency of the alignments on a residue level is high (30%). We further show that levels of consistency vary substantially between methods, with two methods (SAP and Fr-TM-align) producing more consistent alignments than the rest. Inconsistency is found to be higher near gaps and for proteins of low structural complexity, as well as for helices. The ability of the methods to identify good structural alignments is also assessed using geometric measures, for which FATCAT (flexible mode) is found to be the best performer despite being highly inconsistent. We conclude that there is substantial scope for improving the consistency of structural alignment methods. Contact:msadows@nimr.mrc.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- M I Sadowski
- Division of Mathematical Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London, UK
| | | |
Collapse
|
16
|
Hollup SM, Sadowski MI, Jonassen I, Taylor WR. Exploring the limits of fold discrimination by structural alignment: a large scale benchmark using decoys of known fold. Comput Biol Chem 2011; 35:174-88. [PMID: 21704264 PMCID: PMC3145973 DOI: 10.1016/j.compbiolchem.2011.04.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Accepted: 04/23/2011] [Indexed: 11/10/2022]
Abstract
Protein structure comparison by pairwise alignment is commonly used to identify highly similar substructures in pairs of proteins and provide a measure of structural similarity based on the size and geometric similarity of the match. These scores are routinely applied in analyses of protein fold space under the assumption that high statistical significance is equivalent to a meaningful relationship, however the truth of this assumption has previously been difficult to test since there is a lack of automated methods which do not rely on the same underlying principles. As a resolution to this we present a method based on the use of topological descriptions of global protein structure, providing an independent means to assess the ability of structural alignment to maintain meaningful structural correspondances on a large scale. Using a large set of decoys of specified global fold we benchmark three widely used methods for structure comparison, SAP, TM-align and DALI, and test the degree to which this assumption is justified for these methods. Application of a topological edit distance measure to provide a scale of the degree of fold change shows that while there is a broad correlation between high structural alignment scores and low edit distances there remain many pairs of highly significant score which differ by core strand swaps and therefore are structurally different on a global level. Possible causes of this problem and its meaning for present assessments of protein fold space are discussed.
Collapse
|
17
|
Park YR, Kim J, Lee HW, Yoon YJ, Kim JH. GOChase-II: correcting semantic inconsistencies from Gene Ontology-based annotations for gene products. BMC Bioinformatics 2011; 12 Suppl 1:S40. [PMID: 21342572 PMCID: PMC3044297 DOI: 10.1186/1471-2105-12-s1-s40] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background The Gene Ontology (GO) provides a controlled vocabulary for describing genes and gene products. In spite of the undoubted importance of GO, several drawbacks associated with GO and GO-based annotations have been introduced. We identified three types of semantic inconsistencies in GO-based annotations; semantically redundant, biological-domain inconsistent and taxonomy inconsistent annotations. Methods To determine the semantic inconsistencies in GO annotation, we used the hierarchical structure of GO graph and tree structure of NCBI taxonomy. Twenty seven biological databases were collected for finding semantic inconsistent annotation. Results The distributions and possible causes of the semantic inconsistencies were investigated using twenty seven biological databases with GO-based annotations. We found that some evidence codes of annotation were associated with the inconsistencies. The numbers of gene products and species in a database that are related to the complexity of database management are also in correlation with the inconsistencies. Consequently, numerous annotation errors arise and are propagated throughout biological databases and GO-based high-level analyses. GOChase-II is developed to detect and correct both syntactic and semantic errors in GO-based annotations. Conclusions We identified some inconsistencies in GO-based annotation and provided software, GOChase-II, for correcting these semantic inconsistencies in addition to the previous corrections for the syntactic errors by GOChase-I.
Collapse
Affiliation(s)
- Yu Rang Park
- Seoul National University Biomedical Informatics, Div of Biomedical Informatics, Seoul National University College of medicine, Seoul 110799, Korea.
| | | | | | | | | |
Collapse
|