1
|
Cohan MC, Ruff KM, Pappu RV. Information theoretic measures for quantifying sequence-ensemble relationships of intrinsically disordered proteins. Protein Eng Des Sel 2020; 32:191-202. [PMID: 31375817 PMCID: PMC7462041 DOI: 10.1093/protein/gzz014] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 06/19/2019] [Indexed: 01/26/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) contribute to a multitude of functions. De novo design of IDPs should open the door to modulating functions and phenotypes controlled by these systems. Recent design efforts have focused on compositional biases and specific sequence patterns as the design features. Analysis of the impact of these designs on sequence-function relationships indicates that individual sequence/compositional parameters are insufficient for describing sequence-function relationships in IDPs. To remedy this problem, we have developed information theoretic measures for sequence–ensemble relationships (SERs) of IDPs. These measures rely on prior availability of statistically robust conformational ensembles derived from all atom simulations. We show that the measures we have developed are useful for comparing sequence-ensemble relationships even when sequence is poorly conserved. Based on our results, we propose that de novo designs of IDPs, guided by knowledge of their SERs, should provide improved insights into their sequence–ensemble–function relationships.
Collapse
Affiliation(s)
- Megan C Cohan
- Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS) Washington University in St. Louis, One Brookings Drive, Campus Box 1097, St. Louis MO, USA
| | - Kiersten M Ruff
- Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS) Washington University in St. Louis, One Brookings Drive, Campus Box 1097, St. Louis MO, USA
| | - Rohit V Pappu
- Department of Biomedical Engineering and Center for Science & Engineering of Living Systems (CSELS) Washington University in St. Louis, One Brookings Drive, Campus Box 1097, St. Louis MO, USA
| |
Collapse
|
2
|
Konopka BM, Nebel JC, Kotulska M. Quality assessment of protein model-structures based on structural and functional similarities. BMC Bioinformatics 2012; 13:242. [PMID: 22998498 PMCID: PMC3526563 DOI: 10.1186/1471-2105-13-242] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2012] [Accepted: 09/14/2012] [Indexed: 11/10/2022] Open
Abstract
Background Experimental determination of protein 3D structures is expensive, time consuming and sometimes impossible. A gap between number of protein structures deposited in the World Wide Protein Data Bank and the number of sequenced proteins constantly broadens. Computational modeling is deemed to be one of the ways to deal with the problem. Although protein 3D structure prediction is a difficult task, many tools are available. These tools can model it from a sequence or partial structural information, e.g. contact maps. Consequently, biologists have the ability to generate automatically a putative 3D structure model of any protein. However, the main issue becomes evaluation of the model quality, which is one of the most important challenges of structural biology. Results GOBA - Gene Ontology-Based Assessment is a novel Protein Model Quality Assessment Program. It estimates the compatibility between a model-structure and its expected function. GOBA is based on the assumption that a high quality model is expected to be structurally similar to proteins functionally similar to the prediction target. Whereas DALI is used to measure structure similarity, protein functional similarity is quantified using standardized and hierarchical description of proteins provided by Gene Ontology combined with Wang's algorithm for calculating semantic similarity. Two approaches are proposed to express the quality of protein model-structures. One is a single model quality assessment method, the other is its modification, which provides a relative measure of model quality. Exhaustive evaluation is performed on data sets of model-structures submitted to the CASP8 and CASP9 contests. Conclusions The validation shows that the method is able to discriminate between good and bad model-structures. The best of tested GOBA scores achieved 0.74 and 0.8 as a mean Pearson correlation to the observed quality of models in our CASP8 and CASP9-based validation sets. GOBA also obtained the best result for two targets of CASP8, and one of CASP9, compared to the contest participants. Consequently, GOBA offers a novel single model quality assessment program that addresses the practical needs of biologists. In conjunction with other Model Quality Assessment Programs (MQAPs), it would prove useful for the evaluation of single protein models.
Collapse
Affiliation(s)
- Bogumil M Konopka
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland
| | | | | |
Collapse
|
3
|
Deeds EJ, Shakhnovich EI. A structure-centric view of protein evolution, design, and adaptation. ADVANCES IN ENZYMOLOGY AND RELATED AREAS OF MOLECULAR BIOLOGY 2010; 75:133-91, xi-xii. [PMID: 17124867 DOI: 10.1002/9780471224464.ch2] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Proteins, by virtue of their central role in most biological processes, represent one of the key subjects of the study of molecular evolution. Inherent in the indispensability of proteins for living cells is the fact that a given protein can adopt a specific three-dimensional shape that is specified solely by the protein's sequence of amino acids. Over the past several decades, structural biologists have demonstrated that the array of structures that proteins may adopt is quite astounding, and this has lead to a strong interest in understanding how protein structures change and evolve over time. In this review we consider a large body of recent work that attempts to illuminate this structure-centric picture of protein evolution. Much of this work has focused on the question of how completely new protein structures (i.e., new folds or topologies) are discovered by protein sequences as they evolve. Pursuant to this question of structural innovation has been a desire to describe and understand the observation that certain types of protein structures are far more abundant than others and how this uneven distribution of proteins implicates on the process through which new shapes are discovered. We consider a number of theoretical models that have been successful at explaining this heterogeneity in protein populations and discuss the increasing amount of evidence that indicates that the process of structural evolution involves the divergence of protein sequences and structures from one another. We also consider the topic of protein designability, which concerns itself with understanding how a protein's structure influences the number of sequences that can fold successfully into that structure. Understanding and quantifying the relationship between the physical feature of a structure and its designability has been a long-standing goal of the study of protein structure and evolution, and we discuss a number of recent advances that have yielded a promising answer to this question. Finally, we review the relatively new field of protein structural phylogeny, an area of study in which information about the distribution of protein structures among different organisms is used to reconstruct the evolutionary relationships between them. Taken together, the work that we review presents an increasingly coherent picture of how these unique polymers have evolved over the course of life on Earth.
Collapse
Affiliation(s)
- Eric J Deeds
- Department of Molecular and Cellular Biology, Harvard University, 7 Divinity Avenue, Cambridge, MA 02138, USA
| | | |
Collapse
|
4
|
Quantitative Proteome–Property Relationships (QPPRs). Part 1: Finding biomarkers of organic drugs with mean Markov connectivity indices of spiral networks of blood mass spectra. Bioorg Med Chem 2008; 16:9684-93. [DOI: 10.1016/j.bmc.2008.10.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2008] [Revised: 09/29/2008] [Accepted: 10/02/2008] [Indexed: 11/22/2022]
|
5
|
Friedberg I, Godzik A. Connecting the protein structure universe by using sparse recurring fragments. Structure 2007; 13:1213-24. [PMID: 16084393 DOI: 10.1016/j.str.2005.05.009] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2005] [Revised: 04/22/2005] [Accepted: 05/11/2005] [Indexed: 10/25/2022]
Abstract
The quest to order and classify protein structures has lead to various classification schemes, focusing mostly on hierarchical relationships between structural domains. At the coarsest classification level, such schemes typically identify hundreds of types of fundamental units called folds. As a result, we picture protein structure space as a collection of isolated fold islands. It is obvious, however, that many protein folds share structural and functional commonalities. Locating those commonalities is important for our understanding of protein structure, function, and evolution. Here, we present an alternative view of the protein fold space, based on an interfold similarity measure that is related to the frequency of fragments shared between folds. In this view, protein structures form a complicated, crossconnected network with very interesting topology. We show that interfold similarity based on sequence/structure fragments correlates well with similarities of functions between protein populations in different folds.
Collapse
Affiliation(s)
- Iddo Friedberg
- Program in Bioinformatics and Systems Biology, The Burnham Institute, La Jolla, California 92037, USA.
| | | |
Collapse
|
6
|
Krishnamurthy N, Brown DP, Kirshner D, Sjölander K. PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol 2007; 7:R83. [PMID: 16973001 PMCID: PMC1794543 DOI: 10.1186/gb-2006-7-9-r83] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2006] [Revised: 07/12/2006] [Accepted: 09/14/2006] [Indexed: 11/16/2022] Open
Abstract
PhyloFacts, a structural phylogenomic database for protein functional and structural classification, is described. The Berkeley Phylogenomics Group presents PhyloFacts, a structural phylogenomic encyclopedia containing almost 10,000 'books' for protein families and domains, with pre-calculated structural, functional and evolutionary analyses. PhyloFacts enables biologists to avoid the systematic errors associated with function prediction by homology through the integration of a variety of experimental data and bioinformatics methods in an evolutionary framework. Users can submit sequences for classification to families and functional subfamilies. PhyloFacts is available as a worldwide web resource from .
Collapse
Affiliation(s)
- Nandini Krishnamurthy
- Department of Bioengineering, 473 Evans Hall #1762, University of California, Berkeley, CA 94720, USA
| | - Duncan P Brown
- Department of Bioengineering, 473 Evans Hall #1762, University of California, Berkeley, CA 94720, USA
| | - Dan Kirshner
- Department of Bioengineering, 473 Evans Hall #1762, University of California, Berkeley, CA 94720, USA
| | - Kimmen Sjölander
- Department of Bioengineering, 473 Evans Hall #1762, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
7
|
Abstract
We develop models of the divergent evolution of genomes; the elementary object of sequence dynamics is the protein structural domain. To identify patterns of organization that reflect mechanisms of evolution, we consider the individual genomes of many procaryote species, studying the arrangement of protein structural domains in the space of all polypeptide structures. We view the network of structural similarities as a graph, called the organismal Protein Domain Universe Graph (oPDUG); vertices represent types of structural domains and edges represent strong structural similarity. As observed before, each oPDUG is a highly nonrandom graph, as evidenced in the vertex degree distribution, which resembles a Pareto law (which has a power-law asymptotic). To explain this and other peculiar properties of the oPDUGs, we construct an evolving-graph model for the long-timescale evolutionary dynamics of oPDUGs, containing only divergent mechanisms of domain discovery. The model generates degree distributions (resembling Pareto laws) and clustering-coefficient distributions that are characteristic of the oPDUGs. In the infinite-graph limit, we analytically compute the exponent for specific biological parameters, as well as the complete phase diagram of the model, finding two distinct regimes of domain innovation dynamics. Thus, divergent evolutionary dynamics quantitatively explains the nonrandom organization of oPDUGs.
Collapse
Affiliation(s)
- C Brian Roland
- Chemical Physics Program, Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, USA
| | | |
Collapse
|
8
|
Bhalla J, Storchan GB, MacCarthy CM, Uversky VN, Tcherkasskaya O. Local flexibility in molecular function paradigm. Mol Cell Proteomics 2006; 5:1212-23. [PMID: 16571897 DOI: 10.1074/mcp.m500315-mcp200] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
It is generally accepted that the functional activity of biological macromolecules requires tightly packed three-dimensional structures. Recent theoretical and experimental evidence indicates, however, the importance of molecular flexibility for the proper functioning of some proteins. We examined high resolution structures of proteins in various functional categories with respect to the secondary structure assessment. The latter was considered as a characteristic of the inherent flexibility of a polypeptide chain. We found that the proteins in functionally competent conformational states might be comprised of 20-70% flexible residues. For instance, proteins involved in gene regulation, e.g. transcription factors, are on average largely disordered molecules with over 60% of amino acids residing in "coiled" configurations. In contrast, oxygen transporters constitute a class of relatively rigid molecules with only 30% of residues being locally flexible. Phylogenic comparison of a large number of protein families with respect to the propagation of secondary structure illuminates the growing role of the local flexibility in organisms of greater complexity. Furthermore the local flexibility in protein molecules appears to be dependent on the molecular confinement and is essentially larger in extracellular proteins.
Collapse
Affiliation(s)
- Jag Bhalla
- Biochemistry and Molecular & Cellular Biology, Georgetown University School of Medicine, Washington, DC 20007, USA
| | | | | | | | | |
Collapse
|
9
|
Deeds EJ, Hennessey H, Shakhnovich EI. Prokaryotic phylogenies inferred from protein structural domains. Genome Res 2005; 15:393-402. [PMID: 15741510 PMCID: PMC551566 DOI: 10.1101/gr.3033805] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The determination of the phylogenetic relationships among microorganisms has long relied primarily on gene sequence information. Given that prokaryotic organisms often lack morphological characteristics amenable to phylogenetic analysis, prokaryotic phylogenies, in particular, are often based on sequence data. In this work, we explore a new source of phylogenetic information, the distribution of protein structural domains within fully sequenced prokaryotic genomes. The evolution of the structural domains we use has been studied extensively, allowing us to base our phylogenetic methods on testable theoretical models of structural evolution. We find that the methods that produce reasonable phylogenetic relationships are indeed the methods that are most consistent with theoretical evolutionary models. This work represents, to our knowledge, the first such theoretically motivated phylogeny, as well as the first application of structural information to phylogeny on this scale. Our results have strong implications for the phylogenetic relationships among prokaryotic organisms and for the understanding of protein evolution as a whole.
Collapse
Affiliation(s)
- Eric J Deeds
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA
| | | | | |
Collapse
|
10
|
Dokholyan NV. The architecture of the protein domain universe. Gene 2005; 347:199-206. [PMID: 15777630 DOI: 10.1016/j.gene.2004.12.020] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2004] [Revised: 11/08/2004] [Accepted: 12/16/2004] [Indexed: 11/23/2022]
Abstract
Understanding the design of the universe of protein structures may provide insights into protein evolution. We study the architecture of the protein domain universe, which has been found to poses peculiar scale-free properties. We examine the origin of these scale-free properties of the graph of protein domain structures (PDUG) and determine that that the PDUG is not modular, i.e. it does not consist of modules with uniform properties. Instead, we find the PDUG to be self-similar at all scales. We further characterize the PDUG architecture by studying the properties of the hub nodes that are responsible for the scale-free connectivity of the PDUG. We introduce a measure of the betweenness centrality of protein domains in the PDUG and find a power-law distribution of the betweenness centrality values. The scale-free distribution of hubs in the protein universe suggests that a set of specific statistical mechanics models, such as the self-organized criticality model, can potentially identify the principal driving forces of protein evolution. We also find a gatekeeper protein domain, removal of which partitions the largest cluster into two large sub-clusters. We suggest that the loss of such gatekeeper protein domains in the course of evolution is responsible for the creation of new fold families.
Collapse
Affiliation(s)
- Nikolay V Dokholyan
- Department of Biochemistry and Biophysics, The University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC 27599, USA.
| |
Collapse
|
11
|
Meleth S, Deshane J, Kim H. The case for well-conducted experiments to validate statistical protocols for 2D gels: different pre-processing = different lists of significant proteins. BMC Biotechnol 2005; 5:7. [PMID: 15707480 PMCID: PMC553976 DOI: 10.1186/1472-6750-5-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2004] [Accepted: 02/11/2005] [Indexed: 11/15/2022] Open
Abstract
Background The proteomics literature has seen a proliferation of publications that seek to apply the rapidly improving technology of 2D gels to study various biological systems. However, there is a dearth of systematic studies that have investigated appropriate statistical approaches to analyse the data from these experiments. Results Comparison of the effects of statistical pre-processing on the results of two sample t-tests suggests that the results of 2D gel experiments and by extension the conclusions derived from these experiments are not independent of the statistical protocol used. Conclusions This study suggests that there is a need for well-conducted validation studies to establish optimal statistical techniques to be used on such data sets.
Collapse
Affiliation(s)
- Sreelatha Meleth
- Biostatistics and Bioinformatics Unit, Comprehensive Cancer Center, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Jessy Deshane
- Department of Pharmacology, University of Alabama at Birmingham, Birmingham, AL 35924 USA
| | - Helen Kim
- Department of Pharmacology, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| |
Collapse
|
12
|
Shakhnovich BE, Max Harvey J. Quantifying structure-function uncertainty: a graph theoretical exploration into the origins and limitations of protein annotation. J Mol Biol 2004; 337:933-49. [PMID: 15033362 DOI: 10.1016/j.jmb.2004.02.009] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2003] [Revised: 01/13/2004] [Accepted: 02/03/2004] [Indexed: 11/26/2022]
Abstract
Since the advent of investigations into structural genomics, research has focused on correctly identifying domain boundaries, as well as domain similarities and differences in the context of their evolutionary relationships. As the science of structural genomics ramps up adding more and more information into the databanks, questions about the accuracy and completeness of our classification and annotation systems appear on the forefront of this research. A central question of paramount importance is how structural similarity relates to functional similarity. Here, we begin to rigorously and quantitatively answer these questions by first exploring the consensus between the most common protein domain structure annotation databases CATH, SCOP and FSSP. Each of these databases explores the evolutionary relationships between protein domains using a combination of automatic and manual, structural and functional, continuous and discrete similarity measures. In order to examine the issue of consensus thoroughly, we build a generalized graph out of each of these databases and hierarchically cluster these graphs at interval thresholds. We then employ a distance measure to find regions of greatest overlap. Using this procedure we were able not only to enumerate the level of consensus between the different annotation systems, but also to define the graph-theoretical origins behind the annotation schema of class, family and superfamily by observing that the same thresholds that define the best consensus regions between FSSP, SCOP and CATH correspond to distinct, non-random phase-transitions in the structure comparison graph itself. To investigate the correspondence in divergence between structure and function further, we introduce a measure of functional entropy that calculates divergence in function space. First, we use this measure to calculate the general correlation between structural homology and functional proximity. We extend this analysis further by quantitatively calculating the average amount of functional information gained from our understanding of structural distance and the corollary inherent uncertainty that represents the theoretical limit of our ability to infer function from structural similarity. Finally we show how our measure of functional "entropy" translates into a more intuitive concept of functional annotation into similarity EC classes.
Collapse
|
13
|
Tiana G, Shakhnovich BE, Dokholyan NV, Shakhnovich EI. Imprint of evolution on protein structures. Proc Natl Acad Sci U S A 2004; 101:2846-51. [PMID: 14970345 PMCID: PMC365708 DOI: 10.1073/pnas.0306638101] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2003] [Accepted: 12/22/2003] [Indexed: 11/18/2022] Open
Abstract
We attempt to understand the evolutionary origin of protein folds by simulating their divergent evolution with a three-dimensional lattice model. Starting from an initial seed lattice structure, evolution of model proteins progresses by sequence duplication and subsequent point mutations. A new gene's ability to fold into a stable and unique structure is tested each time through direct kinetic folding simulations. Where possible, the algorithm accepts the new sequence and structure and thus a "new protein structure" is born. During the course of each run, this model evolutionary algorithm provides several thousand new proteins with diverse structures. Analysis of evolved structures shows that later evolved structures are more designable than seed structures as judged by recently developed structural determinant of protein designability, as well as direct estimate of designability for selected structures by thermodynamic sampling of their sequence space. We test the significance of this trend predicted on lattice models on real proteins and show that protein domains that are found in eukaryotic organisms only feature statistically significant higher designability than their prokaryotic counterparts. These results present a fundamental view on protein evolution highlighting the relative roles of structural selection and evolutionary dynamics on genesis of modern proteins.
Collapse
Affiliation(s)
- Guido Tiana
- Department of Physics and Istituto Nazionale di Fisica Nucleare, University of Milano, Via Celoria 16, 20133 Milan, Italy
| | | | | | | |
Collapse
|