1
|
Faure AJ, Martí-Aranda A, Hidalgo-Carcedo C, Beltran A, Schmiedel JM, Lehner B. The genetic architecture of protein stability. Nature 2024:10.1038/s41586-024-07966-0. [PMID: 39322666 DOI: 10.1038/s41586-024-07966-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 08/20/2024] [Indexed: 09/27/2024]
Abstract
There are more ways to synthesize a 100-amino acid (aa) protein (20100) than there are atoms in the universe. Only a very small fraction of such a vast sequence space can ever be experimentally or computationally surveyed. Deep neural networks are increasingly being used to navigate high-dimensional sequence spaces1. However, these models are extremely complicated. Here, by experimentally sampling from sequence spaces larger than 1010, we show that the genetic architecture of at least some proteins is remarkably simple, allowing accurate genetic prediction in high-dimensional sequence spaces with fully interpretable energy models. These models capture the nonlinear relationships between free energies and phenotypes but otherwise consist of additive free energy changes with a small contribution from pairwise energetic couplings. These energetic couplings are sparse and associated with structural contacts and backbone proximity. Our results indicate that protein genetics is actually both rather simple and intelligible.
Collapse
Affiliation(s)
- Andre J Faure
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- ALLOX, Barcelona, Spain.
| | - Aina Martí-Aranda
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Cristina Hidalgo-Carcedo
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Antoni Beltran
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Jörn M Schmiedel
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- factorize.bio, Berlin, Germany
| | - Ben Lehner
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
2
|
Vila JA. Analysis of proteins in the light of mutations. EUROPEAN BIOPHYSICS JOURNAL : EBJ 2024; 53:255-265. [PMID: 38955858 DOI: 10.1007/s00249-024-01714-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 05/23/2024] [Accepted: 06/18/2024] [Indexed: 07/04/2024]
Abstract
Proteins have evolved through mutations-amino acid substitutions-since life appeared on Earth, some 109 years ago. The study of these phenomena has been of particular significance because of their impact on protein stability, function, and structure. This study offers a new viewpoint on how the most recent findings in these areas can be used to explore the impact of mutations on protein sequence, stability, and evolvability. Preliminary results indicate that: (1) mutations can be viewed as sensitive probes to identify 'typos' in the amino-acid sequence, and also to assess the resistance of naturally occurring proteins to unwanted sequence alterations; (2) the presence of 'typos' in the amino acid sequence, rather than being an evolutionary obstacle, could promote faster evolvability and, in turn, increase the likelihood of higher protein stability; (3) the mutation site is far more important than the substituted amino acid in terms of the marginal stability changes of the protein, and (4) the unpredictability of protein evolution at the molecular level-by mutations-exists even in the absence of epistasis effects. Finally, the Darwinian concept of evolution "descent with modification" and experimental evidence endorse one of the results of this study, which suggests that some regions of any protein sequence are susceptible to mutations while others are not. This work contributes to our general understanding of protein responses to mutations and may spur significant progress in our efforts to develop methods to accurately forecast changes in protein stability, their propensity for metamorphism, and their ability to evolve.
Collapse
Affiliation(s)
- Jorge A Vila
- IMASL-CONICET, Universidad Nacional de San Luis, Ejército de los Andes 950, 5700, San Luis, Argentina.
| |
Collapse
|
3
|
Joshi SHN, Jenkins C, Ulaeto D, Gorochowski TE. Accelerating Genetic Sensor Development, Scale-up, and Deployment Using Synthetic Biology. BIODESIGN RESEARCH 2024; 6:0037. [PMID: 38919711 PMCID: PMC11197468 DOI: 10.34133/bdr.0037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 04/23/2024] [Indexed: 06/27/2024] Open
Abstract
Living cells are exquisitely tuned to sense and respond to changes in their environment. Repurposing these systems to create engineered biosensors has seen growing interest in the field of synthetic biology and provides a foundation for many innovative applications spanning environmental monitoring to improved biobased production. In this review, we present a detailed overview of currently available biosensors and the methods that have supported their development, scale-up, and deployment. We focus on genetic sensors in living cells whose outputs affect gene expression. We find that emerging high-throughput experimental assays and evolutionary approaches combined with advanced bioinformatics and machine learning are establishing pipelines to produce genetic sensors for virtually any small molecule, protein, or nucleic acid. However, more complex sensing tasks based on classifying compositions of many stimuli and the reliable deployment of these systems into real-world settings remain challenges. We suggest that recent advances in our ability to precisely modify nonmodel organisms and the integration of proven control engineering principles (e.g., feedback) into the broader design of genetic sensing systems will be necessary to overcome these hurdles and realize the immense potential of the field.
Collapse
Affiliation(s)
| | - Christopher Jenkins
- CBR Division, Defence Science and Technology Laboratory, Porton Down, Wiltshire SP4 0JQ, UK
| | - David Ulaeto
- CBR Division, Defence Science and Technology Laboratory, Porton Down, Wiltshire SP4 0JQ, UK
| | - Thomas E. Gorochowski
- School of Biological Sciences, University of Bristol, Bristol BS8 1TQ, UK
- BrisEngBio,
School of Chemistry, University of Bristol, Bristol BS8 1TS, UK
| |
Collapse
|
4
|
Ndochinwa GO, Wang QY, Okoro NO, Amadi OC, Nwagu TN, Nnamchi CI, Moneke AN, Odiba AS. New advances in protein engineering for industrial applications: Key takeaways. Open Life Sci 2024; 19:20220856. [PMID: 38911927 PMCID: PMC11193397 DOI: 10.1515/biol-2022-0856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 03/01/2024] [Accepted: 03/13/2024] [Indexed: 06/25/2024] Open
Abstract
Recent advancements in protein/enzyme engineering have enabled the production of a diverse array of high-value compounds in microbial systems with the potential for industrial applications. The goal of this review is to articulate some of the most recent protein engineering advances in bacteria, yeast, and other microbial systems to produce valuable substances. These high-value substances include α-farnesene, vitamin B12, fumaric acid, linalool, glucaric acid, carminic acid, mycosporine-like amino acids, patchoulol, orcinol glucoside, d-lactic acid, keratinase, α-glucanotransferases, β-glucosidase, seleno-methylselenocysteine, fatty acids, high-efficiency β-glucosidase enzymes, cellulase, β-carotene, physcion, and glucoamylase. Additionally, recent advances in enzyme engineering for enhancing thermostability will be discussed. These findings have the potential to revolutionize various industries, including biotechnology, food, pharmaceuticals, and biofuels.
Collapse
Affiliation(s)
- Giles Obinna Ndochinwa
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, 410001, Nigeria
- State Key Laboratory of Biomass Enzyme Technology, Guangxi Academy of Sciences, Nanning, Nanning, 530007, China
| | - Qing-Yan Wang
- State Key Laboratory of Biomass Enzyme Technology, Guangxi Academy of Sciences, Nanning, Nanning, 530007, China
- National Engineering Research Center for Non-Food Biorefinery, Guangxi Academy of Sciences, Nanning, Nanning, 530007, China
| | - Nkwachukwu Oziamara Okoro
- Department of Pharmaceutical and medicinal chemistry, Faculty of Pharmaceutical Sciences, University of Nigeria, Nsukka, 410001, Nigeria
| | - Oyetugo Chioma Amadi
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, 410001, Nigeria
| | - Tochukwu Nwamaka Nwagu
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, 410001, Nigeria
| | - Chukwudi Innocent Nnamchi
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, 410001, Nigeria
| | - Anene Nwabu Moneke
- Department of Microbiology, Faculty of Biological Science, University of Nigeria, Nsukka, 410001, Nigeria
| | - Arome Solomon Odiba
- Department of Genetics and Biotechnology, Faculty of Biological Sciences, University of Nigeria, Nsukka, 410001, Nigeria
| |
Collapse
|
5
|
Tang X, Dai H, Knight E, Wu F, Li Y, Li T, Gerstein M. A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation. Brief Bioinform 2024; 25:bbae338. [PMID: 39007594 PMCID: PMC11247410 DOI: 10.1093/bib/bbae338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 05/21/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024] Open
Abstract
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
Collapse
Affiliation(s)
- Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Howard Dai
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Elizabeth Knight
- School of Medicine, Yale University, New Haven, CT 06520, United States
| | - Fang Wu
- Computer Science Department, Stanford University, CA 94305, United States
| | - Yunyang Li
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Tianxiao Li
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
| | - Mark Gerstein
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
6
|
Caparotta M, Perez A. Advancing Molecular Dynamics: Toward Standardization, Integration, and Data Accessibility in Structural Biology. J Phys Chem B 2024; 128:2219-2227. [PMID: 38418288 DOI: 10.1021/acs.jpcb.3c04823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Molecular dynamics (MD) simulations have become a valuable tool in structural biology, offering insights into complex biological systems that are difficult to obtain through experimental techniques alone. The lack of available data sets and structures in most published computational work has limited other researchers' use of these models. In recent years, the emergence of online sharing platforms and MD database initiatives favor the deposition of ensembles and structures to accompany publications, favoring reuse of the data sets. However, the lack of uniform metadata collection, formats, and what data are deposited limits the impact and its use by different communities that are not necessarily experts in MD. This Perspective highlights the need for standardization and better resource sharing for processing and interpreting MD simulation results, akin to efforts in other areas of structural biology. As the field moves forward, we will see an increase in popularity and benefits of MD-based integrative approaches combining experimental data and simulations through probabilistic reasoning, but these too are limited by uniformity in experimental data availability and choices on how the data are modeled that are not trivial to decipher from papers. Other fields have addressed similar challenges comprehensively by establishing task forces with different degrees of success. The large scope and number of communities to represent the breadth of types of MD simulations complicates a parallel approach that would fit all. Thus, each group typically decides what data and which format to upload on servers like Zenodo. Uploading data with FAIR (findable, accessible, interoperable, reusable) principles in mind including optimal metadata collection will make the data more accessible and actionable by the community. Such a wealth of simulation data will foster method development and infrastructure advancements, thus propelling the field forward.
Collapse
Affiliation(s)
- Marcelo Caparotta
- Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| | - Alberto Perez
- Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| |
Collapse
|
7
|
Vila JA. Protein folding rate evolution upon mutations. Biophys Rev 2023; 15:661-669. [PMID: 37681091 PMCID: PMC10480377 DOI: 10.1007/s12551-023-01088-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 06/24/2023] [Indexed: 09/09/2023] Open
Abstract
Despite the spectacular success of cutting-edge protein fold prediction methods, many critical questions remain unanswered, including why proteins can reach their native state in a biologically reasonable time. A satisfactory answer to this simple question could shed light on the slowest folding rate of proteins as well as how mutations-amino-acid substitutions and/or post-translational modifications-might affect it. Preliminary results indicate that (i) Anfinsen's dogma validity ensures that proteins reach their native state on a reasonable timescale regardless of their sequence or length, and (ii) it is feasible to determine the evolution of protein folding rates without accounting for epistasis effects or the mutational trajectories between the starting and target sequences. These results have direct implications for evolutionary biology because they lay the groundwork for a better understanding of why, and to what extent, mutations-a crucial element of evolution and a factor influencing it-affect protein evolvability. Furthermore, they may spur significant progress in our efforts to solve crucial structural biology problems, such as how a sequence encodes its folding.
Collapse
Affiliation(s)
- Jorge A. Vila
- IMASL-CONICET, Universidad Nacional de San Luis, Ejército de Los Andes 950, 5700 San Luis, Argentina
| |
Collapse
|
8
|
Sánchez IE, Galpern EA, Garibaldi MM, Ferreiro DU. Molecular Information Theory Meets Protein Folding. J Phys Chem B 2022; 126:8655-8668. [PMID: 36282961 DOI: 10.1021/acs.jpcb.2c04532] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average information contained in the sequences of evolved proteins is very close to the average information needed to specify a fold ∼2.2 ± 0.3 bits/(site·operation). The effective alphabet size in evolved proteins equals the effective number of conformations of a residue in the compact unfolded state at around 5. We calculated an energy-to-information conversion efficiency upon folding of around 50%, lower than the theoretical limit of 70%, but much higher than human-built macroscopic machines. We propose a simple mapping between molecular information theory and energy landscape theory and explore the connections between sequence evolution, configurational entropy, and the energetics of protein folding.
Collapse
Affiliation(s)
- Ignacio E Sánchez
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Ezequiel A Galpern
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Martín M Garibaldi
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Diego U Ferreiro
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| |
Collapse
|
9
|
Gaia as Solaris: An Alternative Default Evolutionary Trajectory. ORIGINS LIFE EVOL B 2022; 52:129-147. [PMID: 35441955 DOI: 10.1007/s11084-022-09619-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 01/21/2022] [Indexed: 01/23/2023]
Abstract
Now that we know that Earth-like planets are ubiquitous in the universe, as well as that most of them are much older than the Earth, it is justified to ask to what extent evolutionary outcomes on other such planets are similar, or indeed commensurable, to the outcomes we perceive around us. In order to assess the degree of specialty or mediocrity of our trajectory of biospheric evolution, we need to take into account recent advances in theoretical astrobiology, in particular (i) establishing the history of habitable planets' formation in the Galaxy, and (ii) understanding the crucial importance of "Gaian" feedback loops and temporal windows for the interaction of early life with its physical environment. Hereby we consider an alternative macroevolutionary pathway that may result in tight functional integration of all sub-planetary ecosystems, eventually giving rise to a true superorganism at the biospheric level. The blueprint for a possible outcome of this scenario has been masterfully provided by the great Polish novelist Stanisław Lem in his 1961 novel Solaris. In fact, Solaris offers such a persuasive and powerful case for an "extremely strong" Gaia hypothesis that it is, arguably, high time to investigate it in a discursive astrobiological and philosophical context. In addition to novel predictions in the domain of potentially detectable biosignatures, some additional cognitive and heuristic benefits of studying such extreme cases of functional integration are briefly discussed.
Collapse
|
10
|
Mokhtari DA, Appel MJ, Fordyce PM, Herschlag D. High throughput and quantitative enzymology in the genomic era. Curr Opin Struct Biol 2021; 71:259-273. [PMID: 34592682 PMCID: PMC8648990 DOI: 10.1016/j.sbi.2021.07.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 07/23/2021] [Indexed: 12/28/2022]
Abstract
Accurate predictions from models based on physical principles are the ultimate metric of our biophysical understanding. Although there has been stunning progress toward structure prediction, quantitative prediction of enzyme function has remained challenging. Realizing this goal will require large numbers of quantitative measurements of rate and binding constants and the use of these ground-truth data sets to guide the development and testing of these quantitative models. Ground truth data more closely linked to the underlying physical forces are also desired. Here, we describe technological advances that enable both types of ground truth measurements. These advances allow classic models to be tested, provide novel mechanistic insights, and place us on the path toward a predictive understanding of enzyme structure and function.
Collapse
Affiliation(s)
- D A Mokhtari
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA
| | - M J Appel
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA
| | - P M Fordyce
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA; ChEM-H Institute, Stanford University, Stanford, CA, 94305, USA; Department of Genetics, Stanford University, Stanford, CA, 94305, USA; Chan Zuckerberg Biohub San Francisco, CA, 94110, USA.
| | - D Herschlag
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA; Department of Chemical Engineering, Stanford University, Stanford, CA, 94305, USA; ChEM-H Institute, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
11
|
Buchholz PCF, van Loo B, Eenink BDG, Bornberg-Bauer E, Pleiss J. Ancestral sequences of a large promiscuous enzyme family correspond to bridges in sequence space in a network representation. J R Soc Interface 2021; 18:20210389. [PMID: 34727710 DOI: 10.1098/rsif.2021.0389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Evolutionary relationships of protein families can be characterized either by networks or by trees. Whereas trees allow for hierarchical grouping and reconstruction of the most likely ancestral sequences, networks lack a time axis but allow for thresholds of pairwise sequence identity to be chosen and, therefore, the clustering of family members with presumably more similar functions. Here, we use the large family of arylsulfatases and phosphonate monoester hydrolases to investigate similarities, strengths and weaknesses in tree and network representations. For varying thresholds of pairwise sequence identity, values of betweenness centrality and clustering coefficients were derived for nodes of the reconstructed ancestors to measure the propensity to act as a bridge in a network. Based on these properties, ancestral protein sequences emerge as bridges in protein sequence networks. Interestingly, many ancestral protein sequences appear close to extant sequences. Therefore, reconstructed ancestor sequences might also be interpreted as yet-to-be-identified homologues. The concept of ancestor reconstruction is compared to consensus sequences, too. It was found that hub sequences in a network, e.g. reconstructed ancestral sequences that are connected to many neighbouring sequences, share closer similarity with derived consensus sequences. Therefore, some reconstructed ancestor sequences can also be interpreted as consensus sequences.
Collapse
Affiliation(s)
- Patrick C F Buchholz
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Allmandring 31, Stuttgart 70569, Germany
| | - Bert van Loo
- Department of Applied Sciences, Northumbria University, Newcastle-upon-Tyne NE1 8ST, UK.,Institute for Evolution and Biodiversity, University of Münster, Hüfferstraße 1, Münster 48149, Germany
| | - Bernard D G Eenink
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstraße 1, Münster 48149, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstraße 1, Münster 48149, Germany.,Department of Protein Evolution, Max Planck Institute for Developmental Biology, Max-Planck-Ring 5, Tübingen 72076, Germany
| | - Jürgen Pleiss
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Allmandring 31, Stuttgart 70569, Germany
| |
Collapse
|
12
|
Bitard-Feildel T. Navigating the amino acid sequence space between functional proteins using a deep learning framework. PeerJ Comput Sci 2021; 7:e684. [PMID: 34616884 PMCID: PMC8459775 DOI: 10.7717/peerj-cs.684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 07/30/2021] [Indexed: 06/13/2023]
Abstract
MOTIVATION Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution. RESULTS This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.
Collapse
Affiliation(s)
- Tristan Bitard-Feildel
- IBPS, CNRS, Laboratoire de Biologie Computationnelle et Quantitative, Sorbonne Université, Paris, France
- Institut des Sciences du Calcul et de des Données (ISCD), Sorbonne Université, Paris, France
| |
Collapse
|
13
|
Abstract
An accurate estimation of the Protein Space size, in light of the factors that govern it, is a long-standing problem and of paramount importance in evolutionary biology, since it determines the nature of protein evolvability. A simple analysis will enable us to, firstly, reduce an unrealistic Protein Space size of ~ 10130 sequences, for a 100-residues polypeptide chain, to ~ 109 functional proteins and, secondly, estimate a robust average-mutation rate per amino acid (ξ ~ 1.23) and infer from it, in light of the protein marginal stability, that only a fraction of the sequence will be available at any one time for a functional protein to evolve. Although this result does not solve the Protein Space vastness problem frames it in a more rational one and illustrates the impact of the marginal stability on protein evolvability.
Collapse
|
14
|
Thorvaldsen S, Hössjer O. Using statistical methods to model the fine-tuning of molecular machines and systems. J Theor Biol 2020; 501:110352. [PMID: 32505827 DOI: 10.1016/j.jtbi.2020.110352] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 05/26/2020] [Accepted: 05/27/2020] [Indexed: 10/24/2022]
Abstract
Fine-tuning has received much attention in physics, and it states that the fundamental constants of physics are finely tuned to precise values for a rich chemistry and life permittance. It has not yet been applied in a broad manner to molecular biology. However, in this paper we argue that biological systems present fine-tuning at different levels, e.g. functional proteins, complex biochemical machines in living cells, and cellular networks. This paper describes molecular fine-tuning, how it can be used in biology, and how it challenges conventional Darwinian thinking. We also discuss the statistical methods underpinning fine-tuning and present a framework for such analysis.
Collapse
Affiliation(s)
| | - Ola Hössjer
- Stockholm University, Dep. of Mathematics, Division of Mathematical Statistics, Sweden.
| |
Collapse
|
15
|
Lespinats S, De Clerck O, Colange B, Gorelova V, Grando D, Maréchal E, Van Der Straeten D, Rébeillé F, Bastien O. Phylogeny and Sequence Space: A Combined Approach to Analyze the Evolutionary Trajectories of Homologous Proteins. The Case Study of Aminodeoxychorismate Synthase. Acta Biotheor 2020; 68:139-156. [PMID: 31312977 DOI: 10.1007/s10441-019-09352-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 07/10/2019] [Indexed: 11/27/2022]
Abstract
During the course of evolution, variations of a protein sequence is an ongoing phenomenon however limited by the need to maintain its structural and functional integrity. Deciphering the evolutionary path of a protein is thus of fundamental interest. With the development of new methods to visualize high dimension spaces and the improvement of phylogenetic analysis tools, it is possible to study the evolutionary trajectories of proteins in the sequence space. Using the data-driven high-dimensional scaling method, we show that it is possible to predict and represent potential evolutionary trajectories by representing phylogenetic trees into a 3D projection of the sequence space. With the case of the aminodeoxychorismate synthase, an enzyme involved in folate synthesis, we show that this representation raises interesting questions about the complexity of the evolution of a given biological function, in particular concerning its capacity to explore the sequence space.
Collapse
Affiliation(s)
| | - Olivier De Clerck
- Department of Biology, Phycology Research Group, Ghent University, Krijgslaan 281, 9000, Ghent, Belgium
| | - Benoît Colange
- Univ. Grenoble Alpes, INES, 73375, Le Bourget du Lac, France
| | - Vera Gorelova
- Department of Biology, Laboratory of Functional Plant Biology, Ghent University, K.L Ledeganckstraat 35, 9000, Ghent, Belgium
- Department of Botany and Plant Biology, Laboratory of Plant Biochemistry and Physiology, University of Geneva, Quai E. Ansermet 30, 1211, Geneva, Switzerland
| | - Delphine Grando
- Univ. Grenoble Alpes, CEA, CNRS, INRA, BIG-LPCV, 38000, Grenoble, France
| | - Eric Maréchal
- Univ. Grenoble Alpes, CEA, CNRS, INRA, BIG-LPCV, 38000, Grenoble, France
| | - Dominique Van Der Straeten
- Department of Biology, Laboratory of Functional Plant Biology, Ghent University, K.L Ledeganckstraat 35, 9000, Ghent, Belgium
| | - Fabrice Rébeillé
- Univ. Grenoble Alpes, CEA, CNRS, INRA, BIG-LPCV, 38000, Grenoble, France
| | - Olivier Bastien
- Univ. Grenoble Alpes, CEA, CNRS, INRA, BIG-LPCV, 38000, Grenoble, France.
- Laboratoire de Physiologie Cellulaire Végétale, Département Réponse et Dynamique Cellulaire, CEA Grenoble, UMR 5168, CNRS-CEA-INRA-Université J. Fourier, 17 rue des Martyrs, 38054, Grenoble Cedex 09, France.
| |
Collapse
|
16
|
Bauer TL, Buchholz PCF, Pleiss J. The modular structure of α/β-hydrolases. FEBS J 2019; 287:1035-1053. [PMID: 31545554 DOI: 10.1111/febs.15071] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 08/15/2019] [Accepted: 09/19/2019] [Indexed: 12/22/2022]
Abstract
The α/β-hydrolase fold family is highly diverse in sequence, structure and biochemical function. To investigate the sequence-structure-function relationships, the Lipase Engineering Database (https://led.biocatnet.de) was updated. Overall, 280 638 protein sequences and 1557 protein structures were analysed. All α/β-hydrolases consist of the catalytically active core domain, but they might also contain additional structural modules, resulting in 12 different architectures: core domain only, additional lids at three different positions, three different caps, additional N- or C-terminal domains and combinations of N- and C-terminal domains with caps and lids respectively. In addition, the α/β-hydrolases were distinguished by their oxyanion hole signature (GX-, GGGX- and Y-types). The N-terminal domains show two different folds, the Rossmann fold or the β-propeller fold. The C-terminal domains show a β-sandwich fold. The N-terminal β-propeller domain and the C-terminal β-sandwich domain are structurally similar to carbohydrate-binding proteins such as lectins. The classification was applied to the newly discovered polyethylene terephthalate (PET)-degrading PETases and MHETases, which are core domain α/β-hydrolases of the GX- and the GGGX-type respectively. To investigate evolutionary relationships, sequence networks were analysed. The degree distribution followed a power law with a scaling exponent γ = 1.4, indicating a highly inhomogeneous network which consists of a few hubs and a large number of less connected sequences. The hub sequences have many functional neighbours and therefore are expected to be robust toward possible deleterious effects of mutations. The cluster size distribution followed a power law with an extrapolated scaling exponent τ = 2.6, which strongly supports the connectedness of the sequence space of α/β-hydrolases. DATABASE: Supporting data about domains from other proteins with structural similarity to the N- or C-terminal domains of α/β-hydrolases are available in Data Repository of the University of Stuttgart (DaRUS) under doi: https://doi.org/10.18419/darus-458.
Collapse
Affiliation(s)
- Tabea L Bauer
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Germany
| | - Patrick C F Buchholz
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Germany
| | - Jürgen Pleiss
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Germany
| |
Collapse
|
17
|
Ćirković MM, Vukotić B, Stojanović M. Persistence of Technosignatures: A Comment on Lingam and Loeb. ASTROBIOLOGY 2019; 19:1300-1302. [PMID: 31260327 DOI: 10.1089/ast.2019.2052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In a recent article in this journal, Lingam and Loeb developed an excellent heuristic for searches for biosignatures versus technosignatures. We consider two ways in which their approach could be extended and sharpened, with focus on durability of technosignatures. We also note an important consequence of the adopted heuristic that offers strong support to the ideas of the Dysonian Search for ExtraTerrestrial Intelligence (SETI).
Collapse
Affiliation(s)
- Milan M Ćirković
- Astronomical Observatory of Belgrade, Belgrade, Serbia
- Future of Humanity Institute, Faculty of Philosophy, University of Oxford, Oxford, United Kingdom
| | | | | |
Collapse
|
18
|
Marchi J, Galpern EA, Espada R, Ferreiro DU, Walczak AM, Mora T. Size and structure of the sequence space of repeat proteins. PLoS Comput Biol 2019; 15:e1007282. [PMID: 31415557 PMCID: PMC6733475 DOI: 10.1371/journal.pcbi.1007282] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2019] [Revised: 09/09/2019] [Accepted: 07/24/2019] [Indexed: 11/18/2022] Open
Abstract
The coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family—the total number of sequences in that family—can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design. Natural protein molecules are only a small subset of the possible strings of amino acids. This naturally calls the question of how many protein sequences theoretically exist that are functional, and how many have already been explored by nature. To help answer this question, we developed a statistical method to calculate the total potential number of protein sequences of a given family, focusing on three families of repeat proteins, which play important roles in a variety of cellular processes. The number of sequences that we compute is limited by functional interactions between the residues of the protein, as well as its evolutionary history. Applying techniques from the physics of disordered systems, we show that the space of sequences has a rugged structure, which could hinder their evolution. Individual proteins can be organised into distinct clusters corresponding to basins of attraction of the landscape, suggesting the existence of subfamilies within each family.
Collapse
Affiliation(s)
- Jacopo Marchi
- Laboratoire de physique de l’École normale supérieure (PSL University), CNRS, Sorbonne Université, and Université de Paris, 75005 Paris, France
| | - Ezequiel A. Galpern
- Protein Physiology Lab, Universidad de Buenos Aires, Facultad de Ciencias Exactas y Naturales, Departamento de Química Biológica, Buenos Aires, Argentina
- CONICET - Universidad de Buenos Aires, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Buenos Aires, Argentina
| | - Rocio Espada
- Laboratoire Gulliver, Ecole supérieure de physique et chimie industrielles (PSL University) and CNRS, 75005, Paris, France
| | - Diego U. Ferreiro
- Protein Physiology Lab, Universidad de Buenos Aires, Facultad de Ciencias Exactas y Naturales, Departamento de Química Biológica, Buenos Aires, Argentina
- CONICET - Universidad de Buenos Aires, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Buenos Aires, Argentina
| | - Aleksandra M. Walczak
- Laboratoire de physique de l’École normale supérieure (PSL University), CNRS, Sorbonne Université, and Université de Paris, 75005 Paris, France
- * E-mail: (AMW); (TM)
| | - Thierry Mora
- Laboratoire de physique de l’École normale supérieure (PSL University), CNRS, Sorbonne Université, and Université de Paris, 75005 Paris, France
- * E-mail: (AMW); (TM)
| |
Collapse
|
19
|
Gorelova V, Bastien O, De Clerck O, Lespinats S, Rébeillé F, Van Der Straeten D. Evolution of folate biosynthesis and metabolism across algae and land plant lineages. Sci Rep 2019; 9:5731. [PMID: 30952916 PMCID: PMC6451014 DOI: 10.1038/s41598-019-42146-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Accepted: 03/25/2019] [Indexed: 11/09/2022] Open
Abstract
Tetrahydrofolate and its derivatives, commonly known as folates, are essential for almost all living organisms. Besides acting as one-carbon donors and acceptors in reactions producing various important biomolecules such as nucleic and amino acids, as well as pantothenate, they also supply one-carbon units for methylation reactions. Plants along with bacteria, yeast and fungi synthesize folates de novo and therefore constitute a very important dietary source of folates for animals. All the major steps of folate biosynthesis and metabolism have been identified but only few have been genetically characterized in a handful of model plant species. The possible differences in the folate pathway between various plant and algal species have never been explored. In this study we present a comprehensive comparative study of folate biosynthesis and metabolism of all major land plant lineages as well as green and red algae. The study identifies new features of plant folate metabolism that might open new directions to folate research in plants.
Collapse
Affiliation(s)
- V Gorelova
- Department of Biology, Laboratory of Functional Plant Biology, Ghent University, K.L Ledeganckstraat 35, 9000, Ghent, Belgium.,Department of Botany and Plant Biology, Laboratory of Plant Biochemistry and Physiology, University of Geneva, Quai E. Ansermet 30, 1211, Geneva, Switzerland
| | - O Bastien
- Laboratoire de Physiologie Cellulaire Vegetale, UMR168 CNRS-CEA-INRA-Universite Joseph Fourier Grenoble I, Bioscience and Biotechnologies Institute of Grenoble, CEA-Grenoble, 17 rue des Martyrs, 38054, Grenoble, Cedex 9, France
| | - O De Clerck
- Department of Biology, Phycology Research Group, Ghent University, Krijgslaan 281, 9000, Gent, Belgium
| | - S Lespinats
- Laboratoire de Physiologie Cellulaire Vegetale, UMR168 CNRS-CEA-INRA-Universite Joseph Fourier Grenoble I, Bioscience and Biotechnologies Institute of Grenoble, CEA-Grenoble, 17 rue des Martyrs, 38054, Grenoble, Cedex 9, France
| | - F Rébeillé
- Laboratoire de Physiologie Cellulaire Vegetale, UMR168 CNRS-CEA-INRA-Universite Joseph Fourier Grenoble I, Bioscience and Biotechnologies Institute of Grenoble, CEA-Grenoble, 17 rue des Martyrs, 38054, Grenoble, Cedex 9, France
| | - D Van Der Straeten
- Department of Biology, Laboratory of Functional Plant Biology, Ghent University, K.L Ledeganckstraat 35, 9000, Ghent, Belgium.
| |
Collapse
|
20
|
Korenić A, Perović S, Ćirković MM, Miquel PA. Symmetry breaking and functional incompleteness in biological systems. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2019; 150:1-12. [PMID: 30776381 DOI: 10.1016/j.pbiomolbio.2019.02.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Revised: 12/31/2018] [Accepted: 02/05/2019] [Indexed: 12/20/2022]
Abstract
Symmetry-based explanations using symmetry breaking (SB) as the key explanatory tool have complemented and replaced traditional causal explanations in various domains of physics. The process of spontaneous SB is now a mainstay of contemporary explanatory accounts of large chunks of condensed-matter physics, quantum field theory, nonlinear dynamics, cosmology, and other disciplines. A wide range of empirical research into various phenomena related to symmetries and SB across biological scales has accumulated as well. Led by these results, we identify and explain some common features of the emergence, propagation, and cascading of SB-induced layers across the biosphere. These features are predicated on the thermodynamic openness and intrinsic functional incompleteness of the systems at stake and have not been systematically analyzed from a general philosophical and methodological perspective. We also consider possible continuity of SB across the physical and biological world and discuss the connection between Darwinism and SB-based analysis of the biosphere and its history.
Collapse
Affiliation(s)
- Andrej Korenić
- The Centre for Laser Microscopy, Institute for Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Serbia
| | | | | | | |
Collapse
|
21
|
Turjanski P, Ferreiro DU. On the Natural Structure of Amino Acid Patterns in Families of Protein Sequences. J Phys Chem B 2018; 122:11295-11301. [PMID: 30239207 DOI: 10.1021/acs.jpcb.8b07206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
All known terrestrial proteins are coded as continuous strings of ≈20 amino acids. The patterns formed by the repetitions of elements in groups of finite sequences describes the natural architectures of protein families. We present a method to search for patterns and groupings of patterns in protein sequences using a mathematically precise definition for "repetition", an efficient algorithmic implementation and a robust scoring system with no adjustable parameters. We show that the sequence patterns can be well-separated into disjoint classes according to their recurrence in nested structures. The statistics of the occurrences of patterns indicate that short repetitions are sufficient to account for the differences between natural families and randomized groups of sequences by more than 10 standard deviations, while contiguous sequence patterns shorter than 5 residues are effectively random in their occurrences. A small subset of patterns is sufficient to account for a robust "familiarity" definition between arbitrary sets of sequences.
Collapse
Affiliation(s)
- Pablo Turjanski
- KAPOW, Departamento de Computación , Facultad de Ciencias Exactas y Naturales, UBA-CONICET-ICC , Buenos Aires , Argentina
| | - Diego U Ferreiro
- Protein Physiology Lab, Departamento de Química Biológica , Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN , Buenos Aires , Argentina
| |
Collapse
|
22
|
Abstract
The sequence space of five protein superfamilies was investigated by constructing sequence networks. The nodes represent individual sequences, and two nodes are connected by an edge if the global sequence identity of two sequences exceeds a threshold. The networks were characterized by their degree distribution (number of nodes with a given number of neighbors) and by their fractal network dimension. Although the five protein families differed in sequence length, fold, and domain arrangement, their network properties were similar. The fractal network dimension Df was distance-dependent: a high dimension for single and double mutants (Df = 4.0), which dropped to Df = 0.7-1.0 at 90% sequence identity, and increased to Df = 3.5-4.5 below 70% sequence identity. The distance dependency of the network dimension is consistent with evolutionary constraints for functional proteins. While random single and double mutations often result in a functional protein, the accumulation of more than ten mutations is dominated by epistasis. The networks of the five protein families were highly inhomogeneous with few highly connected communities ("hub sequences") and a large number of smaller and less connected communities. The degree distributions followed a power-law distribution with similar scaling exponents close to 1. Because the hub sequences have a large number of functional neighbors, they are expected to be robust toward possible deleterious effects of mutations. Because of their robustness, hub sequences have the potential of high innovability, with additional mutations readily inducing new functions. Therefore, they form hotspots of evolution and are promising candidates as starting points for directed evolution experiments in biotechnology.
Collapse
|
23
|
Ćirković MM. Woodpeckers and Diamonds: Some Aspects of Evolutionary Convergence in Astrobiology. ASTROBIOLOGY 2018; 18:491-502. [PMID: 29676927 DOI: 10.1089/ast.2017.1741] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Jared Diamond's argument against extraterrestrial intelligence from evolutionary contingency is subjected to critical scrutiny. As with the earlier arguments of George Gaylord Simpson, it contains critical loopholes that lead to its unraveling. From the point of view of the contemporary debates about biological evolution, perhaps the most contentious aspect of such arguments is their atemporal and gradualist usage of the space of all possible biological forms (morphospace). Such usage enables the translation of the adaptive value of a trait into the probability of its evolving. This procedure, it is argued, is dangerously misleading. Contra Diamond, there are reasons to believe that convergence not only plays an important role in the history of life, but also profoundly improves the prospects for search for extraterrestrial intelligence success. Some further considerations about the role of observation selection effects and our scaling of complexity in the great debate about contingency and convergence are given. Taken together, these considerations militate against the pessimism of Diamond's conclusion, and suggest that the search for traces and manifestations of extraterrestrial intelligences is far from forlorn. Key Words: Astrobiology-Evolution-Contingency-Convergence-Complex life-SETI-Major evolutionary transitions-Selection effects-Jared Diamond. Astrobiology 18, 491-502.
Collapse
|
24
|
Buchholz PCF, Fademrecht S, Pleiss J. Percolation in protein sequence space. PLoS One 2017; 12:e0189646. [PMID: 29261740 PMCID: PMC5738032 DOI: 10.1371/journal.pone.0189646] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 11/28/2017] [Indexed: 01/08/2023] Open
Abstract
The currently known protein sequences are not distributed equally in sequence space, but cluster into families. Analyzing the cluster size distribution gives a glimpse of the large and unknown extant protein sequence space, which has been explored during evolution. For six protein superfamilies with different fold and function, the cluster size distributions followed a power law with slopes between 2.4 and 3.3, which represent upper limits to the cluster distribution of extant sequences. The power law distribution of cluster sizes is in accordance with percolation theory and strongly supports connectedness of extant sequence space. Percolation of extant sequence space has three major consequences: (1) It transforms our view of sequence space as a highly connected network where each sequence has multiple neighbors, and each pair of sequences is connected by many different paths. A high degree of connectedness is a necessary condition of efficient evolution, because it overcomes the possible blockage by sign epistasis and reciprocal sign epistasis. (2) The Fisher exponent is an indicator of connectedness and saturation of sequence space of each protein superfamily. (3) All clusters are expected to be connected by extant sequences that become apparent as a higher portion of extant sequence space becomes known. Being linked to biochemically distinct homologous families, bridging sequences are promising enzyme candidates for applications in biotechnology because they are expected to have substrate ambiguity or catalytic promiscuity.
Collapse
Affiliation(s)
- Patrick C. F. Buchholz
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Stuttgart, Germany
| | - Silvia Fademrecht
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Stuttgart, Germany
| | - Jürgen Pleiss
- Institute of Biochemistry and Technical Biochemistry, University of Stuttgart, Stuttgart, Germany
- * E-mail:
| |
Collapse
|
25
|
|
26
|
Yu JF, Cao Z, Yang Y, Wang CL, Su ZD, Zhao YW, Wang JH, Zhou Y. Natural protein sequences are more intrinsically disordered than random sequences. Cell Mol Life Sci 2016; 73:2949-57. [PMID: 26801222 PMCID: PMC4937073 DOI: 10.1007/s00018-016-2138-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Revised: 01/10/2016] [Accepted: 01/11/2016] [Indexed: 11/16/2022]
Abstract
Most natural protein sequences have resulted from millions or even billions of years of evolution. How they differ from random sequences is not fully understood. Previous computational and experimental studies of random proteins generated from noncoding regions yielded inclusive results due to species-dependent codon biases and GC contents. Here, we approach this problem by investigating 10,000 sequences randomized at the amino acid level. Using well-established predictors for protein intrinsic disorder, we found that natural sequences have more long disordered regions than random sequences, even when random and natural sequences have the same overall composition of amino acid residues. We also showed that random sequences are as structured as natural sequences according to contents and length distributions of predicted secondary structure, although the structures from random sequences may be in a molten globular-like state, according to molecular dynamics simulations. The bias of natural sequences toward more intrinsic disorder suggests that natural sequences are created and evolved to avoid protein aggregation and increase functional diversity.
Collapse
Affiliation(s)
- Jia-Feng Yu
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
| | - Zanxia Cao
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
- College of Physics and Electronic Information, Dezhou University, Dezhou, 253023, China
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr, Southport, QLD, 4222, Australia
| | - Chun-Ling Wang
- College of Physics and Electronic Information, Dezhou University, Dezhou, 253023, China
| | - Zhen-Dong Su
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
| | - Ya-Wei Zhao
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
| | - Ji-Hua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
- College of Physics and Electronic Information, Dezhou University, Dezhou, 253023, China
| | - Yaoqi Zhou
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China.
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr, Southport, QLD, 4222, Australia.
| |
Collapse
|
27
|
Louis AA. Contingency, convergence and hyper-astronomical numbers in biological evolution. STUDIES IN HISTORY AND PHILOSOPHY OF BIOLOGICAL AND BIOMEDICAL SCIENCES 2016; 58:107-116. [PMID: 26868415 DOI: 10.1016/j.shpsc.2015.12.014] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Accepted: 12/21/2015] [Indexed: 06/05/2023]
Abstract
Counterfactual questions such as "what would happen if you re-run the tape of life?" turn on the nature of the landscape of biological possibilities. Since the number of potential sequences that store genetic information grows exponentially with length, genetic possibility spaces can be so unimaginably vast that commentators frequently reach of hyper-astronomical metaphors that compare their size to that of the universe. Re-run the tape of life and the likelihood of encountering the same sequences in such hyper-astronomically large spaces is infinitesimally small, suggesting that evolutionary outcomes are highly contingent. On the other hand, the wide-spread occurrence of evolutionary convergence implies that similar phenotypes can be found again with relative ease. How can this be? Part of the solution to this conundrum must lie in the manner that genotypes map to phenotypes. By studying simple genotype-phenotype maps, where the counterfactual space of all possible phenotypes can be enumerated, it is shown that strong bias in the arrival of variation may explain why certain phenotypes are (repeatedly) observed in nature, while others never appear. This biased variation provides a non-selective cause for certain types of convergence. It illustrates how the role of randomness and contingency may differ significantly between genetic and phenotype spaces.
Collapse
Affiliation(s)
- Ard A Louis
- Rudolph Peierls Centre for Theoretical Physics, Univeristy of Oxford, 1 Keble Road, Ox1 3NP, United Kingdom.
| |
Collapse
|
28
|
Brisendine JM, Koder RL. Fast, cheap and out of control--Insights into thermodynamic and informatic constraints on natural protein sequences from de novo protein design. BIOCHIMICA ET BIOPHYSICA ACTA 2016; 1857:485-492. [PMID: 26498191 PMCID: PMC4856154 DOI: 10.1016/j.bbabio.2015.10.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 10/06/2015] [Indexed: 12/15/2022]
Abstract
The accumulated results of thirty years of rational and computational de novo protein design have taught us important lessons about the stability, information content, and evolution of natural proteins. First, de novo protein design has complicated the assertion that biological function is equivalent to biological structure - demonstrating the capacity to abstract active sites from natural contexts and paste them into non-native topologies without loss of function. The structure-function relationship has thus been revealed to be either a generality or strictly true only in a local sense. Second, the simplification to "maquette" topologies carried out by rational protein design also has demonstrated that even sophisticated functions such as conformational switching, cooperative ligand binding, and light-activated electron transfer can be achieved with low-information design approaches. This is because for simple topologies the functional footprint in sequence space is enormous and easily exceeds the number of structures which could have possibly existed in the history of life on Earth. Finally, the pervasiveness of extraordinary stability in designed proteins challenges accepted models for the "marginal stability" of natural proteins, suggesting that there must be a selection pressure against highly stable proteins. This can be explained using recent theories which relate non-equilibrium thermodynamics and self-replication. This article is part of a Special Issue entitled Biodesign for Bioenergetics--The design and engineering of electronc transfer cofactors, proteins and protein networks, edited by Ronald L. Koder and J.L. Ross Anderson.
Collapse
Affiliation(s)
- Joseph M Brisendine
- Department of Physics, The City College of New York, New York, NY 10031, United States; The Graduate Program in Biochemistry, The Graduate Center of CUNY, New York, NY 10016, United States
| | - Ronald L Koder
- Department of Physics, The City College of New York, New York, NY 10031, United States; Graduate Programs of Physics, Chemistry and Biochemistry, The Graduate Center of CUNY, New York, NY 10016, United States.
| |
Collapse
|
29
|
Uversky VN. Paradoxes and wonders of intrinsic disorder: Complexity of simplicity. INTRINSICALLY DISORDERED PROTEINS 2016; 4:e1135015. [PMID: 28232895 DOI: 10.1080/21690707.2015.1135015] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Accepted: 10/18/2015] [Indexed: 01/20/2023]
Abstract
At first glance it may seem that intrinsically disordered proteins (IDPs) and IDP regions (IDPRs) are simpler than ordered proteins and domains on multiple levels. However, such multilevel simplicity equips these proteins with the ability to have very complex behavior.
Collapse
Affiliation(s)
- Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA; Faculty of Science, Biology Department, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia; Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, Moscow Region, Russia; Laboratory of Structural Dynamics, Stability and Folding of Proteins, Institute of Cytology, Russian Academy of Sciences, St. Petersburg, Russia
| |
Collapse
|
30
|
McLeish TCB. Are there ergodic limits to evolution? Ergodic exploration of genome space and convergence. Interface Focus 2015; 5:20150041. [PMID: 26640648 DOI: 10.1098/rsfs.2015.0041] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
We examine the analogy between evolutionary dynamics and statistical mechanics to include the fundamental question of ergodicity-the representative exploration of the space of possible states (in the case of evolution this is genome space). Several properties of evolutionary dynamics are identified that allow a generalization of the ergodic dynamics, familiar in dynamical systems theory, to evolution. Two classes of evolved biological structure then arise, differentiated by the qualitative duration of their evolutionary time scales. The first class has an ergodicity time scale (the time required for representative genome exploration) longer than available evolutionary time, and has incompletely explored the genotypic and phenotypic space of its possibilities. This case generates no expectation of convergence to an optimal phenotype or possibility of its prediction. The second, more interesting, class exhibits an evolutionary form of ergodicity-essentially all of the structural space within the constraints of slower evolutionary variables have been sampled; the ergodicity time scale for the system evolution is less than the evolutionary time. In this case, some convergence towards similar optima may be expected for equivalent systems in different species where both possess ergodic evolutionary dynamics. When the fitness maximum is set by physical, rather than co-evolved, constraints, it is additionally possible to make predictions of some properties of the evolved structures and systems. We propose four structures that emerge from evolution within genotypes whose fitness is induced from their phenotypes. Together, these result in an exponential speeding up of evolution, when compared with complete exploration of genomic space. We illustrate a possible case of application and a prediction of convergence together with attaining a physical fitness optimum in the case of invertebrate compound eye resolution.
Collapse
Affiliation(s)
- Tom C B McLeish
- Department of Physics and Chemistry , Durham University , Durham DH1 3LE , UK ; Biophysical Sciences Institute , Durham University , Durham DH1 3LE , UK
| |
Collapse
|
31
|
Abstract
Evolution has produced an astonishing array of organisms, but does it have limits and, if so, how are these overcome and how have they changed over the course of time? Here, I review models for describing and explaining existing diversity, and then explore parts of the evolutionary tree that remain empty. In an analysis of 32 forbidden states among eukaryotes, identified in major clades and in the three great habitat realms of water, land and air, I argue that no phenotypic constraint is absolute, that most constraints reflect a limited time-energy budget available to individual organisms, that natural selection is ultimately responsible for both imposing and overcoming constraints, including those normally ascribed to developmental patterns of construction and phylogenetic conservatism, and that increases in adaptive versatility in major clades together with accompanying new ecological opportunities have eliminated many constraints. Phenotypes that were inaccessible during the Early Palaeozoic era have evolved during later periods while very few adaptive states have disappeared. The filling of phenotypic space has proceeded cumulatively in three overlapping phases characterized by diversification at the biochemical, morphological and cultural levels.
Collapse
Affiliation(s)
- Geerat J Vermeij
- Department of Earth and Planetary Sciences , University of California , Davis, CA , USA
| |
Collapse
|
32
|
Wagner A, Rosen W. Spaces of the possible: universal Darwinism and the wall between technological and biological innovation. J R Soc Interface 2015; 11:20131190. [PMID: 24850903 DOI: 10.1098/rsif.2013.1190] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Innovations in biological evolution and in technology have many common features. Some of them involve similar processes, such as trial and error and horizontal information transfer. Others describe analogous outcomes such as multiple independent origins of similar innovations. Yet others display similar temporal patterns such as episodic bursts of change separated by periods of stasis. We review nine such commonalities, and propose that the mathematical concept of a space of innovations, discoveries or designs can help explain them. This concept can also help demolish a persistent conceptual wall between technological and biological innovation.
Collapse
Affiliation(s)
- Andreas Wagner
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland The Swiss Institute of Bioinformatics, Quartier Sorge - Batiment Genopode, 1015 Lausanne, Switzerland The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
| | | |
Collapse
|
33
|
Dutilh BE. Metagenomic ventures into outer sequence space. BACTERIOPHAGE 2014; 4:e979664. [PMID: 26458273 PMCID: PMC4588555 DOI: 10.4161/21597081.2014.979664] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/23/2014] [Revised: 10/14/2014] [Accepted: 10/17/2014] [Indexed: 11/19/2022]
Abstract
Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as “unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as “biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.
Collapse
Affiliation(s)
- Bas E Dutilh
- Theoretical Biology and Bioinformatics; Utrecht University ; Utrecht, The Netherlands ; Centre for Molecular and Biomolecular Informatics; Radboud Institute for Molecular Life Sciences, Radboud University Medical Centre ; Nijmegen, The Netherlands ; Department of Marine Biology, Institute of Biology; Federal University of Rio de Janeiro ; Rio de Janeiro, Brazil
| |
Collapse
|
34
|
Krick T, Verstraete N, Alonso LG, Shub DA, Ferreiro DU, Shub M, Sánchez IE. Amino Acid metabolism conflicts with protein diversity. Mol Biol Evol 2014; 31:2905-12. [PMID: 25086000 PMCID: PMC4209132 DOI: 10.1093/molbev/msu228] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The 20 protein-coding amino acids are found in proteomes with different relative abundances. The most abundant amino acid, leucine, is nearly an order of magnitude more prevalent than the least abundant amino acid, cysteine. Amino acid metabolic costs differ similarly, constraining their incorporation into proteins. On the other hand, a diverse set of protein sequences is necessary to build functional proteomes. Here, we present a simple model for a cost-diversity trade-off postulating that natural proteomes minimize amino acid metabolic flux while maximizing sequence entropy. The model explains the relative abundances of amino acids across a diverse set of proteomes. We found that the data are remarkably well explained when the cost function accounts for amino acid chemical decay. More than 100 organisms reach comparable solutions to the trade-off by different combinations of proteome cost and sequence diversity. Quantifying the interplay between proteome size and entropy shows that proteomes can get optimally large and diverse.
Collapse
Affiliation(s)
- Teresa Krick
- Departamento de Matemática, Facultad de Ciencias Exactas y Naturales and IMAS-CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Nina Verstraete
- Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales and IQUIBICEN-CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
| | | | - David A Shub
- Department of Biological Sciences, University at Albany, State University of New York
| | - Diego U Ferreiro
- Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales and IQUIBICEN-CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Michael Shub
- IMAS-CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Ignacio E Sánchez
- Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales and IQUIBICEN-CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
35
|
Widmann M, Pleiss J. Protein variants form a system of networks: microdiversity of IMP metallo-beta-lactamases. PLoS One 2014; 9:e101813. [PMID: 25013948 PMCID: PMC4094381 DOI: 10.1371/journal.pone.0101813] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2014] [Accepted: 06/10/2014] [Indexed: 12/29/2022] Open
Abstract
Genome and metagenome sequencing projects support the view that only a tiny portion of the total protein microdiversity in the biosphere has been sequenced yet, while the vast majority of existing protein variants is still unknown. By using a network approach, the microdiversity of 42 metallo-β-lactamases of the IMP family was investigated. In the networks, the nodes are formed by the variants, while the edges correspond to single mutations between pairs of variants. The 42 variants were assigned to 7 separate networks. By analyzing the networks and their relationships, the structure of sequence space was studied and existing, but still unknown, functional variants were predicted. The largest network consists of 10 variants with IMP-1 in its center and includes two ubiquitous mutations, V67F and S262G. By relating the corresponding pairs of variants, the networks were integrated into a single system of networks. The largest network also included a quartet of variants: IMP-1, two single mutants, and the respective double mutant. The existence of quartets indicates that if two mutations resulted in functional enzymes, the double mutant may also be active and stable. Therefore, quartet construction from triplets was applied to predict 15 functional variants. Further functional mutants were predicted by applying the two ubiquitous mutations in all networks. In addition, since the networks are separated from each other by 10-15 mutations on average, it is expected that a subset of the theoretical intermediates are functional, and therefore are supposed to exist in the biosphere. Finally, the network analysis helps to distinguish between epistatic and additive effects of mutations; while the presence of correlated mutations indicates a strong interdependency between the respective positions, the mutations V67F and S262G are ubiquitous and therefore background independent.
Collapse
Affiliation(s)
- Michael Widmann
- Institute of Technical Biochemistry, University of Stuttgart, Stuttgart, Germany
| | - Jürgen Pleiss
- Institute of Technical Biochemistry, University of Stuttgart, Stuttgart, Germany
- * E-mail:
| |
Collapse
|
36
|
Chiarabelli C, Stano P, Luisi PL. Chemical synthetic biology: a mini-review. Front Microbiol 2013; 4:285. [PMID: 24065964 PMCID: PMC3779815 DOI: 10.3389/fmicb.2013.00285] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 09/04/2013] [Indexed: 01/08/2023] Open
Abstract
Chemical synthetic biology (CSB) is a branch of synthetic biology (SB) oriented toward the synthesis of chemical structures alternative to those present in nature. Whereas SB combines biology and engineering with the aim of synthesizing biological structures or life forms that do not exist in nature – often based on genome manipulation, CSB uses and assembles biological parts, synthetic or not, to create new and alternative structures. A short epistemological note will introduce the theoretical concepts related to these fields, whereas the text will be largely devoted to introduce and comment two main projects of CSB, carried out in our laboratory in the recent years. The “Never Born Biopolymers” project deals with the construction and the screening of RNA and peptide sequences that are not present in nature, whereas the “Minimal Cell” project focuses on the construction of semi-synthetic compartments (usually liposomes) containing the minimal and sufficient number of components to perform the basic function of a biological cell. These two topics are extremely important for both the general understanding of biology in terms of function, organization, and development, and for applied biotechnology.
Collapse
|
37
|
Uversky VN. A decade and a half of protein intrinsic disorder: biology still waits for physics. Protein Sci 2013; 22:693-724. [PMID: 23553817 PMCID: PMC3690711 DOI: 10.1002/pro.2261] [Citation(s) in RCA: 364] [Impact Index Per Article: 33.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2013] [Revised: 03/23/2013] [Accepted: 03/25/2013] [Indexed: 12/28/2022]
Abstract
The abundant existence of proteins and regions that possess specific functions without being uniquely folded into unique 3D structures has become accepted by a significant number of protein scientists. Sequences of these intrinsically disordered proteins (IDPs) and IDP regions (IDPRs) are characterized by a number of specific features, such as low overall hydrophobicity and high net charge which makes these proteins predictable. IDPs/IDPRs possess large hydrodynamic volumes, low contents of ordered secondary structure, and are characterized by high structural heterogeneity. They are very flexible, but some may undergo disorder to order transitions in the presence of natural ligands. The degree of these structural rearrangements varies over a very wide range. IDPs/IDPRs are tightly controlled under the normal conditions and have numerous specific functions that complement functions of ordered proteins and domains. When lacking proper control, they have multiple roles in pathogenesis of various human diseases. Gaining structural and functional information about these proteins is a challenge, since they do not typically "freeze" while their "pictures are taken." However, despite or perhaps because of the experimental challenges, these fuzzy objects with fuzzy structures and fuzzy functions are among the most interesting targets for modern protein research. This review briefly summarizes some of the recent advances in this exciting field and considers some of the basic lessons learned from the analysis of physics, chemistry, and biology of IDPs.
Collapse
Affiliation(s)
- Vladimir N Uversky
- Department of Molecular Medicine, USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida 33612, USA.
| |
Collapse
|
38
|
Yanagawa H. Exploration of the Origin and Evolution of Globular Proteins by mRNA Display. Biochemistry 2013; 52:3841-51. [DOI: 10.1021/bi301704x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Hiroshi Yanagawa
- Department of Biosciences and Informatics,
Faculty
of Sciences and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan
| |
Collapse
|
39
|
Unusual biophysics of intrinsically disordered proteins. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2012; 1834:932-51. [PMID: 23269364 DOI: 10.1016/j.bbapap.2012.12.008] [Citation(s) in RCA: 413] [Impact Index Per Article: 34.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Revised: 11/21/2012] [Accepted: 12/12/2012] [Indexed: 02/08/2023]
Abstract
Research of a past decade and a half leaves no doubt that complete understanding of protein functionality requires close consideration of the fact that many functional proteins do not have well-folded structures. These intrinsically disordered proteins (IDPs) and proteins with intrinsically disordered protein regions (IDPRs) are highly abundant in nature and play a number of crucial roles in a living cell. Their functions, which are typically associated with a wide range of intermolecular interactions where IDPs possess remarkable binding promiscuity, complement functional repertoire of ordered proteins. All this requires a close attention to the peculiarities of biophysics of these proteins. In this review, some key biophysical features of IDPs are covered. In addition to the peculiar sequence characteristics of IDPs these biophysical features include sequential, structural, and spatiotemporal heterogeneity of IDPs; their rough and relatively flat energy landscapes; their ability to undergo both induced folding and induced unfolding; the ability to interact specifically with structurally unrelated partners; the ability to gain different structures at binding to different partners; and the ability to keep essential amount of disorder even in the bound form. IDPs are also characterized by the "turned-out" response to the changes in their environment, where they gain some structure under conditions resulting in denaturation or even unfolding of ordered proteins. It is proposed that the heterogeneous spatiotemporal structure of IDPs/IDPRs can be described as a set of foldons, inducible foldons, semi-foldons, non-foldons, and unfoldons. They may lose their function when folded, and activation of some IDPs is associated with the awaking of the dormant disorder. It is possible that IDPs represent the "edge of chaos" systems which operate in a region between order and complete randomness or chaos, where the complexity is maximal. This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly.
Collapse
|
40
|
Bywater RP. On dating stages in prebiotic chemical evolution. Naturwissenschaften 2012; 99:167-76. [DOI: 10.1007/s00114-012-0892-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2011] [Revised: 01/28/2012] [Accepted: 01/30/2012] [Indexed: 01/08/2023]
|
41
|
Holliday GL, Fischer JD, Mitchell JBO, Thornton JM. Characterizing the complexity of enzymes on the basis of their mechanisms and structures with a bio-computational analysis. FEBS J 2011; 278:3835-45. [PMID: 21605342 PMCID: PMC3258480 DOI: 10.1111/j.1742-4658.2011.08190.x] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Enzymes are basically composed of 20 naturally occurring amino acids, yet they catalyse a dizzying array of chemical reactions, with regiospecificity and stereospecificity and under physiological conditions. In this review, we attempt to gain some understanding of these complex proteins, from the chemical versatility of the catalytic toolkit, including the use of cofactors (both metal ions and organic molecules), to the complex mapping of reactions to proteins (which is rarely one-to-one), and finally the structural complexity of enzymes and their active sites, often involving multidomain or multisubunit assemblies. This work highlights how the enzymes that we see today reflect millions of years of evolution, involving de novo design followed by exquisite regulation and modulation to create optimal fitness for life.
Collapse
|
42
|
Liu X, Lv B, Guo W. The size distribution of protein families within different types of folds. Biochem Biophys Res Commun 2011; 406:218-22. [PMID: 21303659 DOI: 10.1016/j.bbrc.2011.02.020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2011] [Accepted: 02/03/2011] [Indexed: 11/19/2022]
Abstract
It is well known that the structure is currently available only for a small fraction of known protein sequences. It is urgent to discover the important features of known protein sequences based on present protein structures. Here, we report a study on the size distribution of protein families within different types of folds. The fold of a protein means the global arrangement of its main secondary structures, both in terms of their relative orientations and their topological connections, which specify a certain biochemical and biophysical aspect. We first search protein families in the structural database SCOP against the sequence-based database Pfam, and acquire a pool of corresponding Pfam families whose structures can be deemed as known. This pool of Pfam families is called the sample space for short. Then the size distributions of protein families involving the sample space, the Pfam database and the SCOP database are obtained. The results indicate that the size distributions of protein families under different kinds of folds abide by similar power-law. Specially, the largest families scatter evenly in different kinds of folds. This may help better understand the relationship of protein sequence, structure and function. We also show that the total of proteins with known structures can be considered a random sample from the whole space of protein sequences, which is an essential but unsettled assumption for related predictions, such as, estimating the number of protein folds in nature. Finally we conclude that about 2957 folds are needed to cover the total Pfam families by a simple method.
Collapse
Affiliation(s)
- Xinsheng Liu
- Institute of Nanoscience, and Key Laboratory for Intelligent Nano Materials and Devices of Ministry of Education, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China.
| | | | | |
Collapse
|
43
|
Tanaka J, Doi N, Takashima H, Yanagawa H. Comparative characterization of random-sequence proteins consisting of 5, 12, and 20 kinds of amino acids. Protein Sci 2010; 19:786-95. [PMID: 20162614 DOI: 10.1002/pro.358] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Screening of functional proteins from a random-sequence library has been used to evolve novel proteins in the field of evolutionary protein engineering. However, random-sequence proteins consisting of the 20 natural amino acids tend to aggregate, and the occurrence rate of functional proteins in a random-sequence library is low. From the viewpoint of the origin of life, it has been proposed that primordial proteins consisted of a limited set of amino acids that could have been abundantly formed early during chemical evolution. We have previously found that members of a random-sequence protein library constructed with five primitive amino acids show high solubility (Doi et al., Protein Eng Des Sel 2005;18:279-284). Although such a library is expected to be appropriate for finding functional proteins, the functionality may be limited, because they have no positively charged amino acid. Here, we constructed three libraries of 120-amino acid, random-sequence proteins using alphabets of 5, 12, and 20 amino acids by preselection using mRNA display (to eliminate sequences containing stop codons and frameshifts) and characterized and compared the structural properties of random-sequence proteins arbitrarily chosen from these libraries. We found that random-sequence proteins constructed with the 12-member alphabet (including five primitive amino acids and positively charged amino acids) have higher solubility than those constructed with the 20-member alphabet, though other biophysical properties are very similar in the two libraries. Thus, a library of moderate complexity constructed from 12 amino acids may be a more appropriate resource for functional screening than one constructed from 20 amino acids.
Collapse
Affiliation(s)
- Junko Tanaka
- Department of Biosciences and Informatics, Keio University, Yokohama 223-8522, Japan
| | | | | | | |
Collapse
|
44
|
Marliere P. The farther, the safer: a manifesto for securely navigating synthetic species away from the old living world. SYSTEMS AND SYNTHETIC BIOLOGY 2009; 3:77-84. [PMID: 19816802 PMCID: PMC2759432 DOI: 10.1007/s11693-009-9040-9] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/21/2009] [Revised: 08/05/2009] [Accepted: 08/07/2009] [Indexed: 11/06/2022]
Abstract
Biotechnology has empirically established that it is easier to construct and evaluate variant genes and proteins than to account for the emergence and function of wild-type macromolecules. Systematizing this constructive approach, synthetic biology now promises to infer and assemble entirely novel genomes, cells and ecosystems. It is argued here that the theoretical and computational tools needed for this endeavor are missing altogether. However, such tools may not be required for diversifying organisms at the basic level of their chemical constitution by adding, substituting or removing elements and molecular components through directed evolution under selection. Most importantly, chemical diversification of life forms could be designed to block metabolic cross-feed and genetic cross-talk between synthetic and wild species and hence protect natural habitats and human health through novel types of containment.
Collapse
|
45
|
Bywater RP. Membrane-spanning peptides and the origin of life. J Theor Biol 2009; 261:407-13. [PMID: 19679140 DOI: 10.1016/j.jtbi.2009.08.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2008] [Revised: 07/31/2009] [Accepted: 08/04/2009] [Indexed: 11/29/2022]
Abstract
An explanation is given as to why membrane-spanning peptides must have been the first "information-rich" molecules in the development of life. These peptides are stabilised in a lipid bilayer membrane environment and they are preferentially made from the simplest, and likewise oldest, of the amino acids that survive today. Transmembrane peptides can exercise functions that are essential for biological systems such as signal transduction and material transport across membranes. More complex peptides possessing catalytic properties could later develop on either side of the membrane as independently folding functional units formed by extension of the protruding ends of the transmembrane peptides within an aqueous environment and thereby give rise to more of the functions that are necessary for life. But the membrane was the cradle for the development of the first information-rich biomolecules.
Collapse
|