1
|
Abel DL. Selection in molecular evolution. STUDIES IN HISTORY AND PHILOSOPHY OF SCIENCE 2024; 107:54-63. [PMID: 39137534 DOI: 10.1016/j.shpsa.2024.07.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/29/2024] [Accepted: 07/29/2024] [Indexed: 08/15/2024]
Abstract
Evolution requires selection. Molecular/chemical/preDarwinian evolution is no exception. One molecule must be selected over another for molecular evolution to occur and advance. Evolution, however, has no goal. The laws of physics have no utilitarian desire, intent or proficiency. Laws and constraints are blind to "usefulness." How then were potential multi-step processes anticipated, valued and pursued by inanimate nature? Can orchestration of formal systems be physico-chemically spontaneous? The purely physico-dynamic self-ordering of Chaos Theory and irreversible non-equilibrium thermodynamic "engines of disequilibria conversion" achieve neither orchestration nor formal organization. Natural selection is a passive and after-the-fact-of-life selection. Darwinian selection reduces to the differential survival and reproduction of the fittest already-living organisms. In the case of abiogenesis, selection had to be 1) Active, 2) Pre-Function, and 3) Efficacious. Selection had to take place at the molecular level prior to the existence of non-trivial functional processes. It could not have been passive or secondary. What naturalistic mechanisms might have been at play?
Collapse
Affiliation(s)
- David Lynn Abel
- The Gene Emergence Project, Proto-BioCybernetics & Proto-Cellular Metabolomics, The Origin of Life Science Foundation, Inc., 14005 Youderian Drive, Bowie, MD, 20721-2225, USA.
| |
Collapse
|
2
|
Ardern Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence Novelty. J Mol Evol 2023; 91:570-580. [PMID: 37326679 DOI: 10.1007/s00239-023-10122-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 05/31/2023] [Indexed: 06/17/2023]
Abstract
Protein-coding DNA sequences can be translated into completely different amino acid sequences if the nucleotide triplets used are shifted by a non-triplet amount on the same DNA strand or by translating codons from the opposite strand. Such "alternative reading frames" of protein-coding genes are a major contributor to the evolution of novel protein products. Recent studies demonstrating this include examples across the three domains of cellular life and in viruses. These sequences increase the number of trials potentially available for the evolutionary invention of new genes and also have unusual properties which may facilitate gene origin. There is evidence that the structure of the standard genetic code contributes to the features and gene-likeness of some alternative frame sequences. These findings have important implications across diverse areas of molecular biology, including for genome annotation, structural biology, and evolutionary genomics.
Collapse
|
3
|
Ding W, Nakai K, Gong H. Protein design via deep learning. Brief Bioinform 2022; 23:bbac102. [PMID: 35348602 PMCID: PMC9116377 DOI: 10.1093/bib/bbac102] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/26/2022] [Accepted: 03/01/2022] [Indexed: 12/11/2022] Open
Abstract
Proteins with desired functions and properties are important in fields like nanotechnology and biomedicine. De novo protein design enables the production of previously unseen proteins from the ground up and is believed as a key point for handling real social challenges. Recent introduction of deep learning into design methods exhibits a transformative influence and is expected to represent a promising and exciting future direction. In this review, we retrospect the major aspects of current advances in deep-learning-based design procedures and illustrate their novelty in comparison with conventional knowledge-based approaches through noticeable cases. We not only describe deep learning developments in structure-based protein design and direct sequence design, but also highlight recent applications of deep reinforcement learning in protein design. The future perspectives on design goals, challenges and opportunities are also comprehensively discussed.
Collapse
Affiliation(s)
- Wenze Ding
- School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China
- School of Future Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| | - Kenta Nakai
- Institute of Medical Science, the University of Tokyo, Tokyo 1088639, Japan
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
4
|
Bitard-Feildel T. Navigating the amino acid sequence space between functional proteins using a deep learning framework. PeerJ Comput Sci 2021; 7:e684. [PMID: 34616884 PMCID: PMC8459775 DOI: 10.7717/peerj-cs.684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 07/30/2021] [Indexed: 06/13/2023]
Abstract
MOTIVATION Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution. RESULTS This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.
Collapse
Affiliation(s)
- Tristan Bitard-Feildel
- IBPS, CNRS, Laboratoire de Biologie Computationnelle et Quantitative, Sorbonne Université, Paris, France
- Institut des Sciences du Calcul et de des Données (ISCD), Sorbonne Université, Paris, France
| |
Collapse
|
5
|
Ferguson AL, Ranganathan R. 100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design. ACS Macro Lett 2021; 10:327-340. [PMID: 35549066 DOI: 10.1021/acsmacrolett.0c00885] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The design of synthetic proteins with the desired function is a long-standing goal in biomolecular science, with broad applications in biochemical engineering, agriculture, medicine, and public health. Rational de novo design and experimental directed evolution have achieved remarkable successes but are challenged by the requirement to find functional "needles" in the vast "haystack" of protein sequence space. Data-driven models for fitness landscapes provide a predictive map between protein sequence and function and can prospectively identify functional candidates for experimental testing to greatly improve the efficiency of this search. This Viewpoint reviews the applications of machine learning and, in particular, deep learning as part of data-driven protein engineering platforms. We highlight recent successes, review promising computational methodologies, and provide an outlook on future challenges and opportunities. The article is written for a broad audience comprising both polymer and protein scientists and computer and data scientists interested in an up-to-date review of recent innovations and opportunities in this rapidly evolving field.
Collapse
Affiliation(s)
- Andrew L. Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
- Center for Physics of Evolving Systems, University of Chicago, Chicago, Illinois 60637, United States
- Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
6
|
Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I, Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W, Savolainen O, Meskys R, Engqvist MKM, Zelezniak A. Expanding functional protein sequence spaces using generative adversarial networks. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00310-5] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
7
|
Thorvaldsen S, Hössjer O. Using statistical methods to model the fine-tuning of molecular machines and systems. J Theor Biol 2020; 501:110352. [PMID: 32505827 DOI: 10.1016/j.jtbi.2020.110352] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 05/26/2020] [Accepted: 05/27/2020] [Indexed: 10/24/2022]
Abstract
Fine-tuning has received much attention in physics, and it states that the fundamental constants of physics are finely tuned to precise values for a rich chemistry and life permittance. It has not yet been applied in a broad manner to molecular biology. However, in this paper we argue that biological systems present fine-tuning at different levels, e.g. functional proteins, complex biochemical machines in living cells, and cellular networks. This paper describes molecular fine-tuning, how it can be used in biology, and how it challenges conventional Darwinian thinking. We also discuss the statistical methods underpinning fine-tuning and present a framework for such analysis.
Collapse
Affiliation(s)
| | - Ola Hössjer
- Stockholm University, Dep. of Mathematics, Division of Mathematical Statistics, Sweden.
| |
Collapse
|
8
|
Chowdhury R, Maranas CD. From directed evolution to computational enzyme engineering—A review. AIChE J 2019. [DOI: 10.1002/aic.16847] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Ratul Chowdhury
- Department of Chemical Engineering The Pennsylvania State University University Park Pennsylvania
| | - Costas D. Maranas
- Department of Chemical Engineering The Pennsylvania State University University Park Pennsylvania
| |
Collapse
|
9
|
Rai J. Peptide and protein mimetics by retro and retroinverso analogs. Chem Biol Drug Des 2019; 93:724-736. [PMID: 30582286 DOI: 10.1111/cbdd.13472] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 12/10/2018] [Accepted: 12/16/2018] [Indexed: 12/19/2022]
Abstract
Retroinverso analog of a natural polypeptide can sometimes mimic the structure and function of the natural peptide. The additional advantage of using retroinverso analog is that it is resistant to proteolysis. The retroinverso analogs have peptide sequence in reverse direction with respect to natural peptide and also have chirality of amino acid inverted from L to D. The D amino acids cannot be recognized by common proteases of the body; therefore, these peptides will not be degraded easily and have a longer-lasting effect as vaccine and inhibitor drugs. There have been many contested propositions about the geometric relationship between a peptide and its retro, inverso, or retroinverso analog. A retroinverso analog sometimes fails to adopt the structure that can mimic the function of the natural peptide. In such cases, partial retroinverso analog and other modifications can help in achieving the desired structure and function. Here, we review the theory, major experimental attempts, prediction methods, and alternative strategies related to retroinverso peptidomimetics.
Collapse
|
10
|
Lipinski CA. Rule of five in 2015 and beyond: Target and ligand structural limitations, ligand chemistry structure and drug discovery project decisions. Adv Drug Deliv Rev 2016; 101:34-41. [PMID: 27154268 DOI: 10.1016/j.addr.2016.04.029] [Citation(s) in RCA: 275] [Impact Index Per Article: 34.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2015] [Revised: 04/22/2016] [Accepted: 04/27/2016] [Indexed: 12/13/2022]
Abstract
The rule of five (Ro5), based on physicochemical profiles of phase II drugs, is consistent with structural limitations in protein targets and the drug target ligands. Three of four parameters in Ro5 are fundamental to the structure of both target and drug binding sites. The chemical structure of the drug ligand depends on the ligand chemistry and design philosophy. Two extremes of chemical structure and design philosophy exist; ligands constructed in the medicinal chemistry synthesis laboratory without input from natural selection and natural product (NP) metabolites biosynthesized based on evolutionary selection. Exceptions to Ro5 are found mostly among NPs. Chemistry chameleon-like behavior of some NPs due to intra-molecular hydrogen bonding as exemplified by cyclosporine A is a strong contributor to NP Ro5 outliers. The fragment derived, drug Navitoclax is an example of the extensive expertise, resources, time and key decisions required for the rare discovery of a non-NP Ro5 outlier.
Collapse
|
11
|
Quantifying protein sequences with reference to the genetic code. J Theor Biol 2015; 372:39-46. [DOI: 10.1016/j.jtbi.2015.02.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2014] [Revised: 01/28/2015] [Accepted: 02/16/2015] [Indexed: 11/21/2022]
|
12
|
Currin A, Swainston N, Day PJ, Kell DB. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem Soc Rev 2015; 44:1172-239. [PMID: 25503938 PMCID: PMC4349129 DOI: 10.1039/c4cs00351a] [Citation(s) in RCA: 251] [Impact Index Per Article: 27.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Indexed: 12/21/2022]
Abstract
The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the 'search space' of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (Kd) and catalytic (kcat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving kcat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the 'best' amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust.
Collapse
Affiliation(s)
- Andrew Currin
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| | - Neil Swainston
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- School of Computer Science , The University of Manchester , Manchester M13 9PL , UK
| | - Philip J. Day
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- Faculty of Medical and Human Sciences , The University of Manchester , Manchester M13 9PT , UK
| | - Douglas B. Kell
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| |
Collapse
|
13
|
Molecular Dynamics Simulations for the Ranking, Evaluation, and Refinement of Computationally Designed Proteins. Methods Enzymol 2013; 523:145-70. [DOI: 10.1016/b978-0-12-394292-0.00007-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
|
14
|
Abstract
Knowing how protein sequence maps to function (the "fitness landscape") is critical for understanding protein evolution as well as for engineering proteins with new and useful properties. We demonstrate that the protein fitness landscape can be inferred from experimental data, using Gaussian processes, a Bayesian learning technique. Gaussian process landscapes can model various protein sequence properties, including functional status, thermostability, enzyme activity, and ligand binding affinity. Trained on experimental data, these models achieve unrivaled quantitative accuracy. Furthermore, the explicit representation of model uncertainty allows for efficient searches through the vast space of possible sequences. We develop and test two protein sequence design algorithms motivated by Bayesian decision theory. The first one identifies small sets of sequences that are informative about the landscape; the second one identifies optimized sequences by iteratively improving the Gaussian process model in regions of the landscape that are predicted to be optimized. We demonstrate the ability of Gaussian processes to guide the search through protein sequence space by designing, constructing, and testing chimeric cytochrome P450s. These algorithms allowed us to engineer active P450 enzymes that are more thermostable than any previously made by chimeragenesis, rational design, or directed evolution.
Collapse
|
15
|
Romero PA, Arnold FH. Random field model reveals structure of the protein recombinational landscape. PLoS Comput Biol 2012; 8:e1002713. [PMID: 23055915 PMCID: PMC3464211 DOI: 10.1371/journal.pcbi.1002713] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2012] [Accepted: 08/03/2012] [Indexed: 11/28/2022] Open
Abstract
We are interested in how intragenic recombination contributes to the evolution of proteins and how this mechanism complements and enhances the diversity generated by random mutation. Experiments have revealed that proteins are highly tolerant to recombination with homologous sequences (mutation by recombination is conservative); more surprisingly, they have also shown that homologous sequence fragments make largely additive contributions to biophysical properties such as stability. Here, we develop a random field model to describe the statistical features of the subset of protein space accessible by recombination, which we refer to as the recombinational landscape. This model shows quantitative agreement with experimental results compiled from eight libraries of proteins that were generated by recombining gene fragments from homologous proteins. The model reveals a recombinational landscape that is highly enriched in functional sequences, with properties dominated by a large-scale additive structure. It also quantifies the relative contributions of parent sequence identity, crossover locations, and protein fold to the tolerance of proteins to recombination. Intragenic recombination explores a unique subset of sequence space that promotes rapid molecular diversification and functional adaptation. Mutation and recombination are the primary sources of genetic variation in evolving populations. The relative benefit of these two diversification mechanisms and how they complement each other has been a long-standing question in evolutionary biology. While it is clear what types of genetic diversity these two mechanisms can create, a significant challenge is relating these sequence changes to changes in fitness. The fitness landscape, which describes this mapping from genotype to phenotype, is extraordinarily complex and defined over an incomprehensibly large space of sequences. Here, we develop a model of the landscape that relies not on the details of this mapping, but rather on the statistical relationships between sequences. By studying the expected values of landscape properties, we can gain insights into the structure of the landscape that are independent of the details of how genotype dictates phenotype. We use this random field model to understand how recombination explores a functionally enriched and diverse subset of protein sequence space.
Collapse
Affiliation(s)
| | - Frances H. Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, United States of America
- * E-mail:
| |
Collapse
|
16
|
Abstract
The best approach for creating libraries of functional proteins with large numbers of nondisruptive amino acid substitutions is protein recombination, in which structurally related polypeptides are swapped among homologous proteins. Unfortunately, as more distantly related proteins are recombined, the fraction of variants having a disrupted structure increases. One way to enrich the fraction of folded and potentially interesting chimeras in these libraries is to use computational algorithms to anticipate which structural elements can be swapped without disturbing the integrity of a protein's structure. Herein, we describe how the algorithm Schema uses the sequences and structures of the parent proteins recombined to predict the structural disruption of chimeras, and we outline how dynamic programming can be used to find libraries with a range of amino acid substitution levels that are enriched in variants with low Schema disruption.
Collapse
|
17
|
Ferrada E, Wagner A. Evolutionary innovations and the organization of protein functions in genotype space. PLoS One 2010; 5:e14172. [PMID: 21152394 PMCID: PMC2994758 DOI: 10.1371/journal.pone.0014172] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 10/28/2010] [Indexed: 11/18/2022] Open
Abstract
The organization of protein structures in protein genotype space is well studied. The same does not hold for protein functions, whose organization is important to understand how novel protein functions can arise through blind evolutionary searches of sequence space. In systems other than proteins, two organizational features of genotype space facilitate phenotypic innovation. The first is that genotypes with the same phenotype form vast and connected genotype networks. The second is that different neighborhoods in this space contain different novel phenotypes. We here characterize the organization of enzymatic functions in protein genotype space, using a data set of more than 30,000 proteins with known structure and function. We show that different neighborhoods of genotype space contain proteins with very different functions. This property both facilitates evolutionary innovation through exploration of a genotype network, and it constrains the evolution of novel phenotypes. The phenotypic diversity of different neighborhoods is caused by the fact that some functions can be carried out by multiple structures. We show that the space of protein functions is not homogeneous, and different genotype neighborhoods tend to contain a different spectrum of functions, whose diversity increases with increasing distance of these neighborhoods in sequence space. Whether a protein with a given function can evolve specific new functions is thus determined by the protein's location in sequence space.
Collapse
Affiliation(s)
- Evandro Ferrada
- Department of Biochemistry, University of Zurich, Zurich, Switzerland.
| | | |
Collapse
|
18
|
Abel DL. The Universal Plausibility Metric (UPM) & Principle (UPP). Theor Biol Med Model 2009; 6:27. [PMID: 19958539 PMCID: PMC2796651 DOI: 10.1186/1742-4682-6-27] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2009] [Accepted: 12/03/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mere possibility is not an adequate basis for asserting scientific plausibility. A precisely defined universal bound is needed beyond which the assertion of plausibility, particularly in life-origin models, can be considered operationally falsified. But can something so seemingly relative and subjective as plausibility ever be quantified? Amazingly, the answer is, "Yes." A method of objectively measuring the plausibility of any chance hypothesis (The Universal Plausibility Metric [UPM]) is presented. A numerical inequality is also provided whereby any chance hypothesis can be definitively falsified when its UPM metric of xi is < 1 (The Universal Plausibility Principle [UPP]). Both UPM and UPP pre-exist and are independent of any experimental design and data set. CONCLUSION No low-probability hypothetical plausibility assertion should survive peer-review without subjection to the UPP inequality standard of formal falsification (xi < 1).
Collapse
Affiliation(s)
- David L Abel
- Department of ProtoBioCybernetics/ProtoBioSemiotics, The Gene Emergence Project of The Origin of Life Science Foundation, Inc, 113-120 Hedgewood Dr, Greenbelt, MD 20770-1610, USA.
| |
Collapse
|
19
|
Abstract
Directed evolution circumvents our profound ignorance of how a protein's sequence encodes its function by using iterative rounds of random mutation and artificial selection to discover new and useful proteins. Proteins can be tuned to adapt to new functions or environments by simple adaptive walks involving small numbers of mutations. Directed evolution studies have shown how rapidly some proteins can evolve under strong selection pressures and, because the entire 'fossil record' of evolutionary intermediates is available for detailed study, they have provided new insight into the relationship between sequence and function. Directed evolution has also shown how mutations that are functionally neutral can set the stage for further adaptation.
Collapse
Affiliation(s)
| | - Frances H. Arnold
- Dick and Barbara Dickinson Professor of Chemical Engineering and Biochemistry, Division of Chemistry and Chemical Engineering, 210-41, California Institute of Technology, Pasadena, CA 91125 USA, Tel: (626) 395-4162
| |
Collapse
|
20
|
Patel SC, Bradley LH, Jinadasa SP, Hecht MH. Cofactor binding and enzymatic activity in an unevolved superfamily of de novo designed 4-helix bundle proteins. Protein Sci 2009; 18:1388-400. [PMID: 19544578 PMCID: PMC2775209 DOI: 10.1002/pro.147] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2009] [Revised: 04/12/2009] [Accepted: 04/13/2009] [Indexed: 11/09/2022]
Abstract
To probe the potential for enzymatic activity in unevolved amino acid sequence space, we created a combinatorial library of de novo 4-helix bundle proteins. This collection of novel proteins can be considered an "artificial superfamily" of helical bundles. The superfamily of 102-residue proteins was designed using binary patterning of polar and nonpolar residues, and expressed in Escherichia coli from a library of synthetic genes. Sequences from the library were screened for a range of biological functions including heme binding and peroxidase, esterase, and lipase activities. Proteins exhibiting these functions were purified and characterized biochemically. The majority of de novo proteins from this superfamily bound the heme cofactor, and a sizable fraction of the proteins showed activity significantly above background for at least one of the tested enzymatic activities. Moreover, several of the designed 4-helix bundles proteins showed activity in all of the assays, thereby demonstrating the functional promiscuity of unevolved proteins. These studies reveal that de novo proteins-which have neither been designed for function, nor subjected to evolutionary pressure (either in vivo or in vitro)-can provide rudimentary activities and serve as a "feedstock" for evolution.
Collapse
Affiliation(s)
- Shona C Patel
- Department of Chemical Engineering, Princeton UniversityPrinceton, New Jersey 08544
| | - Luke H Bradley
- Department of Chemistry, Princeton UniversityPrinceton, New Jersey 08544
| | - Sayuri P Jinadasa
- Department of Chemistry, Princeton UniversityPrinceton, New Jersey 08544
| | - Michael H Hecht
- Department of Chemistry, Princeton UniversityPrinceton, New Jersey 08544
| |
Collapse
|
21
|
Dryden DTF, Thomson AR, White JH. How much of protein sequence space has been explored by life on Earth? J R Soc Interface 2008; 5:953-6. [PMID: 18426772 PMCID: PMC2459213 DOI: 10.1098/rsif.2008.0085] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
We suggest that the vastness of protein sequence space is actually completely explorable during the populating of the Earth by life by considering upper and lower limits for the number of organisms, genome size, mutation rate and the number of functionally distinct classes of amino acids. We conclude that rather than life having explored only an infinitesimally small part of sequence space in the last 4 Gyr, it is instead quite plausible for all of functional protein sequence space to have been explored and that furthermore, at the molecular level, there is no role for contingency.
Collapse
Affiliation(s)
- David T F Dryden
- School of Chemistry, University of Edinburgh, The King's Buildings, Edinburgh EH9 3JJ, UK.
| | | | | |
Collapse
|
22
|
Stylus: a system for evolutionary experimentation based on a protein/proteome model with non-arbitrary functional constraints. PLoS One 2008; 3:e2246. [PMID: 18523658 PMCID: PMC2405935 DOI: 10.1371/journal.pone.0002246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2008] [Accepted: 04/15/2008] [Indexed: 11/28/2022] Open
Abstract
The study of protein evolution is complicated by the vast size of protein sequence space, the huge number of possible protein folds, and the extraordinary complexity of the causal relationships between protein sequence, structure, and function. Much simpler model constructs may therefore provide an attractive complement to experimental studies in this area. Lattice models, which have long been useful in studies of protein folding, have found increasing use here. However, while these models incorporate actual sequences and structures (albeit non-biological ones), they incorporate no actual functions—relying instead on largely arbitrary structural criteria as a proxy for function. In view of the central importance of function to evolution, and the impossibility of incorporating real functional constraints without real function, it is important that protein-like models be developed around real structure–function relationships. Here we describe such a model and introduce open-source software that implements it. The model is based on the structure–function relationship in written language, where structures are two-dimensional ink paths and functions are the meanings that result when these paths form legible characters. To capture something like the hierarchical complexity of protein structure, we use the traditional characters of Chinese origin. Twenty coplanar vectors, encoded by base triplets, act like amino acids in building the character forms. This vector-world model captures many aspects of real proteins, including life-size sequences, a life-size structural repertoire, a realistic genetic code, secondary, tertiary, and quaternary structure, structural domains and motifs, operon-like genetic structures, and layered functional complexity up to a level resembling bacterial genomes and proteomes. Stylus is a full-featured implementation of the vector world for Unix systems. To demonstrate the utility of Stylus, we generated a sample set of homologous vector proteins by evolving successive lines from a single starting gene. These homologues show sequence and structure divergence resembling those of natural homologues in many respects, suggesting that the system may be sufficiently life-like for informative comparison to biology.
Collapse
|
23
|
Rao AG. The outlook for protein engineering in crop improvement. PLANT PHYSIOLOGY 2008; 147:6-12. [PMID: 18443101 PMCID: PMC2330291 DOI: 10.1104/pp.108.117929] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2008] [Accepted: 03/10/2008] [Indexed: 05/26/2023]
Affiliation(s)
- A Gururaj Rao
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA.
| |
Collapse
|
24
|
Leisola M, Turunen O. Protein engineering: opportunities and challenges. Appl Microbiol Biotechnol 2007; 75:1225-32. [PMID: 17404726 DOI: 10.1007/s00253-007-0964-2] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2007] [Revised: 03/20/2007] [Accepted: 03/21/2007] [Indexed: 11/26/2022]
Abstract
The extraordinary properties of natural proteins demonstrate that life-like protein engineering is both achievable and valuable. Rapid progress and impressive results have been made towards this goal using rational design and random techniques or a combination of both. However, we still do not have a general theory on how to specify a structure that is suited to a target function nor can we specify a sequence that folds to a target structure. There is also overreliance on the Darwinian blind search to obtain practical results. In the long run, random methods cannot replace insight in constructing life-like proteins. For the near future, however, in enzyme development, we need to rely on a combination of both.
Collapse
Affiliation(s)
- Matti Leisola
- Laboratory of Bioprocess Engineering, Helsinki University of Technology, P.O. Box 6100, 02015 HUT, Espoo, Finland.
| | | |
Collapse
|
25
|
Otey CR, Landwehr M, Endelman JB, Hiraga K, Bloom JD, Arnold FH. Structure-guided recombination creates an artificial family of cytochromes P450. PLoS Biol 2006; 4:e112. [PMID: 16594730 PMCID: PMC1431580 DOI: 10.1371/journal.pbio.0040112] [Citation(s) in RCA: 103] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2005] [Accepted: 02/09/2006] [Indexed: 11/19/2022] Open
Abstract
Creating artificial protein families affords new opportunities to explore the determinants of structure and biological function free from many of the constraints of natural selection. We have created an artificial family comprising 3,000 P450 heme proteins that correctly fold and incorporate a heme cofactor by recombining three cytochromes P450 at seven crossover locations chosen to minimize structural disruption. Members of this protein family differ from any known sequence at an average of 72 and by as many as 109 amino acids. Most (>73%) of the properly folded chimeric P450 heme proteins are catalytically active peroxygenases; some are more thermostable than the parent proteins. A multiple sequence alignment of 955 chimeras, including both folded and not, is a valuable resource for sequence-structure-function studies. Logistic regression analysis of the multiple sequence alignment identifies key structural contributions to cytochrome P450 heme incorporation and peroxygenase activity and suggests possible structural differences between parents CYP102A1 and CYP102A2.
Collapse
Affiliation(s)
- Christopher R Otey
- 1Biochemistry and Molecular Biophysics, California Institute of Technology, Pasadena, California, United States of America
| | - Marco Landwehr
- 2Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, United States of America
| | - Jeffrey B Endelman
- 3Bioengineering, California Institute of Technology, Pasadena, California, United States of America
| | - Kaori Hiraga
- 2Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, United States of America
| | - Jesse D Bloom
- 2Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, United States of America
| | - Frances H Arnold
- 1Biochemistry and Molecular Biophysics, California Institute of Technology, Pasadena, California, United States of America
- 2Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, United States of America
- 3Bioengineering, California Institute of Technology, Pasadena, California, United States of America
| |
Collapse
|
26
|
Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. Thermodynamic prediction of protein neutrality. Proc Natl Acad Sci U S A 2005; 102:606-11. [PMID: 15644440 PMCID: PMC545518 DOI: 10.1073/pnas.0406744102] [Citation(s) in RCA: 261] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We present a simple theory that uses thermodynamic parameters to predict the probability that a protein retains the wild-type structure after one or more random amino acid substitutions. Our theory predicts that for large numbers of substitutions the probability that a protein retains its structure will decline exponentially with the number of substitutions, with the severity of this decline determined by properties of the structure. Our theory also predicts that a protein can gain extra robustness to the first few substitutions by increasing its thermodynamic stability. We validate our theory with simulations on lattice protein models and by showing that it quantitatively predicts previously published experimental measurements on subtilisin and our own measurements on variants of TEM1 beta-lactamase. Our work unifies observations about the clustering of functional proteins in sequence space, and provides a basis for interpreting the response of proteins to substitutions in protein engineering applications.
Collapse
Affiliation(s)
- Jesse D Bloom
- Division of Chemistry and Chemical Engineering 210-41, California Institute of Technology, Pasadena, CA 91125, USA.
| | | | | | | | | | | |
Collapse
|