1
|
Zhu Z, Hu Q, Fu Y, Tong Y, Zhou Z. Design and Evolution of an Enzyme for the Asymmetric Michael Addition of Cyclic Ketones to Nitroolefins by Enamine Catalysis. Angew Chem Int Ed Engl 2024; 63:e202404312. [PMID: 38783596 DOI: 10.1002/anie.202404312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 05/01/2024] [Accepted: 05/23/2024] [Indexed: 05/25/2024]
Abstract
Consistent introduction of novel enzymes is required for developing efficient biocatalysts for challenging biotransformations. Absorbing catalytic modes from organocatalysis may be fruitful for designing new-to-nature enzymes with novel functions. Herein we report a newly designed artificial enzyme harboring a catalytic pyrrolidine residue that catalyzes the asymmetric Michael addition of cyclic ketones to nitroolefins through enamine activation with high efficiency. Diverse chiral γ-nitro cyclic ketones with two stereocenters were efficiently prepared with excellent stereoselectivity (up to 97 % e.e., >20 : 1 d.r.) and good yield (up to 86 %). This work provides an efficient biocatalytic strategy for cyclic ketone functionalization, and highlights the usefulness of artificial enzymes for extending biocatalysis to further non-natural reactions.
Collapse
Affiliation(s)
- Zhixi Zhu
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| | - Qinru Hu
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| | - Yi Fu
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| | - Yingjia Tong
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| | - Zhi Zhou
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| |
Collapse
|
2
|
Johnston KE, Almhjell PJ, Watkins-Dulaney EJ, Liu G, Porter NJ, Yang J, Arnold FH. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proc Natl Acad Sci U S A 2024; 121:e2400439121. [PMID: 39074291 DOI: 10.1073/pnas.2400439121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 06/17/2024] [Indexed: 07/31/2024] Open
Abstract
Protein engineering often targets binding pockets or active sites which are enriched in epistasis-nonadditive interactions between amino acid substitutions-and where the combined effects of multiple single substitutions are difficult to predict. Few existing sequence-fitness datasets capture epistasis at large scale, especially for enzyme catalysis, limiting the development and assessment of model-guided enzyme engineering approaches. We present here a combinatorially complete, 160,000-variant fitness landscape across four residues in the active site of an enzyme. Assaying the native reaction of a thermostable β-subunit of tryptophan synthase (TrpB) in a nonnative environment yielded a landscape characterized by significant epistasis and many local optima. These effects prevent simulated directed evolution approaches from efficiently reaching the global optimum. There is nonetheless wide variability in the effectiveness of different directed evolution approaches, which together provide experimental benchmarks for computational and machine learning workflows. The most-fit TrpB variants contain a substitution that is nearly absent in natural TrpB sequences-a result that conservation-based predictions would not capture. Thus, although fitness prediction using evolutionary data can enrich in more-active variants, these approaches struggle to identify and differentiate among the most-active variants, even for this near-native function. Overall, this work presents a large-scale testing ground for model-guided enzyme engineering and suggests that efficient navigation of epistatic fitness landscapes can be improved by advances in both machine learning and physical modeling.
Collapse
Affiliation(s)
- Kadina E Johnston
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA 91125
| | - Patrick J Almhjell
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125
| | - Ella J Watkins-Dulaney
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA 91125
| | - Grace Liu
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA 91125
| | - Nicholas J Porter
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125
| | - Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125
| | - Frances H Arnold
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA 91125
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125
| |
Collapse
|
3
|
Lian X, Praljak N, Subramanian SK, Wasinger S, Ranganathan R, Ferguson AL. Deep-learning-based design of synthetic orthologs of SH3 signaling domains. Cell Syst 2024:S2405-4712(24)00204-7. [PMID: 39106868 DOI: 10.1016/j.cels.2024.07.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 11/12/2023] [Accepted: 07/22/2024] [Indexed: 08/09/2024]
Abstract
Evolution-based deep generative models represent an exciting direction in understanding and designing proteins. An open question is whether such models can learn specialized functional constraints that control fitness in specific biological contexts. Here, we examine the ability of generative models to produce synthetic versions of Src-homology 3 (SH3) domains that mediate signaling in the Sho1 osmotic stress response pathway of yeast. We show that a variational autoencoder (VAE) model produces artificial sequences that experimentally recapitulate the function of natural SH3 domains. More generally, the model organizes all fungal SH3 domains such that locality in the model latent space (but not simply locality in sequence space) enriches the design of synthetic orthologs and exposes non-obvious amino acid constraints distributed near and far from the SH3 ligand-binding site. The ability of generative models to design ortholog-like functions in vivo opens new avenues for engineering protein function in specific cellular contexts and environments.
Collapse
Affiliation(s)
- Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, IL 60637, USA
| | - Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, IL 60637, USA
| | - Subu K Subramanian
- Department of Molecular and Cell Biology, California Institute for Quantitative Biosciences (QB3), and Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sarah Wasinger
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Rama Ranganathan
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA; Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL 60637, USA.
| | - Andrew L Ferguson
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
4
|
Freschlin CR, Fahlberg SA, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. Nat Commun 2024; 15:6405. [PMID: 39080282 PMCID: PMC11289474 DOI: 10.1038/s41467-024-50712-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 07/13/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
5
|
Guan A, He Z, Wang X, Jia ZJ, Qin J. Engineering the next-generation synthetic cell factory driven by protein engineering. Biotechnol Adv 2024; 73:108366. [PMID: 38663492 DOI: 10.1016/j.biotechadv.2024.108366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 03/21/2024] [Accepted: 04/22/2024] [Indexed: 05/09/2024]
Abstract
Synthetic cell factory offers substantial advantages in economically efficient production of biofuels, chemicals, and pharmaceutical compounds. However, to create a high-performance synthetic cell factory, precise regulation of cellular material and energy flux is essential. In this context, protein components including enzymes, transcription factor-based biosensors and transporters play pivotal roles. Protein engineering aims to create novel protein variants with desired properties by modifying or designing protein sequences. This review focuses on summarizing the latest advancements of protein engineering in optimizing various aspects of synthetic cell factory, including: enhancing enzyme activity to eliminate production bottlenecks, altering enzyme selectivity to steer metabolic pathways towards desired products, modifying enzyme promiscuity to explore innovative routes, and improving the efficiency of transporters. Furthermore, the utilization of protein engineering to modify protein-based biosensors accelerates evolutionary process and optimizes the regulation of metabolic pathways. The remaining challenges and future opportunities in this field are also discussed.
Collapse
Affiliation(s)
- Ailin Guan
- College of Biomass Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Zixi He
- College of Biomass Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Xin Wang
- West China School of Pharmacy, Sichuan University, Chengdu 610041, China
| | - Zhi-Jun Jia
- West China School of Pharmacy, Sichuan University, Chengdu 610041, China
| | - Jiufu Qin
- College of Biomass Science and Engineering, Sichuan University, Chengdu 610065, China.
| |
Collapse
|
6
|
Goles M, Daza A, Cabas-Mora G, Sarmiento-Varón L, Sepúlveda-Yañez J, Anvari-Kazemabad H, Davari MD, Uribe-Paredes R, Olivera-Nappa Á, Navarrete MA, Medina-Ortiz D. Peptide-based drug discovery through artificial intelligence: towards an autonomous design of therapeutic peptides. Brief Bioinform 2024; 25:bbae275. [PMID: 38856172 PMCID: PMC11163380 DOI: 10.1093/bib/bbae275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/23/2024] [Accepted: 06/04/2024] [Indexed: 06/11/2024] Open
Abstract
With their diverse biological activities, peptides are promising candidates for therapeutic applications, showing antimicrobial, antitumour and hormonal signalling capabilities. Despite their advantages, therapeutic peptides face challenges such as short half-life, limited oral bioavailability and susceptibility to plasma degradation. The rise of computational tools and artificial intelligence (AI) in peptide research has spurred the development of advanced methodologies and databases that are pivotal in the exploration of these complex macromolecules. This perspective delves into integrating AI in peptide development, encompassing classifier methods, predictive systems and the avant-garde design facilitated by deep-generative models like generative adversarial networks and variational autoencoders. There are still challenges, such as the need for processing optimization and careful validation of predictive models. This work outlines traditional strategies for machine learning model construction and training techniques and proposes a comprehensive AI-assisted peptide design and validation pipeline. The evolving landscape of peptide design using AI is emphasized, showcasing the practicality of these methods in expediting the development and discovery of novel peptides within the context of peptide-based drug discovery.
Collapse
Affiliation(s)
- Montserrat Goles
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
- Departamento de Ingeniería Química, Biotecnología y Materiales, Universidad de Chile, Beauchef 851, 8370456, Santiago, Chile
| | - Anamaría Daza
- Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Beauchef 851, 8370456, Santiago, Chile
| | - Gabriel Cabas-Mora
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
| | - Lindybeth Sarmiento-Varón
- Centro Asistencial de Docencia e Investigación, CADI, Universidad de Magallanes, Av. Los Flamencos 01364, 6210005, Punta Arenas, Chile
| | - Julieta Sepúlveda-Yañez
- Facultad de Ciencias de la Salud, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
| | - Hoda Anvari-Kazemabad
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
| | - Mehdi D Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120, Halle, Germany
| | - Roberto Uribe-Paredes
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
| | - Álvaro Olivera-Nappa
- Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Beauchef 851, 8370456, Santiago, Chile
| | - Marcelo A Navarrete
- Centro Asistencial de Docencia e Investigación, CADI, Universidad de Magallanes, Av. Los Flamencos 01364, 6210005, Punta Arenas, Chile
- Escuela de Medicina, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
| | - David Medina-Ortiz
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Av. Pdte. Manuel Bulnes 01855, 6210427, Punta Arenas, Chile
- Centre for Biotechnology and Bioengineering, CeBiB, Universidad de Chile, Beauchef 851, 8370456, Santiago, Chile
| |
Collapse
|
7
|
Zhou B, Zheng L, Wu B, Tan Y, Lv O, Yi K, Fan G, Hong L. Protein Engineering with Lightweight Graph Denoising Neural Networks. J Chem Inf Model 2024; 64:3650-3661. [PMID: 38630581 DOI: 10.1021/acs.jcim.4c00036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
Protein engineering faces challenges in finding optimal mutants from a massive pool of candidate mutants. In this study, we introduce a deep-learning-based data-efficient fitness prediction tool to steer protein engineering. Our methodology establishes a lightweight graph neural network scheme for protein structures, which efficiently analyzes the microenvironment of amino acids in wild-type proteins and reconstructs the distribution of the amino acid sequences that are more likely to pass natural selection. This distribution serves as a general guidance for scoring proteins toward arbitrary properties on any order of mutations. Our proposed solution undergoes extensive wet-lab experimental validation spanning diverse physicochemical properties of various proteins, including fluorescence intensity, antigen-antibody affinity, thermostability, and DNA cleavage activity. More than 40% of ProtLGN-designed single-site mutants outperform their wild-type counterparts across all studied proteins and targeted properties. More importantly, our model can bypass the negative epistatic effect to combine single mutation sites and form deep mutants with up to seven mutation sites in a single round, whose physicochemical properties are significantly improved. This observation provides compelling evidence of the structure-based model's potential to guide deep mutations in protein engineering. Overall, our approach emerges as a versatile tool for protein engineering, benefiting both the computational and bioengineering communities.
Collapse
Affiliation(s)
- Bingxin Zhou
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
| | - Lirong Zheng
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Banghao Wu
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Tan
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
| | - Outongyi Lv
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Kai Yi
- School of Mathematics and Statistics, University of New South Wales, Sydney 2052, Australia
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liang Hong
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai 201203, China
| |
Collapse
|
8
|
Michael R, Kæstel-Hansen J, Mørch Groth P, Bartels S, Salomon J, Tian P, Hatzakis NS, Boomsma W. A systematic analysis of regression models for protein engineering. PLoS Comput Biol 2024; 20:e1012061. [PMID: 38701099 PMCID: PMC11095727 DOI: 10.1371/journal.pcbi.1012061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/15/2024] [Accepted: 04/10/2024] [Indexed: 05/05/2024] Open
Abstract
To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.
Collapse
Affiliation(s)
- Richard Michael
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
- Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark
| | - Simon Bartels
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Pengfei Tian
- Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark
| | - Nikos S. Hatzakis
- Department of Chemistry, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
9
|
Harihar B, Saravanan KM, Gromiha MM, Selvaraj S. Importance of Inter-residue Contacts for Understanding Protein Folding and Unfolding Rates, Remote Homology, and Drug Design. Mol Biotechnol 2024:10.1007/s12033-024-01119-4. [PMID: 38498284 DOI: 10.1007/s12033-024-01119-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 02/10/2024] [Indexed: 03/20/2024]
Abstract
Inter-residue interactions in protein structures provide valuable insights into protein folding and stability. Understanding these interactions can be helpful in many crucial applications, including rational design of therapeutic small molecules and biologics, locating functional protein sites, and predicting protein-protein and protein-ligand interactions. The process of developing machine learning models incorporating inter-residue interactions has been improved recently. This review highlights the theoretical models incorporating inter-residue interactions in predicting folding and unfolding rates of proteins. Utilizing contact maps to depict inter-residue interactions aids researchers in developing computer models for detecting remote homologs and interface residues within protein-protein complexes which, in turn, enhances our knowledge of the relationship between sequence and structure of proteins. Further, the application of contact maps derived from inter-residue interactions is highlighted in the field of drug discovery. Overall, this review presents an extensive assessment of the significant models that use inter-residue interactions to investigate folding rates, unfolding rates, remote homology, and drug development, providing potential future advancements in constructing efficient computational models in structural biology.
Collapse
Affiliation(s)
- Balasubramanian Harihar
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Konda Mani Saravanan
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, 600073, India
| | - Michael M Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Samuel Selvaraj
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India.
| |
Collapse
|
10
|
Honda Malca S, Duss N, Meierhofer J, Patsch D, Niklaus M, Reiter S, Hanlon SP, Wetzl D, Kuhn B, Iding H, Buller R. Effective engineering of a ketoreductase for the biocatalytic synthesis of an ipatasertib precursor. Commun Chem 2024; 7:46. [PMID: 38418529 PMCID: PMC10902378 DOI: 10.1038/s42004-024-01130-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 02/15/2024] [Indexed: 03/01/2024] Open
Abstract
Semi-rational enzyme engineering is a powerful method to develop industrial biocatalysts. Profiting from advances in molecular biology and bioinformatics, semi-rational approaches can effectively accelerate enzyme engineering campaigns. Here, we present the optimization of a ketoreductase from Sporidiobolus salmonicolor for the chemo-enzymatic synthesis of ipatasertib, a potent protein kinase B inhibitor. Harnessing the power of mutational scanning and structure-guided rational design, we created a 10-amino acid substituted variant exhibiting a 64-fold higher apparent kcat and improved robustness under process conditions compared to the wild-type enzyme. In addition, the benefit of algorithm-aided enzyme engineering was studied to derive correlations in protein sequence-function data, and it was found that the applied Gaussian processes allowed us to reduce enzyme library size. The final scalable and high performing biocatalytic process yielded the alcohol intermediate with ≥ 98% conversion and a diastereomeric excess of 99.7% (R,R-trans) from 100 g L-1 ketone after 30 h. Modelling and kinetic studies shed light on the mechanistic factors governing the improved reaction outcome, with mutations T134V, A238K, M242W and Q245S exerting the most beneficial effect on reduction activity towards the target ketone.
Collapse
Affiliation(s)
- Sumire Honda Malca
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| | - Nadine Duss
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| | - Jasmin Meierhofer
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
- Analytical Research and Development, MSD Werthenstein BioPharma GmbH, Industrie Nord 1, 6105 Schachen, Switzerland
| | - David Patsch
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| | - Michael Niklaus
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| | - Stefanie Reiter
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
- Manufacturing Science and Technology, Fisher Clinical Services GmbH, Biotech Innovation Park, 2543 Lengnau, Switzerland
| | - Steven Paul Hanlon
- Process Chemistry and Catalysis, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070 Basel, Switzerland
| | - Dennis Wetzl
- Process Chemistry and Catalysis, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070 Basel, Switzerland
- Nonclinical Drug Development, Boehringer Ingelheim International GmbH, Birkendorfer Strasse 65, 88397 Biberach an der Riss, Germany
| | - Bernd Kuhn
- Pharmaceutical Research and Early Development, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070 Basel, Switzerland
| | - Hans Iding
- Process Chemistry and Catalysis, F. Hoffmann-La Roche Ltd., Grenzacherstrasse 124, 4070 Basel, Switzerland
| | - Rebecca Buller
- Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland.
| |
Collapse
|
11
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
12
|
Nam K, Shao Y, Major DT, Wolf-Watz M. Perspectives on Computational Enzyme Modeling: From Mechanisms to Design and Drug Development. ACS OMEGA 2024; 9:7393-7412. [PMID: 38405524 PMCID: PMC10883025 DOI: 10.1021/acsomega.3c09084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 01/15/2024] [Accepted: 01/19/2024] [Indexed: 02/27/2024]
Abstract
Understanding enzyme mechanisms is essential for unraveling the complex molecular machinery of life. In this review, we survey the field of computational enzymology, highlighting key principles governing enzyme mechanisms and discussing ongoing challenges and promising advances. Over the years, computer simulations have become indispensable in the study of enzyme mechanisms, with the integration of experimental and computational exploration now established as a holistic approach to gain deep insights into enzymatic catalysis. Numerous studies have demonstrated the power of computer simulations in characterizing reaction pathways, transition states, substrate selectivity, product distribution, and dynamic conformational changes for various enzymes. Nevertheless, significant challenges remain in investigating the mechanisms of complex multistep reactions, large-scale conformational changes, and allosteric regulation. Beyond mechanistic studies, computational enzyme modeling has emerged as an essential tool for computer-aided enzyme design and the rational discovery of covalent drugs for targeted therapies. Overall, enzyme design/engineering and covalent drug development can greatly benefit from our understanding of the detailed mechanisms of enzymes, such as protein dynamics, entropy contributions, and allostery, as revealed by computational studies. Such a convergence of different research approaches is expected to continue, creating synergies in enzyme research. This review, by outlining the ever-expanding field of enzyme research, aims to provide guidance for future research directions and facilitate new developments in this important and evolving field.
Collapse
Affiliation(s)
- Kwangho Nam
- Department
of Chemistry and Biochemistry, University
of Texas at Arlington, Arlington, Texas 76019, United States
| | - Yihan Shao
- Department
of Chemistry and Biochemistry, University
of Oklahoma, Norman, Oklahoma 73019-5251, United States
| | - Dan T. Major
- Department
of Chemistry and Institute for Nanotechnology & Advanced Materials, Bar-Ilan University, Ramat-Gan 52900, Israel
| | | |
Collapse
|
13
|
Notin P, Rollins N, Gal Y, Sander C, Marks D. Machine learning for functional protein design. Nat Biotechnol 2024; 42:216-228. [PMID: 38361074 DOI: 10.1038/s41587-024-02127-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 01/05/2024] [Indexed: 02/17/2024]
Abstract
Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.
Collapse
Affiliation(s)
- Pascal Notin
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Computer Science, University of Oxford, Oxford, UK.
| | | | - Yarin Gal
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Chris Sander
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Debora Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Broad Institute of Harvard and MIT, Cambridge, MA, USA.
| |
Collapse
|
14
|
Joho Y, Vongsouthi V, Gomez C, Larsen JS, Ardevol A, Jackson CJ. Improving plastic degrading enzymes via directed evolution. Protein Eng Des Sel 2024; 37:gzae009. [PMID: 38713696 PMCID: PMC11091475 DOI: 10.1093/protein/gzae009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 04/30/2024] [Accepted: 05/05/2024] [Indexed: 05/09/2024] Open
Abstract
Plastic degrading enzymes have immense potential for use in industrial applications. Protein engineering efforts over the last decade have resulted in considerable enhancement of many properties of these enzymes. Directed evolution, a protein engineering approach that mimics the natural process of evolution in a laboratory, has been particularly useful in overcoming some of the challenges of structure-based protein engineering. For example, directed evolution has been used to improve the catalytic activity and thermostability of polyethylene terephthalate (PET)-degrading enzymes, although its use for the improvement of other desirable properties, such as solvent tolerance, has been less studied. In this review, we aim to identify some of the knowledge gaps and current challenges, and highlight recent studies related to the directed evolution of plastic-degrading enzymes.
Collapse
Affiliation(s)
- Yvonne Joho
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Research Way, Clayton, Victoria 3168, Australia
- Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
- CSIRO Advanced Engineering Biology Future Science Platform, GPO Box 1700, Canberra, ACT 2601, Australia
| | - Vanessa Vongsouthi
- Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
| | - Chloe Gomez
- Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
| | - Joachim S Larsen
- Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
- ARC Centre of Excellence for Synthetic Biology, Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
| | - Albert Ardevol
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Research Way, Clayton, Victoria 3168, Australia
- CSIRO Advanced Engineering Biology Future Science Platform, GPO Box 1700, Canberra, ACT 2601, Australia
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
- ARC Centre of Excellence for Synthetic Biology, Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Sullivans Creek Rd, Canberra, ACT 2601, Australia
| |
Collapse
|
15
|
Marchal D, Schulz L, Schuster I, Ivanovska J, Paczia N, Prinz S, Zarzycki J, Erb TJ. Machine Learning-Supported Enzyme Engineering toward Improved CO 2-Fixation of Glycolyl-CoA Carboxylase. ACS Synth Biol 2023; 12:3521-3530. [PMID: 37983631 PMCID: PMC10729300 DOI: 10.1021/acssynbio.3c00403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 11/01/2023] [Accepted: 11/07/2023] [Indexed: 11/22/2023]
Abstract
Glycolyl-CoA carboxylase (GCC) is a new-to-nature enzyme that catalyzes the key reaction in the tartronyl-CoA (TaCo) pathway, a synthetic photorespiration bypass that was recently designed to improve photosynthetic CO2 fixation. GCC was created from propionyl-CoA carboxylase (PCC) through five mutations. However, despite reaching activities of naturally evolved biotin-dependent carboxylases, the quintuple substitution variant GCC M5 still lags behind 4-fold in catalytic efficiency compared to its template PCC and suffers from futile ATP hydrolysis during CO2 fixation. To further improve upon GCC M5, we developed a machine learning-supported workflow that reduces screening efforts for identifying improved enzymes. Using this workflow, we present two novel GCC variants with 2-fold increased carboxylation rate and 60% reduced energy demand, respectively, which are able to address kinetic and thermodynamic limitations of the TaCo pathway. Our work highlights the potential of combining machine learning and directed evolution strategies to reduce screening efforts in enzyme engineering.
Collapse
Affiliation(s)
- Daniel
G. Marchal
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | - Luca Schulz
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | | | | | - Nicole Paczia
- Core
Facility for Metabolomics and Small Molecule Mass Spectrometry, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | - Simone Prinz
- Central
Electron Microscopy Facility, Max-Planck-Institute
of Biophysics, Frankfurt 60438, Germany
| | - Jan Zarzycki
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | - Tobias J. Erb
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
- SYNMIKRO
Center for Synthetic Microbiology, Marburg 35032, Germany
| |
Collapse
|
16
|
Alquran H, Al Fahoum A, Zyout A, Abu Qasmieh I. A comprehensive framework for advanced protein classification and function prediction using synergistic approaches: Integrating bispectral analysis, machine learning, and deep learning. PLoS One 2023; 18:e0295805. [PMID: 38096313 PMCID: PMC10721063 DOI: 10.1371/journal.pone.0295805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
Proteins are fundamental components of diverse cellular systems and play crucial roles in a variety of disease processes. Consequently, it is crucial to comprehend their structure, function, and intricate interconnections. Classifying proteins into families or groups with comparable structural and functional characteristics is a crucial aspect of this comprehension. This classification is crucial for evolutionary research, predicting protein function, and identifying potential therapeutic targets. Sequence alignment and structure-based alignment are frequently ineffective techniques for identifying protein families.This study addresses the need for a more efficient and accurate technique for feature extraction and protein classification. The research proposes a novel method that integrates bispectrum characteristics, deep learning techniques, and machine learning algorithms to overcome the limitations of conventional methods. The proposed method uses numbers to represent protein sequences, utilizes bispectrum analysis, uses different topologies for convolutional neural networks to pull out features, and chooses robust features to classify protein families. The goal is to outperform existing methods for identifying protein families, thereby enhancing classification metrics. The materials consist of numerous protein datasets, whereas the methods incorporate bispectrum characteristics and deep learning strategies. The results of this study demonstrate that the proposed method for identifying protein families is superior to conventional approaches. Significantly enhanced quality metrics demonstrated the efficacy of the combined bispectrum and deep learning approaches. These findings have the potential to advance the field of protein biology and facilitate pharmaceutical innovation. In conclusion, this study presents a novel method that employs bispectrum characteristics and deep learning techniques to improve the precision and efficiency of protein family identification. The demonstrated advancements in classification metrics demonstrate this method's applicability to numerous scientific disciplines. This furthers our understanding of protein function and its implications for disease and treatment.
Collapse
Affiliation(s)
- Hiam Alquran
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Amjed Al Fahoum
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Ala’a Zyout
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Isam Abu Qasmieh
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| |
Collapse
|
17
|
Parthiban S, Vijeesh T, Gayathri T, Shanmugaraj B, Sharma A, Sathishkumar R. Artificial intelligence-driven systems engineering for next-generation plant-derived biopharmaceuticals. FRONTIERS IN PLANT SCIENCE 2023; 14:1252166. [PMID: 38034587 PMCID: PMC10684705 DOI: 10.3389/fpls.2023.1252166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023]
Abstract
Recombinant biopharmaceuticals including antigens, antibodies, hormones, cytokines, single-chain variable fragments, and peptides have been used as vaccines, diagnostics and therapeutics. Plant molecular pharming is a robust platform that uses plants as an expression system to produce simple and complex recombinant biopharmaceuticals on a large scale. Plant system has several advantages over other host systems such as humanized expression, glycosylation, scalability, reduced risk of human or animal pathogenic contaminants, rapid and cost-effective production. Despite many advantages, the expression of recombinant proteins in plant system is hindered by some factors such as non-human post-translational modifications, protein misfolding, conformation changes and instability. Artificial intelligence (AI) plays a vital role in various fields of biotechnology and in the aspect of plant molecular pharming, a significant increase in yield and stability can be achieved with the intervention of AI-based multi-approach to overcome the hindrance factors. Current limitations of plant-based recombinant biopharmaceutical production can be circumvented with the aid of synthetic biology tools and AI algorithms in plant-based glycan engineering for protein folding, stability, viability, catalytic activity and organelle targeting. The AI models, including but not limited to, neural network, support vector machines, linear regression, Gaussian process and regressor ensemble, work by predicting the training and experimental data sets to design and validate the protein structures thereby optimizing properties such as thermostability, catalytic activity, antibody affinity, and protein folding. This review focuses on, integrating systems engineering approaches and AI-based machine learning and deep learning algorithms in protein engineering and host engineering to augment protein production in plant systems to meet the ever-expanding therapeutics market.
Collapse
Affiliation(s)
- Subramanian Parthiban
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thandarvalli Vijeesh
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thashanamoorthi Gayathri
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Balamurugan Shanmugaraj
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Ashutosh Sharma
- Tecnologico de Monterrey, School of Engineering and Sciences, Centre of Bioengineering, Queretaro, Mexico
| | - Ramalingam Sathishkumar
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| |
Collapse
|
18
|
Khakzad H, Igashov I, Schneuing A, Goverde C, Bronstein M, Correia B. A new age in protein design empowered by deep learning. Cell Syst 2023; 14:925-939. [PMID: 37972559 DOI: 10.1016/j.cels.2023.10.006] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 06/22/2023] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein structure prediction, leading to the availability of high-quality models for millions of proteins. Along with novel architectures for generative modeling and sequence analysis, they have revolutionized the protein design field in the past few years remarkably by improving the accuracy and ability to identify novel protein sequences and structures. Deep neural networks can now learn and extract the fundamental features of protein structures, predict how they interact with other biomolecules, and have the potential to create new effective drugs for treating disease. As their applicability in protein design is rapidly growing, we review the recent developments and technology in deep learning methods and provide examples of their performance to generate novel functional proteins.
Collapse
Affiliation(s)
- Hamed Khakzad
- Université de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France; École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Ilia Igashov
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Arne Schneuing
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Bruno Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
19
|
Michailidou F. Engineering of Therapeutic and Detoxifying Enzymes. Angew Chem Int Ed Engl 2023; 62:e202308814. [PMID: 37433049 DOI: 10.1002/anie.202308814] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/07/2023] [Accepted: 07/11/2023] [Indexed: 07/13/2023]
Abstract
Therapeutic enzymes present excellent opportunities for the treatment of human disease, modulation of metabolic pathways and system detoxification. However, current use of enzyme therapy in the clinic is limited as naturally occurring enzymes are seldom optimal for such applications and require substantial improvement by protein engineering. Engineering strategies such as design and directed evolution that have been successfully implemented for industrial biocatalysis can significantly advance the field of therapeutic enzymes, leading to biocatalysts with new-to-nature therapeutic activities, high selectivity, and suitability for medical applications. This minireview highlights case studies of how state-of-the-art and emerging methods in protein engineering are explored for the generation of therapeutic enzymes and discusses gaps and future opportunities in the field of enzyme therapy.
Collapse
Affiliation(s)
- Freideriki Michailidou
- Department of Health Sciences and Technology, ETH Zurich, Schmelzbergstrasse 9, 8092, Zürich, Switzerland
| |
Collapse
|
20
|
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566287. [PMID: 37987009 PMCID: PMC10659313 DOI: 10.1101/2023.11.08.566287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape.
Collapse
Affiliation(s)
- Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
21
|
Romero-Romero S, Lindner S, Ferruz N. Exploring the Protein Sequence Space with Global Generative Models. Cold Spring Harb Perspect Biol 2023; 15:a041471. [PMID: 37848247 PMCID: PMC10626256 DOI: 10.1101/cshperspect.a041471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
Recent advancements in specialized large-scale architectures for training images and language have profoundly impacted the field of computer vision and natural language processing (NLP). Language models, such as the recent ChatGPT and GPT-4, have demonstrated exceptional capabilities in processing, translating, and generating human language. These breakthroughs have also been reflected in protein research, leading to the rapid development of numerous new methods in a short time, with unprecedented performance. Several of these models have been developed with the goal of generating sequences in novel regions of the protein space. In this work, we provide an overview of the use of protein generative models, reviewing (1) language models for the design of novel artificial proteins, (2) works that use non-transformer architectures, and (3) applications in directed evolution approaches.
Collapse
Affiliation(s)
| | | | - Noelia Ferruz
- Barcelona Institute of Molecular Biology, 08028 Barcelona, Spain
| |
Collapse
|
22
|
Lopez-Martinez E, Manteca A, Ferruz N, Cortajarena AL. Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries. ACS Synth Biol 2023; 12:2812-2818. [PMID: 37703075 PMCID: PMC10594869 DOI: 10.1021/acssynbio.3c00201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Indexed: 09/14/2023]
Abstract
Epitopes are specific regions on an antigen's surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.
Collapse
Affiliation(s)
- Elena Lopez-Martinez
- Centre
for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain
| | - Aitor Manteca
- Centre
for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain
| | - Noelia Ferruz
- Molecular
Biology Institute of Barcelona (IBMB-CSIC), Barcelona Science Park, Baldiri Reixac, 15-21, 08028, Barcelona, Spain
| | - Aitziber L. Cortajarena
- Centre
for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain
- IKERBASQUE, Basque
Foundation for Science, Plaza Euskadi 5, 48009 Bilbao, Spain
| |
Collapse
|
23
|
Marshall LR, Bhattacharya S, Korendovych IV. Fishing for Catalysis: Experimental Approaches to Narrowing Search Space in Directed Evolution of Enzymes. JACS AU 2023; 3:2402-2412. [PMID: 37772192 PMCID: PMC10523367 DOI: 10.1021/jacsau.3c00315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 08/07/2023] [Accepted: 08/08/2023] [Indexed: 09/30/2023]
Abstract
Directed evolution has transformed protein engineering offering a path to rapid improvement of protein properties. Yet, in practice it is limited by the hyper-astronomic protein sequence search space, and approaches to identify mutagenic hot spots, i.e., locations where mutations are most likely to have a productive impact, are needed. In this perspective, we categorize and discuss recent progress in the experimental approaches (broadly defined as structural, bioinformatic, and dynamic) to hot spot identification. Recent successes in harnessing protein dynamics and machine learning approaches provide new opportunities for the field and will undoubtedly help directed evolution reach its full potential.
Collapse
Affiliation(s)
- Liam R. Marshall
- Department of Chemistry, Syracuse
University, 111 College Place, Syracuse, New York 13224, United States
| | - Sagar Bhattacharya
- Department of Chemistry, Syracuse
University, 111 College Place, Syracuse, New York 13224, United States
| | - Ivan V. Korendovych
- Department of Chemistry, Syracuse
University, 111 College Place, Syracuse, New York 13224, United States
| |
Collapse
|
24
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
25
|
Tachibana R, Zhang K, Zou Z, Burgener S, Ward TR. A Customized Bayesian Algorithm to Optimize Enzyme-Catalyzed Reactions. ACS SUSTAINABLE CHEMISTRY & ENGINEERING 2023; 11:12336-12344. [PMID: 37621696 PMCID: PMC10445256 DOI: 10.1021/acssuschemeng.3c02402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2023] [Revised: 07/21/2023] [Indexed: 08/26/2023]
Abstract
Design of experiments (DoE) plays an important role in optimizing the catalytic performance of chemical reactions. The most commonly used DoE relies on the response surface methodology (RSM) to model the variable space of experimental conditions with the fewest number of experiments. However, the RSM leads to an exponential increase in the number of required experiments as the number of variables increases. Herein we describe a Bayesian optimization algorithm (BOA) to optimize the continuous parameters (e.g., temperature, reaction time, reactant and enzyme concentrations, etc.) of enzyme-catalyzed reactions with the aim of maximizing performance. Compared to existing Bayesian optimization methods, we propose an improved algorithm that leads to better results under limited resources and time for experiments. To validate the versatility of the BOA, we benchmarked its performance with biocatalytic C-C bond formation and amination for the optimization of the turnover number. Gratifyingly, up to 80% improvement compared to RSM and up to 360% improvement vs previous Bayesian optimization algorithms were obtained. Importantly, this strategy enabled simultaneous optimization of both the enzyme's activity and selectivity for cross-benzoin condensation.
Collapse
Affiliation(s)
- Ryo Tachibana
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, BPR 1096, CH-4058, Basel, Switzerland
- National
Center of Competence in Research (NCCR) “Catalysis”,
ETHZ, 8093 Zurich, Switzerland
| | - Kailin Zhang
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, BPR 1096, CH-4058, Basel, Switzerland
| | - Zhi Zou
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, BPR 1096, CH-4058, Basel, Switzerland
| | - Simon Burgener
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, BPR 1096, CH-4058, Basel, Switzerland
| | - Thomas R. Ward
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, BPR 1096, CH-4058, Basel, Switzerland
- National
Center of Competence in Research (NCCR) “Molecular Systems
Engineering”, 4058 Basel, Switzerland
- National
Center of Competence in Research (NCCR) “Catalysis”,
ETHZ, 8093 Zurich, Switzerland
| |
Collapse
|
26
|
Yang J, Ducharme J, Johnston KE, Li FZ, Yue Y, Arnold FH. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering. ACS Synth Biol 2023; 12:2444-2454. [PMID: 37524064 DOI: 10.1021/acssynbio.3c00301] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method that directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with the potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy to use, generalizable, and scalable. With accompanying software (https://github.com/jsunn-y/DeCOIL), DeCOIL can be readily implemented to generate desired informed libraries.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Julie Ducharme
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Kadina E Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, California 91125, United States
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
27
|
Yu T, Boob AG, Singh N, Su Y, Zhao H. In vitro continuous protein evolution empowered by machine learning and automation. Cell Syst 2023; 14:633-644. [PMID: 37224814 DOI: 10.1016/j.cels.2023.04.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 11/19/2022] [Accepted: 04/20/2023] [Indexed: 05/26/2023]
Abstract
Directed evolution has become one of the most successful and powerful tools for protein engineering. However, the efforts required for designing, constructing, and screening a large library of variants can be laborious, time-consuming, and costly. With the recent advent of machine learning (ML) in the directed evolution of proteins, researchers can now evaluate variants in silico and guide a more efficient directed evolution campaign. Furthermore, recent advancements in laboratory automation have enabled the rapid execution of long, complex experiments for high-throughput data acquisition in both industrial and academic settings, thus providing the means to collect a large quantity of data required to develop ML models for protein engineering. In this perspective, we propose a closed-loop in vitro continuous protein evolution framework that leverages the best of both worlds, ML and automation, and provide a brief overview of the recent developments in the field.
Collapse
Affiliation(s)
- Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA
| | - Aashutosh Girish Boob
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nilmani Singh
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Yufeng Su
- NSF Molecule Maker Lab Institute, Urbana, IL, USA; Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| |
Collapse
|
28
|
McConnell A, Hackel BJ. Protein engineering via sequence-performance mapping. Cell Syst 2023; 14:656-666. [PMID: 37494931 PMCID: PMC10527434 DOI: 10.1016/j.cels.2023.06.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 05/10/2023] [Accepted: 06/21/2023] [Indexed: 07/28/2023]
Abstract
Discovery and evolution of new and improved proteins has empowered molecular therapeutics, diagnostics, and industrial biotechnology. Discovery and evolution both require efficient screens and effective libraries, although they differ in their challenges because of the absence or presence, respectively, of an initial protein variant with the desired function. A host of high-throughput technologies-experimental and computational-enable efficient screens to identify performant protein variants. In partnership, an informed search of sequence space is needed to overcome the immensity, sparsity, and complexity of the sequence-performance landscape. Early in the historical trajectory of protein engineering, these elements aligned with distinct approaches to identify the most performant sequence: selection from large, randomized combinatorial libraries versus rational computational design. Substantial advances have now emerged from the synergy of these perspectives. Rational design of combinatorial libraries aids the experimental search of sequence space, and high-throughput, high-integrity experimental data inform computational design. At the core of the collaborative interface, efficient protein characterization (rather than mere selection of optimal variants) maps sequence-performance landscapes. Such quantitative maps elucidate the complex relationships between protein sequence and performance-e.g., binding, catalytic efficiency, biological activity, and developability-thereby advancing fundamental protein science and facilitating protein discovery and evolution.
Collapse
Affiliation(s)
- Adam McConnell
- Department of Biomedical Engineering, University of Minnesota - Twin Cities, 421 Washington Avenue SE, Minneapolis, MN 55455, USA
| | - Benjamin J Hackel
- Department of Biomedical Engineering, University of Minnesota - Twin Cities, 421 Washington Avenue SE, Minneapolis, MN 55455, USA; Department of Chemical Engineering and Materials Science, University of Minnesota - Twin Cities, 421 Washington Avenue SE, Minneapolis, MN 55455, USA.
| |
Collapse
|
29
|
Clements HD, Flynn AR, Nicholls BT, Grosheva D, Lefave SJ, Merriman MT, Hyster TK, Sigman MS. Using Data Science for Mechanistic Insights and Selectivity Predictions in a Non-Natural Biocatalytic Reaction. J Am Chem Soc 2023; 145:17656-17664. [PMID: 37530568 PMCID: PMC10602048 DOI: 10.1021/jacs.3c03639] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2023]
Abstract
The study of non-natural biocatalytic transformations relies heavily on empirical methods, such as directed evolution, for identifying improved variants. Although exceptionally effective, this approach provides limited insight into the molecular mechanisms behind the transformations and necessitates multiple protein engineering campaigns for new reactants. To address this limitation, we disclose a strategy to explore the biocatalytic reaction space and garner insight into the molecular mechanisms driving enzymatic transformations. Specifically, we explored the selectivity of an "ene"-reductase, GluER-T36A, to create a data-driven toolset that explores reaction space and rationalizes the observed and predicted selectivities of substrate/mutant combinations. The resultant statistical models related structural features of the enzyme and substrate to selectivity and were used to effectively predict selectivity in reactions with out-of-sample substrates and mutants. Our approach provided a deeper understanding of enantioinduction by GluER-T36A and holds the potential to enhance the virtual screening of enzyme mutants.
Collapse
Affiliation(s)
- Hanna D Clements
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| | - Autumn R Flynn
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| | - Bryce T Nicholls
- Department of Chemistry and Chemical Biology, Cornell University, 122 Baker Laboratory, Ithaca, New York 14853, United States
| | - Daria Grosheva
- Department of Chemistry and Chemical Biology, Cornell University, 122 Baker Laboratory, Ithaca, New York 14853, United States
| | - Sarah J Lefave
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| | - Morgan T Merriman
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| | - Todd K Hyster
- Department of Chemistry and Chemical Biology, Cornell University, 122 Baker Laboratory, Ithaca, New York 14853, United States
| | - Matthew S Sigman
- Department of Chemistry, University of Utah, 315 South 1400 East, Salt Lake City, Utah 84112, United States
| |
Collapse
|
30
|
Jagota M, Ye C, Albors C, Rastogi R, Koehl A, Ioannidis N, Song YS. Cross-protein transfer learning substantially improves disease variant prediction. Genome Biol 2023; 24:182. [PMID: 37550700 PMCID: PMC10408151 DOI: 10.1186/s13059-023-03024-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 07/27/2023] [Indexed: 08/09/2023] Open
Abstract
BACKGROUND Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity. RESULTS We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes. CONCLUSIONS Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.
Collapse
Affiliation(s)
- Milind Jagota
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, 94720, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
| | - Ruchir Rastogi
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
| | - Antoine Koehl
- Department of Statistics, University of California, Berkeley, 94720, CA, USA
| | - Nilah Ioannidis
- Computer Science Division, University of California, Berkeley, 94720, CA, USA
- Chan Zuckerberg Biohub, San Francisco, 94158, CA, USA
- Center for Computational Biology, University of California, Berkeley, 94720, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, 94720, CA, USA.
- Department of Statistics, University of California, Berkeley, 94720, CA, USA.
- Center for Computational Biology, University of California, Berkeley, 94720, CA, USA.
| |
Collapse
|
31
|
Rabitz H, Russell B, Ho TS. The Surprising Ease of Finding Optimal Solutions for Controlling Nonlinear Phenomena in Quantum and Classical Complex Systems. J Phys Chem A 2023; 127:4224-4236. [PMID: 37142303 DOI: 10.1021/acs.jpca.3c01896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
This Perspective addresses the often observed surprising ease of achieving optimal control of nonlinear phenomena in quantum and classical complex systems. The circumstances involved are wide-ranging, with scenarios including manipulation of atomic scale processes, maximization of chemical and material properties or synthesis yields, Nature's optimization of species' populations by natural selection, and directed evolution. Natural evolution will mainly be discussed in terms of laboratory experiments with microorganisms, and the field is also distinct from the other domains where a scientist specifies the goal(s) and oversees the control process. We use the word "control" in reference to all of the available variables, regardless of the circumstance. The empirical observations on the ease of achieving at least good, if not excellent, control in diverse domains of science raise the question of why this occurs despite the generally inherent complexity of the systems in each scenario. The key to addressing the question lies in examining the associated control landscape, which is defined as the optimization objective as a function of the control variables that can be as diverse as the phenomena under consideration. Controls may range from laser pulses, chemical reagents, chemical processing conditions, out to nucleic acids in the genome and more. This Perspective presents a conjecture, based on present findings, that the systematics of readily finding good outcomes from controlled phenomena may be unified through consideration of control landscapes with the same common set of three underlying assumptions─the existence of an optimal solution, the ability for local movement on the landscape, and the availability of sufficient control resources─whose validity needs assessment in each scenario. In practice, many cases permit using myopic gradient-like algorithms while other circumstances utilize algorithms having some elements of stochasticity or introduced noise, depending on whether the landscape is locally smooth or rough. The overarching observation is that only relatively short searches are required despite the common high dimensionality of the available controls in typical scenarios.
Collapse
Affiliation(s)
- Herschel Rabitz
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Benjamin Russell
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Tak-San Ho
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| |
Collapse
|
32
|
Gantz M, Neun S, Medcalf EJ, van Vliet LD, Hollfelder F. Ultrahigh-Throughput Enzyme Engineering and Discovery in In Vitro Compartments. Chem Rev 2023; 123:5571-5611. [PMID: 37126602 PMCID: PMC10176489 DOI: 10.1021/acs.chemrev.2c00910] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Novel and improved biocatalysts are increasingly sourced from libraries via experimental screening. The success of such campaigns is crucially dependent on the number of candidates tested. Water-in-oil emulsion droplets can replace the classical test tube, to provide in vitro compartments as an alternative screening format, containing genotype and phenotype and enabling a readout of function. The scale-down to micrometer droplet diameters and picoliter volumes brings about a >107-fold volume reduction compared to 96-well-plate screening. Droplets made in automated microfluidic devices can be integrated into modular workflows to set up multistep screening protocols involving various detection modes to sort >107 variants a day with kHz frequencies. The repertoire of assays available for droplet screening covers all seven enzyme commission (EC) number classes, setting the stage for widespread use of droplet microfluidics in everyday biochemical experiments. We review the practicalities of adapting droplet screening for enzyme discovery and for detailed kinetic characterization. These new ways of working will not just accelerate discovery experiments currently limited by screening capacity but profoundly change the paradigms we can probe. By interfacing the results of ultrahigh-throughput droplet screening with next-generation sequencing and deep learning, strategies for directed evolution can be implemented, examined, and evaluated.
Collapse
Affiliation(s)
- Maximilian Gantz
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K
| | - Stefanie Neun
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K
| | - Elliot J Medcalf
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K
| | - Liisa D van Vliet
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K
| | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K
| |
Collapse
|
33
|
Yu T, Boob AG, Volk MJ, Liu X, Cui H, Zhao H. Machine learning-enabled retrobiosynthesis of molecules. Nat Catal 2023. [DOI: 10.1038/s41929-022-00909-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
|
34
|
Qiu Y, Wei GW. Persistent spectral theory-guided protein engineering. NATURE COMPUTATIONAL SCIENCE 2023; 3:149-163. [PMID: 37637776 PMCID: PMC10456983 DOI: 10.1038/s43588-022-00394-y] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 12/22/2022] [Indexed: 08/29/2023]
Abstract
While protein engineering, which iteratively optimizes protein fitness by screening the gigantic mutational space, is constrained by experimental capacity, various machine learning models have substantially expedited protein engineering. Three-dimensional protein structures promise further advantages, but their intricate geometric complexity hinders their applications in deep mutational screening. Persistent homology, an established algebraic topology tool for protein structural complexity reduction, fails to capture the homotopic shape evolution during the filtration of a given data. This work introduces a Topology-offered protein Fitness (TopFit) framework to complement protein sequence and structure embeddings. Equipped with an ensemble regression strategy, TopFit integrates the persistent spectral theory, a new topological Laplacian, and two auxiliary sequence embeddings to capture mutation-induced topological invariant, shape evolution, and sequence disparity in the protein fitness landscape. The performance of TopFit is assessed by 34 benchmark datasets with 128,634 variants, involving a vast variety of protein structure acquisition modalities and training set size variations.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI, 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
35
|
Jiang Y, Ran X, Yang ZJ. Data-driven enzyme engineering to identify function-enhancing enzymes. Protein Eng Des Sel 2023; 36:gzac009. [PMID: 36214500 PMCID: PMC10365845 DOI: 10.1093/protein/gzac009] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 08/08/2022] [Accepted: 09/28/2022] [Indexed: 01/22/2023] Open
Abstract
Identifying function-enhancing enzyme variants is a 'holy grail' challenge in protein science because it will allow researchers to expand the biocatalytic toolbox for late-stage functionalization of drug-like molecules, environmental degradation of plastics and other pollutants, and medical treatment of food allergies. Data-driven strategies, including statistical modeling, machine learning, and deep learning, have largely advanced the understanding of the sequence-structure-function relationships for enzymes. They have also enhanced the capability of predicting and designing new enzymes and enzyme variants for catalyzing the transformation of new-to-nature reactions. Here, we reviewed the recent progresses of data-driven models that were applied in identifying efficiency-enhancing mutants for catalytic reactions. We also discussed existing challenges and obstacles faced by the community. Although the review is by no means comprehensive, we hope that the discussion can inform the readers about the state-of-the-art in data-driven enzyme engineering, inspiring more joint experimental-computational efforts to develop and apply data-driven modeling to innovate biocatalysts for synthetic and pharmaceutical applications.
Collapse
Affiliation(s)
- Yaoyukun Jiang
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA
| | - Xinchun Ran
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA
| | - Zhongyue J Yang
- Department of Chemistry, Vanderbilt University, Nashville, TN 37235, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37235, USA
- Vanderbilt Institute of Chemical Biology, Vanderbilt University, Nashville, TN 37235, USA
- Data Science Institute, Vanderbilt University, Nashville, TN 37235, USA
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, TN 37235, USA
| |
Collapse
|
36
|
Paik I, Ngo PHT, Shroff R, Diaz DJ, Maranhao AC, Walker DJ, Bhadra S, Ellington AD. Improved Bst DNA Polymerase Variants Derived via a Machine Learning Approach. Biochemistry 2023; 62:410-418. [PMID: 34762799 PMCID: PMC9514386 DOI: 10.1021/acs.biochem.1c00451] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The DNA polymerase I from Geobacillus stearothermophilus (also known as Bst DNAP) is widely used in isothermal amplification reactions, where its strand displacement ability is prized. More robust versions of this enzyme should be enabled for diagnostic applications, especially for carrying out higher temperature reactions that might proceed more quickly. To this end, we appended a short fusion domain from the actin-binding protein villin that improved both stability and purification of the enzyme. In parallel, we have developed a machine learning algorithm that assesses the relative fit of individual amino acids to their chemical microenvironments at any position in a protein and applied this algorithm to predict sequence substitutions in Bst DNAP. The top predicted variants had greatly improved thermotolerance (heating prior to assay), and upon combination, the mutations showed additive thermostability, with denaturation temperatures up to 2.5 °C higher than the parental enzyme. The increased thermostability of the enzyme allowed faster loop-mediated isothermal amplification assays to be carried out at 73 °C, where both Bst DNAP and its improved commercial counterpart Bst 2.0 are inactivated. Overall, this is one of the first examples of the application of machine learning approaches to the thermostabilization of an enzyme.
Collapse
Affiliation(s)
- Inyup Paik
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Phuoc H. T. Ngo
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology and Department of Chemistry, College of Natural Sciences, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Raghav Shroff
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States; CCDC Army Research Lab-South, Austin, Texas 78712, United States
| | - Daniel J. Diaz
- Center for Systems and Synthetic Biology and Department of Chemistry, College of Natural Sciences, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Andre C. Maranhao
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - David J.F. Walker
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Sanchita Bhadra
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Andrew D. Ellington
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| |
Collapse
|
37
|
Miton CM, Tokuriki N. Insertions and Deletions (Indels): A Missing Piece of the Protein Engineering Jigsaw. Biochemistry 2023; 62:148-157. [PMID: 35830609 DOI: 10.1021/acs.biochem.2c00188] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Over the years, protein engineers have studied nature and borrowed its tricks to accelerate protein evolution in the test tube. While there have been considerable advances, our ability to generate new proteins in the laboratory is seemingly limited. One explanation for these shortcomings may be that insertions and deletions (indels), which frequently arise in nature, are largely overlooked during protein engineering campaigns. The profound effect of indels on protein structures, by way of drastic backbone alterations, could be perceived as "saltation" events that bring about significant phenotypic changes in a single mutational step. Should we leverage these effects to accelerate protein engineering and gain access to unexplored regions of adaptive landscapes? In this Perspective, we describe the role played by indels in the functional diversification of proteins in nature and discuss their untapped potential for protein engineering, despite their often-destabilizing nature. We hope to spark a renewed interest in indels, emphasizing that their wider study and use may prove insightful and shape the future of protein engineering by unlocking unique functional changes that substitutions alone could never achieve.
Collapse
Affiliation(s)
- Charlotte M Miton
- Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4 BC, Canada
| | - Nobuhiko Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4 BC, Canada
| |
Collapse
|
38
|
Clifton BE, Kozome D, Laurino P. Efficient Exploration of Sequence Space by Sequence-Guided Protein Engineering and Design. Biochemistry 2023; 62:210-220. [PMID: 35245020 DOI: 10.1021/acs.biochem.1c00757] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The rapid growth of sequence databases over the past two decades means that protein engineers faced with optimizing a protein for any given task will often have immediate access to a vast number of related protein sequences. These sequences encode information about the evolutionary history of the protein and the underlying sequence requirements to produce folded, stable, and functional protein variants. Methods that can take advantage of this information are an increasingly important part of the protein engineering tool kit. In this Perspective, we discuss the utility of sequence data in protein engineering and design, focusing on recent advances in three main areas: the use of ancestral sequence reconstruction as an engineering tool to generate thermostable and multifunctional proteins, the use of sequence data to guide engineering of multipoint mutants by structure-based computational protein design, and the use of unlabeled sequence data for unsupervised and semisupervised machine learning, allowing the generation of diverse and functional protein sequences in unexplored regions of sequence space. Altogether, these methods enable the rapid exploration of sequence space within regions enriched with functional proteins and therefore have great potential for accelerating the engineering of stable, functional, and diverse proteins for industrial and biomedical applications.
Collapse
Affiliation(s)
- Ben E Clifton
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan
| | - Dan Kozome
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan
| | - Paola Laurino
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan
| |
Collapse
|
39
|
Marklund E, Ke Y, Greenleaf WJ. High-throughput biochemistry in RNA sequence space: predicting structure and function. Nat Rev Genet 2023; 24:401-414. [PMID: 36635406 DOI: 10.1038/s41576-022-00567-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/08/2022] [Indexed: 01/14/2023]
Abstract
RNAs are central to fundamental biological processes in all known organisms. The set of possible intramolecular interactions of RNA nucleotides defines the range of alternative structural conformations of a specific RNA that can coexist, and these structures enable functional catalytic properties of RNAs and/or their productive intermolecular interactions with other RNAs or proteins. However, the immense combinatorial space of potential RNA sequences has precluded predictive mapping between RNA sequence and molecular structure and function. Recent advances in high-throughput approaches in vitro have enabled quantitative thermodynamic and kinetic measurements of RNA-RNA and RNA-protein interactions, across hundreds of thousands of sequence variations. In this Review, we explore these techniques, how they can be used to understand RNA function and how they might form the foundations of an accurate model to predict the structure and function of an RNA directly from its nucleotide sequence. The experimental techniques and modelling frameworks discussed here are also highly relevant for the sampling of sequence-structure-function space of DNAs and proteins.
Collapse
Affiliation(s)
- Emil Marklund
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Yuxi Ke
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - William J Greenleaf
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
40
|
Iqbal WA, Lisitsa A, Kapralov MV. Predicting plant Rubisco kinetics from RbcL sequence data using machine learning. JOURNAL OF EXPERIMENTAL BOTANY 2023; 74:638-650. [PMID: 36094849 PMCID: PMC9833099 DOI: 10.1093/jxb/erac368] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 09/12/2022] [Indexed: 06/15/2023]
Abstract
Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) is responsible for the conversion of atmospheric CO2 to organic carbon during photosynthesis, and often acts as a rate limiting step in the later process. Screening the natural diversity of Rubisco kinetics is the main strategy used to find better Rubisco enzymes for crop engineering efforts. Here, we demonstrate the use of Gaussian processes (GPs), a family of Bayesian models, coupled with protein encoding schemes, for predicting Rubisco kinetics from Rubisco large subunit (RbcL) sequence data. GPs trained on published experimentally obtained Rubisco kinetic datasets were applied to over 9000 sequences encoding RbcL to predict Rubisco kinetic parameters. Notably, our predicted kinetic values were in agreement with known trends, e.g. higher carboxylation turnover rates (Kcat) for Rubisco enzymes from C4 or crassulacean acid metabolism (CAM) species, compared with those found in C3 species. This is the first study demonstrating machine learning approaches as a tool for screening and predicting Rubisco kinetics, which could be applied to other enzymes.
Collapse
Affiliation(s)
- Wasim A Iqbal
- School of Natural and Environmental Sciences, Newcastle University, Newcastle upon Tyne, NE1 7RU, United Kingdom
| | - Alexei Lisitsa
- Department of Computer Science, University of Liverpool, Liverpool, L69 3BX, United Kingdom
| | | |
Collapse
|
41
|
Qin Y, Li Q, Fan L, Ning X, Wei X, You C. Biomanufacturing by In Vitro Biotransformation (ivBT) Using Purified Cascade Multi-enzymes. ADVANCES IN BIOCHEMICAL ENGINEERING/BIOTECHNOLOGY 2023; 186:1-27. [PMID: 37455283 DOI: 10.1007/10_2023_231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]
Abstract
In vitro biotransformation (ivBT) refers to the use of an artificial biological reaction system that employs purified enzymes for the one-pot conversion of low-cost materials into biocommodities such as ethanol, organic acids, and amino acids. Unshackled from cell growth and metabolism, ivBT exhibits distinct advantages compared with metabolic engineering, including but not limited to high engineering flexibility, ease of operation, fast reaction rate, high product yields, and good scalability. These characteristics position ivBT as a promising next-generation biomanufacturing platform. Nevertheless, challenges persist in the enhancement of bulk enzyme preparation methods, the acquisition of enzymes with superior catalytic properties, and the development of sophisticated approaches for pathway design and system optimization. In alignment with the workflow of ivBT development, this chapter presents a systematic introduction to pathway design, enzyme mining and engineering, system construction, and system optimization. The chapter also proffers perspectives on ivBT development.
Collapse
Affiliation(s)
- Yanmei Qin
- University of Chinese Academy of Sciences, Beijing, China
- In Vitro Synthetic Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Qiangzi Li
- In Vitro Synthetic Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Lin Fan
- University of Chinese Academy of Sciences, Beijing, China
- In Vitro Synthetic Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- University of Chinese Academy of Sciences Sino-Danish College, Beijing, China
| | - Xiao Ning
- University of Chinese Academy of Sciences, Beijing, China
- In Vitro Synthetic Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Xinlei Wei
- In Vitro Synthetic Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China.
- National Technology Innovation Center of Synthetic Biology, Tianjin, China.
| | - Chun You
- In Vitro Synthetic Biology Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China.
- National Technology Innovation Center of Synthetic Biology, Tianjin, China.
| |
Collapse
|
42
|
Sellés Vidal L, Isalan M, Heap JT, Ledesma-Amaro R. A primer to directed evolution: current methodologies and future directions. RSC Chem Biol 2023; 4:271-291. [PMID: 37034405 PMCID: PMC10074555 DOI: 10.1039/d2cb00231k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 01/18/2023] [Indexed: 01/30/2023] Open
Abstract
This review summarises the methods available for directed evolution, including mutagenesis and variant selection techniques. The advantages and disadvantages of each technique are presented, and future challenges in the field are discussed.
Collapse
Affiliation(s)
- Lara Sellés Vidal
- Imperial College Centre for Synthetic Biology, Imperial College London, London, SW7 2AZ, UK
- Department of Bioengineering, Imperial College London, London, SW7 2AZ, UK
| | - Mark Isalan
- Imperial College Centre for Synthetic Biology, Imperial College London, London, SW7 2AZ, UK
- Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK
| | - John T. Heap
- Imperial College Centre for Synthetic Biology, Imperial College London, London, SW7 2AZ, UK
- Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK
- School of Life Sciences, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | - Rodrigo Ledesma-Amaro
- Imperial College Centre for Synthetic Biology, Imperial College London, London, SW7 2AZ, UK
- Department of Bioengineering, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|
43
|
Ó'Fágáin C. Protein Stability: Enhancement and Measurement. Methods Mol Biol 2023; 2699:369-419. [PMID: 37647007 DOI: 10.1007/978-1-0716-3362-5_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
This chapter defines protein stability, emphasizes its importance, and surveys the field of protein stabilization, with summary reference to a selection of 2014-2021 publications. One can enhance stability, particularly by protein engineering strategies but also by chemical modification and by other means. General protocols are set out on how to measure a given protein's (i) kinetic thermal stability and (ii) oxidative stability and (iii) how to undertake chemical modification of a protein in solution.
Collapse
Affiliation(s)
- Ciarán Ó'Fágáin
- School of Biotechnology, Dublin City University, Dublin, Ireland.
| |
Collapse
|
44
|
Minot M, Reddy ST. Nucleotide augmentation for machine learning-guided protein engineering. BIOINFORMATICS ADVANCES 2022; 3:vbac094. [PMID: 36698759 PMCID: PMC9843584 DOI: 10.1093/bioadv/vbac094] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 10/24/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022]
Abstract
Summary Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance. Availability and implementation The code used in this study is publicly available at https://github.com/minotm/NTA. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Mason Minot
- Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland
| | | |
Collapse
|
45
|
Efficient synthesis of 2-aryl benzothiazoles mediated by Vitreoscilla hemoglobin. MOLECULAR CATALYSIS 2022. [DOI: 10.1016/j.mcat.2022.112784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
46
|
Wittmund M, Cadet F, Davari MD. Learning Epistasis and Residue Coevolution Patterns: Current Trends and Future Perspectives for Advancing Enzyme Engineering. ACS Catal 2022. [DOI: 10.1021/acscatal.2c01426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Marcel Wittmund
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Frederic Cadet
- Laboratory of Excellence LABEX GR, DSIMB, Inserm UMR S1134, University of Paris city & University of Reunion, Paris 75014, France
| | - Mehdi D. Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| |
Collapse
|
47
|
Qiu Y, Wei GW. CLADE 2.0: Evolution-Driven Cluster Learning-Assisted Directed Evolution. J Chem Inf Model 2022; 62:4629-4641. [PMID: 36154171 DOI: 10.1021/acs.jcim.2c01046] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Directed evolution, a revolutionary biotechnology in protein engineering, optimizes protein fitness by searching an astronomical mutational space via expensive experiments. The cluster learning-assisted directed evolution (CLADE) efficiently explores the mutational space via a combination of unsupervised hierarchical clustering and supervised learning. However, the initial-stage sampling in CLADE treats all clusters equally despite many clusters containing a large portion of non-functional mutations. Recent statistical and deep learning tools enable evolutionary density modeling to access protein fitness in an unsupervised manner. In this work, we construct an ensemble of multiple evolutionary scores to guide the initial sampling in CLADE. The resulting evolutionary score-enhanced CLADE, called CLADE 2.0, efficiently selects a training set within a small informative space using the evolution-driven clustering sampling. CLADE 2.0 is validated by using two benchmark libraries both having 160,000 sequences from four-site mutational combinations. Extensive computational experiments and comparisons with existing cutting-edge methods indicate that CLADE 2.0 is a new state-of-art tool for machine learning-assisted directed evolution.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States.,Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
48
|
Medina-Ortiz D, Contreras S, Amado-Hinojosa J, Torres-Almonacid J, Asenjo JA, Navarrete M, Olivera-Nappa Á. Generalized Property-Based Encoders and Digital Signal Processing Facilitate Predictive Tasks in Protein Engineering. Front Mol Biosci 2022; 9:898627. [PMID: 35911960 PMCID: PMC9329607 DOI: 10.3389/fmolb.2022.898627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 06/23/2022] [Indexed: 11/13/2022] Open
Abstract
Computational methods in protein engineering often require encoding amino acid sequences, i.e., converting them into numeric arrays. Physicochemical properties are a typical choice to define encoders, where we replace each amino acid by its value for a given property. However, what property (or group thereof) is best for a given predictive task remains an open problem. In this work, we generalize property-based encoding strategies to maximize the performance of predictive models in protein engineering. First, combining text mining and unsupervised learning, we partitioned the AAIndex database into eight semantically-consistent groups of properties. We then applied a non-linear PCA within each group to define a single encoder to represent it. Then, in several case studies, we assess the performance of predictive models for protein and peptide function, folding, and biological activity, trained using the proposed encoders and classical methods (One Hot Encoder and TAPE embeddings). Models trained on datasets encoded with our encoders and converted to signals through the Fast Fourier Transform (FFT) increased their precision and reduced their overfitting substantially, outperforming classical approaches in most cases. Finally, we propose a preliminary methodology to create de novo sequences with desired properties. All these results offer simple ways to increase the performance of general and complex predictive tasks in protein engineering without increasing their complexity.
Collapse
Affiliation(s)
- David Medina-Ortiz
- Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Punta Arenas, Chile
| | - Sebastian Contreras
- Max Planck Institute for Dynamics and Self-Organization, Göttingen, Germany
- *Correspondence: Sebastian Contreras, ; Álvaro Olivera-Nappa,
| | - Juan Amado-Hinojosa
- Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile
- Departamento de Ingeniería Química, Biotecnología y Materiales, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile, Santiago, Chile
| | - Jorge Torres-Almonacid
- Departamento de Ingeniería en Computación, Universidad de Magallanes, Punta Arenas, Chile
| | - Juan A. Asenjo
- Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile
- Departamento de Ingeniería Química, Biotecnología y Materiales, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile, Santiago, Chile
| | | | - Álvaro Olivera-Nappa
- Centre for Biotechnology and Bioengineering, Universidad de Chile, Santiago, Chile
- Departamento de Ingeniería Química, Biotecnología y Materiales, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile, Santiago, Chile
- *Correspondence: Sebastian Contreras, ; Álvaro Olivera-Nappa,
| |
Collapse
|
49
|
Pandi A, Diehl C, Yazdizadeh Kharrazi A, Scholz SA, Bobkova E, Faure L, Nattermann M, Adam D, Chapin N, Foroughijabbari Y, Moritz C, Paczia N, Cortina NS, Faulon JL, Erb TJ. A versatile active learning workflow for optimization of genetic and metabolic networks. Nat Commun 2022; 13:3876. [PMID: 35790733 PMCID: PMC9256728 DOI: 10.1038/s41467-022-31245-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Accepted: 06/10/2022] [Indexed: 11/13/2022] Open
Abstract
Optimization of biological networks is often limited by wet lab labor and cost, and the lack of convenient computational tools. Here, we describe METIS, a versatile active machine learning workflow with a simple online interface for the data-driven optimization of biological targets with minimal experiments. We demonstrate our workflow for various applications, including cell-free transcription and translation, genetic circuits, and a 27-variable synthetic CO2-fixation cycle (CETCH cycle), improving these systems between one and two orders of magnitude. For the CETCH cycle, we explore 1025 conditions with only 1,000 experiments to yield the most efficient CO2-fixation cascade described to date. Beyond optimization, our workflow also quantifies the relative importance of individual factors to the performance of a system identifying unknown interactions and bottlenecks. Overall, our workflow opens the way for convenient optimization and prototyping of genetic and metabolic networks with customizable adjustments according to user experience, experimental setup, and laboratory facilities. Optimization of biological networks is often limited by wet lab labor and cost, and the lack of convenient computational tools. Here, aimed at democratization and standardization, the authors describe METIS, a modular and versatile active machine learning workflow with a simple online interface for the optimization of biological target functions with minimal experimental datasets.
Collapse
Affiliation(s)
- Amir Pandi
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany.
| | - Christoph Diehl
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | | | - Scott A Scholz
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Elizaveta Bobkova
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Léon Faure
- Micalis Institute, INRAE, AgroParisTech, University of Paris-Saclay, Jouy-en-Josas, France
| | - Maren Nattermann
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - David Adam
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Nils Chapin
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Yeganeh Foroughijabbari
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Charles Moritz
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Nicole Paczia
- Core Facility for Metabolomics and Small Molecule Mass Spectrometry, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Niña Socorro Cortina
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany.,LiVeritas Biosciences, Inc., 432N Canal St.; Ste. 20, South San Francisco, CA, 94080, USA
| | - Jean-Loup Faulon
- Micalis Institute, INRAE, AgroParisTech, University of Paris-Saclay, Jouy-en-Josas, France.,Genomique Metabolique, Genoscope, Institut Francois Jacob, CEA, CNRS, Univ Evry, University of Paris-Saclay, Evry, France.,Manchester Institute of Biotechnology, SYNBIOCHEM center, School of Chemistry, The University of Manchester, Manchester, UK
| | - Tobias J Erb
- Department of Biochemistry & Synthetic Metabolism, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany. .,SYNMIKRO Center of Synthetic Microbiology, Marburg, Germany.
| |
Collapse
|
50
|
Machine learning to navigate fitness landscapes for protein engineering. Curr Opin Biotechnol 2022; 75:102713. [DOI: 10.1016/j.copbio.2022.102713] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/05/2022] [Accepted: 02/28/2022] [Indexed: 11/19/2022]
|