1
|
Zhou L, Tao C, Shen X, Sun X, Wang J, Yuan Q. Unlocking the potential of enzyme engineering via rational computational design strategies. Biotechnol Adv 2024; 73:108376. [PMID: 38740355 DOI: 10.1016/j.biotechadv.2024.108376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 04/27/2024] [Accepted: 05/08/2024] [Indexed: 05/16/2024]
Abstract
Enzymes play a pivotal role in various industries by enabling efficient, eco-friendly, and sustainable chemical processes. However, the low turnover rates and poor substrate selectivity of enzymes limit their large-scale applications. Rational computational enzyme design, facilitated by computational algorithms, offers a more targeted and less labor-intensive approach. There has been notable advancement in employing rational computational protein engineering strategies to overcome these issues, it has not been comprehensively reviewed so far. This article reviews recent developments in rational computational enzyme design, categorizing them into three types: structure-based, sequence-based, and data-driven machine learning computational design. Case studies are presented to demonstrate successful enhancements in catalytic activity, stability, and substrate selectivity. Lastly, the article provides a thorough analysis of these approaches, highlights existing challenges and potential solutions, and offers insights into future development directions.
Collapse
Affiliation(s)
- Lei Zhou
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Chunmeng Tao
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Xiaolin Shen
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Xinxiao Sun
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Jia Wang
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China.
| | - Qipeng Yuan
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China.
| |
Collapse
|
2
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
3
|
Fu X, Suo H, Zhang J, Chen D. Machine-learning-guided Directed Evolution for AAV Capsid Engineering. Curr Pharm Des 2024; 30:811-824. [PMID: 38445704 DOI: 10.2174/0113816128286593240226060318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 02/07/2024] [Accepted: 02/13/2024] [Indexed: 03/07/2024]
Abstract
Target gene delivery is crucial to gene therapy. Adeno-associated virus (AAV) has emerged as a primary gene therapy vector due to its broad host range, long-term expression, and low pathogenicity. However, AAV vectors have some limitations, such as immunogenicity and insufficient targeting. Designing or modifying capsids is a potential method of improving the efficacy of gene delivery, but hindered by weak biological basis of AAV, complexity of the capsids, and limitations of current screening methods. Artificial intelligence (AI), especially machine learning (ML), has great potential to accelerate and improve the optimization of capsid properties as well as decrease their development time and manufacturing costs. This review introduces the traditional methods of designing AAV capsids and the general steps of building a sequence-function ML model, highlights the applications of ML in the development workflow, and summarizes its advantages and challenges.
Collapse
Affiliation(s)
- Xianrong Fu
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Hairui Suo
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Jiachen Zhang
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Dongmei Chen
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| |
Collapse
|
4
|
Vasina M, Kovar D, Damborsky J, Ding Y, Yang T, deMello A, Mazurenko S, Stavrakis S, Prokop Z. In-depth analysis of biocatalysts by microfluidics: An emerging source of data for machine learning. Biotechnol Adv 2023; 66:108171. [PMID: 37150331 DOI: 10.1016/j.biotechadv.2023.108171] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 05/04/2023] [Accepted: 05/04/2023] [Indexed: 05/09/2023]
Abstract
Nowadays, the vastly increasing demand for novel biotechnological products is supported by the continuous development of biocatalytic applications which provide sustainable green alternatives to chemical processes. The success of a biocatalytic application is critically dependent on how quickly we can identify and characterize enzyme variants fitting the conditions of industrial processes. While miniaturization and parallelization have dramatically increased the throughput of next-generation sequencing systems, the subsequent characterization of the obtained candidates is still a limiting process in identifying the desired biocatalysts. Only a few commercial microfluidic systems for enzyme analysis are currently available, and the transformation of numerous published prototypes into commercial platforms is still to be streamlined. This review presents the state-of-the-art, recent trends, and perspectives in applying microfluidic tools in the functional and structural analysis of biocatalysts. We discuss the advantages and disadvantages of available technologies, their reproducibility and robustness, and readiness for routine laboratory use. We also highlight the unexplored potential of microfluidics to leverage the power of machine learning for biocatalyst development.
Collapse
Affiliation(s)
- Michal Vasina
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic
| | - David Kovar
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic
| | - Yun Ding
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland
| | - Tianjin Yang
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland; Department of Biochemistry, University of Zurich, 8057 Zurich, Switzerland
| | - Andrew deMello
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic.
| | - Stavros Stavrakis
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland.
| | - Zbynek Prokop
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic.
| |
Collapse
|
5
|
Wittmund M, Cadet F, Davari MD. Learning Epistasis and Residue Coevolution Patterns: Current Trends and Future Perspectives for Advancing Enzyme Engineering. ACS Catal 2022. [DOI: 10.1021/acscatal.2c01426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Marcel Wittmund
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Frederic Cadet
- Laboratory of Excellence LABEX GR, DSIMB, Inserm UMR S1134, University of Paris city & University of Reunion, Paris 75014, France
| | - Mehdi D. Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| |
Collapse
|
6
|
Computational enzyme redesign: large jumps in function. TRENDS IN CHEMISTRY 2022. [DOI: 10.1016/j.trechm.2022.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
7
|
Wang Y, Xue P, Cao M, Yu T, Lane ST, Zhao H. Directed Evolution: Methodologies and Applications. Chem Rev 2021; 121:12384-12444. [PMID: 34297541 DOI: 10.1021/acs.chemrev.1c00260] [Citation(s) in RCA: 218] [Impact Index Per Article: 72.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Directed evolution aims to expedite the natural evolution process of biological molecules and systems in a test tube through iterative rounds of gene diversifications and library screening/selection. It has become one of the most powerful and widespread tools for engineering improved or novel functions in proteins, metabolic pathways, and even whole genomes. This review describes the commonly used gene diversification strategies, screening/selection methods, and recently developed continuous evolution strategies for directed evolution. Moreover, we highlight some representative applications of directed evolution in engineering nucleic acids, proteins, pathways, genetic circuits, viruses, and whole cells. Finally, we discuss the challenges and future perspectives in directed evolution.
Collapse
Affiliation(s)
- Yajie Wang
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Pu Xue
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Mingfeng Cao
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Stephan T Lane
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| |
Collapse
|
8
|
Siedhoff NE, Illig AM, Schwaneberg U, Davari MD. PyPEF-An Integrated Framework for Data-Driven Protein Engineering. J Chem Inf Model 2021; 61:3463-3476. [PMID: 34260225 DOI: 10.1021/acs.jcim.1c00099] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Data-driven strategies are gaining increased attention in protein engineering due to recent advances in access to large experimental databanks of proteins, next-generation sequencing (NGS), high-throughput screening (HTS) methods, and the development of artificial intelligence algorithms. However, the reliable prediction of beneficial amino acid substitutions, their combination, and the effect on functional properties remain the most significant challenges in protein engineering, which is applied to develop proteins and enzymes for biocatalysis, biomedicine, and life sciences. Here, we present a general-purpose framework (PyPEF: pythonic protein engineering framework) for performing data-driven protein engineering using machine learning methods combined with techniques from signal processing and statistical physics. PyPEF guides the identification and selection of beneficial proteins of a defined sequence space by systematically or randomly exploring the fitness of variants and by sampling random evolution pathways. The performance of PyPEF was evaluated concerning its predictive accuracy and throughput on four public protein and enzyme data sets using common regression models. It was proved that the program could efficiently predict the fitness of protein sequences for different target properties (predictive models with coefficient of determination values ranging from 0.58 to 0.92). By combining machine learning and protein evolution, PyPEF enabled the screening of proteins with various functions, reaching a screening capacity of more than 500,000 protein sequence variants in the timeframe of only a few minutes on a personal computer. PyPEF displayed significant accuracies on four public data sets (different proteins and properties) and underlined the potential of integrating data-driven technologies for covering different philosophies by either predicting the fitness of the variants to the highest accuracy accounting for epistatic effects or capturing the general trend of introduced mutations on the fitness in directed protein evolution campaigns. In essence, PyPEF can provide a powerful solution to current sequence exploration and combinatorial problems faced in protein engineering through exhaustive in silico screening of the sequence space.
Collapse
Affiliation(s)
- Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
| | | | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany.,DWI-Leibniz Institute for Interactive Materials, Forckenbeckstraße 50, 52074 Aachen, Germany
| | - Mehdi D Davari
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
| |
Collapse
|
9
|
Song W, Ko J, Choi YH, Hwang NS. Recent advancements in enzyme-mediated crosslinkable hydrogels: In vivo-mimicking strategies. APL Bioeng 2021; 5:021502. [PMID: 33834154 PMCID: PMC8018798 DOI: 10.1063/5.0037793] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 03/03/2021] [Indexed: 12/19/2022] Open
Abstract
Enzymes play a central role in fundamental biological processes and have been traditionally used to trigger various processes. In recent years, enzymes have been used to tune biomaterial responses and modify the chemical structures at desired sites. These chemical modifications have allowed the fabrication of various hydrogels for tissue engineering and therapeutic applications. This review provides a comprehensive overview of recent advancements in the use of enzymes for hydrogel fabrication. Strategies to enhance the enzyme function and improve biocompatibility are described. In addition, we describe future opportunities and challenges for the production of enzyme-mediated crosslinkable hydrogels.
Collapse
Affiliation(s)
- Wonmoon Song
- School of Chemical and Biological Engineering, Institute of Chemical Processes, Seoul National University, Seoul 08826, Republic of Korea
| | - Junghyeon Ko
- School of Chemical and Biological Engineering, Institute of Chemical Processes, Seoul National University, Seoul 08826, Republic of Korea
| | - Young Hwan Choi
- School of Chemical and Biological Engineering, Institute of Chemical Processes, Seoul National University, Seoul 08826, Republic of Korea
| | - Nathaniel S. Hwang
- Author to whom correspondence should be addressed:. Tel.: 82-2-880-1635. Fax: 82-2-880-7295
| |
Collapse
|
10
|
Ferguson AL, Ranganathan R. 100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design. ACS Macro Lett 2021; 10:327-340. [PMID: 35549066 DOI: 10.1021/acsmacrolett.0c00885] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The design of synthetic proteins with the desired function is a long-standing goal in biomolecular science, with broad applications in biochemical engineering, agriculture, medicine, and public health. Rational de novo design and experimental directed evolution have achieved remarkable successes but are challenged by the requirement to find functional "needles" in the vast "haystack" of protein sequence space. Data-driven models for fitness landscapes provide a predictive map between protein sequence and function and can prospectively identify functional candidates for experimental testing to greatly improve the efficiency of this search. This Viewpoint reviews the applications of machine learning and, in particular, deep learning as part of data-driven protein engineering platforms. We highlight recent successes, review promising computational methodologies, and provide an outlook on future challenges and opportunities. The article is written for a broad audience comprising both polymer and protein scientists and computer and data scientists interested in an up-to-date review of recent innovations and opportunities in this rapidly evolving field.
Collapse
Affiliation(s)
- Andrew L. Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
- Center for Physics of Evolving Systems, University of Chicago, Chicago, Illinois 60637, United States
- Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
11
|
Volk MJ, Lourentzou I, Mishra S, Vo LT, Zhai C, Zhao H. Biosystems Design by Machine Learning. ACS Synth Biol 2020; 9:1514-1533. [PMID: 32485108 DOI: 10.1021/acssynbio.0c00129] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Biosystems such as enzymes, pathways, and whole cells have been increasingly explored for biotechnological applications. However, the intricate connectivity and resulting complexity of biosystems poses a major hurdle in designing biosystems with desirable features. As -omics and other high throughput technologies have been rapidly developed, the promise of applying machine learning (ML) techniques in biosystems design has started to become a reality. ML models enable the identification of patterns within complicated biological data across multiple scales of analysis and can augment biosystems design applications by predicting new candidates for optimized performance. ML is being used at every stage of biosystems design to help find nonobvious engineering solutions with fewer design iterations. In this review, we first describe commonly used models and modeling paradigms within ML. We then discuss some applications of these models that have already shown success in biotechnological applications. Moreover, we discuss successful applications at all scales of biosystems design, including nucleic acids, genetic circuits, proteins, pathways, genomes, and bioprocesses. Finally, we discuss some limitations of these methods and potential solutions as well as prospects of the combination of ML and biosystems design.
Collapse
|
12
|
Lennox M, Robertson N, Devereux B. Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2020:2361-2367. [PMID: 33018481 DOI: 10.1109/embc44109.2020.9176380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.
Collapse
|
13
|
Qu G, Li A, Acevedo‐Rocha CG, Sun Z, Reetz MT. Die zentrale Rolle der Methodenentwicklung in der gerichteten Evolution selektiver Enzyme. Angew Chem Int Ed Engl 2020. [DOI: 10.1002/ange.201901491] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Affiliation(s)
- Ge Qu
- Tianjin Institute of Industrial Biotechnology Chinese Academy of Sciences 32 West 7th Avenue, Tianjin Airport Economic Area Tianjin 300308 China
| | - Aitao Li
- State Key Laboratory of Biocatalysis and Enzyme Engineering Hubei Collaborative Innovation Center for Green Transformation of Bio-resources Hubei Key Laboratory of Industrial Biotechnology College of Life Sciences Hubei University 368 Youyi Road Wuchang Wuhan 430062 China
| | | | - Zhoutong Sun
- Tianjin Institute of Industrial Biotechnology Chinese Academy of Sciences 32 West 7th Avenue, Tianjin Airport Economic Area Tianjin 300308 China
| | - Manfred T. Reetz
- Tianjin Institute of Industrial Biotechnology Chinese Academy of Sciences 32 West 7th Avenue, Tianjin Airport Economic Area Tianjin 300308 China
- Max-Planck-Institut für Kohlenforschung Kaiser-Wilhelm-Platz 1 45470 Mülheim Deutschland
- Department of Chemistry, Hans-Meerwein-Straße 4 Philipps-Universität 35032 Marburg Deutschland
| |
Collapse
|
14
|
Qu G, Li A, Acevedo‐Rocha CG, Sun Z, Reetz MT. The Crucial Role of Methodology Development in Directed Evolution of Selective Enzymes. Angew Chem Int Ed Engl 2020; 59:13204-13231. [PMID: 31267627 DOI: 10.1002/anie.201901491] [Citation(s) in RCA: 246] [Impact Index Per Article: 61.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2019] [Indexed: 12/14/2022]
Affiliation(s)
- Ge Qu
- Tianjin Institute of Industrial Biotechnology Chinese Academy of Sciences 32 West 7th Avenue, Tianjin Airport Economic Area Tianjin 300308 China
| | - Aitao Li
- State Key Laboratory of Biocatalysis and Enzyme Engineering Hubei Collaborative Innovation Center for Green Transformation of Bio-resources Hubei Key Laboratory of Industrial Biotechnology College of Life Sciences Hubei University 368 Youyi Road Wuchang Wuhan 430062 China
| | | | - Zhoutong Sun
- Tianjin Institute of Industrial Biotechnology Chinese Academy of Sciences 32 West 7th Avenue, Tianjin Airport Economic Area Tianjin 300308 China
| | - Manfred T. Reetz
- Tianjin Institute of Industrial Biotechnology Chinese Academy of Sciences 32 West 7th Avenue, Tianjin Airport Economic Area Tianjin 300308 China
- Max-Planck-Institut für Kohlenforschung Kaiser-Wilhelm-Platz 1 45470 Mülheim Germany
- Department of Chemistry, Hans-Meerwein-Strasse 4 Philipps-University 35032 Marburg Germany
| |
Collapse
|
15
|
Yang KK, Wu Z, Arnold FH. Machine-learning-guided directed evolution for protein engineering. Nat Methods 2019; 16:687-694. [PMID: 31308553 DOI: 10.1038/s41592-019-0496-6] [Citation(s) in RCA: 471] [Impact Index Per Article: 94.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 06/17/2019] [Indexed: 02/06/2023]
Abstract
Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence-function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.
Collapse
Affiliation(s)
- Kevin K Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
16
|
Yang KK, Wu Z, Bedbrook CN, Arnold FH. Learned protein embeddings for machine learning. Bioinformatics 2018; 34:2642-2648. [PMID: 29584811 PMCID: PMC6061698 DOI: 10.1093/bioinformatics/bty178] [Citation(s) in RCA: 138] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Revised: 03/20/2018] [Accepted: 03/22/2018] [Indexed: 12/26/2022] Open
Abstract
Motivation Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kevin K Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Claire N Bedbrook
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| |
Collapse
|