1
|
Minot M, Reddy ST. Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering. Cell Syst 2024; 15:4-18.e4. [PMID: 38194961 DOI: 10.1016/j.cels.2023.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 07/21/2023] [Accepted: 12/07/2023] [Indexed: 01/11/2024]
Abstract
Machine learning-guided protein engineering is rapidly progressing; however, collecting high-quality, large datasets remains a bottleneck. Directed evolution and protein engineering studies often require extensive experimental processes to eliminate noise and label protein sequence-function data. Meta learning has proven effective in other fields in learning from noisy data via bi-level optimization given the availability of a small dataset with trusted labels. Here, we leverage meta learning approaches to overcome noisy and under-labeled data and expedite workflows in antibody engineering. We generate yeast display antibody mutagenesis libraries and screen them for target antigen binding followed by deep sequencing. We then create representative learning tasks, including learning from noisy training data, positive and unlabeled learning, and learning out of distribution properties. We demonstrate that meta learning has the potential to reduce experimental screening time and improve the robustness of machine learning models by training with noisy and under-labeled training data.
Collapse
Affiliation(s)
- Mason Minot
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland
| | - Sai T Reddy
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland.
| |
Collapse
|
2
|
Mardikoraem M, Woldring D. Machine Learning-driven Protein Library Design: A Path Toward Smarter Libraries. Methods Mol Biol 2022; 2491:87-104. [PMID: 35482186 DOI: 10.1007/978-1-0716-2285-8_5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Proteins are small yet valuable biomolecules that play a versatile role in therapeutics and diagnostics. The intricate sequence-structure-function paradigm in the realm of proteins opens the possibility for directly mapping amino acid sequence to function. However, the rugged nature of the protein fitness landscape and an astronomical number of possible mutations even for small proteins make navigating this system a daunting task. Moreover, the scarcity of functional proteins and the ease with which deleterious mutations are introduced, due to complex epistatic relationships, compound the existing challenges. This highlights the need for auxiliary tools in current techniques such as rational design and directed evolution. To that end, the state-of-the-art machine learning can offer time and cost efficiency in finding high fitness proteins, circumventing unnecessary wet-lab experiments. In the context of improving library design, machine learning provides valuable insights via its unique features such as high adaptation to complex systems, multi-tasking, and parallelism, and the ability to capture hidden trends in input data. Finally, both the advancements in computational resources and the rapidly increasing number of sequences in protein databases will allow more promising and detailed insights delivered from machine learning to protein library design. In this chapter, fundamental concepts and a method for machine learning-driven library design leveraging deep sequencing datasets will be discussed. We elaborate on (1) basic knowledge about machine learning algorithms, (2) the benefit of machine learning in library design, and (3) methodology for implementing machine learning in library design.
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Daniel Woldring
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI, USA.
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI, USA.
| |
Collapse
|
3
|
Currin A, Parker S, Robinson CJ, Takano E, Scrutton NS, Breitling R. The evolving art of creating genetic diversity: From directed evolution to synthetic biology. Biotechnol Adv 2021; 50:107762. [PMID: 34000294 PMCID: PMC8299547 DOI: 10.1016/j.biotechadv.2021.107762] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 04/21/2021] [Accepted: 04/25/2021] [Indexed: 12/31/2022]
Abstract
The ability to engineer biological systems, whether to introduce novel functionality or improved performance, is a cornerstone of biotechnology and synthetic biology. Typically, this requires the generation of genetic diversity to explore variations in phenotype, a process that can be performed at many levels, from single molecule targets (i.e., in directed evolution of enzymes) to whole organisms (e.g., in chassis engineering). Recent advances in DNA synthesis technology and automation have enhanced our ability to create variant libraries with greater control and throughput. This review highlights the latest developments in approaches to create such a hierarchy of diversity from the enzyme level to entire pathways in vitro, with a focus on the creation of combinatorial libraries that are required to navigate a target's vast design space successfully to uncover significant improvements in function.
Collapse
Affiliation(s)
- Andrew Currin
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, United Kingdom.
| | - Steven Parker
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, United Kingdom
| | - Christopher J Robinson
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, United Kingdom
| | - Eriko Takano
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, United Kingdom
| | - Nigel S Scrutton
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, United Kingdom
| | - Rainer Breitling
- Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM), Manchester Institute of Biotechnology, The University of Manchester, Manchester M1 7DN, United Kingdom.
| |
Collapse
|
4
|
Abstract
Saturation mutagenesis is conveniently located between the two extremes of protein engineering, namely random mutagenesis, and rational design. It involves mutating a confined number of target residues to other amino acids, and hence requires knowledge regarding the sites for mutagenesis, but not their final identity. There are many different strategies for performing and designing such experiments, ranging from simple single degenerate codons to codon collections that code for distinct sets of amino acids. Here, we provide detailed information on the Dynamic Management for Codon Compression (DYNAMCC) approaches that allow us to precisely define the desired amino acid composition to be introduced to a specific target site. DYNAMCC allows us to set usage thresholds and to eliminate undesirable stop and wild-type codons, thus allowing us to control library size and subsequently downstream screening efforts. The DYNAMCC algorithms are free of charge and are implemented in a website for easy access and usage: www.dynamcc.com .
Collapse
Affiliation(s)
- Gur Pines
- Renewable and Sustainable Energy Institute (RASEI), University of Colorado Boulder, Boulder, CO, USA. .,Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA.
| | - Ryan T Gill
- Renewable and Sustainable Energy Institute (RASEI), University of Colorado Boulder, Boulder, CO, USA.,Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| |
Collapse
|
5
|
Cahn JKB, Werlang CA, Baumschlager A, Brinkmann-Chen S, Mayo SL, Arnold FH. A General Tool for Engineering the NAD/NADP Cofactor Preference of Oxidoreductases. ACS Synth Biol 2017; 6:326-333. [PMID: 27648601 DOI: 10.1021/acssynbio.6b00188] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The ability to control enzymatic nicotinamide cofactor utilization is critical for engineering efficient metabolic pathways. However, the complex interactions that determine cofactor-binding preference render this engineering particularly challenging. Physics-based models have been insufficiently accurate and blind directed evolution methods too inefficient to be widely adopted. Building on a comprehensive survey of previous studies and our own prior engineering successes, we present a structure-guided, semirational strategy for reversing enzymatic nicotinamide cofactor specificity. This heuristic-based approach leverages the diversity and sensitivity of catalytically productive cofactor binding geometries to limit the problem to an experimentally tractable scale. We demonstrate the efficacy of this strategy by inverting the cofactor specificity of four structurally diverse NADP-dependent enzymes: glyoxylate reductase, cinnamyl alcohol dehydrogenase, xylose reductase, and iron-containing alcohol dehydrogenase. The analytical components of this approach have been fully automated and are available in the form of an easy-to-use web tool: Cofactor Specificity Reversal-Structural Analysis and Library Design (CSR-SALAD).
Collapse
Affiliation(s)
- Jackson K. B. Cahn
- Division of Chemistry and Chemical Engineering, and ‡Division of Biology and Biological
Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Caroline A. Werlang
- Division of Chemistry and Chemical Engineering, and ‡Division of Biology and Biological
Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Armin Baumschlager
- Division of Chemistry and Chemical Engineering, and ‡Division of Biology and Biological
Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Sabine Brinkmann-Chen
- Division of Chemistry and Chemical Engineering, and ‡Division of Biology and Biological
Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Stephen L. Mayo
- Division of Chemistry and Chemical Engineering, and ‡Division of Biology and Biological
Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division of Chemistry and Chemical Engineering, and ‡Division of Biology and Biological
Engineering, California Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
6
|
Halweg-Edwards AL, Pines G, Winkler JD, Pines A, Gill RT. A Web Interface for Codon Compression. ACS Synth Biol 2016; 5:1021-3. [PMID: 27169595 DOI: 10.1021/acssynbio.6b00026] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Saturation mutagenesis is widely used in protein engineering and other experiments. A common practice is to utilize the single degenerate codon NNK. However, this approach suffers from amino acid bias and the presence of a stop codon and of the wild type amino acid. These extra features needlessly increase library size and consequently downstream screening load. Recently, we developed the DYNAMCC algorithms for codon compression that find the minimal set of degenerate codons, covering any defined set of amino acids, with no off-target codons and with redundancy control. Additionally, we experimentally demonstrated the advantages of this approach over the standard NNK method. While the code is freely available from our Web site, we have now made this method more accessible to a broader audience without any computational background by building a user-friendly web-based interface for those algorithms. The Web site can be accessed through: www.dynamcc.com .
Collapse
Affiliation(s)
- Andrea L. Halweg-Edwards
- Department
of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Gur Pines
- Department
of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - James D. Winkler
- Department
of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, Colorado 80309, United States
| | | | - Ryan T. Gill
- Department
of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, Colorado 80309, United States
| |
Collapse
|