1
|
Harteveld Z, Van Hall-Beauvais A, Morozova I, Southern J, Goverde C, Georgeon S, Rosset S, Defferrard M, Loukas A, Vandergheynst P, Bronstein MM, Correia BE. Exploring "dark-matter" protein folds using deep learning. Cell Syst 2024; 15:898-910.e5. [PMID: 39383860 DOI: 10.1016/j.cels.2024.09.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Revised: 06/13/2024] [Accepted: 09/16/2024] [Indexed: 10/11/2024]
Abstract
De novo protein design explores uncharted sequence and structure space to generate novel proteins not sampled by evolution. A main challenge in de novo design involves crafting "designable" structural templates to guide the sequence searches toward adopting target structures. We present a convolutional variational autoencoder that learns patterns of protein structure, dubbed Genesis. We coupled Genesis with trRosetta to design sequences for a set of protein folds and found that Genesis is capable of reconstructing native-like distance and angle distributions for five native folds and three novel, the so-called "dark-matter" folds as a demonstration of generalizability. We used a high-throughput assay to characterize the stability of the designs through protease resistance, obtaining encouraging success rates for folded proteins. Genesis enables exploration of the protein fold space within minutes, unrestricted by protein topologies. Our approach addresses the backbone designability problem, showing that small neural networks can efficiently learn structural patterns in proteins. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Zander Harteveld
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Alexandra Van Hall-Beauvais
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Irina Morozova
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | | | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Stéphane Rosset
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | | | - Andreas Loukas
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Prescient Design, gRED, Roche, Basel, Switzerland
| | | | | | - Bruno E Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
2
|
Draizen EJ, Veretnik S, Mura C, Bourne PE. Deep generative models of protein structure uncover distant relationships across a continuous fold space. Nat Commun 2024; 15:8094. [PMID: 39294145 PMCID: PMC11410806 DOI: 10.1038/s41467-024-52020-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 08/23/2024] [Indexed: 09/20/2024] Open
Abstract
Our views of fold space implicitly rest upon many assumptions that impact how we analyze, interpret and understand protein structure, function and evolution. For instance, is there an optimal granularity in viewing protein structural similarities (e.g., architecture, topology or some other level)? Similarly, the discrete/continuous dichotomy of fold space is central, but remains unresolved. Discrete views of fold space bin similar folds into distinct, non-overlapping groups; unfortunately, such binning can miss remote relationships. While hierarchical systems like CATH are indispensable resources, less heuristic and more conceptually flexible approaches could enable more nuanced explorations of fold space. Building upon an Urfold model of protein structure, here we present a deep generative modeling framework, termed DeepUrfold, for analyzing protein relationships at scale. DeepUrfold's learned embeddings occupy high-dimensional latent spaces that can be distilled for a given protein in terms of an amalgamated representation uniting sequence, structure and biophysical properties. This approach is structure-guided, versus being purely structure-based, and DeepUrfold learns representations that, in a sense, define superfamilies. Deploying DeepUrfold with CATH reveals evolutionarily-remote relationships that evade existing methodologies, and suggests a mostly-continuous view of fold space-a view that extends beyond simple geometric similarity, towards the realm of integrated sequence ↔ structure ↔ function properties.
Collapse
Affiliation(s)
- Eli J Draizen
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
| | - Stella Veretnik
- School of Data Science, University of Virginia, Charlottesville, VA, USA
| | - Cameron Mura
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
| | - Philip E Bourne
- School of Data Science, University of Virginia, Charlottesville, VA, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
3
|
Rosenberg AA, Yehishalom N, Marx A, Bronstein AM. An amino-domino model described by a cross-peptide-bond Ramachandran plot defines amino acid pairs as local structural units. Proc Natl Acad Sci U S A 2023; 120:e2301064120. [PMID: 37878722 PMCID: PMC10623034 DOI: 10.1073/pnas.2301064120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 08/24/2023] [Indexed: 10/27/2023] Open
Abstract
Protein structure, both at the global and local level, dictates function. Proteins fold from chains of amino acids, forming secondary structures, α-helices and β-strands, that, at least for globular proteins, subsequently fold into a three-dimensional structure. Here, we show that a Ramachandran-type plot focusing on the two dihedral angles separated by the peptide bond, and entirely contained within an amino acid pair, defines a local structural unit. We further demonstrate the usefulness of this cross-peptide-bond Ramachandran plot by showing that it captures β-turn conformations in coil regions, that traditional Ramachandran plot outliers fall into occupied regions of our plot, and that thermophilic proteins prefer specific amino acid pair conformations. Further, we demonstrate experimentally that the effect of a point mutation on backbone conformation and protein stability depends on the amino acid pair context, i.e., the identity of the adjacent amino acid, in a manner predictable by our method.
Collapse
Affiliation(s)
- Aviv A. Rosenberg
- Department of Computer Science, Technion–Israel Institute of Technology, Haifa32000, Israel
| | - Nitsan Yehishalom
- Faculty of Biology, Technion–Israel Institute of Technology, Haifa32000, Israel
| | - Ailie Marx
- Department of Computer Science, Technion–Israel Institute of Technology, Haifa32000, Israel
| | - Alex M. Bronstein
- Department of Computer Science, Technion–Israel Institute of Technology, Haifa32000, Israel
| |
Collapse
|
4
|
Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, Mangan NM, Ovchinnikov S, Rocklin GJ. Mega-scale experimental analysis of protein folding stability in biology and design. Nature 2023; 620:434-444. [PMID: 37468638 PMCID: PMC10412457 DOI: 10.1038/s41586-023-06328-6] [Citation(s) in RCA: 77] [Impact Index Per Article: 77.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 06/14/2023] [Indexed: 07/21/2023]
Abstract
Advances in DNA sequencing and machine learning are providing insights into protein sequences and structures on an enormous scale1. However, the energetics driving folding are invisible in these structures and remain largely unknown2. The hidden thermodynamics of folding can drive disease3,4, shape protein evolution5-7 and guide protein engineering8-10, and new approaches are needed to reveal these thermodynamics for every sequence and structure. Here we present cDNA display proteolysis, a method for measuring thermodynamic folding stability for up to 900,000 protein domains in a one-week experiment. From 1.8 million measurements in total, we curated a set of around 776,000 high-quality folding stabilities covering all single amino acid variants and selected double mutants of 331 natural and 148 de novo designed protein domains 40-72 amino acids in length. Using this extensive dataset, we quantified (1) environmental factors influencing amino acid fitness, (2) thermodynamic couplings (including unexpected interactions) between protein sites, and (3) the global divergence between evolutionary amino acid usage and protein folding stability. We also examined how our approach could identify stability determinants in designed proteins and evaluate design methods. The cDNA display proteolysis method is fast, accurate and uniquely scalable, and promises to reveal the quantitative rules for how amino acid sequences encode folding stability.
Collapse
Affiliation(s)
- Kotaro Tsuboyama
- Department of Pharmacology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
- PRESTO, Japan Science and Technology Agency, Tokyo, Japan
- Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
| | - Justas Dauparas
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Jonathan Chen
- Department of Pharmacology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
- McCormick School of Engineering, Northwestern University, Evanston, IL, USA
| | - Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Paris, France
| | - Yasser Mohseni Behbahani
- Sorbonne Université, CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Paris, France
| | - Jonathan J Weinstein
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Niall M Mangan
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
- Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL, USA
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
| | - Gabriel J Rocklin
- Department of Pharmacology, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA.
| |
Collapse
|
5
|
Li AJ, Lu M, Desta I, Sundar V, Grigoryan G, Keating AE. Neural network-derived Potts models for structure-based protein design using backbone atomic coordinates and tertiary motifs. Protein Sci 2023; 32:e4554. [PMID: 36564857 PMCID: PMC9854172 DOI: 10.1002/pro.4554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 11/15/2022] [Accepted: 12/20/2022] [Indexed: 12/25/2022]
Abstract
Designing novel proteins to perform desired functions, such as binding or catalysis, is a major goal in synthetic biology. A variety of computational approaches can aid in this task. An energy-based framework rooted in the sequence-structure statistics of tertiary motifs (TERMs) can be used for sequence design on predefined backbones. Neural network models that use backbone coordinate-derived features provide another way to design new proteins. In this work, we combine the two methods to make neural structure-based models more suitable for protein design. Specifically, we supplement backbone-coordinate features with TERM-derived data, as inputs, and we generate energy functions as outputs. We present two architectures that generate Potts models over the sequence space: TERMinator, which uses both TERM-based and coordinate-based information, and COORDinator, which uses only coordinate-based information. Using these two models, we demonstrate that TERMs can be utilized to improve native sequence recovery performance of neural models. Furthermore, we demonstrate that sequences designed by TERMinator are predicted to fold to their target structures by AlphaFold. Finally, we show that both TERMinator and COORDinator learn notions of energetics, and these methods can be fine-tuned on experimental data to improve predictions. Our results suggest that using TERM-based and coordinate-based features together may be beneficial for protein design and that structure-based neural models that produce Potts energy tables have utility for flexible applications in protein science.
Collapse
Affiliation(s)
- Alex J. Li
- Department of ChemistryMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Mindren Lu
- Department of Electrical Engineering and Computer ScienceMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Israel Desta
- Department of BiologyMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Vikram Sundar
- Computational and Systems Biology ProgramMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Gevorg Grigoryan
- Department of Computer ScienceDartmouth CollegeHanoverNew HampshireUSA
| | - Amy E. Keating
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
- Department of BiologyMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
- Koch Institute for Integrative Cancer ResearchMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| |
Collapse
|
6
|
Delaunay M, Ha-Duong T. Computational Tools and Strategies to Develop Peptide-Based Inhibitors of Protein-Protein Interactions. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2022; 2405:205-230. [PMID: 35298816 DOI: 10.1007/978-1-0716-1855-4_11] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Protein-protein interactions play crucial and subtle roles in many biological processes and modifications of their fine mechanisms generally result in severe diseases. Peptide derivatives are very promising therapeutic agents for modulating protein-protein associations with sizes and specificities between those of small compounds and antibodies. For the same reasons, rational design of peptide-based inhibitors naturally borrows and combines computational methods from both protein-ligand and protein-protein research fields. In this chapter, we aim to provide an overview of computational tools and approaches used for identifying and optimizing peptides that target protein-protein interfaces with high affinity and specificity. We hope that this review will help to implement appropriate in silico strategies for peptide-based drug design that builds on available information for the systems of interest.
Collapse
Affiliation(s)
| | - Tâp Ha-Duong
- Université Paris-Saclay, CNRS, BioCIS, Châtenay-Malabry, France.
| |
Collapse
|
7
|
Boral A, Khamaru M, Mitra D. Designing synthetic transcription factors: A structural perspective. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 130:245-287. [PMID: 35534109 DOI: 10.1016/bs.apcsb.2021.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this chapter, we discuss different design strategies of synthetic proteins, especially synthetic transcription factors. Design and engineering of synthetic transcription factors is particularly relevant for the need-based manipulation of gene expression. With recent advances in structural biology techniques and with the emergence of other precision biochemical/physical tools, accurate knowledge on structure-function relations is increasingly becoming available. Besides discussing the underlying principles of design, we go through individual cases, especially those involving four major groups of transcription factors-basic leucine zippers, zinc fingers, helix-turn-helix and homeodomains. We further discuss how synthetic biology can come together with structural biology to alter the genetic blueprint of life.
Collapse
Affiliation(s)
- Aparna Boral
- Department of Life Sciences, Presidency University, Kolkata, West Bengal, India
| | - Madhurima Khamaru
- Department of Life Sciences, Presidency University, Kolkata, West Bengal, India
| | - Devrani Mitra
- Department of Life Sciences, Presidency University, Kolkata, West Bengal, India.
| |
Collapse
|
8
|
Frappier V, Keating AE. Data-driven computational protein design. Curr Opin Struct Biol 2021; 69:63-69. [PMID: 33910104 DOI: 10.1016/j.sbi.2021.03.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Revised: 03/18/2021] [Accepted: 03/19/2021] [Indexed: 01/28/2023]
Abstract
Computational protein design can generate proteins not found in nature that adopt desired structures and perform novel functions. Although proteins could, in theory, be designed with ab initio methods, practical success has come from using large amounts of data that describe the sequences, structures, and functions of existing proteins and their variants. We present recent creative uses of multiple-sequence alignments, protein structures, and high-throughput functional assays in computational protein design. Approaches range from enhancing structure-based design with experimental data to building regression models to training deep neural nets that generate novel sequences. Looking ahead, deep learning will be increasingly important for maximizing the value of data for protein design.
Collapse
Affiliation(s)
- Vincent Frappier
- Generate Biomedicines, 26 Landsdowne Street, Cambridge, MA, 02139, USA
| | - Amy E Keating
- MIT Departments of Biology and Biological Engineering, 77 Massachusetts Ave., Cambridge, MA, 02139, USA.
| |
Collapse
|
9
|
Norn C, Wicky BIM, Juergens D, Liu S, Kim D, Tischer D, Koepnick B, Anishchenko I, Baker D, Ovchinnikov S. Protein sequence design by conformational landscape optimization. Proc Natl Acad Sci U S A 2021; 118:e2017228118. [PMID: 33712545 PMCID: PMC7980421 DOI: 10.1073/pnas.2017228118] [Citation(s) in RCA: 77] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen's thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.
Collapse
Affiliation(s)
- Christoffer Norn
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
| | - Basile I M Wicky
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
| | - David Juergens
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
- Graduate Program in Molecular Engineering, University of Washington, Seattle, WA 98105
| | - Sirui Liu
- Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138
| | - David Kim
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
| | - Doug Tischer
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
| | - Brian Koepnick
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98105
- Institute for Protein Design, University of Washington, Seattle, WA 98105
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98105;
- Institute for Protein Design, University of Washington, Seattle, WA 98105
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98105
| | - Sergey Ovchinnikov
- Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138;
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA 02138
| |
Collapse
|
10
|
Pan X, Kortemme T. Recent advances in de novo protein design: Principles, methods, and applications. J Biol Chem 2021; 296:100558. [PMID: 33744284 PMCID: PMC8065224 DOI: 10.1016/j.jbc.2021.100558] [Citation(s) in RCA: 93] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 03/12/2021] [Accepted: 03/16/2021] [Indexed: 02/06/2023] Open
Abstract
The computational de novo protein design is increasingly applied to address a number of key challenges in biomedicine and biological engineering. Successes in expanding applications are driven by advances in design principles and methods over several decades. Here, we review recent innovations in major aspects of the de novo protein design and include how these advances were informed by principles of protein architecture and interactions derived from the wealth of structures in the Protein Data Bank. We describe developments in de novo generation of designable backbone structures, optimization of sequences, design scoring functions, and the design of the function. The advances not only highlight design goals reachable now but also point to the challenges and opportunities for the future of the field.
Collapse
Affiliation(s)
- Xingjie Pan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, USA; UC Berkeley - UCSF Graduate Program in Bioengineering, University of California San Francisco, San Francisco, California, USA.
| | - Tanja Kortemme
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, USA; UC Berkeley - UCSF Graduate Program in Bioengineering, University of California San Francisco, San Francisco, California, USA; Quantitative Biosciences Institute (QBI), University of California San Francisco, San Francisco, California, USA.
| |
Collapse
|
11
|
Zhou J, Panaitiu AE, Grigoryan G. A general-purpose protein design framework based on mining sequence-structure relationships in known protein structures. Proc Natl Acad Sci U S A 2020; 117:1059-1068. [PMID: 31892539 PMCID: PMC6969538 DOI: 10.1073/pnas.1908723117] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Current state-of-the-art approaches to computational protein design (CPD) aim to capture the determinants of structure from physical principles. While this has led to many successful designs, it does have strong limitations associated with inaccuracies in physical modeling, such that a reliable general solution to CPD has yet to be found. Here, we propose a design framework-one based on identifying and applying patterns of sequence-structure compatibility found in known proteins, rather than approximating them from models of interatomic interactions. We carry out extensive computational analyses and an experimental validation for our method. Our results strongly argue that the Protein Data Bank is now sufficiently large to enable proteins to be designed by using only examples of structural motifs from unrelated proteins. Because our method is likely to have orthogonal strengths relative to existing techniques, it could represent an important step toward removing remaining barriers to robust CPD.
Collapse
Affiliation(s)
- Jianfu Zhou
- Department of Computer Science, Dartmouth College, Hanover, NH 03755
| | | | - Gevorg Grigoryan
- Department of Computer Science, Dartmouth College, Hanover, NH 03755;
- Department of Biological Sciences, Dartmouth College, Hanover, NH 03755
| |
Collapse
|
12
|
Frappier V, Jenson JM, Zhou J, Grigoryan G, Keating AE. Tertiary Structural Motif Sequence Statistics Enable Facile Prediction and Design of Peptides that Bind Anti-apoptotic Bfl-1 and Mcl-1. Structure 2019; 27:606-617.e5. [PMID: 30773399 PMCID: PMC6447450 DOI: 10.1016/j.str.2019.01.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Revised: 12/20/2018] [Accepted: 01/18/2019] [Indexed: 12/25/2022]
Abstract
Understanding the relationship between protein sequence and structure well enough to design new proteins with desired functions is a longstanding goal in protein science. Here, we show that recurring tertiary structural motifs (TERMs) in the PDB provide rich information for protein-peptide interaction prediction and design. TERM statistics can be used to predict peptide binding energies for Bcl-2 family proteins as accurately as widely used structure-based tools. Furthermore, design using TERM energies (dTERMen) rapidly and reliably generates high-affinity peptide binders of anti-apoptotic proteins Bfl-1 and Mcl-1 with just 15%-38% sequence identity to any known native Bcl-2 family protein ligand. High-resolution structures of four designed peptides bound to their targets provide opportunities to analyze the strengths and limitations of the computational design method. Our results support dTERMen as a powerful approach that can complement existing tools for protein engineering.
Collapse
Affiliation(s)
- Vincent Frappier
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Justin M Jenson
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Jianfu Zhou
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA
| | - Gevorg Grigoryan
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA; Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA; Department of Biological Sciences, Dartmouth College, Hanover, NH 03755, USA.
| | - Amy E Keating
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Koch Center for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
13
|
Marcos E, Silva D. Essentials of
de novo
protein design: Methods and applications. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2018. [DOI: 10.1002/wcms.1374] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- Enrique Marcos
- Institute for Research in Biomedicine (IRB Barcelona)The Barcelona Institute of Science and TechnologyBarcelonaSpain
| | - Daniel‐Adriano Silva
- Department of BiochemistryUniversity of WashingtonSeattleWashington
- Institute for Protein DesignUniversity of WashingtonSeattleWashington
| |
Collapse
|
14
|
Holland J, Pan Q, Grigoryan G. Contact prediction is hardest for the most informative contacts, but improves with the incorporation of contact potentials. PLoS One 2018; 13:e0199585. [PMID: 29953468 PMCID: PMC6023208 DOI: 10.1371/journal.pone.0199585] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 06/11/2018] [Indexed: 11/18/2022] Open
Abstract
Co-evolution between pairs of residues in a multiple sequence alignment (MSA) of homologous proteins has long been proposed as an indicator of structural contacts. Recently, several methods, such as direct-coupling analysis (DCA) and MetaPSICOV, have been shown to achieve impressive rates of contact prediction by taking advantage of considerable sequence data. In this paper, we show that prediction success rates are highly sensitive to the structural definition of a contact, with more permissive definitions (i.e., those classifying more pairs as true contacts) naturally leading to higher positive predictive rates, but at the expense of the amount of structural information contributed by each contact. Thus, the remaining limitations of contact prediction algorithms are most noticeable in conjunction with geometrically restrictive contacts—precisely those that contribute more information in structure prediction. We suggest that to improve prediction rates for such “informative” contacts one could combine co-evolution scores with additional indicators of contact likelihood. Specifically, we find that when a pair of co-varying positions in an MSA is occupied by residue pairs with favorable statistical contact energies, that pair is more likely to represent a true contact. We show that combining a contact potential metric with DCA or MetaPSICOV performs considerably better than DCA or MetaPSICOV alone, respectively. This is true regardless of contact definition, but especially true for stricter and more informative contact definitions. In summary, this work outlines some remaining challenges to be addressed in contact prediction and proposes and validates a promising direction towards improvement.
Collapse
Affiliation(s)
- Jack Holland
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States of America
| | - Qinxin Pan
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States of America
| | - Gevorg Grigoryan
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States of America
- Department of Biological Sciences, Dartmouth College, Hanover, NH 03755, United States of America
- * E-mail:
| |
Collapse
|
15
|
Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, Carter L, Ravichandran R, Mulligan VK, Chevalier A, Arrowsmith CH, Baker D. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 2018; 357:168-175. [PMID: 28706065 PMCID: PMC5568797 DOI: 10.1126/science.aan0693] [Citation(s) in RCA: 296] [Impact Index Per Article: 49.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Accepted: 06/09/2017] [Indexed: 12/18/2022]
Abstract
Proteins fold into unique native structures stabilized by thousands of weak interactions that collectively overcome the entropic cost of folding. Although these forces are "encoded" in the thousands of known protein structures, "decoding" them is challenging because of the complexity of natural proteins that have evolved for function, not stability. We combined computational protein design, next-generation gene synthesis, and a high-throughput protease susceptibility assay to measure folding and stability for more than 15,000 de novo designed miniproteins, 1000 natural proteins, 10,000 point mutants, and 30,000 negative control sequences. This analysis identified more than 2500 stable designed proteins in four basic folds-a number sufficient to enable us to systematically examine how sequence determines folding and stability in uncharted protein space. Iteration between design and experiment increased the design success rate from 6% to 47%, produced stable proteins unlike those found in nature for topologies where design was initially unsuccessful, and revealed subtle contributions to stability as designs became increasingly optimized. Our approach achieves the long-standing goal of a tight feedback cycle between computation and experiment and has the potential to transform computational protein design into a data-driven science.
Collapse
Affiliation(s)
- Gabriel J Rocklin
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Tamuka M Chidyausiku
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA.,Graduate Program in Biological Physics, Structure, and Design, University of Washington, Seattle, WA 98195, USA
| | - Inna Goreshnik
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Alex Ford
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA.,Graduate Program in Biological Physics, Structure, and Design, University of Washington, Seattle, WA 98195, USA
| | - Scott Houliston
- Princess Margaret Cancer Centre, Toronto, Ontario M5G 1L7, Canada.,Structural Genomics Consortium, University of Toronto, Toronto, Ontario M5G 1L7, Canada
| | - Alexander Lemak
- Princess Margaret Cancer Centre, Toronto, Ontario M5G 1L7, Canada
| | - Lauren Carter
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Rashmi Ravichandran
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Vikram K Mulligan
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Aaron Chevalier
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Cheryl H Arrowsmith
- Princess Margaret Cancer Centre, Toronto, Ontario M5G 1L7, Canada.,Structural Genomics Consortium, University of Toronto, Toronto, Ontario M5G 1L7, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, Ontario M5G 1L7, Canada
| | - David Baker
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA. .,Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
16
|
Dang B, Wu H, Mulligan VK, Mravic M, Wu Y, Lemmin T, Ford A, Silva DA, Baker D, DeGrado WF. De novo design of covalently constrained mesosize protein scaffolds with unique tertiary structures. Proc Natl Acad Sci U S A 2017; 114:10852-10857. [PMID: 28973862 PMCID: PMC5642715 DOI: 10.1073/pnas.1710695114] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The folding of natural proteins typically relies on hydrophobic packing, metal binding, or disulfide bond formation in the protein core. Alternatively, a 3D structure can be defined by incorporating a multivalent cross-linking agent, and this approach has been successfully developed for the selection of bicyclic peptides from large random-sequence libraries. By contrast, there is no general method for the de novo computational design of multicross-linked proteins with predictable and well-defined folds, including ones not found in nature. Here we use Rosetta and Tertiary Motifs (TERMs) to design small proteins that fold around multivalent cross-linkers. The hydrophobic cross-linkers stabilize the fold by macrocyclic restraints, and they also form an integral part of a small apolar core. The designed CovCore proteins were prepared by chemical synthesis, and their structures were determined by solution NMR or X-ray crystallography. These mesosized proteins, lying between conventional proteins and small peptides, are easily accessible either through biosynthetic precursors or chemical synthesis. The unique tertiary structures and ease of synthesis of CovCore proteins indicate that they should provide versatile templates for developing inhibitors of protein-protein interactions.
Collapse
Affiliation(s)
- Bobo Dang
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158
| | - Haifan Wu
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158
| | | | - Marco Mravic
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158
| | - Yibing Wu
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158
| | - Thomas Lemmin
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158
| | - Alexander Ford
- Department of Biochemistry, University of Washington, Seattle, WA 98195
| | | | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195
| | - William F DeGrado
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158;
| |
Collapse
|
17
|
Mackenzie CO, Grigoryan G. Protein structural motifs in prediction and design. Curr Opin Struct Biol 2017; 44:161-167. [PMID: 28460216 PMCID: PMC5513761 DOI: 10.1016/j.sbi.2017.03.012] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Revised: 03/18/2017] [Accepted: 03/28/2017] [Indexed: 01/11/2023]
Abstract
The Protein Data Bank (PDB) has been an integral resource for shaping our fundamental understanding of protein structure and for the advancement of such applications as protein design and structure prediction. Over the years, information from the PDB has been used to generate models ranging from specific structural mechanisms to general statistical potentials. With accumulating structural data, it has become possible to mine for more complete and complex structural observations, deducing more accurate generalizations. Motif libraries, which capture recurring structural features along with their sequence preferences, have exposed modularity in the structural universe and found successful application in various problems of structural biology. Here we summarize recent achievements in this arena, focusing on subdomain level structural patterns and their applications to protein design and structure prediction, and suggest promising future directions as the structural database continues to grow.
Collapse
Affiliation(s)
- Craig O Mackenzie
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, United States
| | - Gevorg Grigoryan
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, United States; Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States.
| |
Collapse
|
18
|
Sequence statistics of tertiary structural motifs reflect protein stability. PLoS One 2017; 12:e0178272. [PMID: 28552940 PMCID: PMC5446159 DOI: 10.1371/journal.pone.0178272] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2017] [Accepted: 05/10/2017] [Indexed: 11/19/2022] Open
Abstract
The Protein Data Bank (PDB) has been a key resource for learning general rules of sequence-structure relationships in proteins. Quantitative insights have been gained by defining geometric descriptors of structure (e.g., distances, dihedral angles, solvent exposure, etc.) and observing their distributions and sequence preferences. Here we argue that as the PDB continues to grow, it may become unnecessary to reduce structure into a set of elementary descriptors. Instead, it could be possible to deduce quantitative sequence-structure relationships in the context of precisely-defined complex structural motifs by mining the PDB for closely matching backbone geometries. To validate this idea, we turned to the the task of predicting changes in protein stability upon amino-acid substitution—a difficult problem of broad significance. We defined non-contiguous tertiary motifs (TERMs) around a protein site of interest and extracted sequence preferences from ensembles of closely-matching substructures in the PDB to predict mutational stability changes at the site, ΔΔGm. We demonstrate that these ensemble statistics predict ΔΔGm on par with state-of-the-art statistical and machine-learning methods on large thermodynamic datasets, and outperform these, along with a leading structure-based modeling approach, when tested in the context of unbiased diverse mutations. Further, we show that the performance of the TERM-based method is directly related to the amount of available relevant structural data, automatically improving with the growing PDB. This enables a means of estimating prediction accuracy. Our results clearly demonstrate that: 1) statistics of non-contiguous structural motifs in the PDB encode fundamental sequence-structure relationships related to protein thermodynamic stability, and 2) the PDB is now large enough that such statistics are already useful in practice, with their accuracy expected to continue increasing as the database grows. These observations suggest new ways of using structural data towards addressing problems of computational structural biology.
Collapse
|
19
|
Abstract
Here, we systematically decompose the known protein structural universe into its basic elements, which we dub tertiary structural motifs (TERMs). A TERM is a compact backbone fragment that captures the secondary, tertiary, and quaternary environments around a given residue, comprising one or more disjoint segments (three on average). We seek the set of universal TERMs that capture all structure in the Protein Data Bank (PDB), finding remarkable degeneracy. Only ∼600 TERMs are sufficient to describe 50% of the PDB at sub-Angstrom resolution. However, more rare geometries also exist, and the overall structural coverage grows logarithmically with the number of TERMs. We go on to show that universal TERMs provide an effective mapping between sequence and structure. We demonstrate that TERM-based statistics alone are sufficient to recapitulate close-to-native sequences given either NMR or X-ray backbones. Furthermore, sequence variability predicted from TERM data agrees closely with evolutionary variation. Finally, locations of TERMs in protein chains can be predicted from sequence alone based on sequence signatures emergent from TERM instances in the PDB. For multisegment motifs, this method identifies spatially adjacent fragments that are not contiguous in sequence-a major bottleneck in structure prediction. Although all TERMs recur in diverse proteins, some appear specialized for certain functions, such as interface formation, metal coordination, or even water binding. Structural biology has benefited greatly from previously observed degeneracies in structure. The decomposition of the known structural universe into a finite set of compact TERMs offers exciting opportunities toward better understanding, design, and prediction of protein structure.
Collapse
|
20
|
A topological and conformational stability alphabet for multipass membrane proteins. Nat Chem Biol 2016; 12:167-73. [PMID: 26780406 DOI: 10.1038/nchembio.2001] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 11/13/2015] [Indexed: 12/27/2022]
Abstract
Multipass membrane proteins perform critical signal transduction and transport across membranes. How transmembrane helix (TMH) sequences encode the topology and conformational flexibility regulating these functions remains poorly understood. Here we describe a comprehensive analysis of the sequence-structure relationships at multiple interacting TMHs from all membrane proteins with structures in the Protein Data Bank (PDB). We found that membrane proteins can be deconstructed in interacting TMH trimer units, which mostly fold into six distinct structural classes of topologies and conformations. Each class is enriched in recurrent sequence motifs from functionally unrelated proteins, revealing unforeseen consensus and evolutionary conserved networks of stabilizing interhelical contacts. Interacting TMHs' topology and local protein conformational flexibility were remarkably well predicted in a blinded fashion from the identified binding-hotspot motifs. Our results reveal universal sequence-structure principles governing the complex anatomy and plasticity of multipass membrane proteins that may guide de novo structure prediction, design, and studies of folding and dynamics.
Collapse
|