1
|
Wells J, Hawkins-Hooker A, Bordin N, Sillitoe I, Paige B, Orengo C. Chainsaw: protein domain segmentation with fully convolutional neural networks. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae296. [PMID: 38718225 DOI: 10.1093/bioinformatics/btae296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/23/2024] [Accepted: 05/07/2024] [Indexed: 05/23/2024]
Abstract
MOTIVATION Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION github.com/JudeWells/Chainsaw.
Collapse
Affiliation(s)
- Jude Wells
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Alex Hawkins-Hooker
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Brooks Paige
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| |
Collapse
|
2
|
Lau AM, Kandathil SM, Jones DT. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat Commun 2023; 14:8445. [PMID: 38114456 PMCID: PMC10730818 DOI: 10.1038/s41467-023-43934-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/24/2023] [Indexed: 12/21/2023] Open
Abstract
The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
3
|
Zhu K, Su H, Peng Z, Yang J. A unified approach to protein domain parsing with inter-residue distance matrix. Bioinformatics 2023; 39:7025502. [PMID: 36734597 PMCID: PMC9919455 DOI: 10.1093/bioinformatics/btad070] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 01/02/2023] [Accepted: 02/01/2023] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION It is fundamental to cut multi-domain proteins into individual domains, for precise domain-based structural and functional studies. In the past, sequence-based and structure-based domain parsing was carried out independently with different methodologies. The recent progress in deep learning-based protein structure prediction provides the opportunity to unify sequence-based and structure-based domain parsing. RESULTS Based on the inter-residue distance matrix, which can be either derived from the input structure or predicted by trRosettaX, we can decode the domain boundaries under a unified framework. We name the proposed method UniDoc. The principle of UniDoc is based on the well-accepted physical concept of maximizing intra-domain interaction while minimizing inter-domain interaction. Comprehensive tests on five benchmark datasets indicate that UniDoc outperforms other state-of-the-art methods in terms of both accuracy and speed, for both sequence-based and structure-based domain parsing. The major contribution of UniDoc is providing a unified framework for structure-based and sequence-based domain parsing. We hope that UniDoc would be a convenient tool for protein domain analysis. AVAILABILITY AND IMPLEMENTATION https://yanglab.nankai.edu.cn/UniDoc/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kun Zhu
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Hong Su
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Jianyi Yang
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
4
|
Chu AE, Fernandez D, Liu J, Eguchi RR, Huang PS. De Novo Design of a Highly Stable Ovoid TIM Barrel: Unlocking Pocket Shape towards Functional Design. BIODESIGN RESEARCH 2022; 2022:9842315. [PMID: 37850141 PMCID: PMC10521652 DOI: 10.34133/2022/9842315] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 05/26/2022] [Indexed: 10/19/2023] Open
Abstract
The ability to finely control the structure of protein folds is an important prerequisite to functional protein design. The TIM barrel fold is an important target for these efforts as it is highly enriched for diverse functions in nature. Although a TIM barrel protein has been designed de novo, the ability to finely alter the curvature of the central beta barrel and the overall architecture of the fold remains elusive, limiting its utility for functional design. Here, we report the de novo design of a TIM barrel with ovoid (twofold) symmetry, drawing inspiration from natural beta and TIM barrels with ovoid curvature. We use an autoregressive backbone sampling strategy to implement our hypothesis for elongated barrel curvature, followed by an iterative enrichment sequence design protocol to obtain sequences which yield a high proportion of successfully folding designs. Designed sequences are highly stable and fold to the designed barrel curvature as determined by a 2.1 Å resolution crystal structure. The designs show robustness to drastic mutations, retaining high melting temperatures even when multiple charged residues are buried in the hydrophobic core or when the hydrophobic core is ablated to alanine. As a scaffold with a greater capacity for hosting diverse hydrogen bonding networks and installation of binding pockets or active sites, the ovoid TIM barrel represents a major step towards the de novo design of functional TIM barrels.
Collapse
Affiliation(s)
- Alexander E Chu
- Biophysics Program, Stanford University, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Daniel Fernandez
- Program in Chemistry, Engineering, And Medicine for Human Health (ChEM-H), Stanford University, Stanford, CA, USA
- Stanford ChEM-H, Macromolecular Structure Knowledge Center, Stanford University, Stanford, CA, USA
| | - Jingjia Liu
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Raphael R Eguchi
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Stanford ChEM-H, Macromolecular Structure Knowledge Center, Stanford University, Stanford, CA, USA
- Department of Biochemistry, Stanford University, Stanford, CA, USA
| | - Po-Ssu Huang
- Biophysics Program, Stanford University, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Stanford ChEM-H, Macromolecular Structure Knowledge Center, Stanford University, Stanford, CA, USA
- Bio-X Institute, Stanford University, Stanford, CA, USA
| |
Collapse
|
5
|
Eguchi RR, Choe CA, Huang PS. Ig-VAE: Generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput Biol 2022; 18:e1010271. [PMID: 35759518 PMCID: PMC9269947 DOI: 10.1371/journal.pcbi.1010271] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 07/08/2022] [Accepted: 06/01/2022] [Indexed: 12/26/2022] Open
Abstract
While deep learning models have seen increasing applications in protein science, few have been implemented for protein backbone generation—an important task in structure-based problems such as active site and interface design. We present a new approach to building class-specific backbones, using a variational auto-encoder to directly generate the 3D coordinates of immunoglobulins. Our model is torsion- and distance-aware, learns a high-resolution embedding of the dataset, and generates novel, high-quality structures compatible with existing design tools. We show that the Ig-VAE can be used with Rosetta to create a computational model of a SARS-CoV2-RBD binder via latent space sampling. We further demonstrate that the model’s generative prior is a powerful tool for guiding computational protein design, motivating a new paradigm under which backbone design is solved as constrained optimization problem in the latent space of a generative model. Many essential biochemical processes are governed by protein-protein interactions (PPIs), and our ability to make binding proteins that modulate PPIs is crucial to the creation of therapeutics and the study of cell-signaling. One critical aspect of PPI design is to capture protein conformational flexibility. Deep generative models are a class of mathematical models that are able to synthesize novel data from a finite set of training examples. Here, we make advances in computational protein design methodology by developing a deep generative model that creates protein backbones adopting the immunoglobulin fold, which is found in natural binding proteins such as antibodies. While generative models have been powerful in tasks such as image generation, using them to create proteins has remained a challenge. We solve this problem with a new model that allows for the direct generation of novel 3D molecules and show that they are of high chemical accuracy. Generated structures work well with existing protein design methods such as Rosetta, providing access to a large collection of novel immunoglobulin structures. Finally, we present a new protein design framework, called “generative design,” that shows how deep generative models such as ours can be applied to virtually any protein design problem.
Collapse
Affiliation(s)
- Raphael R. Eguchi
- Department of Biochemistry, Stanford University, Stanford, California, United States of America
- Department of Statistics, Stanford University, Stanford, California, United States of America
| | - Christian A. Choe
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
- * E-mail:
| |
Collapse
|
6
|
Rudden LSP, Hijazi M, Barth P. Deep learning approaches for conformational flexibility and switching properties in protein design. Front Mol Biosci 2022; 9:928534. [PMID: 36032687 PMCID: PMC9399439 DOI: 10.3389/fmolb.2022.928534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 07/15/2022] [Indexed: 11/30/2022] Open
Abstract
Following the hugely successful application of deep learning methods to protein structure prediction, an increasing number of design methods seek to leverage generative models to design proteins with improved functionality over native proteins or novel structure and function. The inherent flexibility of proteins, from side-chain motion to larger conformational reshuffling, poses a challenge to design methods, where the ideal approach must consider both the spatial and temporal evolution of proteins in the context of their functional capacity. In this review, we highlight existing methods for protein design before discussing how methods at the forefront of deep learning-based design accommodate flexibility and where the field could evolve in the future.
Collapse
Affiliation(s)
| | | | - Patrick Barth
- *Correspondence: Lucas S. P. Rudden, ; Patrick Barth,
| |
Collapse
|
7
|
Ovchinnikov S, Huang PS. Structure-based protein design with deep learning. Curr Opin Chem Biol 2021; 65:136-144. [PMID: 34547592 PMCID: PMC8671290 DOI: 10.1016/j.cbpa.2021.08.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 08/13/2021] [Indexed: 12/11/2022]
Abstract
Since the first revelation of proteins functioning as macromolecular machines through their three dimensional structures, researchers have been intrigued by the marvelous ways the biochemical processes are carried out by proteins. The aspiration to understand protein structures has fueled extensive efforts across different scientific disciplines. In recent years, it has been demonstrated that proteins with new functionality or shapes can be designed via structure-based modeling methods, and the design strategies have combined all available information - but largely piece-by-piece - from sequence derived statistics to the detailed atomic-level modeling of chemical interactions. Despite the significant progress, incorporating data-derived approaches through the use of deep learning methods can be a game changer. In this review, we summarize current progress, compare the arc of developing the deep learning approaches with the conventional methods, and describe the motivation and concepts behind current strategies that may lead to potential future opportunities.
Collapse
Affiliation(s)
- Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, 02138, USA.
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
8
|
Gidley F, Parmeggiani F. Repeat proteins: designing new shapes and functions for solenoid folds. Curr Opin Struct Biol 2021; 68:208-214. [PMID: 33721772 DOI: 10.1016/j.sbi.2021.02.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 01/31/2021] [Accepted: 02/01/2021] [Indexed: 10/21/2022]
Abstract
The modular nature of repeat proteins has inspired the design of regular and completely novel sequences and structures. Research in the past years has provided a broad set of design approaches and new repeat proteins that have found applications in molecular recognition, taking advantage of the natural ability of some of these families to bind proteins, peptides and nucleic acids. Here, we provide an overview on the recent trends in design of repeat proteins, particularly solenoid folds, and their applications. By exploiting the intrinsic modularity of repeats, new architectures have been designed that combine different types of repeat, are easily scalable by changing the number of repeats and can be quickly generated by using existing modular building blocks.
Collapse
Affiliation(s)
- Frances Gidley
- School of Chemistry, School of Biochemistry, Bristol Biodesign Institute, University of Bristol, United Kingdom
| | - Fabio Parmeggiani
- School of Chemistry, School of Biochemistry, Bristol Biodesign Institute, University of Bristol, United Kingdom.
| |
Collapse
|
9
|
Abstract
Proteins are molecular machines whose function depends on their ability to achieve complex folds with precisely defined structural and dynamic properties. The rational design of proteins from first-principles, or de novo, was once considered to be impossible, but today proteins with a variety of folds and functions have been realized. We review the evolution of the field from its earliest days, placing particular emphasis on how this endeavor has illuminated our understanding of the principles underlying the folding and function of natural proteins, and is informing the design of macromolecules with unprecedented structures and properties. An initial set of milestones in de novo protein design focused on the construction of sequences that folded in water and membranes to adopt folded conformations. The first proteins were designed from first-principles using very simple physical models. As computers became more powerful, the use of the rotamer approximation allowed one to discover amino acid sequences that stabilize the desired fold. As the crystallographic database of protein structures expanded in subsequent years, it became possible to construct proteins by assembling short backbone fragments that frequently recur in Nature. The second set of milestones in de novo design involves the discovery of complex functions. Proteins have been designed to bind a variety of metals, porphyrins, and other cofactors. The design of proteins that catalyze hydrolysis and oxygen-dependent reactions has progressed significantly. However, de novo design of catalysts for energetically demanding reactions, or even proteins that bind with high affinity and specificity to highly functionalized complex polar molecules remains an importnant challenge that is now being achieved. Finally, the protein design contributed significantly to our understanding of membrane protein folding and transport of ions across membranes. The area of membrane protein design, or more generally of biomimetic polymers that function in mixed or non-aqueous environments, is now becoming increasingly possible.
Collapse
|