1
|
Genomic language model predicts protein co-regulation and function. Nat Commun 2024; 15:2880. [PMID: 38570504 PMCID: PMC10991518 DOI: 10.1038/s41467-024-46947-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 03/13/2024] [Indexed: 04/05/2024] Open
Abstract
Deciphering the relationship between a gene and its genomic context is fundamental to understanding and engineering biological systems. Machine learning has shown promise in learning latent relationships underlying the sequence-structure-function paradigm from massive protein sequence datasets. However, to date, limited attempts have been made in extending this continuum to include higher order genomic context information. Evolutionary processes dictate the specificity of genomic contexts in which a gene is found across phylogenetic distances, and these emergent genomic patterns can be leveraged to uncover functional relationships between gene products. Here, we train a genomic language model (gLM) on millions of metagenomic scaffolds to learn the latent functional and regulatory relationships between genes. gLM learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and encode biologically meaningful and functionally relevant information (e.g. enzymatic function, taxonomy). Our analysis of the attention patterns demonstrates that gLM is learning co-regulated functional modules (i.e. operons). Our findings illustrate that gLM's unsupervised deep learning of the metagenomic corpus is an effective and promising approach to encode functional semantics and regulatory syntax of genes in their genomic contexts and uncover complex relationships between genes in a genomic region.
Collapse
|
2
|
Computational design of soluble functional analogues of integral membrane proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.09.540044. [PMID: 38496615 PMCID: PMC10942269 DOI: 10.1101/2023.05.09.540044] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
De novo design of complex protein folds using solely computational means remains a significant challenge. Here, we use a robust deep learning pipeline to design complex folds and soluble analogues of integral membrane proteins. Unique membrane topologies, such as those from GPCRs, are not found in the soluble proteome and we demonstrate that their structural features can be recapitulated in solution. Biophysical analyses reveal high thermal stability of the designs and experimental structures show remarkable design accuracy. The soluble analogues were functionalized with native structural motifs, standing as a proof-of-concept for bringing membrane protein functions to the soluble proteome, potentially enabling new approaches in drug discovery. In summary, we designed complex protein topologies and enriched them with functionalities from membrane proteins, with high experimental success rates, leading to a de facto expansion of the functional soluble fold space.
Collapse
|
3
|
An atlas of protein homo-oligomerization across domains of life. Cell 2024; 187:999-1010.e15. [PMID: 38325366 DOI: 10.1016/j.cell.2024.01.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 11/03/2023] [Accepted: 01/15/2024] [Indexed: 02/09/2024]
Abstract
Protein structures are essential to understanding cellular processes in molecular detail. While advances in artificial intelligence revealed the tertiary structure of proteins at scale, their quaternary structure remains mostly unknown. We devise a scalable strategy based on AlphaFold2 to predict homo-oligomeric assemblies across four proteomes spanning the tree of life. Our results suggest that approximately 45% of an archaeal proteome and a bacterial proteome and 20% of two eukaryotic proteomes form homomers. Our predictions accurately capture protein homo-oligomerization, recapitulate megadalton complexes, and unveil hundreds of homo-oligomer types, including three confirmed experimentally by structure determination. Integrating these datasets with omics information suggests that a majority of known protein complexes are symmetric. Finally, these datasets provide a structural context for interpreting disease mutations and reveal coiled-coil regions as major enablers of quaternary structure evolution in human. Our strategy is applicable to any organism and provides a comprehensive view of homo-oligomerization in proteomes.
Collapse
|
4
|
NMPFamsDB: a database of novel protein families from microbial metagenomes and metatranscriptomes. Nucleic Acids Res 2024; 52:D502-D512. [PMID: 37811892 PMCID: PMC10767849 DOI: 10.1093/nar/gkad800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 09/19/2023] [Indexed: 10/10/2023] Open
Abstract
The Novel Metagenome Protein Families Database (NMPFamsDB) is a database of metagenome- and metatranscriptome-derived protein families, whose members have no hits to proteins of reference genomes or Pfam domains. Each protein family is accompanied by multiple sequence alignments, Hidden Markov Models, taxonomic information, ecosystem and geolocation metadata, sequence and structure predictions, as well as 3D structure models predicted with AlphaFold2. In its current version, NMPFamsDB hosts over 100 000 protein families, each with at least 100 members. The reported protein families significantly expand (more than double) the number of known protein sequence clusters from reference genomes and reveal new insights into their habitat distribution, origins, functions and taxonomy. We expect NMPFamsDB to be a valuable resource for microbial proteome-wide analyses and for further discovery and characterization of novel functions. NMPFamsDB is publicly available in http://www.nmpfamsdb.org/ or https://bib.fleming.gr/NMPFamsDB.
Collapse
|
5
|
Predicting multiple conformations via sequence clustering and AlphaFold2. Nature 2024; 625:832-839. [PMID: 37956700 PMCID: PMC10808063 DOI: 10.1038/s41586-023-06832-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 11/03/2023] [Indexed: 11/15/2023]
Abstract
AlphaFold2 (ref. 1) has revolutionized structural biology by accurately predicting single structures of proteins. However, a protein's biological function often depends on multiple conformational substates2, and disease-causing point mutations often cause population changes within these substates3,4. We demonstrate that clustering a multiple-sequence alignment by sequence similarity enables AlphaFold2 to sample alternative states of known metamorphic proteins with high confidence. Using this method, named AF-Cluster, we investigated the evolutionary distribution of predicted structures for the metamorphic protein KaiB5 and found that predictions of both conformations were distributed in clusters across the KaiB family. We used nuclear magnetic resonance spectroscopy to confirm an AF-Cluster prediction: a cyanobacteria KaiB variant is stabilized in the opposite state compared with the more widely studied variant. To test AF-Cluster's sensitivity to point mutations, we designed and experimentally verified a set of three mutations predicted to flip KaiB from Rhodobacter sphaeroides from the ground to the fold-switched state. Finally, screening for alternative states in protein families without known fold switching identified a putative alternative state for the oxidoreductase Mpt53 in Mycobacterium tuberculosis. Further development of such bioinformatic methods in tandem with experiments will probably have a considerable impact on predicting protein energy landscapes, essential for illuminating biological function.
Collapse
|
6
|
Abstract
Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
Collapse
|
7
|
Abstract
There has been considerable recent progress in designing new proteins using deep-learning methods1-9. Despite this progress, a general deep-learning framework for protein design that enables solution of a wide range of design challenges, including de novo binder design and design of higher-order symmetric architectures, has yet to be described. Diffusion models10,11 have had considerable success in image and language generative modelling but limited success when applied to protein modelling, probably due to the complexity of protein backbone geometry and sequence-structure relationships. Here we show that by fine-tuning the RoseTTAFold structure prediction network on protein structure denoising tasks, we obtain a generative model of protein backbones that achieves outstanding performance on unconditional and topology-constrained protein monomer design, protein binder design, symmetric oligomer design, enzyme active site scaffolding and symmetric motif scaffolding for therapeutic and metal-binding protein design. We demonstrate the power and generality of the method, called RoseTTAFold diffusion (RFdiffusion), by experimentally characterizing the structures and functions of hundreds of designed symmetric assemblies, metal-binding proteins and protein binders. The accuracy of RFdiffusion is confirmed by the cryogenic electron microscopy structure of a designed binder in complex with influenza haemagglutinin that is nearly identical to the design model. In a manner analogous to networks that produce images from user-specified inputs, RFdiffusion enables the design of diverse functional proteins from simple molecular specifications.
Collapse
|
8
|
Mega-scale experimental analysis of protein folding stability in biology and design. Nature 2023; 620:434-444. [PMID: 37468638 PMCID: PMC10412457 DOI: 10.1038/s41586-023-06328-6] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 06/14/2023] [Indexed: 07/21/2023]
Abstract
Advances in DNA sequencing and machine learning are providing insights into protein sequences and structures on an enormous scale1. However, the energetics driving folding are invisible in these structures and remain largely unknown2. The hidden thermodynamics of folding can drive disease3,4, shape protein evolution5-7 and guide protein engineering8-10, and new approaches are needed to reveal these thermodynamics for every sequence and structure. Here we present cDNA display proteolysis, a method for measuring thermodynamic folding stability for up to 900,000 protein domains in a one-week experiment. From 1.8 million measurements in total, we curated a set of around 776,000 high-quality folding stabilities covering all single amino acid variants and selected double mutants of 331 natural and 148 de novo designed protein domains 40-72 amino acids in length. Using this extensive dataset, we quantified (1) environmental factors influencing amino acid fitness, (2) thermodynamic couplings (including unexpected interactions) between protein sites, and (3) the global divergence between evolutionary amino acid usage and protein folding stability. We also examined how our approach could identify stability determinants in designed proteins and evaluate design methods. The cDNA display proteolysis method is fast, accurate and uniquely scalable, and promises to reveal the quantitative rules for how amino acid sequences encode folding stability.
Collapse
|
9
|
Co-evolution-based prediction of metal-binding sites in proteomes by machine learning. Nat Chem Biol 2023; 19:548-555. [PMID: 36593274 DOI: 10.1038/s41589-022-01223-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Accepted: 11/08/2022] [Indexed: 01/03/2023]
Abstract
Metal ions have various important biological roles in proteins, including structural maintenance, molecular recognition and catalysis. Previous methods of predicting metal-binding sites in proteomes were based on either sequence or structural motifs. Here we developed a co-evolution-based pipeline named 'MetalNet' to systematically predict metal-binding sites in proteomes. We applied MetalNet to proteomes of four representative prokaryotic species and predicted 4,849 potential metalloproteins, which substantially expands the currently annotated metalloproteomes. We biochemically and structurally validated previously unannotated metal-binding sites in several proteins, including apo-citrate lyase phosphoribosyl-dephospho-CoA transferase citX, an Escherichia coli enzyme lacking structural or sequence homology to any known metalloprotein (Protein Data Bank (PDB) codes: 7DCM and 7DCN ). MetalNet also successfully recapitulated all known zinc-binding sites from the human spliceosome complex. The pipeline of MetalNet provides a unique and enabling tool for interrogating the hidden metalloproteome and studying metal biology.
Collapse
|
10
|
De novo design of small beta barrel proteins. Proc Natl Acad Sci U S A 2023; 120:e2207974120. [PMID: 36897987 PMCID: PMC10089152 DOI: 10.1073/pnas.2207974120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 01/27/2023] [Indexed: 03/12/2023] Open
Abstract
Small beta barrel proteins are attractive targets for computational design because of their considerable functional diversity despite their very small size (<70 amino acids). However, there are considerable challenges to designing such structures, and there has been little success thus far. Because of the small size, the hydrophobic core stabilizing the fold is necessarily very small, and the conformational strain of barrel closure can oppose folding; also intermolecular aggregation through free beta strand edges can compete with proper monomer folding. Here, we explore the de novo design of small beta barrel topologies using both Rosetta energy-based methods and deep learning approaches to design four small beta barrel folds: Src homology 3 (SH3) and oligonucleotide/oligosaccharide-binding (OB) topologies found in nature and five and six up-and-down-stranded barrels rarely if ever seen in nature. Both approaches yielded successful designs with high thermal stability and experimentally determined structures with less than 2.4 Å rmsd from the designed models. Using deep learning for backbone generation and Rosetta for sequence design yielded higher design success rates and increased structural diversity than Rosetta alone. The ability to design a large and structurally diverse set of small beta barrel proteins greatly increases the protein shape space available for designing binders to protein targets of interest.
Collapse
|
11
|
Cyclic peptide structure prediction and design using AlphaFold. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.25.529956. [PMID: 36865323 PMCID: PMC9980166 DOI: 10.1101/2023.02.25.529956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/28/2023]
Abstract
Deep learning networks offer considerable opportunities for accurate structure prediction and design of biomolecules. While cyclic peptides have gained significant traction as a therapeutic modality, developing deep learning methods for designing such peptides has been slow, mostly due to the small number of available structures for molecules in this size range. Here, we report approaches to modify the AlphaFold network for accurate structure prediction and design of cyclic peptides. Our results show this approach can accurately predict the structures of native cyclic peptides from a single sequence, with 36 out of 49 cases predicted with high confidence (pLDDT > 0.85) matching the native structure with root mean squared deviation (RMSD) less than 1.5 Å. Further extending our approach, we describe computational methods for designing sequences of peptide backbones generated by other backbone sampling methods and for de novo design of new macrocyclic peptides. We extensively sampled the structural diversity of cyclic peptides between 7-13 amino acids, and identified around 10,000 unique design candidates predicted to fold into the designed structures with high confidence. X-ray crystal structures for seven sequences with diverse sizes and structures designed by our approach match very closely with the design models (root mean squared deviation < 1.0 Å), highlighting the atomic level accuracy in our approach. The computational methods and scaffolds developed here provide the basis for custom-designing peptides for targeted therapeutic applications.
Collapse
|
12
|
End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman. Bioinformatics 2023; 39:6820925. [PMID: 36355460 PMCID: PMC9805565 DOI: 10.1093/bioinformatics/btac724] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 09/28/2022] [Accepted: 11/08/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Multiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. RESULTS Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood. AVAILABILITY AND IMPLEMENTATION Our code and examples are available at: https://github.com/spetti/SMURF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
13
|
Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters. FRONTIERS IN BIOINFORMATICS 2023; 3:1157956. [PMID: 36959975 PMCID: PMC10029925 DOI: 10.3389/fbinf.2023.1157956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 02/21/2023] [Indexed: 03/06/2023] Open
Abstract
Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.
Collapse
|
14
|
State-of-the-Art Estimation of Protein Model Accuracy Using AlphaFold. PHYSICAL REVIEW LETTERS 2022; 129:238101. [PMID: 36563190 DOI: 10.1103/physrevlett.129.238101] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Accepted: 10/18/2022] [Indexed: 06/17/2023]
Abstract
The problem of predicting a protein's 3D structure from its primary amino acid sequence is a longstanding challenge in structural biology. Recently, approaches like alphafold have achieved remarkable performance on this task by combining deep learning techniques with coevolutionary data from multiple sequence alignments of related protein sequences. The use of coevolutionary information is critical to these models' accuracy, and without it their predictive performance drops considerably. In living cells, however, the 3D structure of a protein is fully determined by its primary sequence and the biophysical laws that cause it to fold into a low-energy configuration. Thus, it should be possible to predict a protein's structure from only its primary sequence by learning an approximate biophysical energy function. We provide evidence that alphafold has learned such an energy function, and uses coevolution data to solve the global search problem of finding a low-energy conformation. We demonstrate that alphafold'slearned energy function can be used to rank the quality of candidate protein structures with state-of-the-art accuracy, without using any coevolution data. Finally, we explore several applications of this energy function, including the prediction of protein structures without multiple sequence alignments.
Collapse
|
15
|
A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol 2022; 29:1056-1067. [PMID: 36344848 PMCID: PMC9663297 DOI: 10.1038/s41594-022-00849-w] [Citation(s) in RCA: 179] [Impact Index Per Article: 89.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 09/20/2022] [Indexed: 11/09/2022]
Abstract
Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.
Collapse
|
16
|
Temperature- and Field-Induced Transformation of the Magnetic State in Co 2.5Ge 0.5BO 5. Inorg Chem 2022; 61:13034-13046. [PMID: 35947773 DOI: 10.1021/acs.inorgchem.2c01193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
A tetravalent-substituted cobalt ludwigite Co2.5Ge0.5BO5 has been synthesized using the flux method. The compound undergoes two magnetic transitions: a long-range antiferromagnetic transition at TN1 = 84 K and a metamagnetic one at TN2 = 36 K. The sample-oriented magnetization measurements revealed a fully compensated magnetic moment along the a- and c-axes and an uncompensated one along the b-axis leading to high uniaxial anisotropy. A field-induced enhancement of the ferromagnetic correlations at TN2 is observed in specific heat measurements. The DFT+GGA calculation predicts the spin configuration of (↑↓↓↑) as a ground state with a magnetic moment of 1.37 μB/f.u. The strong hybridization of Ge(4s, 4p) with O (2p) orbitals resulting from the high electronegativity of Ge4+ is assumed to cause an increase in the interlayer interaction, contributing to the long-range magnetic order. The effect of two super-superexchange pathways Co2+-O-B-O-Co2+ and Co2+-O-M4-O-Co2+ on the magnetic state is discussed.
Collapse
|
17
|
Abstract
The binding and catalytic functions of proteins are generally mediated by a small number of functional residues held in place by the overall protein structure. Here, we describe deep learning approaches for scaffolding such functional sites without needing to prespecify the fold or secondary structure of the scaffold. The first approach, "constrained hallucination," optimizes sequences such that their predicted structures contain the desired functional site. The second approach, "inpainting," starts from the functional site and fills in additional sequence and structure to create a viable protein scaffold in a single forward pass through a specifically trained RoseTTAFold network. We use these two methods to design candidate immunogens, receptor traps, metalloproteins, enzymes, and protein-binding proteins and validate the designs using a combination of in silico and experimental tests.
Collapse
|
18
|
Abstract
ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold's 40-60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.com/sokrypton/ColabFold and its novel environmental databases are available at https://colabfold.mmseqs.com .
Collapse
|
19
|
Abstract
ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40−60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.com/sokrypton/ColabFold and its novel environmental databases are available at https://colabfold.mmseqs.com. ColabFold is a free and accessible platform for protein folding that provides accelerated prediction of protein structures and complexes using AlphaFold2 or RoseTTAFold.
Collapse
|
20
|
Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2022; 27:34-45. [PMID: 34890134 PMCID: PMC8752338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
The established approach to unsupervised protein contact prediction estimates coevolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment. Increasingly large Transformers are being pretrained on unlabeled, unaligned protein sequence databases and showing competitive performance on protein contact prediction. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce an energy-based attention layer, factored attention, which, in a certain limit, recovers a Potts model, and use it to contrast Potts and Transformers. We show that the Transformer leverages hierarchical signal in protein family databases not captured by single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases.
Collapse
|
21
|
Abstract
Protein-protein interactions play critical roles in biology, but the structures of many eukaryotic protein complexes are unknown, and there are likely many interactions not yet identified. We take advantage of advances in proteome-wide amino acid coevolution analysis and deep-learning–based structure modeling to systematically identify and build accurate models of core eukaryotic protein complexes within the Saccharomyces cerevisiae proteome. We use a combination of RoseTTAFold and AlphaFold to screen through paired multiple sequence alignments for 8.3 million pairs of yeast proteins, identify 1505 likely to interact, and build structure models for 106 previously unidentified assemblies and 806 that have not been structurally characterized. These complexes, which have as many as five subunits, play roles in almost all key processes in eukaryotic cells and provide broad insights into biological function.
Collapse
|
22
|
Abstract
Since the first revelation of proteins functioning as macromolecular machines through their three dimensional structures, researchers have been intrigued by the marvelous ways the biochemical processes are carried out by proteins. The aspiration to understand protein structures has fueled extensive efforts across different scientific disciplines. In recent years, it has been demonstrated that proteins with new functionality or shapes can be designed via structure-based modeling methods, and the design strategies have combined all available information - but largely piece-by-piece - from sequence derived statistics to the detailed atomic-level modeling of chemical interactions. Despite the significant progress, incorporating data-derived approaches through the use of deep learning methods can be a game changer. In this review, we summarize current progress, compare the arc of developing the deep learning approaches with the conventional methods, and describe the motivation and concepts behind current strategies that may lead to potential future opportunities.
Collapse
|
23
|
Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021; 373:871-876. [PMID: 34282049 PMCID: PMC7612213 DOI: 10.1126/science.abj8754] [Citation(s) in RCA: 2086] [Impact Index Per Article: 695.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 07/07/2021] [Indexed: 01/17/2023]
Abstract
DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.
Collapse
|
24
|
Abstract
The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen's thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.
Collapse
|
25
|
Solution NMR structure of Se0862, a highly conserved cyanobacterial protein involved in biofilm formation. Protein Sci 2020; 29:2274-2280. [PMID: 32949024 PMCID: PMC7586914 DOI: 10.1002/pro.3952] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Revised: 09/08/2020] [Accepted: 09/12/2020] [Indexed: 12/13/2022]
Abstract
Biofilms are accumulations of microorganisms embedded in extracellular matrices that protect against external factors and stressful environments. Cyanobacterial biofilms are ubiquitous and have potential for treatment of wastewater and sustainable production of biofuels. But the underlying mechanisms regulating cyanobacterial biofilm formation are unclear. Here, we report the solution NMR structure of a protein, Se0862, conserved across diverse cyanobacterial species and involved in regulation of biofilm formation in the cyanobacterium Synechococcus elongatus PCC 7942. Se0862 is a class α+β protein with ααββββαα topology and roll architecture, consisting of a four-stranded β-sheet that is flanked by four α-helices on one side. Conserved surface residues constitute a hydrophobic pocket and charged regions that are likely also present in Se0862 orthologs.
Collapse
|
26
|
Advances in Chromatin and Chromosome Research: Perspectives from Multiple Fields. Mol Cell 2020; 79:881-901. [PMID: 32768408 PMCID: PMC7888594 DOI: 10.1016/j.molcel.2020.07.003] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 06/12/2020] [Accepted: 07/06/2020] [Indexed: 12/12/2022]
Abstract
Nucleosomes package genomic DNA into chromatin. By regulating DNA access for transcription, replication, DNA repair, and epigenetic modification, chromatin forms the nexus of most nuclear processes. In addition, dynamic organization of chromatin underlies both regulation of gene expression and evolution of chromosomes into individualized sister objects, which can segregate cleanly to different daughter cells at anaphase. This collaborative review shines a spotlight on technologies that will be crucial to interrogate key questions in chromatin and chromosome biology including state-of-the-art microscopy techniques, tools to physically manipulate chromatin, single-cell methods to measure chromatin accessibility, computational imaging with neural networks and analytical tools to interpret chromatin structure and dynamics. In addition, this review provides perspectives on how these tools can be applied to specific research fields such as genome stability and developmental biology and to test concepts such as phase separation of chromatin.
Collapse
|
27
|
Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat Methods 2020; 17:665-680. [PMID: 32483333 PMCID: PMC7603796 DOI: 10.1038/s41592-020-0848-2] [Citation(s) in RCA: 373] [Impact Index Per Article: 93.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 04/22/2020] [Indexed: 12/12/2022]
Abstract
The Rosetta software for macromolecular modeling, docking and design is extensively used in laboratories worldwide. During two decades of development by a community of laboratories at more than 60 institutions, Rosetta has been continuously refactored and extended. Its advantages are its performance and interoperability between broad modeling capabilities. Here we review tools developed in the last 5 years, including over 80 methods. We discuss improvements to the score function, user interfaces and usability. Rosetta is available at http://www.rosettacommons.org.
Collapse
|
28
|
Structure determination of the HgcAB complex using metagenome sequence data: insights into microbial mercury methylation. Commun Biol 2020; 3:320. [PMID: 32561885 PMCID: PMC7305189 DOI: 10.1038/s42003-020-1047-5] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 05/27/2020] [Indexed: 11/09/2022] Open
Abstract
Bacteria and archaea possessing the hgcAB gene pair methylate inorganic mercury (Hg) to form highly toxic methylmercury. HgcA consists of a corrinoid binding domain and a transmembrane domain, and HgcB is a dicluster ferredoxin. However, their detailed structure and function have not been thoroughly characterized. We modeled the HgcAB complex by combining metagenome sequence data mining, coevolution analysis, and Rosetta structure calculations. In addition, we overexpressed HgcA and HgcB in Escherichia coli, confirmed spectroscopically that they bind cobalamin and [4Fe-4S] clusters, respectively, and incorporated these cofactors into the structural model. Surprisingly, the two domains of HgcA do not interact with each other, but HgcB forms extensive contacts with both domains. The model suggests that conserved cysteines in HgcB are involved in shuttling HgII, methylmercury, or both. These findings refine our understanding of the mechanism of Hg methylation and expand the known repertoire of corrinoid methyltransferases in nature. Connor J. Cooper et al. expressed HgcA and HgcB in Escherichia coli and modeled the structure of the HgcAB complex by combining metagenome sequence data, coevolution analysis, and ab initio structure calculations. This study provides insights into the biochemical mechanism of mercury (Hg) methylation.
Collapse
|
29
|
Structural basis of ER-associated protein degradation mediated by the Hrd1 ubiquitin ligase complex. Science 2020; 368:368/6489/eaaz2449. [PMID: 32327568 DOI: 10.1126/science.aaz2449] [Citation(s) in RCA: 119] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2019] [Revised: 01/18/2020] [Accepted: 03/11/2020] [Indexed: 12/13/2022]
Abstract
Misfolded luminal endoplasmic reticulum (ER) proteins undergo ER-associated degradation (ERAD-L): They are retrotranslocated into the cytosol, polyubiquitinated, and degraded by the proteasome. ERAD-L is mediated by the Hrd1 complex (composed of Hrd1, Hrd3, Der1, Usa1, and Yos9), but the mechanism of retrotranslocation remains mysterious. Here, we report a structure of the active Hrd1 complex, as determined by cryo-electron microscopy analysis of two subcomplexes. Hrd3 and Yos9 jointly create a luminal binding site that recognizes glycosylated substrates. Hrd1 and the rhomboid-like Der1 protein form two "half-channels" with cytosolic and luminal cavities, respectively, and lateral gates facing one another in a thinned membrane region. These structures, along with crosslinking and molecular dynamics simulation results, suggest how a polypeptide loop of an ERAD-L substrate moves through the ER membrane.
Collapse
|
30
|
A demonstration of unsupervised machine learning in species delimitation. Mol Phylogenet Evol 2019; 139:106562. [PMID: 31323334 PMCID: PMC6880864 DOI: 10.1016/j.ympev.2019.106562] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2019] [Revised: 07/03/2019] [Accepted: 07/15/2019] [Indexed: 01/13/2023]
Abstract
One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.
Collapse
|
31
|
Template-based modeling by ClusPro in CASP13 and the potential for using co-evolutionary information in docking. Proteins 2019; 87:1241-1248. [PMID: 31444975 DOI: 10.1002/prot.25808] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Revised: 07/21/2019] [Accepted: 07/30/2019] [Indexed: 12/29/2022]
Abstract
As a participant in the joint CASP13-CAPRI46 assessment, the ClusPro server debuted its new template-based modeling functionality. The addition of this feature, called ClusPro TBM, was motivated by the previous CASP-CAPRI assessments and by the proven ability of template-based methods to produce higher-quality models, provided templates are available. In prior assessments, ClusPro submissions consisted of models that were produced via free docking of pre-generated homology models. This method was successful in terms of the number of acceptable predictions across targets; however, analysis of results showed that purely template-based methods produced a substantially higher number of medium-quality models for targets for which there were good templates available. The addition of template-based modeling has expanded ClusPro's ability to produce higher accuracy predictions, primarily for homomeric but also for some heteromeric targets. Here we review the newest additions to the ClusPro web server and discuss examples of CASP-CAPRI targets that continue to drive further development. We also describe ongoing work not yet implemented in the server. This includes the development of methods to improve template-based models and the use of co-evolutionary information for data-assisted free docking.
Collapse
|
32
|
Protein interaction networks revealed by proteome coevolution. SCIENCE (NEW YORK, N.Y.) 2019; 365:185-189. [PMID: 31296772 DOI: 10.1126/science.aaw6718] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 06/07/2019] [Indexed: 01/19/2023]
Abstract
Residue-residue coevolution has been observed across a number of protein-protein interfaces, but the extent of residue coevolution between protein families on the whole-proteome scale has not been systematically studied. We investigate coevolution between 5.4 million pairs of proteins in Escherichia coli and between 3.9 millions pairs in Mycobacterium tuberculosis We find strong coevolution for binary complexes involved in metabolism and weaker coevolution for larger complexes playing roles in genetic information processing. We take advantage of this coevolution, in combination with structure modeling, to predict protein-protein interactions (PPIs) with an accuracy that benchmark studies suggest is considerably higher than that of proteome-wide two-hybrid and mass spectrometry screens. We identify hundreds of previously uncharacterized PPIs in E. coli and M. tuberculosis that both add components to known protein complexes and networks and establish the existence of new ones.
Collapse
|
33
|
A structural and data-driven approach to engineering a plant cytochrome P450 enzyme. SCIENCE CHINA-LIFE SCIENCES 2019; 62:873-882. [DOI: 10.1007/s11427-019-9538-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/21/2019] [Accepted: 02/26/2019] [Indexed: 10/26/2022]
|
34
|
Development of a dual-functional conjugate of antigenic peptide and Fc-III mimetics (DCAF) for targeted antibody blocking. Chem Sci 2019; 10:3271-3280. [PMID: 30996912 PMCID: PMC6429600 DOI: 10.1039/c8sc05273e] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 01/28/2019] [Indexed: 01/12/2023] Open
Abstract
Targeted antibody blocking enables characterization of binding sites on immunoglobulin G (IgG), and can efficiently eliminate harmful antibodies from organisms. In this report, we present a novel peptide-denoted as a dual-functional conjugate of antigenic peptide and Fc-III mimetics (DCAF)-for targeted blocking of antibodies. Synthesis of DCAF was achieved by native chemical ligation, and the molecule consists of three functional parts: a specific antigenic peptide, a linker and the Fc-III mimetic peptide, which has a high affinity toward the Fc region of IgG molecules. We demonstrate that DCAF binds the cognate antibody with high selectivity by simultaneously binding to the Fab and Fc regions of IgG. Animal experiments revealed that DCAF molecules diminish the antibody-dependent enhancement effect in a dengue virus infection model, and rescue the acetylcholine receptor by inhibiting the complement cascade in a myasthenia gravis model. These results suggest that DCAFs could have utility in the development of new therapeutics against harmful antibodies.
Collapse
|
35
|
Abstract
The regular arrangements of β-strands around a central axis in β-barrels and of α-helices in coiled coils contrast with the irregular tertiary structures of most globular proteins, and have fascinated structural biologists since they were first discovered. Simple parametric models have been used to design a wide range of α-helical coiled-coil structures, but to date there has been no success with β-barrels. Here we show that accurate de novo design of β-barrels requires considerable symmetry-breaking to achieve continuous hydrogen-bond connectivity and eliminate backbone strain. We then build ensembles of β-barrel backbone models with cavity shapes that match the fluorogenic compound DFHBI, and use a hierarchical grid-based search method to simultaneously optimize the rigid-body placement of DFHBI in these cavities and the identities of the surrounding amino acids to achieve high shape and chemical complementarity. The designs have high structural accuracy and bind and fluorescently activate DFHBI in vitro and in Escherichia coli, yeast and mammalian cells. This de novo design of small-molecule binding activity, using backbones custom-built to bind the ligand, should enable the design of increasingly sophisticated ligand-binding proteins, sensors and catalysts that are not limited by the backbone geometries available in known protein structures.
Collapse
|
36
|
An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12. Sci Rep 2018; 8:9939. [PMID: 29967418 PMCID: PMC6028396 DOI: 10.1038/s41598-018-26812-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 05/17/2018] [Indexed: 01/14/2023] Open
Abstract
Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research.
Collapse
|
37
|
Abstract
Proteins fold to their lowest free-energy structures, and hence the most straightforward way to increase the accuracy of a partially incorrect protein structure model is to search for the lowest-energy nearby structure. This direct approach has met with little success for two reasons: first, energy function inaccuracies can lead to false energy minima, resulting in model degradation rather than improvement; and second, even with an accurate energy function, the search problem is formidable because the energy only drops considerably in the immediate vicinity of the global minimum, and there are a very large number of degrees of freedom. Here we describe a large-scale energy optimization-based refinement method that incorporates advances in both search and energy function accuracy that can substantially improve the accuracy of low-resolution homology models. The method refined low-resolution homology models into correct folds for 50 of 84 diverse protein families and generated improved models in recent blind structure prediction experiments. Analyses of the basis for these improvements reveal contributions from both the improvements in conformational sampling techniques and the energy function.
Collapse
|
38
|
Automatic structure prediction of oligomeric assemblies using Robetta in CASP12. Proteins 2018; 86 Suppl 1:283-291. [PMID: 28913931 PMCID: PMC6019630 DOI: 10.1002/prot.25387] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Revised: 09/01/2017] [Accepted: 09/11/2017] [Indexed: 12/15/2022]
Abstract
Many naturally occurring protein systems function primarily as symmetric assemblies. Prediction of the quaternary structure of these assemblies is an important biological problem. This article describes automated tools we have developed for predicting the structures of symmetric protein assemblies in the Robetta structure prediction server. We assess the performance of this pipeline on a set of targets from the recent CASP12/CAPRI blind quaternary structure prediction experiment. Our approach successfully predicted 5 of 7 symmetric assemblies in this challenge, and was assessed as the best participating server group, and 1 of only 2 groups (human or server) with 2 predictions judged as high quality by the assessors. We also assess the method on a broader set of 22 natively symmetric CASP12 targets, where we show that oligomeric modeling can improve the accuracy of monomeric structure determination, particularly in highly intertwined oligomers.
Collapse
|
39
|
Protein structure prediction using Rosetta in CASP12. Proteins 2017; 86 Suppl 1:113-121. [PMID: 28940798 DOI: 10.1002/prot.25390] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Accepted: 09/18/2017] [Indexed: 12/20/2022]
Abstract
We describe several notable aspects of our structure predictions using Rosetta in CASP12 in the free modeling (FM) and refinement (TR) categories. First, we had previously generated (and published) models for most large protein families lacking experimentally determined structures using Rosetta guided by co-evolution based contact predictions, and for several targets these models proved better starting points for comparative modeling than any known crystal structure-our model database thus starts to fulfill one of the goals of the original protein structure initiative. Second, while our "human" group simply submitted ROBETTA models for most targets, for six targets expert intervention improved predictions considerably; the largest improvement was for T0886 where we correctly parsed two discontinuous domains guided by predicted contact maps to accurately identify a structural homolog of the same fold. Third, Rosetta all atom refinement followed by MD simulations led to consistent but small improvements when starting models were close to the native structure, and larger but less consistent improvements when starting models were further away.
Collapse
|
40
|
Cryo-EM structure of the protein-conducting ERAD channel Hrd1 in complex with Hrd3. Nature 2017; 548:352-355. [PMID: 28682307 PMCID: PMC5736104 DOI: 10.1038/nature23314] [Citation(s) in RCA: 135] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2017] [Accepted: 06/30/2017] [Indexed: 12/16/2022]
Abstract
Misfolded endoplasmic reticulum (ER) proteins are retro-translocated through the membrane into the cytosol, where they are poly-ubiquitinated, extracted from the ER membrane, and degraded by the proteasome 1–4, a pathway termed ER-associated protein degradation (ERAD). Proteins with misfolded domains in the ER lumen or membrane are discarded through the ERAD-L and –M pathways, respectively. In S. cerevisiae, both pathways require the ubiquitin ligase Hrd1, a multi-spanning membrane protein with a cytosolic RING finger domain 5,6. Hrd1 is the crucial membrane component for retro-translocation 7,8, but whether it forms a protein-conducting channel is unclear. Here, we report a cryo-electron microscopy (cryo-EM) structure of S. cerevisiae Hrd1 in complex with its ER luminal binding partner Hrd3. Hrd1 forms a dimer within the membrane with one or two Hrd3 molecules associated at its luminal side. Each Hrd1 molecule has eight trans-membrane segments, five of which form an aqueous cavity extending from the cytosol almost to the ER lumen, while a segment of the neighboring Hrd1 molecule forms a lateral seal. The aqueous cavity and lateral gate are reminiscent of features in protein-conducting conduits that facilitate polypeptide movement in the opposite direction, i.e. from the cytosol into or across membranes 9–11. Our results suggest that Hrd1 forms a retro-translocation channel for the movement of misfolded polypeptides through the ER membrane.
Collapse
|
41
|
Applications of contact predictions to structural biology. IUCRJ 2017; 4:291-300. [PMID: 28512576 PMCID: PMC5414403 DOI: 10.1107/s2052252517005115] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Accepted: 04/03/2017] [Indexed: 06/07/2023]
Abstract
Evolutionary pressure on residue interactions, intramolecular or intermolecular, that are important for protein structure or function can lead to covariance between the two positions. Recent methodological advances allow much more accurate contact predictions to be derived from this evolutionary covariance signal. The practical application of contact predictions has largely been confined to structural bioinformatics, yet, as this work seeks to demonstrate, the data can be of enormous value to the structural biologist working in X-ray crystallo-graphy, cryo-EM or NMR. Integrative structural bioinformatics packages such as Rosetta can already exploit contact predictions in a variety of ways. The contribution of contact predictions begins at construct design, where structural domains may need to be expressed separately and contact predictions can help to predict domain limits. Structure solution by molecular replacement (MR) benefits from contact predictions in diverse ways: in difficult cases, more accurate search models can be constructed using ab initio modelling when predictions are available, while intermolecular contact predictions can allow the construction of larger, oligomeric search models. Furthermore, MR using supersecondary motifs or large-scale screens against the PDB can exploit information, such as the parallel or antiparallel nature of any β-strand pairing in the target, that can be inferred from contact predictions. Contact information will be particularly valuable in the determination of lower resolution structures by helping to assign sequence register. In large complexes, contact information may allow the identity of a protein responsible for a certain region of density to be determined and then assist in the orientation of an available model within that density. In NMR, predicted contacts can provide long-range information to extend the upper size limit of the technique in a manner analogous but complementary to experimental methods. Finally, predicted contacts can distinguish between biologically relevant interfaces and mere lattice contacts in a final crystal structure, and have potential in the identification of functionally important regions and in foreseeing the consequences of mutations.
Collapse
|
42
|
Architectures of Lipid Transport Systems for the Bacterial Outer Membrane. Cell 2017; 169:273-285.e17. [PMID: 28388411 PMCID: PMC5467742 DOI: 10.1016/j.cell.2017.03.019] [Citation(s) in RCA: 140] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Revised: 01/07/2017] [Accepted: 03/14/2017] [Indexed: 10/19/2022]
Abstract
How phospholipids are trafficked between the bacterial inner and outer membranes through the hydrophilic space of the periplasm is not known. We report that members of the mammalian cell entry (MCE) protein family form hexameric assemblies with a central channel capable of mediating lipid transport. The E. coli MCE protein, MlaD, forms a ring associated with an ABC transporter complex in the inner membrane. A soluble lipid-binding protein, MlaC, ferries lipids between MlaD and an outer membrane protein complex. In contrast, EM structures of two other E. coli MCE proteins show that YebT forms an elongated tube consisting of seven stacked MCE rings, and PqiB adopts a syringe-like architecture. Both YebT and PqiB create channels of sufficient length to span the periplasmic space. This work reveals diverse architectures of highly conserved protein-based channels implicated in the transport of lipids between the membranes of bacteria and some eukaryotic organelles.
Collapse
|
43
|
Overcoming an optimization plateau in the directed evolution of highly efficient nerve agent bioscavengers. Protein Eng Des Sel 2017; 30:333-345. [DOI: 10.1093/protein/gzx003] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2016] [Accepted: 01/10/2017] [Indexed: 11/13/2022] Open
|
44
|
Architectures of Lipid Transport Systems for the Bacterial Outer Membrane. Biophys J 2017. [DOI: 10.1016/j.bpj.2016.11.107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
|
45
|
Protein structure determination using metagenome sequence data. Science 2017; 355:294-298. [PMID: 28104891 PMCID: PMC5493203 DOI: 10.1126/science.aah4043] [Citation(s) in RCA: 331] [Impact Index Per Article: 47.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 11/22/2016] [Indexed: 01/30/2023]
Abstract
Despite decades of work by structural biologists, there are still ~5200 protein families with unknown structure outside the range of comparative modeling. We show that Rosetta structure prediction guided by residue-residue contacts inferred from evolutionary information can accurately model proteins that belong to large families and that metagenome sequence data more than triple the number of protein families with sufficient sequences for accurate modeling. We then integrate metagenome data, contact-based structure matching, and Rosetta structure calculations to generate models for 614 protein families with currently unknown structures; 206 are membrane proteins and 137 have folds not represented in the Protein Data Bank. This approach provides the representative models for large protein families originally envisioned as the goal of the Protein Structure Initiative at a fraction of the cost.
Collapse
|
46
|
Structural insights into SAM domain-mediated tankyrase oligomerization. Protein Sci 2016; 25:1744-52. [PMID: 27328430 DOI: 10.1002/pro.2968] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 06/16/2016] [Indexed: 12/28/2022]
Abstract
Tankyrase 1 (TNKS1; a.k.a. ARTD5) and tankyrase 2 (TNKS2; a.k.a ARTD6) are highly homologous poly(ADP-ribose) polymerases (PARPs) that function in a wide variety of cellular processes including Wnt signaling, Src signaling, Akt signaling, Glut4 vesicle translocation, telomere length regulation, and centriole and spindle pole maturation. Tankyrase proteins include a sterile alpha motif (SAM) domain that undergoes oligomerization in vitro and in vivo. However, the SAM domains of TNKS1 and TNKS2 have not been structurally characterized and the mode of oligomerization is not yet defined. Here we model the SAM domain-mediated oligomerization of tankyrase. The structural model, supported by mutagenesis and NMR analysis, demonstrates a helical, homotypic head-to-tail polymer that facilitates TNKS self-association. Furthermore, we show that TNKS1 and TNKS2 can form (TNKS1 SAM-TNKS2 SAM) hetero-oligomeric structures mediated by their SAM domains. Though wild-type tankyrase proteins have very low solubility, model-based mutations of the SAM oligomerization interface residues allowed us to obtain soluble TNKS proteins. These structural insights will be invaluable for the functional and biophysical characterization of TNKS1/2, including the role of TNKS oligomerization in protein poly(ADP-ribosyl)ation (PARylation) and PARylation-dependent ubiquitylation.
Collapse
|
47
|
Structure of a bd oxidase indicates similar mechanisms for membrane-integrated oxygen reductases. Science 2016; 352:583-6. [PMID: 27126043 DOI: 10.1126/science.aaf2477] [Citation(s) in RCA: 106] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2016] [Accepted: 03/28/2016] [Indexed: 12/29/2022]
Abstract
The cytochrome bd oxidases are terminal oxidases that are present in bacteria and archaea. They reduce molecular oxygen (dioxygen) to water, avoiding the production of reactive oxygen species. In addition to their contribution to the proton motive force, they mediate viability under oxygen-related stress conditions and confer tolerance to nitric oxide, thus contributing to the virulence of pathogenic bacteria. Here we present the atomic structure of the bd oxidase from Geobacillus thermodenitrificans, revealing a pseudosymmetrical subunit fold. The arrangement and order of the heme cofactors support the conclusions from spectroscopic measurements that the cleavage of the dioxygen bond may be mechanistically similar to that in the heme-copper-containing oxidases, even though the structures are completely different.
Collapse
|
48
|
Structure prediction using sparse simulated NOE restraints with Rosetta in CASP11. Proteins 2016; 84 Suppl 1:181-8. [PMID: 26857542 PMCID: PMC5490372 DOI: 10.1002/prot.25006] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Revised: 01/11/2016] [Accepted: 02/02/2016] [Indexed: 12/17/2022]
Abstract
In CASP11 we generated protein structure models using simulated ambiguous and unambiguous nuclear Overhauser effect (NOE) restraints with a two stage protocol. Low resolution models were generated guided by the unambiguous restraints using continuous chain folding for alpha and alpha-beta proteins, and iterative annealing for all beta proteins to take advantage of the strand pairing information implicit in the restraints. The Rosetta fragment/model hybridization protocol was then used to recombine and regularize these models, and refine them in the Rosetta full atom energy function guided by both the unambiguous and the ambiguous restraints. Fifteen out of 19 targets were modeled with GDT-TS quality scores greater than 60 for Model 1, significantly improving upon the non-assisted predictions. Our results suggest that atomic level accuracy is achievable using sparse NOE data when there is at least one correctly assigned NOE for every residue. Proteins 2016; 84(Suppl 1):181-188. © 2016 Wiley Periodicals, Inc.
Collapse
|
49
|
Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta. Proteins 2016; 84 Suppl 1:67-75. [PMID: 26677056 PMCID: PMC5490371 DOI: 10.1002/prot.24974] [Citation(s) in RCA: 83] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2015] [Revised: 11/27/2015] [Accepted: 12/12/2015] [Indexed: 12/19/2022]
Abstract
We describe CASP11 de novo blind structure predictions made using the Rosetta structure prediction methodology with both automatic and human assisted protocols. Model accuracy was generally improved using coevolution derived residue-residue contact information as restraints during Rosetta conformational sampling and refinement, particularly when the number of sequences in the family was more than three times the length of the protein. The highlight was the human assisted prediction of T0806, a large and topologically complex target with no homologs of known structure, which had unprecedented accuracy-<3.0 Å root-mean-square deviation (RMSD) from the crystal structure over 223 residues. For this target, we increased the amount of conformational sampling over our fully automated method by employing an iterative hybridization protocol. Our results clearly demonstrate, in a blind prediction scenario, that coevolution derived contacts can considerably increase the accuracy of template-free structure modeling. Proteins 2016; 84(Suppl 1):67-75. © 2015 Wiley Periodicals, Inc.
Collapse
|
50
|
Catalytic efficiencies of directly evolved phosphotriesterase variants with structurally different organophosphorus compounds in vitro. Arch Toxicol 2015; 90:2711-2724. [DOI: 10.1007/s00204-015-1626-2] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 10/22/2015] [Indexed: 11/29/2022]
|