1
|
Medvedev KE, Schaeffer RD, Grishin NV. DrugDomain: The evolutionary context of drugs and small molecules bound to domains. Protein Sci 2024; 33:e5116. [PMID: 38979784 PMCID: PMC11231930 DOI: 10.1002/pro.5116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 06/27/2024] [Accepted: 06/29/2024] [Indexed: 07/10/2024]
Abstract
Interactions between proteins and small organic compounds play a crucial role in regulating protein functions. These interactions can modulate various aspects of protein behavior, including enzymatic activity, signaling cascades, and structural stability. By binding to specific sites on proteins, small organic compounds can induce conformational changes, alter protein-protein interactions, or directly affect catalytic activity. Therefore, many drugs available on the market today are small molecules (72% of all approved drugs in the last 5 years). Proteins are composed of one or more domains: evolutionary units that convey function or fitness either singly or in concert with others. Understanding which domain(s) of the target protein binds to a drug can lead to additional opportunities for discovering novel targets. The evolutionary classification of protein domains (ECOD) classifies domains into an evolutionary hierarchy that focuses on distant homology. Previously, no structure-based protein domain classification existed that included information about both the interaction between small molecules or drugs and the structural domains of a target protein. This data is especially important for multidomain proteins and large complexes. Here, we present the DrugDomain database that reports the interaction between ECOD of human target proteins and DrugBank molecules and drugs. The pilot version of DrugDomain describes the interaction of 5160 DrugBank molecules associated with 2573 human proteins. It describes domains for all experimentally determined structures of these proteins and incorporates AlphaFold models when such structures are unavailable. The DrugDomain database is available online: http://prodata.swmed.edu/DrugDomain/.
Collapse
Affiliation(s)
- Kirill E. Medvedev
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - R. Dustin Schaeffer
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
2
|
Yagi S, Tagami S. An ancestral fold reveals the evolutionary link between RNA polymerase and ribosomal proteins. Nat Commun 2024; 15:5938. [PMID: 39025855 PMCID: PMC11258233 DOI: 10.1038/s41467-024-50013-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 06/25/2024] [Indexed: 07/20/2024] Open
Abstract
Numerous molecular machines are required to drive the central dogma of molecular biology. However, the means by which these numerous proteins emerged in the early evolutionary stage of life remains enigmatic. Many of them possess small β-barrel folds with different topologies, represented by double-psi β-barrels (DPBBs) conserved in DNA and RNA polymerases, and similar but topologically distinct six-stranded β-barrel RIFT or five-stranded β-barrel folds such as OB and SH3 in ribosomal proteins. Here, we discover that the previously reconstructed ancient DPBB sequence could also adopt a β-barrel fold named Double-Zeta β-barrel (DZBB), as a metamorphic protein. The DZBB fold is not found in any modern protein, although its structure shares similarities with RIFT and OB. Indeed, DZBB could be transformed into them through simple engineering experiments. Furthermore, the OB designs could be further converted into SH3 by circular-permutation as previously predicted. These results indicate that these β-barrels diversified quickly from a common ancestor at the beginning of the central dogma evolution.
Collapse
Affiliation(s)
- Sota Yagi
- RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan.
- Faculty of Human Sciences, Waseda University, 2-579-15, Mikajima, Tokorozawa, Saitama, 359-1192, Japan.
| | - Shunsuke Tagami
- RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan.
- Graduate School of Medicine, Science and Technology, Shinshu University, 3-1-1 Asahi, Matsumoto City, Nagano, 390-8621, Japan.
- International Institute for Sustainability with Knotted Chiral Meta Matter (WPI-SKCM²), Hiroshima University, 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima, 739-8526, Japan.
| |
Collapse
|
3
|
Cottle T, Joh L, Posner C, DeCosta A, Kardon JR. An adaptor for feedback regulation of heme biosynthesis by the mitochondrial protease CLPXP. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.05.602318. [PMID: 39005287 PMCID: PMC11245108 DOI: 10.1101/2024.07.05.602318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Heme biosynthesis is tightly coordinated such that essential heme functions including oxygen transport, respiration, and catalysis are fully supplied without overproducing toxic heme precursors and depleting cellular iron. The initial heme biosynthetic enzyme, ALA synthase (ALAS), exhibits heme-induced degradation that is dependent on the mitochondrial AAA+ protease complex CLPXP, but the mechanism for this negative feedback regulation had not been elucidated. By biochemical reconstitution, we have discovered that POLDIP2 serves as a heme-sensing adaptor protein to deliver ALAS for degradation. Similarly, loss of POLDIP2 strongly impairs ALAS turnover in cells. POLDIP2 directly recognizes heme-bound ALAS to drive assembly of the degradation complex. The C-terminal element of ALAS, truncation of which leads to a form of porphyria (XLDPP), is dispensable for interaction with POLDIP2 but necessary for degradation. Our findings establish the molecular basis for heme-induced degradation of ALAS by CLPXP, establish POLDIP2 as a substrate adaptor for CLPXP, and provide mechanistic insight into two forms of erythropoietic protoporphyria linked to CLPX and ALAS.
Collapse
|
4
|
Zhou Y, Myung Y, Rodrigues CM, Ascher D. DDMut-PPI: predicting effects of mutations on protein-protein interactions using graph-based deep learning. Nucleic Acids Res 2024; 52:W207-W214. [PMID: 38783112 PMCID: PMC11223791 DOI: 10.1093/nar/gkae412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 04/30/2024] [Accepted: 05/02/2024] [Indexed: 05/25/2024] Open
Abstract
Protein-protein interactions (PPIs) play a vital role in cellular functions and are essential for therapeutic development and understanding diseases. However, current predictive tools often struggle to balance efficiency and precision in predicting the effects of mutations on these complex interactions. To address this, we present DDMut-PPI, a deep learning model that efficiently and accurately predicts changes in PPI binding free energy upon single and multiple point mutations. Building on the robust Siamese network architecture with graph-based signatures from our prior work, DDMut, the DDMut-PPI model was enhanced with a graph convolutional network operated on the protein interaction interface. We used residue-specific embeddings from ProtT5 protein language model as node features, and a variety of molecular interactions as edge features. By integrating evolutionary context with spatial information, this framework enables DDMut-PPI to achieve a robust Pearson correlation of up to 0.75 (root mean squared error: 1.33 kcal/mol) in our evaluations, outperforming most existing methods. Importantly, the model demonstrated consistent performance across mutations that increase or decrease binding affinity. DDMut-PPI offers a significant advancement in the field and will serve as a valuable tool for researchers probing the complexities of protein interactions. DDMut-PPI is freely available as a web server and an application programming interface at https://biosig.lab.uq.edu.au/ddmut_ppi.
Collapse
Affiliation(s)
- Yunzhuo Zhou
- The Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, Queensland 4072, Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia
| | - YooChan Myung
- The Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, Queensland 4072, Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia
| | - Carlos H M Rodrigues
- The Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, Queensland 4072, Australia
| | - David B Ascher
- The Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, Queensland 4072, Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia
| |
Collapse
|
5
|
Selim KA, Alva V. PII-like signaling proteins: a new paradigm in orchestrating cellular homeostasis. Curr Opin Microbiol 2024; 79:102453. [PMID: 38678827 DOI: 10.1016/j.mib.2024.102453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 02/19/2024] [Accepted: 02/20/2024] [Indexed: 05/01/2024]
Abstract
Members of the PII superfamily are versatile, multitasking signaling proteins ubiquitously found in all domains of life. They adeptly monitor and synchronize the cell's carbon, nitrogen, energy, redox, and diurnal states, primarily by binding interdependently to adenyl-nucleotides, including charged nucleotides (ATP, ADP, and AMP) and second messengers such as cyclic adenosine monophosphate (cAMP), cyclic di-adenosine monophosphate (c-di-AMP), and S-adenosylmethionine-AMP (SAM-AMP). These proteins also undergo a variety of posttranslational modifications, such as phosphorylation, adenylation, uridylation, carboxylation, and disulfide bond formation, which further provide cues on the metabolic state of the cell. Serving as precise metabolic sensors, PII superfamily proteins transmit this information to diverse cellular targets, establishing dynamic regulatory assemblies that fine-tune cellular homeostasis. Recently discovered, PII-like proteins are emerging families of signaling proteins that, while related to canonical PII proteins, have evolved to fulfill a diverse range of cellular functions, many of which remain elusive. In this review, we focus on the evolution of PII-like proteins and summarize the molecular mechanisms governing the assembly dynamics of PII complexes, with a special emphasis on the PII-like protein SbtB.
Collapse
Affiliation(s)
- Khaled A Selim
- Microbiology / Molecular Physiology of Prokaryotes, Institute of Biology II, University of Freiburg, Schänzlestraße 1, 79104 Freiburg, Germany; Protein Evolution Department, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
| | - Vikram Alva
- Protein Evolution Department, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
6
|
Medvedev KE, Zhang J, Schaeffer RD, Kinch LN, Cong Q, Grishin NV. Structure classification of the proteins from Salmonella enterica pangenome revealed novel potential pathogenicity islands. Sci Rep 2024; 14:12260. [PMID: 38806511 PMCID: PMC11133325 DOI: 10.1038/s41598-024-60991-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 04/30/2024] [Indexed: 05/30/2024] Open
Abstract
Salmonella enterica is a pathogenic bacterium known for causing severe typhoid fever in humans, making it important to study due to its potential health risks and significant impact on public health. This study provides evolutionary classification of proteins from Salmonella enterica pangenome. We classified 17,238 domains from 13,147 proteins from 79,758 Salmonella enterica strains and studied in detail domains of 272 proteins from 14 characterized Salmonella pathogenicity islands (SPIs). Among SPIs-related proteins, 90 proteins function in the secretion machinery. 41% domains of SPI proteins have no previous sequence annotation. By comparing clinical and environmental isolates, we identified 3682 proteins that are overrepresented in clinical group that we consider as potentially pathogenic. Among domains of potentially pathogenic proteins only 50% domains were annotated by sequence methods previously. Moreover, 36% (1330 out of 3682) of potentially pathogenic proteins cannot be classified into Evolutionary Classification of Protein Domains database (ECOD). Among classified domains of potentially pathogenic proteins the most populated homology groups include helix-turn-helix (HTH), Immunoglobulin-related, and P-loop domains-related. Functional analysis revealed overrepresentation of these protein in biological processes related to viral entry into host cell, antibiotic biosynthesis, DNA metabolism and conformation change, and underrepresentation in translational processes. Analysis of the potentially pathogenic proteins indicates that they form 119 clusters or novel potential pathogenicity islands (NPPIs) within the Salmonella genome, suggesting their potential contribution to the bacterium's virulence. One of the NPPIs revealed significant overrepresentation of potentially pathogenic proteins. Overall, our analysis revealed that identified potentially pathogenic proteins are poorly studied.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Lisa N Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| |
Collapse
|
7
|
Burnim AA, Dufault-Thompson K, Jiang X. The three-sided right-handed β-helix is a versatile fold for glycan interactions. Glycobiology 2024; 34:cwae037. [PMID: 38767844 PMCID: PMC11129586 DOI: 10.1093/glycob/cwae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/13/2024] [Accepted: 05/17/2024] [Indexed: 05/22/2024] Open
Abstract
Interactions between proteins and glycans are critical to various biological processes. With databases of carbohydrate-interacting proteins and increasing amounts of structural data, the three-sided right-handed β-helix (RHBH) has emerged as a significant structural fold for glycan interactions. In this review, we provide an overview of the sequence, mechanistic, and structural features that enable the RHBH to interact with glycans. The RHBH is a prevalent fold that exists in eukaryotes, prokaryotes, and viruses associated with adhesin and carbohydrate-active enzyme (CAZyme) functions. An evolutionary trajectory analysis on structurally characterized RHBH-containing proteins shows that they likely evolved from carbohydrate-binding proteins with their carbohydrate-degrading activities evolving later. By examining three polysaccharide lyase and three glycoside hydrolase structures, we provide a detailed view of the modes of glycan binding in RHBH proteins. The 3-dimensional shape of the RHBH creates an electrostatically and spatially favorable glycan binding surface that allows for extensive hydrogen bonding interactions, leading to favorable and stable glycan binding. The RHBH is observed to be an adaptable domain capable of being modified with loop insertions and charge inversions to accommodate heterogeneous and flexible glycans and diverse reaction mechanisms. Understanding this prevalent protein fold can advance our knowledge of glycan binding in biological systems and help guide the efficient design and utilization of RHBH-containing proteins in glycobiology research.
Collapse
Affiliation(s)
- Audrey A Burnim
- National Library of Medicine, National Institutes of Health, Building 38A, Room 6N607, 8600 Rockville Pike, Bethesda, MD 20894 United States
| | - Keith Dufault-Thompson
- National Library of Medicine, National Institutes of Health, Building 38A, Room 6N607, 8600 Rockville Pike, Bethesda, MD 20894 United States
| | - Xiaofang Jiang
- National Library of Medicine, National Institutes of Health, Building 38A, Room 6N607, 8600 Rockville Pike, Bethesda, MD 20894 United States
| |
Collapse
|
8
|
Wells J, Hawkins-Hooker A, Bordin N, Sillitoe I, Paige B, Orengo C. Chainsaw: protein domain segmentation with fully convolutional neural networks. Bioinformatics 2024; 40:btae296. [PMID: 38718225 PMCID: PMC11256964 DOI: 10.1093/bioinformatics/btae296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/23/2024] [Accepted: 05/07/2024] [Indexed: 05/23/2024] Open
Abstract
MOTIVATION Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION github.com/JudeWells/Chainsaw.
Collapse
Affiliation(s)
- Jude Wells
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Alex Hawkins-Hooker
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Brooks Paige
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| |
Collapse
|
9
|
Goldford JE, Smith HB, Longo LM, Wing BA, McGlynn SE. Primitive purine biosynthesis connects ancient geochemistry to modern metabolism. Nat Ecol Evol 2024; 8:999-1009. [PMID: 38519634 DOI: 10.1038/s41559-024-02361-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 02/06/2024] [Indexed: 03/25/2024]
Abstract
An unresolved question in the origin and evolution of life is whether a continuous path from geochemical precursors to the majority of molecules in the biosphere can be reconstructed from modern-day biochemistry. Here we identified a feasible path by simulating the evolution of biosphere-scale metabolism, using only known biochemical reactions and models of primitive coenzymes. We find that purine synthesis constitutes a bottleneck for metabolic expansion, which can be alleviated by non-autocatalytic phosphoryl coupling agents. Early phases of the expansion are enriched with enzymes that are metal dependent and structurally symmetric, supporting models of early biochemical evolution. This expansion trajectory suggests distinct hypotheses regarding the tempo, mode and timing of metabolic pathway evolution, including a late appearance of methane metabolisms and oxygenic photosynthesis consistent with the geochemical record. The concordance between biological and geological analyses suggests that this trajectory provides a plausible evolutionary history for the vast majority of core biochemistry.
Collapse
Affiliation(s)
- Joshua E Goldford
- Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, USA.
- Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Blue Marble Space Institute of Science, Seattle, WA, USA.
| | - Harrison B Smith
- Blue Marble Space Institute of Science, Seattle, WA, USA
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan
| | - Liam M Longo
- Blue Marble Space Institute of Science, Seattle, WA, USA
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan
| | - Boswell A Wing
- Department of Geological Sciences, University of Colorado, Boulder, CO, USA
| | - Shawn Erin McGlynn
- Blue Marble Space Institute of Science, Seattle, WA, USA.
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan.
- Biofunctional Catalyst Research Team, RIKEN Center for Sustainable Resource Science, Wako, Japan.
| |
Collapse
|
10
|
Cisneros AF, Nielly-Thibault L, Mallik S, Levy ED, Landry CR. Mutational biases favor complexity increases in protein interaction networks after gene duplication. Mol Syst Biol 2024; 20:549-572. [PMID: 38499674 PMCID: PMC11066126 DOI: 10.1038/s44320-024-00030-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 02/27/2024] [Accepted: 02/28/2024] [Indexed: 03/20/2024] Open
Abstract
Biological systems can gain complexity over time. While some of these transitions are likely driven by natural selection, the extent to which they occur without providing an adaptive benefit is unknown. At the molecular level, one example is heteromeric complexes replacing homomeric ones following gene duplication. Here, we build a biophysical model and simulate the evolution of homodimers and heterodimers following gene duplication using distributions of mutational effects inferred from available protein structures. We keep the specific activity of each dimer identical, so their concentrations drift neutrally without new functions. We show that for more than 60% of tested dimer structures, the relative concentration of the heteromer increases over time due to mutational biases that favor the heterodimer. However, allowing mutational effects on synthesis rates and differences in the specific activity of homo- and heterodimers can limit or reverse the observed bias toward heterodimers. Our results show that the accumulation of more complex protein quaternary structures is likely under neutral evolution, and that natural selection would be needed to reverse this tendency.
Collapse
Affiliation(s)
- Angel F Cisneros
- Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, G1V 0A6, Québec, Canada
- Institut de biologie intégrative et des systèmes, Université Laval, G1V 0A6, Québec, Canada
- PROTEO, Le regroupement québécois de recherche sur la fonction, l'ingénierie et les applications des protéines, Université Laval, G1V 0A6, Québec, Canada
- Centre de recherche sur les données massives, Université Laval, G1V 0A6, Québec, Canada
- Department of Chemical and Structural Biology, Weizmann Institute of Science, 7610001, Rehovot, Israel
| | - Lou Nielly-Thibault
- Institut de biologie intégrative et des systèmes, Université Laval, G1V 0A6, Québec, Canada
- PROTEO, Le regroupement québécois de recherche sur la fonction, l'ingénierie et les applications des protéines, Université Laval, G1V 0A6, Québec, Canada
- Centre de recherche sur les données massives, Université Laval, G1V 0A6, Québec, Canada
- Département de biologie, Faculté des sciences et de génie, Université Laval, G1V 0A6, Québec, Canada
| | - Saurav Mallik
- Department of Chemical and Structural Biology, Weizmann Institute of Science, 7610001, Rehovot, Israel
| | - Emmanuel D Levy
- Department of Chemical and Structural Biology, Weizmann Institute of Science, 7610001, Rehovot, Israel
| | - Christian R Landry
- Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, G1V 0A6, Québec, Canada.
- Institut de biologie intégrative et des systèmes, Université Laval, G1V 0A6, Québec, Canada.
- PROTEO, Le regroupement québécois de recherche sur la fonction, l'ingénierie et les applications des protéines, Université Laval, G1V 0A6, Québec, Canada.
- Centre de recherche sur les données massives, Université Laval, G1V 0A6, Québec, Canada.
- Département de biologie, Faculté des sciences et de génie, Université Laval, G1V 0A6, Québec, Canada.
| |
Collapse
|
11
|
Wright E. Accurately clustering biological sequences in linear time by relatedness sorting. Nat Commun 2024; 15:3047. [PMID: 38589369 PMCID: PMC11001989 DOI: 10.1038/s41467-024-47371-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open
Abstract
Clustering biological sequences into similar groups is an increasingly important task as the number of available sequences continues to grow exponentially. Search-based approaches to clustering scale super-linearly with the number of input sequences, making it impractical to cluster very large sets of sequences. Approaches to clustering sequences in linear time currently lack the accuracy of super-linear approaches. Here, I set out to develop and characterize a strategy for clustering with linear time complexity that retains the accuracy of less scalable approaches. The resulting algorithm, named Clusterize, sorts sequences by relatedness to linearize the clustering problem. Clusterize produces clusters with accuracy rivaling popular programs (CD-HIT, MMseqs2, and UCLUST) but exhibits linear asymptotic scalability. Clusterize generates higher accuracy and oftentimes much larger clusters than Linclust, a fast linear time clustering algorithm. I demonstrate the utility of Clusterize by accurately solving different clustering problems involving millions of nucleotide or protein sequences.
Collapse
Affiliation(s)
- Erik Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Center for Evolutionary Biology and Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
12
|
Corso G, Deng A, Fry B, Polizzi N, Barzilay R, Jaakkola T. Deep Confident Steps to New Pockets: Strategies for Docking Generalization. ARXIV 2024:arXiv:2402.18396v1. [PMID: 38463508 PMCID: PMC10925391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
Collapse
Affiliation(s)
| | | | - Benjamin Fry
- Dana-Farber Cancer Institute and Harvard Medical School
| | | | | | | |
Collapse
|
13
|
Sakuma K, Kobayashi N, Sugiki T, Nagashima T, Fujiwara T, Suzuki K, Kobayashi N, Murata T, Kosugi T, Tatsumi-Koga R, Koga N. Design of complicated all-α protein structures. Nat Struct Mol Biol 2024; 31:275-282. [PMID: 38177681 DOI: 10.1038/s41594-023-01147-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Accepted: 10/04/2023] [Indexed: 01/06/2024]
Abstract
A wide range of de novo protein structure designs have been achieved, but the complexity of naturally occurring protein structures is still far beyond these designs. Here, to expand the diversity and complexity of de novo designed protein structures, we sought to develop a method for designing 'difficult-to-describe' α-helical protein structures composed of irregularly aligned α-helices like globins. Backbone structure libraries consisting of a myriad of α-helical structures with five or six helices were generated by combining 18 helix-loop-helix motifs and canonical α-helices, and five distinct topologies were selected for de novo design. The designs were found to be monomeric with high thermal stability in solution and fold into the target topologies with atomic accuracy. This study demonstrated that complicated α-helical proteins are created using typical building blocks. The method we developed will enable us to explore the universe of protein structures for designing novel functional proteins.
Collapse
Affiliation(s)
- Koya Sakuma
- Department of Structural Molecular Science, School of Physical Sciences, SOKENDAI (The Graduate University for Advanced Studies), Hayama, Japan
| | - Naohiro Kobayashi
- RIKEN Center for Biosystems Dynamics Research, RIKEN, Yokohama, Japan
- Institute for Protein Research, Osaka University, Suita, Japan
| | | | - Toshio Nagashima
- RIKEN Center for Biosystems Dynamics Research, RIKEN, Yokohama, Japan
| | | | - Kano Suzuki
- Department of Chemistry, Graduate School of Science, Chiba University, Chiba, Japan
| | - Naoya Kobayashi
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of National Sciences, Okazaki, Japan
| | - Takeshi Murata
- Department of Chemistry, Graduate School of Science, Chiba University, Chiba, Japan
- Membrane Protein Research Center, Chiba University, Chiba, Japan
- Structural Biology Research Center, Institute of Materials Structure Science, High Energy Accelerator Research Organization (KEK), Tsukuba, Japan
| | - Takahiro Kosugi
- Department of Structural Molecular Science, School of Physical Sciences, SOKENDAI (The Graduate University for Advanced Studies), Hayama, Japan
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of National Sciences, Okazaki, Japan
- Research Center of Integrative Molecular Systems, Institute for Molecular Science, National Institutes of National Sciences, Okazaki, Japan
| | - Rie Tatsumi-Koga
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of National Sciences, Okazaki, Japan
| | - Nobuyasu Koga
- Department of Structural Molecular Science, School of Physical Sciences, SOKENDAI (The Graduate University for Advanced Studies), Hayama, Japan.
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of National Sciences, Okazaki, Japan.
- Research Center of Integrative Molecular Systems, Institute for Molecular Science, National Institutes of National Sciences, Okazaki, Japan.
- Institute for Protein Research, Osaka University, Suita, Japan.
| |
Collapse
|
14
|
Schaeffer RD, Zhang J, Medvedev KE, Kinch LN, Cong Q, Grishin NV. ECOD domain classification of 48 whole proteomes from AlphaFold Structure Database using DPAM2. PLoS Comput Biol 2024; 20:e1011586. [PMID: 38416793 PMCID: PMC10927120 DOI: 10.1371/journal.pcbi.1011586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/11/2024] [Accepted: 02/20/2024] [Indexed: 03/01/2024] Open
Abstract
Protein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets.
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Kirill E. Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| |
Collapse
|
15
|
Rigden DJ, Fernández XM. The 2024 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res 2024; 52:D1-D9. [PMID: 38035367 PMCID: PMC10767945 DOI: 10.1093/nar/gkad1173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 11/23/2023] [Indexed: 12/02/2023] Open
Abstract
The 2024 Nucleic Acids Research database issue contains 180 papers from across biology and neighbouring disciplines. There are 90 papers reporting on new databases and 83 updates from resources previously published in the Issue. Updates from databases most recently published elsewhere account for a further seven. Nucleic acid databases include the new NAKB for structural information and updates from Genbank, ENA, GEO, Tarbase and JASPAR. The Issue's Breakthrough Article concerns NMPFamsDB for novel prokaryotic protein families and the AlphaFold Protein Structure Database has an important update. Metabolism is covered by updates from Reactome, Wikipathways and Metabolights. Microbes are covered by RefSeq, UNITE, SPIRE and P10K; viruses by ViralZone and PhageScope. Medically-oriented databases include the familiar COSMIC, Drugbank and TTD. Genomics-related resources include Ensembl, UCSC Genome Browser and Monarch. New arrivals cover plant imaging (OPIA and PlantPAD) and crop plants (SoyMD, TCOD and CropGS-Hub). The entire Database Issue is freely available online on the Nucleic Acids Research website (https://academic.oup.com/nar). Over the last year the NAR online Molecular Biology Database Collection has been updated, reviewing 1060 entries, adding 97 new resources and eliminating 388 discontinued URLs bringing the current total to 1959 databases. It is available at http://www.oxfordjournals.org/nar/database/c/.
Collapse
Affiliation(s)
- Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, UK
| | | |
Collapse
|
16
|
McGuinness KN, Fehon N, Feehan R, Miller M, Mutter AC, Rybak LA, Nam J, AbuSalim JE, Atkinson JT, Heidari H, Losada N, Kim JD, Koder RL, Lu Y, Silberg JJ, Slusky JSG, Falkowski PG, Nanda V. The energetics and evolution of oxidoreductases in deep time. Proteins 2024; 92:52-59. [PMID: 37596815 DOI: 10.1002/prot.26563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/06/2023] [Indexed: 08/20/2023]
Abstract
The core metabolic reactions of life drive electrons through a class of redox protein enzymes, the oxidoreductases. The energetics of electron flow is determined by the redox potentials of organic and inorganic cofactors as tuned by the protein environment. Understanding how protein structure affects oxidation-reduction energetics is crucial for studying metabolism, creating bioelectronic systems, and tracing the history of biological energy utilization on Earth. We constructed ProtReDox (https://protein-redox-potential.web.app), a manually curated database of experimentally determined redox potentials. With over 500 measurements, we can begin to identify how proteins modulate oxidation-reduction energetics across the tree of life. By mapping redox potentials onto networks of oxidoreductase fold evolution, we can infer the evolution of electron transfer energetics over deep time. ProtReDox is designed to include user-contributed submissions with the intention of making it a valuable resource for researchers in this field.
Collapse
Affiliation(s)
- Kenneth N McGuinness
- Department of Natural Sciences, Caldwell University, Caldwell, New Jersey, USA
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Nolan Fehon
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Ryan Feehan
- Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
| | - Michelle Miller
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Andrew C Mutter
- Department of Physics, The City College of New York, New York, New York, USA
| | - Laryssa A Rybak
- Department of Physics, The City College of New York, New York, New York, USA
| | - Justin Nam
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Jenna E AbuSalim
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - Joshua T Atkinson
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Hirbod Heidari
- Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
| | - Natalie Losada
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
| | - J Dongun Kim
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Ronald L Koder
- Department of Physics, The City College of New York, New York, New York, USA
| | - Yi Lu
- Department of Chemistry, University of Texas at Austin, Austin, Texas, USA
| | - Jonathan J Silberg
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Joanna S G Slusky
- Computational Biology Program, The University of Kansas, Lawrence, Kansas, USA
- Department of Molecular Biosciences, The University of Kansas, Lawrence, Kansas, USA
| | - Paul G Falkowski
- Environmental Biophysics and Molecular Ecology Program, Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, New Jersey, USA
- Department of Earth and Planetary Sciences, Rutgers University, New Brunswick, New Jersey, USA
| | - Vikas Nanda
- Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, New Jersey, USA
- Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers University, Piscataway, New Jersey, USA
| |
Collapse
|
17
|
Banach M. Structural Outlier Detection and Zernike-Canterakis Moments for Molecular Surface Meshes-Fast Implementation in Python. Molecules 2023; 29:52. [PMID: 38202635 PMCID: PMC10779519 DOI: 10.3390/molecules29010052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/06/2023] [Accepted: 12/12/2023] [Indexed: 01/12/2024] Open
Abstract
Object retrieval systems measure the degree of similarity of the shape of 3D models. They search for the elements of the 3D model databases that resemble the query model. In structural bioinformatics, the query model is a protein tertiary/quaternary structure and the objective is to find similarly shaped molecules in the Protein Data Bank. With the ever-growing size of the PDB, a direct atomic coordinate comparison with all its members is impractical. To overcome this problem, the shape of the molecules can be encoded by fixed-length feature vectors. The distance of a protein to the entire PDB can be measured in this low-dimensional domain in linear time. The state-of-the-art approaches utilize Zernike-Canterakis moments for the shape encoding and supply the retrieval process with geometric data of the input structures. The BioZernike descriptors are a standard utility of the PDB since 2020. However, when trying to calculate the ZC moments locally, the issue of the deficiency of libraries readily available for use in custom programs (i.e., without relying on external binaries) is encountered, in particular programs written in Python. Here, a fast and well-documented Python implementation of the Pozo-Koehl algorithm is presented. In contrast to the more popular algorithm by Novotni and Klein, which is based on the voxelized volume, the PK algorithm produces ZC moments directly from the triangular surface meshes of 3D models. In particular, it can accept the molecular surfaces of proteins as its input. In the presented PK-Zernike library, owing to Numba's just-in-time compilation, a mesh with 50,000 facets is processed by a single thread in a second at the moment order 20. Since this is the first time the PK algorithm is used in structural bioinformatics, it is employed in a novel, simple, but efficient protein structure retrieval pipeline. The elimination of the outlying chain fragments via a fast PCA-based subroutine improves the discrimination ability, allowing for this pipeline to achieve an 0.961 area under the ROC curve in the BioZernike validation suite (0.997 for the assemblies). The correlation between the results of the proposed approach and of the 3D Surfer program attains values up to 0.99.
Collapse
Affiliation(s)
- Mateusz Banach
- Department of Bioinformatics and Telemedicine, Faculty of Medicine, Jagiellonian University Medical College, Medyczna 7, 30-688 Kraków, Poland
| |
Collapse
|
18
|
Kinch LN, Schaeffer RD, Zhang J, Cong Q, Orth K, Grishin N. Insights into virulence: structure classification of the Vibrio parahaemolyticus RIMD mobilome. mSystems 2023; 8:e0079623. [PMID: 38014954 PMCID: PMC10734457 DOI: 10.1128/msystems.00796-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 10/17/2023] [Indexed: 11/29/2023] Open
Abstract
IMPORTANCE The pandemic Vpar strain RIMD causes seafood-borne illness worldwide. Previous comparative genomic studies have revealed pathogenicity islands in RIMD that contribute to the success of the strain in infection. However, not all virulence determinants have been identified, and many of the proteins encoded in known pathogenicity islands are of unknown function. Based on the EOCD database, we used evolution-based classification of structure models for the RIMD proteome to improve our functional understanding of virulence determinants acquired by the pandemic strain. We further identify and classify previously unknown mobile protein domains as well as fast evolving residue positions in structure models that contribute to virulence and adaptation with respect to a pre-pandemic strain. Our work highlights key contributions of phage in mediating seafood born illness, suggesting this strain balances its avoidance of phage predators with its successful colonization of human hosts.
Collapse
Affiliation(s)
- Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Kim Orth
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Nick Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
19
|
Lau AM, Kandathil SM, Jones DT. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat Commun 2023; 14:8445. [PMID: 38114456 PMCID: PMC10730818 DOI: 10.1038/s41467-023-43934-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/24/2023] [Indexed: 12/21/2023] Open
Abstract
The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
20
|
Zayats V, Sikora M, Perlinska AP, Stasiulewicz A, Gren BA, Sulkowska JI. Conservation of knotted and slipknotted topology in transmembrane transporters. Biophys J 2023; 122:4528-4541. [PMID: 37919904 PMCID: PMC10719070 DOI: 10.1016/j.bpj.2023.10.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 08/25/2023] [Accepted: 10/30/2023] [Indexed: 11/04/2023] Open
Abstract
The existence of nontrivial topology is well accepted in globular proteins but not in membrane proteins. Our comprehensive topological analysis of the Protein Data Bank structures reveals 18 families of transmembrane proteins with nontrivial topology, showing that they constitute a significant number of membrane proteins. Moreover, we found that they comprise one of the largest groups of secondary active transporters. We classified them based on their knotted fingerprint into four groups: three slipknotted and one knotted. Unexpectedly, we found that the same protein can possess two distinct slipknot motifs that correspond to its outward- and inward-open conformational state. Based on the analysis of structures and knotted fingerprints, we show that slipknot topology is directly involved in the conformational transition and substrate transfer. Therefore, entanglement can be used to classify proteins and to find their structure-function relationship. Furthermore, based on the topological analysis of the transmembrane protein structures predicted by AlphaFold, we identified new potentially slipknotted protein families.
Collapse
Affiliation(s)
- Vasilina Zayats
- Centre of New Technologies, University of Warsaw, Warsaw, Poland
| | - Maciej Sikora
- Centre of New Technologies, University of Warsaw, Warsaw, Poland; Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
| | | | - Adam Stasiulewicz
- Centre of New Technologies, University of Warsaw, Warsaw, Poland; Department of Drug Chemistry, Faculty of Pharmacy, Medical University of Warsaw, Warsaw, Poland
| | - Bartosz A Gren
- Centre of New Technologies, University of Warsaw, Warsaw, Poland
| | | |
Collapse
|
21
|
Mogila I, Tamulaitiene G, Keda K, Timinskas A, Ruksenaite A, Sasnauskas G, Venclovas Č, Siksnys V, Tamulaitis G. Ribosomal stalk-captured CARF-RelE ribonuclease inhibits translation following CRISPR signaling. Science 2023; 382:1036-1041. [PMID: 38033086 DOI: 10.1126/science.adj2107] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 10/31/2023] [Indexed: 12/02/2023]
Abstract
Prokaryotic type III CRISPR-Cas antiviral systems employ cyclic oligoadenylate (cAn) signaling to activate a diverse range of auxiliary proteins that reinforce the CRISPR-Cas defense. Here we characterize a class of cAn-dependent effector proteins named CRISPR-Cas-associated messenger RNA (mRNA) interferase 1 (Cami1) consisting of a CRISPR-associated Rossmann fold sensor domain fused to winged helix-turn-helix and a RelE-family mRNA interferase domain. Upon activation by cyclic tetra-adenylate (cA4), Cami1 cleaves mRNA exposed at the ribosomal A-site thereby depleting mRNA and leading to cell growth arrest. The structures of apo-Cami1 and the ribosome-bound Cami1-cA4 complex delineate the conformational changes that lead to Cami1 activation and the mechanism of Cami1 binding to a bacterial ribosome, revealing unexpected parallels with eukaryotic ribosome-inactivating proteins.
Collapse
Affiliation(s)
- Irmantas Mogila
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Giedre Tamulaitiene
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Konstanty Keda
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Albertas Timinskas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Audrone Ruksenaite
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Giedrius Sasnauskas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Česlovas Venclovas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Virginijus Siksnys
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| | - Gintautas Tamulaitis
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Saulėtekio av. 7, LT-10257 Vilnius, Lithuania
| |
Collapse
|
22
|
Kryshtafovych A, Rigden DJ. To split or not to split: CASP15 targets and their processing into tertiary structure evaluation units. Proteins 2023; 91:1558-1570. [PMID: 37254889 PMCID: PMC10687315 DOI: 10.1002/prot.26533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 05/02/2023] [Accepted: 05/18/2023] [Indexed: 06/01/2023]
Abstract
Processing of CASP15 targets into evaluation units (EUs) and assigning them to evolutionary-based prediction classes is presented in this study. The targets were first split into structural domains based on compactness and similarity to other proteins. Models were then evaluated against these domains and their combinations. The domains were joined into larger EUs if predictors' performance on the combined units was similar to that on individual domains. Alternatively, if most predictors performed better on the individual domains, then they were retained as EUs. As a result, 112 evaluation units were created from 77 tertiary structure prediction targets. The EUs were assigned to four prediction classes roughly corresponding to target difficulty categories in previous CASPs: TBM (template-based modeling, easy or hard), FM (free modeling), and the TBM/FM overlap category. More than a third of CASP15 EUs were attributed to the historically most challenging FM class, where homology or structural analogy to proteins of known fold cannot be detected.
Collapse
Affiliation(s)
| | - Daniel J. Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
23
|
Ribeiro AJM, Riziotis IG, Borkakoti N, Thornton JM. Enzyme function and evolution through the lens of bioinformatics. Biochem J 2023; 480:1845-1863. [PMID: 37991346 PMCID: PMC10754289 DOI: 10.1042/bcj20220405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 11/09/2023] [Accepted: 11/14/2023] [Indexed: 11/23/2023]
Abstract
Enzymes have been shaped by evolution over billions of years to catalyse the chemical reactions that support life on earth. Dispersed in the literature, or organised in online databases, knowledge about enzymes can be structured in distinct dimensions, either related to their quality as biological macromolecules, such as their sequence and structure, or related to their chemical functions, such as the catalytic site, kinetics, mechanism, and overall reaction. The evolution of enzymes can only be understood when each of these dimensions is considered. In addition, many of the properties of enzymes only make sense in the light of evolution. We start this review by outlining the main paradigms of enzyme evolution, including gene duplication and divergence, convergent evolution, and evolution by recombination of domains. In the second part, we overview the current collective knowledge about enzymes, as organised by different types of data and collected in several databases. We also highlight some increasingly powerful computational tools that can be used to close gaps in understanding, in particular for types of data that require laborious experimental protocols. We believe that recent advances in protein structure prediction will be a powerful catalyst for the prediction of binding, mechanism, and ultimately, chemical reactions. A comprehensive mapping of enzyme function and evolution may be attainable in the near future.
Collapse
Affiliation(s)
- Antonio J. M. Ribeiro
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Ioannis G. Riziotis
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Neera Borkakoti
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Janet M. Thornton
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| |
Collapse
|
24
|
Smug BJ, Szczepaniak K, Rocha EPC, Dunin-Horkawicz S, Mostowy RJ. Ongoing shuffling of protein fragments diversifies core viral functions linked to interactions with bacterial hosts. Nat Commun 2023; 14:7460. [PMID: 38016962 PMCID: PMC10684548 DOI: 10.1038/s41467-023-43236-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 11/03/2023] [Indexed: 11/30/2023] Open
Abstract
Biological modularity enhances evolutionary adaptability. This principle is vividly exemplified by bacterial viruses (phages), which display extensive genomic modularity. Phage genomes are composed of independent functional modules that evolve separately and recombine in various configurations. While genomic modularity in phages has been extensively studied, less attention has been paid to protein modularity-proteins consisting of distinct building blocks that can evolve and recombine, enhancing functional and genetic diversity. Here, we use a set of 133,574 representative phage proteins and highly sensitive homology detection to capture instances of domain mosaicism, defined as fragment sharing between two otherwise unrelated proteins, and to understand its relationship with functional diversity in phage genomes. We discover that unrelated proteins from diverse functional classes frequently share homologous domains. This phenomenon is particularly pronounced within receptor-binding proteins, endolysins, and DNA polymerases. We also identify multiple instances of recent diversification via domain shuffling in receptor-binding proteins, neck passage structures, endolysins and some members of the core replication machinery, often transcending distant taxonomic and ecological boundaries. Our findings suggest that ongoing diversification via domain shuffling is reflective of a co-evolutionary arms race, driven by the need to overcome various bacterial resistance mechanisms against phages.
Collapse
Affiliation(s)
- Bogna J Smug
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
| | | | - Eduardo P C Rocha
- Institut Pasteur, Université Paris Cité, CNRS UMR3525, Microbial Evolutionary Genomics, Paris, France
| | - Stanislaw Dunin-Horkawicz
- Institute of Evolutionary Biology, Faculty of Biology & Biological and Chemical Research Centre, University of Warsaw, Żwirki i Wigury 101, 02-089, Warsaw, Poland
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, Max-Planck-Ring 5, 72076, Tübingen, Germany
| | - Rafał J Mostowy
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
| |
Collapse
|
25
|
Avraham O, Tsaban T, Ben-Aharon Z, Tsaban L, Schueler-Furman O. Protein language models can capture protein quaternary state. BMC Bioinformatics 2023; 24:433. [PMID: 37964216 PMCID: PMC10647083 DOI: 10.1186/s12859-023-05549-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 10/27/2023] [Indexed: 11/16/2023] Open
Abstract
BACKGROUND Determining a protein's quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. RESULTS We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. CONCLUSIONS QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab. RESEARCH google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb .
Collapse
Affiliation(s)
- Orly Avraham
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Tomer Tsaban
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Ziv Ben-Aharon
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Linoy Tsaban
- Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
- The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Ora Schueler-Furman
- Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem, Israel.
| |
Collapse
|
26
|
Wang Y, Gallagher LA, Andrade PA, Liu A, Humphreys IR, Turkarslan S, Cutler KJ, Arrieta-Ortiz ML, Li Y, Radey MC, McLean JS, Cong Q, Baker D, Baliga NS, Peterson SB, Mougous JD. Genetic manipulation of Patescibacteria provides mechanistic insights into microbial dark matter and the epibiotic lifestyle. Cell 2023; 186:4803-4817.e13. [PMID: 37683634 PMCID: PMC10633639 DOI: 10.1016/j.cell.2023.08.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 07/06/2023] [Accepted: 08/16/2023] [Indexed: 09/10/2023]
Abstract
Patescibacteria, also known as the candidate phyla radiation (CPR), are a diverse group of bacteria that constitute a disproportionately large fraction of microbial dark matter. Its few cultivated members, belonging mostly to Saccharibacteria, grow as epibionts on host Actinobacteria. Due to a lack of suitable tools, the genetic basis of this lifestyle and other unique features of Patescibacteira remain unexplored. Here, we show that Saccharibacteria exhibit natural competence, and we exploit this property for their genetic manipulation. Imaging of fluorescent protein-labeled Saccharibacteria provides high spatiotemporal resolution of phenomena accompanying epibiotic growth, and a transposon-insertion sequencing (Tn-seq) genome-wide screen reveals the contribution of enigmatic Saccharibacterial genes to growth on their hosts. Finally, we leverage metagenomic data to provide cutting-edge protein structure-based bioinformatic resources that support the strain Southlakia epibionticum and its corresponding host, Actinomyces israelii, as a model system for unlocking the molecular underpinnings of the epibiotic lifestyle.
Collapse
Affiliation(s)
- Yaxi Wang
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Larry A Gallagher
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Pia A Andrade
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Andi Liu
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Ian R Humphreys
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA; Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | | | - Kevin J Cutler
- Department of Physics, University of Washington, Seattle, WA 98195, USA
| | | | - Yaqiao Li
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA; Institute for Systems Biology, Seattle, WA 98109, USA
| | - Matthew C Radey
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Jeffrey S McLean
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA; Department of Periodontics, University of Washington, Seattle, WA 98195, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA; Institute for Protein Design, University of Washington, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98109, USA
| | | | - S Brook Peterson
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Joseph D Mougous
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98109, USA; Microbial Interactions and Microbiome Center, University of Washington, Seattle, WA 98195, USA.
| |
Collapse
|
27
|
Kojima Y, Mishiro-Sato E, Fujishita T, Satoh K, Kajino-Sakamoto R, Oze I, Nozawa K, Narita Y, Ogata T, Matsuo K, Muro K, Taketo MM, Soga T, Aoki M. Decreased liver B vitamin-related enzymes as a metabolic hallmark of cancer cachexia. Nat Commun 2023; 14:6246. [PMID: 37803016 PMCID: PMC10558488 DOI: 10.1038/s41467-023-41952-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Accepted: 09/20/2023] [Indexed: 10/08/2023] Open
Abstract
Cancer cachexia is a complex metabolic disorder accounting for ~20% of cancer-related deaths, yet its metabolic landscape remains unexplored. Here, we report a decrease in B vitamin-related liver enzymes as a hallmark of systemic metabolic changes occurring in cancer cachexia. Metabolomics of multiple mouse models highlights cachexia-associated reductions of niacin, vitamin B6, and a glycine-related subset of one-carbon (C1) metabolites in the liver. Integration of proteomics and metabolomics reveals that liver enzymes related to niacin, vitamin B6, and glycine-related C1 enzymes dependent on B vitamins decrease linearly with their associated metabolites, likely reflecting stoichiometric cofactor-enzyme interactions. The decrease of B vitamin-related enzymes is also found to depend on protein abundance and cofactor subtype. These metabolic/proteomic changes and decreased protein malonylation, another cachexia feature identified by protein post-translational modification analysis, are reflected in blood samples from mouse models and gastric cancer patients with cachexia, underscoring the clinical relevance of our findings.
Collapse
Affiliation(s)
- Yasushi Kojima
- Division of Pathophysiology, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan.
| | - Emi Mishiro-Sato
- Division of Pathophysiology, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Teruaki Fujishita
- Division of Pathophysiology, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Kiyotoshi Satoh
- Institute for Advanced Biosciences, Keio University, 246-2 Mizukami, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan
| | - Rie Kajino-Sakamoto
- Division of Pathophysiology, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Isao Oze
- Division of Cancer Epidemiology and Prevention, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Kazuki Nozawa
- Department of Clinical Oncology, Aichi Cancer Center Hospital, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Yukiya Narita
- Department of Clinical Oncology, Aichi Cancer Center Hospital, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Takatsugu Ogata
- Department of Clinical Oncology, Aichi Cancer Center Hospital, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Keitaro Matsuo
- Division of Cancer Epidemiology and Prevention, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Kei Muro
- Department of Clinical Oncology, Aichi Cancer Center Hospital, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan
| | - Makoto Mark Taketo
- Colon Cancer Project, Kyoto University Hospital-iACT, Kyoto University, Yoshida-Konoe-cho, Sakyo-ku, Kyoto, 606-8501, Japan
| | - Tomoyoshi Soga
- Institute for Advanced Biosciences, Keio University, 246-2 Mizukami, Kakuganji, Tsuruoka, Yamagata, 997-0052, Japan
| | - Masahiro Aoki
- Division of Pathophysiology, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Nagoya, Aichi, 464-8681, Japan.
- Department of Cancer Physiology, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya, Aichi, 466-8550, Japan.
| |
Collapse
|
28
|
Kaminski K, Ludwiczak J, Pawlicki K, Alva V, Dunin-Horkawicz S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics 2023; 39:btad579. [PMID: 37725369 PMCID: PMC10576641 DOI: 10.1093/bioinformatics/btad579] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 07/09/2023] [Accepted: 09/15/2023] [Indexed: 09/21/2023] Open
Abstract
MOTIVATION The detection of homology through sequence comparison is a typical first step in the study of protein function and evolution. In this work, we explore the applicability of protein language models to this task. RESULTS We introduce pLM-BLAST, a tool inspired by BLAST, that detects distant homology by comparing single-sequence representations (embeddings) derived from a protein language model, ProtT5. Our benchmarks reveal that pLM-BLAST maintains a level of accuracy on par with HHsearch for both highly similar sequences (with >50% identity) and markedly divergent sequences (with <30% identity), while being significantly faster. Additionally, pLM-BLAST stands out among other embedding-based tools due to its ability to compute local alignments. We show that these local alignments, produced by pLM-BLAST, often connect highly divergent proteins, thereby highlighting its potential to uncover previously undiscovered homologous relationships and improve protein annotation. AVAILABILITY AND IMPLEMENTATION pLM-BLAST is accessible via the MPI Bioinformatics Toolkit as a web server for searching precomputed databases (https://toolkit.tuebingen.mpg.de/tools/plmblast). It is also available as a standalone tool for building custom databases and performing batch searches (https://github.com/labstructbioinf/pLM-BLAST).
Collapse
Affiliation(s)
- Kamil Kaminski
- Institute of Evolutionary Biology, Faculty of Biology, Biological and Chemical Research Centre, University of Warsaw, Warsaw 02-089, Poland
- Laboratory of Structural Bioinformatics, Centre of New Technologies, University of Warsaw, Warsaw 02-097, Poland
| | - Jan Ludwiczak
- Institute of Evolutionary Biology, Faculty of Biology, Biological and Chemical Research Centre, University of Warsaw, Warsaw 02-089, Poland
| | - Kamil Pawlicki
- Institute of Evolutionary Biology, Faculty of Biology, Biological and Chemical Research Centre, University of Warsaw, Warsaw 02-089, Poland
| | - Vikram Alva
- Department of Protein Evolution, Max Planck Institute for Biology Tübingen, Tübingen 72076, Germany
| | - Stanislaw Dunin-Horkawicz
- Institute of Evolutionary Biology, Faculty of Biology, Biological and Chemical Research Centre, University of Warsaw, Warsaw 02-089, Poland
- Department of Protein Evolution, Max Planck Institute for Biology Tübingen, Tübingen 72076, Germany
| |
Collapse
|
29
|
Ardern Z. Alternative Reading Frames are an Underappreciated Source of Protein Sequence Novelty. J Mol Evol 2023; 91:570-580. [PMID: 37326679 DOI: 10.1007/s00239-023-10122-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 05/31/2023] [Indexed: 06/17/2023]
Abstract
Protein-coding DNA sequences can be translated into completely different amino acid sequences if the nucleotide triplets used are shifted by a non-triplet amount on the same DNA strand or by translating codons from the opposite strand. Such "alternative reading frames" of protein-coding genes are a major contributor to the evolution of novel protein products. Recent studies demonstrating this include examples across the three domains of cellular life and in viruses. These sequences increase the number of trials potentially available for the evolutionary invention of new genes and also have unusual properties which may facilitate gene origin. There is evidence that the structure of the standard genetic code contributes to the features and gene-likeness of some alternative frame sequences. These findings have important implications across diverse areas of molecular biology, including for genome annotation, structural biology, and evolutionary genomics.
Collapse
|
30
|
El Khoury G, Azzam W, Rebehmed J. PyProtif: a PyMol plugin to retrieve and visualize protein motifs for structural studies. Amino Acids 2023; 55:1429-1436. [PMID: 37698713 DOI: 10.1007/s00726-023-03323-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 08/24/2023] [Indexed: 09/13/2023]
Abstract
Proteins often possess several motifs and the ones with similar motifs were found to have similar biochemical properties and thus related biological functions. Thereby, multiple databases were developed to store information on such motifs in proteins. For instance, PDBsum stores the results of Promotif's generated structural motifs and Pfam stores pre-computed patterns of functional domains. In addition to the fact that all this stored information is extremely useful, we can further augment its importance if we ought to integrate these motifs into visualization software. In this work, we have developed PyProtif, a plugin for the PyMOL molecular visualization program, which automatically retrieves protein structural and functional motifs from different databases and integrates them in PyMOL for visualization and analyses. Through an expendable menu and a user-friendly interface, the plugin grants the users the ability to study simultaneously multiple proteins and to select and manipulate each motif separately. Thus, this plugin will be of great interest for structural, evolutionary and classification studies of proteins.
Collapse
Affiliation(s)
- Gilbert El Khoury
- Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon
| | - Wael Azzam
- Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon
| | - Joseph Rebehmed
- Department of Computer Science and Mathematics, Lebanese American University, Beirut, Lebanon.
| |
Collapse
|
31
|
Pei J, Cong Q. Computational analysis of regulatory regions in human protein kinases. Protein Sci 2023; 32:e4764. [PMID: 37632170 PMCID: PMC10503413 DOI: 10.1002/pro.4764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 08/08/2023] [Accepted: 08/22/2023] [Indexed: 08/27/2023]
Abstract
Eukaryotic proteins often feature modular domain structures comprising globular domains that are connected by linker regions and intrinsically disordered regions that may contain important functional motifs. The intramolecular interactions of globular domains and nonglobular regions can play critical roles in different aspects of protein function. However, studying these interactions and their regulatory roles can be challenging due to the flexibility of nonglobular regions, the long insertions separating interacting modules, and the transient nature of some interactions. Obtaining the experimental structures of multiple domains and functional regions is more difficult than determining the structures of individual globular domains. High-quality structural models generated by AlphaFold offer a unique opportunity to study intramolecular interactions in eukaryotic proteins. In this study, we systematically explored intramolecular interactions between human protein kinase domains (KDs) and potential regulatory regions, including globular domains, N- and C-terminal tails, long insertions, and distal nonglobular regions. Our analysis identified intramolecular interactions between human KDs and 35 different types of globular domains, exhibiting a variety of interaction modes that could contribute to orthosteric or allosteric regulation of kinase activity. We also identified prevalent interactions between human KDs and their flanking regions (N- and C-terminal tails). These interactions exhibit group-specific characteristics and can vary within each specific kinase group. Although long-range interactions between KDs and nonglobular regions are relatively rare, structural details of these interactions offer new insights into the regulation mechanisms of several kinases, such as HASPIN, MAPK7, MAPK15, and SIK1B.
Collapse
Affiliation(s)
- Jimin Pei
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
32
|
Barrio-Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist CLM, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature 2023; 622:637-645. [PMID: 37704730 PMCID: PMC10584675 DOI: 10.1038/s41586-023-06510-w] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 08/02/2023] [Indexed: 09/15/2023]
Abstract
Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy1, and over 214 million predicted structures are available in the AlphaFold database2. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.
Collapse
Affiliation(s)
- Inigo Barrio-Hernandez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Jingi Yeo
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | - Jürgen Jänes
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | | | - Tanita Wein
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel
| | - Mihaly Varadi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Sameer Velankar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK
| | - Pedro Beltrao
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea.
| |
Collapse
|
33
|
Camargo AP, Roux S, Schulz F, Babinski M, Xu Y, Hu B, Chain PSG, Nayfach S, Kyrpides NC. Identification of mobile genetic elements with geNomad. Nat Biotechnol 2023:10.1038/s41587-023-01953-y. [PMID: 37735266 DOI: 10.1038/s41587-023-01953-y] [Citation(s) in RCA: 61] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Accepted: 08/17/2023] [Indexed: 09/23/2023]
Abstract
Identifying and characterizing mobile genetic elements in sequencing data is essential for understanding their diversity, ecology, biotechnological applications and impact on public health. Here we introduce geNomad, a classification and annotation framework that combines information from gene content and a deep neural network to identify sequences of plasmids and viruses. geNomad uses a dataset of more than 200,000 marker protein profiles to provide functional gene annotation and taxonomic assignment of viral genomes. Using a conditional random field model, geNomad also detects proviruses integrated into host genomes with high precision. In benchmarks, geNomad achieved high classification performance for diverse plasmids and viruses (Matthews correlation coefficient of 77.8% and 95.3%, respectively), substantially outperforming other tools. Leveraging geNomad's speed and scalability, we processed over 2.7 trillion base pairs of sequencing data, leading to the discovery of millions of viruses and plasmids that are available through the IMG/VR and IMG/PR databases. geNomad is available at https://portal.nersc.gov/genomad .
Collapse
Affiliation(s)
- Antonio Pedro Camargo
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Frederik Schulz
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Michal Babinski
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Yan Xu
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Bin Hu
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Patrick S G Chain
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
34
|
Goldtzvik Y, Sen N, Lam SD, Orengo C. Protein diversification through post-translational modifications, alternative splicing, and gene duplication. Curr Opin Struct Biol 2023; 81:102640. [PMID: 37354790 DOI: 10.1016/j.sbi.2023.102640] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 05/05/2023] [Accepted: 05/24/2023] [Indexed: 06/26/2023]
Abstract
Proteins provide the basis for cellular function. Having multiple versions of the same protein within a single organism provides a way of regulating its activity or developing novel functions. Post-translational modifications of proteins, by means of adding/removing chemical groups to amino acids, allow for a well-regulated and controlled way of generating functionally distinct protein species. Alternative splicing is another method with which organisms possibly generate new isoforms. Additionally, gene duplication events throughout evolution generate multiple paralogs of the same genes, resulting in multiple versions of the same protein within an organism. In this review, we discuss recent advancements in the study of these three methods of protein diversification and provide illustrative examples of how they affect protein structure and function.
Collapse
Affiliation(s)
- Yonathan Goldtzvik
- Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Neeladri Sen
- Department of Structural and Molecular Biology, University College London, London, United Kingdom. https://twitter.com/@NeeladriSen
| | - Su Datt Lam
- Department of Structural and Molecular Biology, University College London, London, United Kingdom; Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
35
|
Minami S, Kobayashi N, Sugiki T, Nagashima T, Fujiwara T, Tatsumi-Koga R, Chikenji G, Koga N. Exploration of novel αβ-protein folds through de novo design. Nat Struct Mol Biol 2023; 30:1132-1140. [PMID: 37400653 PMCID: PMC10442233 DOI: 10.1038/s41594-023-01029-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 05/30/2023] [Indexed: 07/05/2023]
Abstract
A fundamental question in protein evolution is whether nature has exhaustively sampled nearly all possible protein folds throughout evolution, or whether a large fraction of the possible folds remains unexplored. To address this question, we defined a set of rules for β-sheet topology to predict novel αβ-folds and carried out a systematic de novo protein design exploration of the novel αβ-folds predicted by the rules. The designs for all eight of the predicted novel αβ-folds with a four-stranded β-sheet, including a knot-forming one, folded into structures close to the design models. Further, the rules predicted more than 10,000 novel αβ-folds with five- to eight-stranded β-sheets; this number far exceeds the number of αβ-folds observed in nature so far. This result suggests that a vast number of αβ-folds are possible, but have not emerged or have become extinct due to evolutionary bias.
Collapse
Affiliation(s)
- Shintaro Minami
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of Natural Sciences (NINS), Okazaki, Japan
| | - Naohiro Kobayashi
- Institute for Protein Research (IPR), Osaka University, Osaka, Japan
- RIKEN Center for Biosystems Dynamics Research, RIKEN, Yokohama, Japan
| | - Toshihiko Sugiki
- Institute for Protein Research (IPR), Osaka University, Osaka, Japan
| | - Toshio Nagashima
- RIKEN Center for Biosystems Dynamics Research, RIKEN, Yokohama, Japan
| | | | - Rie Tatsumi-Koga
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of Natural Sciences (NINS), Okazaki, Japan
| | - George Chikenji
- Department of Applied Physics, Graduate School of Engineering, Nagoya University, Nagoya, Japan
| | - Nobuyasu Koga
- Protein Design Group, Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of Natural Sciences (NINS), Okazaki, Japan.
- SOKENDAI, The Graduate University for Advanced Studies, Hayama, Japan.
- Research Center of Integrative Molecular Systems, Institute for Molecular Science (IMS), National Institutes of Natural Sciences (NINS), Okazaki, Japan.
- Laboratory for Protein Design, Institute for Protein Research (IPR), Osaka University, Osaka, Japan.
| |
Collapse
|
36
|
Medvedev KE, Schaeffer RD, Chen KS, Grishin NV. Pan-cancer structurome reveals overrepresentation of beta sandwiches and underrepresentation of alpha helical domains. Sci Rep 2023; 13:11988. [PMID: 37491511 PMCID: PMC10368619 DOI: 10.1038/s41598-023-39273-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 07/22/2023] [Indexed: 07/27/2023] Open
Abstract
The recent progress in the prediction of protein structures marked a historical milestone. AlphaFold predicted 200 million protein models with an accuracy comparable to experimental methods. Protein structures are widely used to understand evolution and to identify potential drug targets for the treatment of various diseases, including cancer. Thus, these recently predicted structures might convey previously unavailable information about cancer biology. Evolutionary classification of protein domains is challenging and different approaches exist. Recently our team presented a classification of domains from human protein models released by AlphaFold. Here we evaluated the pan-cancer structurome, domains from over and under expressed proteins in 21 cancer types, using the broadest levels of the ECOD classification: the architecture (A-groups) and possible homology (X-groups) levels. Our analysis reveals that AlphaFold has greatly increased the three-dimensional structural landscape for proteins that are differentially expressed in these 21 cancer types. We show that beta sandwich domains are significantly overrepresented and alpha helical domains are significantly underrepresented in the majority of cancer types. Our data suggest that the prevalence of the beta sandwiches is due to the high levels of immunoglobulins and immunoglobulin-like domains that arise during tumor development-related inflammation. On the other hand, proteins with exclusively alpha domains are important elements of homeostasis, apoptosis and transmembrane transport. Therefore cancer cells tend to reduce representation of these proteins to promote successful oncogeneses.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Kenneth S Chen
- Department of Pediatrics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Children's Medical Center Research Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| |
Collapse
|
37
|
Dhondge H, Chauvot de Beauchêne I, Devignes MD. CroMaSt: a workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment. BIOINFORMATICS ADVANCES 2023; 3:vbad081. [PMID: 37431435 PMCID: PMC10329740 DOI: 10.1093/bioadv/vbad081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 06/16/2023] [Accepted: 06/26/2023] [Indexed: 07/12/2023]
Abstract
Motivation Protein domains can be viewed as building blocks, essential for understanding structure-function relationships in proteins. However, each domain database classifies protein domains using its own methodology. Thus, in many cases, domain models and boundaries differ from one domain database to the other, raising the question of domain definition and enumeration of true domain instances. Results We propose an automated iterative workflow to assess protein domain classification by cross-mapping domain structural instances between domain databases and by evaluating structural alignments. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all experimental structural instances of a given domain type into four different categories ('Core', 'True', 'Domain-like' and 'Failed'). CroMast is developed in Common Workflow Language and takes advantage of two well-known domain databases with wide coverage: Pfam and CATH. It uses the Kpax structural alignment tool with expert-adjusted parameters. CroMaSt was tested with the RNA Recognition Motif domain type and identifies 962 'True' and 541 'Domain-like' structural instances for this domain type. This method solves a crucial issue in domain-centric research and can generate essential information that could be used for synthetic biology and machine-learning approaches of protein domain engineering. Availability and implementation The workflow and the Results archive for the CroMaSt runs presented in this article are available from WorkflowHub (doi: 10.48546/workflowhub.workflow.390.2). Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
38
|
Chakravarty D, Sreenivasan S, Swint-Kruse L, Porter LL. Identification of a covert evolutionary pathway between two protein folds. Nat Commun 2023; 14:3177. [PMID: 37264049 DOI: 10.1038/s41467-023-38519-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 05/03/2023] [Indexed: 06/03/2023] Open
Abstract
Although homologous protein sequences are expected to adopt similar structures, some amino acid substitutions can interconvert α-helices and β-sheets. Such fold switching may have occurred over evolutionary history, but supporting evidence has been limited by the: (1) abundance and diversity of sequenced genes, (2) quantity of experimentally determined protein structures, and (3) assumptions underlying the statistical methods used to infer homology. Here, we overcome these barriers by applying multiple statistical methods to a family of ~600,000 bacterial response regulator proteins. We find that their homologous DNA-binding subunits assume divergent structures: helix-turn-helix versus α-helix + β-sheet (winged helix). Phylogenetic analyses, ancestral sequence reconstruction, and AlphaFold2 models indicate that amino acid substitutions facilitated a switch from helix-turn-helix into winged helix. This structural transformation likely expanded DNA-binding specificity. Our approach uncovers an evolutionary pathway between two protein folds and provides a methodology to identify secondary structure switching in other protein families.
Collapse
Affiliation(s)
- Devlina Chakravarty
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Shwetha Sreenivasan
- Department of Biochemistry and Molecular Biology, The University of Kansas Medical Center, Kansas City, KS, 66160, USA
| | - Liskin Swint-Kruse
- Department of Biochemistry and Molecular Biology, The University of Kansas Medical Center, Kansas City, KS, 66160, USA
| | - Lauren L Porter
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
- Biochemistry and Biophysics Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
39
|
Liu S, Hur YH, Cai X, Cong Q, Yang Y, Xu C, Bilate AM, Gonzales KAU, Parigi SM, Cowley CJ, Hurwitz B, Luo JD, Tseng T, Gur-Cohen S, Sribour M, Omelchenko T, Levorse J, Pasolli HA, Thompson CB, Mucida D, Fuchs E. A tissue injury sensing and repair pathway distinct from host pathogen defense. Cell 2023; 186:2127-2143.e22. [PMID: 37098344 PMCID: PMC10321318 DOI: 10.1016/j.cell.2023.03.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 02/03/2023] [Accepted: 03/27/2023] [Indexed: 04/27/2023]
Abstract
Pathogen infection and tissue injury are universal insults that disrupt homeostasis. Innate immunity senses microbial infections and induces cytokines/chemokines to activate resistance mechanisms. Here, we show that, in contrast to most pathogen-induced cytokines, interleukin-24 (IL-24) is predominately induced by barrier epithelial progenitors after tissue injury and is independent of microbiome or adaptive immunity. Moreover, Il24 ablation in mice impedes not only epidermal proliferation and re-epithelialization but also capillary and fibroblast regeneration within the dermal wound bed. Conversely, ectopic IL-24 induction in the homeostatic epidermis triggers global epithelial-mesenchymal tissue repair responses. Mechanistically, Il24 expression depends upon both epithelial IL24-receptor/STAT3 signaling and hypoxia-stabilized HIF1α, which converge following injury to trigger autocrine and paracrine signaling involving IL-24-mediated receptor signaling and metabolic regulation. Thus, parallel to innate immune sensing of pathogens to resolve infections, epithelial stem cells sense injury signals to orchestrate IL-24-mediated tissue repair.
Collapse
Affiliation(s)
- Siqi Liu
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Yun Ha Hur
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Xin Cai
- Cancer Biology and Genetics Program, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Qian Cong
- McDermott Center for Human Growth and Development, Department of Biophysics, and Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Yihao Yang
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Chiwei Xu
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Angelina M Bilate
- Laboratory of Mucosal Immunology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Kevin Andrew Uy Gonzales
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - S Martina Parigi
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Christopher J Cowley
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Brian Hurwitz
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Ji-Dung Luo
- Bioinformatics Resource Center, The Rockefeller University, New York, NY 10065, USA
| | - Tiffany Tseng
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Shiri Gur-Cohen
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Megan Sribour
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Tatiana Omelchenko
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - John Levorse
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Hilda Amalia Pasolli
- Electron Microscopy Resource Center, The Rockefeller University, New York, NY 10065, USA
| | - Craig B Thompson
- Cancer Biology and Genetics Program, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Daniel Mucida
- Laboratory of Mucosal Immunology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA
| | - Elaine Fuchs
- Robin Chemers Neustein Laboratory of Mammalian Development and Cell Biology, Howard Hughes Medical Institute, The Rockefeller University, New York, NY 10065, USA.
| |
Collapse
|
40
|
Wang Y, Gallagher LA, Andrade PA, Liu A, Humphreys IR, Turkarslan S, Cutler KJ, Arrieta-Ortiz ML, Li Y, Radey MC, McLean JS, Cong Q, Baker D, Baliga NS, Peterson SB, Mougous JD. Genetic manipulation of candidate phyla radiation bacteria provides functional insights into microbial dark matter. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.02.539146. [PMID: 37205512 PMCID: PMC10187176 DOI: 10.1101/2023.05.02.539146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The study of bacteria has yielded fundamental insights into cellular biology and physiology, biotechnological advances and many therapeutics. Yet due to a lack of suitable tools, the significant portion of bacterial diversity held within the candidate phyla radiation (CPR) remains inaccessible to such pursuits. Here we show that CPR bacteria belonging to the phylum Saccharibacteria exhibit natural competence. We exploit this property to develop methods for their genetic manipulation, including the insertion of heterologous sequences and the construction of targeted gene deletions. Imaging of fluorescent protein-labeled Saccharibacteria provides high spatiotemporal resolution of phenomena accompanying epibiotic growth and a transposon insertion sequencing genome-wide screen reveals the contribution of enigmatic Saccharibacterial genes to growth on their Actinobacteria hosts. Finally, we leverage metagenomic data to provide cutting-edge protein structure-based bioinformatic resources that support the strain Southlakia epibionticum and its corresponding host, Actinomyces israelii , as a model system for unlocking the molecular underpinnings of the epibiotic lifestyle.
Collapse
Affiliation(s)
- Yaxi Wang
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Larry A. Gallagher
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Pia A. Andrade
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Andi Liu
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Ian R. Humphreys
- Department of Biochemistry, University of Washington, Seattle, WA 98109, USA
- Institute for Protein Design, Seattle, WA 98109, USA
| | | | - Kevin J. Cutler
- Department of Physics, University of Washington, Seattle, WA 98195, USA
| | | | - Yaqiao Li
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Matthew C. Radey
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Jeffrey S. McLean
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
- Department of Periodontics, University of Washington, Seattle, WA 98195, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98109, USA
- Institute for Protein Design, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | | | - S. Brook Peterson
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
| | - Joseph D. Mougous
- Department of Microbiology, University of Washington, Seattle, WA 98109, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
- Microbial Interactions and Microbiome Center, University of Washington, Seattle, WA 98109, USA
| |
Collapse
|
41
|
Nawaz MS, Fournier-Viger P, He Y, Zhang Q. PSAC-PDB: Analysis and classification of protein structures. Comput Biol Med 2023; 158:106814. [PMID: 36989742 DOI: 10.1016/j.compbiomed.2023.106814] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 03/09/2023] [Accepted: 03/20/2023] [Indexed: 03/29/2023]
Abstract
This paper presents a novel framework, called PSAC-PDB, for analyzing and classifying protein structures from the Protein Data Bank (PDB). PSAC-PDB first finds, analyze and identifies protein structures in PDB that are similar to a protein structure of interest using a protein structure comparison tool. Second, the amino acids (AA) sequences of identified protein structures (obtained from PDB), their aligned amino acids (AAA) and aligned secondary structure elements (ASSE) (obtained by structural alignment), and frequent AA (FAA) patterns (discovered by sequential pattern mining), are used for the reliable detection/classification of protein structures. Eleven classifiers are used and their performance is compared using six evaluation metrics. Results show that three classifiers perform well on overall, and that FAA patterns can be used to efficiently classify protein structures in place of providing the whole AA sequences, AAA or ASSE. Furthermore, better classification results are obtained using AAA of protein structures rather than AA sequences. PSAC-PDB also performed better than state-of-the-art approaches for SARS-CoV-2 genome sequences classification.
Collapse
|
42
|
Koehler Leman J, Szczerbiak P, Renfrew PD, Gligorijevic V, Berenberg D, Vatanen T, Taylor BC, Chandler C, Janssen S, Pataki A, Carriero N, Fisk I, Xavier RJ, Knight R, Bonneau R, Kosciolek T. Sequence-structure-function relationships in the microbial protein universe. Nat Commun 2023; 14:2351. [PMID: 37100781 PMCID: PMC10133388 DOI: 10.1038/s41467-023-37896-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 04/05/2023] [Indexed: 04/28/2023] Open
Abstract
For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We predict ~200,000 structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid, a large-scale citizen science initiative. The resulting database of structural models is complementary to the AlphaFold database, with regards to domains of life as well as sequence diversity and sequence length. We identify 148 novel folds and describe examples where we map specific functions to structural motifs. We also show that the structural space is continuous and largely saturated, highlighting the need for a shift in focus across all branches of biology, from obtaining structures to putting them into context and from sequence-based to sequence-structure-function based meta-omics analyses.
Collapse
Affiliation(s)
- Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| | - Pawel Szczerbiak
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
| | - P Douglas Renfrew
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Department of Biology, New York University, New York, NY, USA
| | - Vladimir Gligorijevic
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Daniel Berenberg
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
- Center for Data Science, New York University, New York, NY, 10011, USA
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA
| | - Tommi Vatanen
- Broad Institute, Cambridge, MA, USA
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, 00014 University of Helsinki, Helsinki, Finland
| | - Bryn C Taylor
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- In Silico Discovery and External Innovation, Janssen Research and Development, San Diego, CA, 92122, USA
| | - Chris Chandler
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Stefan Janssen
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA
- Algorithmic Bioinformatics, Justus Liebig University Giessen, Giessen, Germany
| | - Andras Pataki
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Nick Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Ian Fisk
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Ramnik J Xavier
- Broad Institute, Cambridge, MA, USA
- Center for Microbiome Informatics and Therapeutics, MIT, Cambridge, MA, 02139, USA
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Bioengineering, University of California, San Diego, USA
| | - Richard Bonneau
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Department of Biology, New York University, New York, NY, USA
- Center for Data Science, New York University, New York, NY, 10011, USA
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
| |
Collapse
|
43
|
von Kügelgen A, van Dorst S, Yamashita K, Sexton DL, Tocheva EI, Murshudov G, Alva V, Bharat TAM. Interdigitated immunoglobulin arrays form the hyperstable surface layer of the extremophilic bacterium Deinococcus radiodurans. Proc Natl Acad Sci U S A 2023; 120:e2215808120. [PMID: 37043530 PMCID: PMC10120038 DOI: 10.1073/pnas.2215808120] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 03/14/2023] [Indexed: 04/13/2023] Open
Abstract
Deinococcus radiodurans is an atypical diderm bacterium with a remarkable ability to tolerate various environmental stresses, due in part to its complex cell envelope encapsulated within a hyperstable surface layer (S-layer). Despite decades of research on this cell envelope, atomic structural details of the S-layer have remained obscure. In this study, we report the electron cryomicroscopy structure of the D. radiodurans S-layer, showing how it is formed by the Hexagonally Packed Intermediate-layer (HPI) protein arranged in a planar hexagonal lattice. The HPI protein forms an array of immunoglobulin-like folds within the S-layer, with each monomer extending into the adjacent hexamer, resulting in a highly interconnected, stable, sheet-like arrangement. Using electron cryotomography and subtomogram averaging from focused ion beam-milled D. radiodurans cells, we have obtained a structure of the cellular S-layer, showing how this HPI S-layer coats native membranes on the surface of cells. Our S-layer structure from the diderm bacterium D. radiodurans shows similarities to immunoglobulin-like domain-containing S-layers from monoderm bacteria and archaea, highlighting common features in cell surface organization across different domains of life, with connotations on the evolution of immunoglobulin-based molecular recognition systems in eukaryotes.
Collapse
Affiliation(s)
- Andriko von Kügelgen
- Structural Studies Division, MRC Laboratory of Molecular Biology, CambridgeCB2 0QH, United Kingdom
- Sir William Dunn School of Pathology, University of Oxford, OxfordOX1 3RE, United Kingdom
| | - Sofie van Dorst
- Sir William Dunn School of Pathology, University of Oxford, OxfordOX1 3RE, United Kingdom
| | - Keitaro Yamashita
- Structural Studies Division, MRC Laboratory of Molecular Biology, CambridgeCB2 0QH, United Kingdom
| | - Danielle L. Sexton
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Elitza I. Tocheva
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Garib Murshudov
- Structural Studies Division, MRC Laboratory of Molecular Biology, CambridgeCB2 0QH, United Kingdom
| | - Vikram Alva
- Department of Protein Evolution, Max Planck Institute for Biology Tübingen, Tübingen72076, Germany
| | - Tanmay A. M. Bharat
- Structural Studies Division, MRC Laboratory of Molecular Biology, CambridgeCB2 0QH, United Kingdom
- Sir William Dunn School of Pathology, University of Oxford, OxfordOX1 3RE, United Kingdom
| |
Collapse
|
44
|
S. G, E.R. V. Protein secondary structure prediction using Cascaded Feature Learning Model. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023]
|
45
|
Varadi M, Bordin N, Orengo C, Velankar S. The opportunities and challenges posed by the new generation of deep learning-based protein structure predictors. Curr Opin Struct Biol 2023; 79:102543. [PMID: 36807079 DOI: 10.1016/j.sbi.2023.102543] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Revised: 01/04/2023] [Accepted: 01/13/2023] [Indexed: 02/21/2023]
Abstract
The function of proteins can often be inferred from their three-dimensional structures. Experimental structural biologists spent decades studying these structures, but the accelerated pace of protein sequencing continuously increases the gaps between sequences and structures. The early 2020s saw the advent of a new generation of deep learning-based protein structure prediction tools that offer the potential to predict structures based on any number of protein sequences. In this review, we give an overview of the impact of this new generation of structure prediction tools, with examples of the impacted field in the life sciences. We discuss the novel opportunities and new scientific and technical challenges these tools present to the broader scientific community. Finally, we highlight some potential directions for the future of computational protein structure prediction.
Collapse
Affiliation(s)
- Mihaly Varadi
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Welcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College, London, London, WC1E 6BT, UK. https://twitter.com/nicolabordin
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College, London, London, WC1E 6BT, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Welcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
46
|
Bordin N, Dallago C, Heinzinger M, Kim S, Littmann M, Rauer C, Steinegger M, Rost B, Orengo C. Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci 2023; 48:345-359. [PMID: 36504138 PMCID: PMC10570143 DOI: 10.1016/j.tibs.2022.11.001] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 10/24/2022] [Accepted: 11/17/2022] [Indexed: 12/10/2022]
Abstract
Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Christian Dallago
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; VantAI, 151 W 42nd Street, New York, NY 10036, USA
| | - Michael Heinzinger
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Stephanie Kim
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Maria Littmann
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK.
| |
Collapse
|
47
|
Brunori M, Miele AE. Modulation of Allosteric Control and Evolution of Hemoglobin. Biomolecules 2023; 13:biom13030572. [PMID: 36979507 PMCID: PMC10046315 DOI: 10.3390/biom13030572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Revised: 03/17/2023] [Accepted: 03/18/2023] [Indexed: 03/30/2023] Open
Abstract
Allostery arises when a ligand-induced change in shape of a binding site of a protein is coupled to a tertiary/quaternary conformational change with a consequent modulation of functional properties. The two-state allosteric model of Monod, Wyman and Changeux [J. Mol. Biol. 1965; 12, 88-118] is an elegant and effective theory to account for protein regulation and control. Tetrameric hemoglobin (Hb), the oxygen transporter of all vertebrates, has been for decades the ideal system to test for the validity of the MWC theory. The small ligands affecting Hb's behavior (organic phosphates, protons, bicarbonate) are produced by the red blood cell during metabolism. By binding to specific sites, these messengers make Hb sensing the environment and reacting consequently. HbI and HbIV from trout and human HbA are classical cooperative models, being similar yet different. They share many fundamental features, starting with the globin fold and the quaternary assembly, and reversible cooperative O2 binding. Nevertheless, they differ in ligand affinity, binding of allosteric effectors, and stability of the quaternary assembly. Here, we recollect essential functional properties and correlate them to the tertiary and quaternary structures available in the protein databank to infer on the molecular basis of the evolution of oxygen transporters.
Collapse
Affiliation(s)
- Maurizio Brunori
- Accademia Nazionale dei Lincei, via della Lungara, 00165 Rome, Italy
- Department of Biochemical Sciences, Sapienza University of Rome, P.le Aldo Moro 5, 00185 Rome, Italy
| | - Adriana Erica Miele
- Department of Biochemical Sciences, Sapienza University of Rome, P.le Aldo Moro 5, 00185 Rome, Italy
- Institute of Analytical Sciences, UMR 5280 ISA CNRS UCBL, Université Claude Bernard Lyon 1, 5 Rue de la Doua, 69100 Villeurbanne, France
| |
Collapse
|
48
|
Schaeffer RD, Zhang J, Kinch LN, Pei J, Cong Q, Grishin NV. Classification of domains in predicted structures of the human proteome. Proc Natl Acad Sci U S A 2023; 120:e2214069120. [PMID: 36917664 PMCID: PMC10041065 DOI: 10.1073/pnas.2214069120] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 02/06/2023] [Indexed: 03/16/2023] Open
Abstract
Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website (http://prodata.swmed.edu/ecod/index_human.php).
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX75390
- HHMI, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Jimin Pei
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX75390
| |
Collapse
|
49
|
Zhao K, Xia Y, Zhang F, Zhou X, Li SZ, Zhang G. Protein structure and folding pathway prediction based on remote homologs recognition using PAthreader. Commun Biol 2023; 6:243. [PMID: 36871126 PMCID: PMC9985440 DOI: 10.1038/s42003-023-04605-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Accepted: 02/16/2023] [Indexed: 03/06/2023] Open
Abstract
Recognition of remote homologous structures is a necessary module in AlphaFold2 and is also essential for the exploration of protein folding pathways. Here, we propose a method, PAthreader, to recognize remote templates and explore folding pathways. Firstly, we design a three-track alignment between predicted distance profiles and structure profiles extracted from PDB and AlphaFold DB, to improve the recognition accuracy of remote templates. Secondly, we improve the performance of AlphaFold2 using the templates identified by PAthreader. Thirdly, we explore protein folding pathways based on our conjecture that dynamic folding information of protein is implicitly contained in its remote homologs. The results show that the average accuracy of PAthreader templates is 11.6% higher than that of HHsearch. In terms of structure modelling, PAthreader outperform AlphaFold2 and ranks first on the CAMEO blind test for the latest three months. Furthermore, we predict protein folding pathways for 37 proteins, in which the results of 7 proteins are almost consistent with those of biological experiments, and the other 30 human proteins have yet to be verified by biological experiments, revealing that folding information can be exploited from remote homologous structures.
Collapse
Affiliation(s)
- Kailong Zhao
- College of Information Engineering, Zhejiang University of Technology, HangZhou, 310023, China
| | - Yuhao Xia
- College of Information Engineering, Zhejiang University of Technology, HangZhou, 310023, China
| | - Fujin Zhang
- College of Information Engineering, Zhejiang University of Technology, HangZhou, 310023, China
| | - Xiaogen Zhou
- College of Information Engineering, Zhejiang University of Technology, HangZhou, 310023, China
| | - Stan Z Li
- AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, 310030, Zhejiang, China.
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology, HangZhou, 310023, China.
| |
Collapse
|
50
|
van der Weg KJ, Gohlke H. TopEnzyme: a framework and database for structural coverage of the functional enzyme space. Bioinformatics 2023; 39:7072462. [PMID: 36883717 PMCID: PMC10023222 DOI: 10.1093/bioinformatics/btad116] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 02/03/2023] [Accepted: 02/09/2023] [Indexed: 03/09/2023] Open
Abstract
MOTIVATION TopEnzyme is a database of structural enzyme models created with TopModel and is linked to the SWISS-MODEL repository and AlphaFold Protein Structure Database to provide an overview of structural coverage of the functional enzyme space for over 200 000 enzyme models. It allows the user to quickly obtain representative structural models for 60% of all known enzyme functions. RESULTS We assessed the models with TopScore and contributed 9039 good-quality and 1297 high-quality structures. Furthermore, we compared these models to AlphaFold2 models with TopScore and found that the TopScore differs only by 0.04 on average in favor of AlphaFold2. We tested TopModel and AlphaFold2 for targets not seen in the respective training databases and found that both methods create qualitatively similar structures. When no experimental structures are available, this database will facilitate quick access to structural models across the currently most extensive structural coverage of the functional enzyme space within Swiss-Prot. AVAILABILITY AND IMPLEMENTATION We provide a full web interface to the database at https://cpclab.uni-duesseldorf.de/topenzyme/.
Collapse
Affiliation(s)
- Karel J van der Weg
- John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), and Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, Jülich 52425, Germany
| | - Holger Gohlke
- Corresponding author. John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), and Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, Jülich 52425, Germany. E-mail:
| |
Collapse
|