1
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
2
|
Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, O'Donnell TJ, Berenberg D, Fisk I, Zanichelli N, Zhang B, Nowaczynski A, Wang B, Stepniewska-Dziubinska MM, Zhang S, Ojewole A, Guney ME, Biderman S, Watkins AM, Ra S, Lorenzo PR, Nivon L, Weitzner B, Ban YEA, Chen S, Zhang M, Li C, Song SL, He Y, Sorger PK, Mostaque E, Zhang Z, Bonneau R, AlQuraishi M. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 2024; 21:1514-1524. [PMID: 38744917 DOI: 10.1038/s41592-024-02272-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/03/2024] [Indexed: 05/16/2024]
Abstract
AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.
Collapse
Affiliation(s)
- Gustaf Ahdritz
- Department of Systems Biology, Columbia University, New York, NY, USA
- Harvard University, Cambridge, MA, USA
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
| | | | - Sachin Kadyan
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Qinghui Xia
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - William Gerecke
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
| | | | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
| | - Ian Fisk
- Flatiron Institute, New York, NY, USA
| | | | - Bo Zhang
- Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA
| | | | | | | | | | | | | | - Stella Biderman
- EleutherAI, New York, NY, USA
- Booz Allen Hamilton, McLean, VA, USA
| | | | - Stephen Ra
- Prescient Design, Genentech, New York, NY, USA
| | | | | | | | | | | | - Minjia Zhang
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | | | | | | | - Peter K Sorger
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
| | | | - Zhao Zhang
- Rutgers University, New Brunswick, NJ, USA
| | | | | |
Collapse
|
3
|
Medvedev KE, Schaeffer RD, Grishin NV. DrugDomain: The evolutionary context of drugs and small molecules bound to domains. Protein Sci 2024; 33:e5116. [PMID: 38979784 PMCID: PMC11231930 DOI: 10.1002/pro.5116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 06/27/2024] [Accepted: 06/29/2024] [Indexed: 07/10/2024]
Abstract
Interactions between proteins and small organic compounds play a crucial role in regulating protein functions. These interactions can modulate various aspects of protein behavior, including enzymatic activity, signaling cascades, and structural stability. By binding to specific sites on proteins, small organic compounds can induce conformational changes, alter protein-protein interactions, or directly affect catalytic activity. Therefore, many drugs available on the market today are small molecules (72% of all approved drugs in the last 5 years). Proteins are composed of one or more domains: evolutionary units that convey function or fitness either singly or in concert with others. Understanding which domain(s) of the target protein binds to a drug can lead to additional opportunities for discovering novel targets. The evolutionary classification of protein domains (ECOD) classifies domains into an evolutionary hierarchy that focuses on distant homology. Previously, no structure-based protein domain classification existed that included information about both the interaction between small molecules or drugs and the structural domains of a target protein. This data is especially important for multidomain proteins and large complexes. Here, we present the DrugDomain database that reports the interaction between ECOD of human target proteins and DrugBank molecules and drugs. The pilot version of DrugDomain describes the interaction of 5160 DrugBank molecules associated with 2573 human proteins. It describes domains for all experimentally determined structures of these proteins and incorporates AlphaFold models when such structures are unavailable. The DrugDomain database is available online: http://prodata.swmed.edu/DrugDomain/.
Collapse
Affiliation(s)
- Kirill E. Medvedev
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - R. Dustin Schaeffer
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
4
|
Umuhire Juru A, Ghirlando R, Zhang J. Structural basis of tRNA recognition by the widespread OB fold. Nat Commun 2024; 15:6385. [PMID: 39075051 PMCID: PMC11286949 DOI: 10.1038/s41467-024-50730-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 07/18/2024] [Indexed: 07/31/2024] Open
Abstract
The widespread oligonucleotide/oligosaccharide-binding (OB)-fold recognizes diverse substrates from sugars to nucleic acids and proteins, and plays key roles in genome maintenance, transcription, translation, and tRNA metabolism. OB-containing bacterial Trbp and yeast Arc1p proteins are thought to recognize the tRNA elbow or anticodon regions. Here we report a 2.6 Å co-crystal structure of Aquifex aeolicus Trbp111 bound to tRNAIle, which reveals that Trbp recognizes tRNAs solely by capturing their 3' ends. Structural, mutational, and biophysical analyses show that the Trbp/EMAPII-like OB fold precisely recognizes the single-stranded structure, 3' terminal location, and specific sequence of the 3' CA dinucleotide - a universal feature of mature tRNAs. Arc1p supplements its OB - tRNA 3' end interaction with additional contacts that involve an adjacent basic region and the tRNA body. This study uncovers a previously unrecognized mode of tRNA recognition by an ancient protein fold, and provides insights into protein-mediated tRNA aminoacylation, folding, localization, trafficking, and piracy.
Collapse
Affiliation(s)
- Aline Umuhire Juru
- Laboratory of Molecular Biology, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD, USA
| | - Rodolfo Ghirlando
- Laboratory of Molecular Biology, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD, USA
| | - Jinwei Zhang
- Laboratory of Molecular Biology, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD, USA.
| |
Collapse
|
5
|
Schiffrin B, Calabrese AN. Chaperones in concert: Orchestrating co-translational protein folding in the cell. Mol Cell 2024; 84:2403-2404. [PMID: 38996455 DOI: 10.1016/j.molcel.2024.06.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Revised: 06/18/2024] [Accepted: 06/19/2024] [Indexed: 07/14/2024]
Abstract
In this issue of Molecular Cell, Roeselová et al.1 provide insights into co-translational folding of a multidomain protein in bacteria, revealing how the chaperones Trigger Factor, DnaJ, and DnaK work together to facilitate the folding of nascent chains.
Collapse
Affiliation(s)
- Bob Schiffrin
- Astbury Centre for Structural Molecular Biology, School of Molecular and Cellular Biology, Faculty of Biological Sciences, University of Leeds, Leeds LS2 9JT, UK.
| | - Antonio N Calabrese
- Astbury Centre for Structural Molecular Biology, School of Molecular and Cellular Biology, Faculty of Biological Sciences, University of Leeds, Leeds LS2 9JT, UK.
| |
Collapse
|
6
|
Wang X, Zhang Y, Li Z, Duan Z, Guo M, Wang Z, Zhu F, Xue W. PROSCA: an online platform for humanized scaffold mining facilitating rational protein engineering. Nucleic Acids Res 2024; 52:W272-W279. [PMID: 38738624 PMCID: PMC11223824 DOI: 10.1093/nar/gkae384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 04/23/2024] [Accepted: 04/29/2024] [Indexed: 05/14/2024] Open
Abstract
Protein scaffolds with small size, high stability and low immunogenicity show important applications in the field of protein engineering and design. However, no relevant computational platform has been reported yet to mining such scaffolds with the desired properties from massive protein structures in human body. Here, we developed PROSCA, a structure-based online platform dedicated to explore the space of the entire human proteome, and to discovery new privileged protein scaffolds with potential engineering value that have never been noticed. PROSCA accepts structure of protein as an input, which can be subsequently aligned with a certain class of protein structures (e.g. the human proteome either from experientially resolved or AlphaFold2 predicted structures, and the human proteins belonging to specific families or domains), and outputs humanized protein scaffolds which are structurally similar with the input protein as well as other related important information such as families, sequences, structures and expression level in human tissues. Through PROSCA, the user can also get excellent experience in visualizations of protein structures and expression overviews, and download the figures and tables of results which can be customized according to the user's needs. Along with the advanced protein engineering and selection technologies, PROSCA will facilitate the rational design of new functional proteins with privileged scaffolds. PROSCA is freely available at https://idrblab.org/prosca/.
Collapse
Affiliation(s)
- Xiaona Wang
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
- Department of Intensive Care Medicine, Army Medical Center of PLA, Chongqing 401331, China
| | - Yintao Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Zengpeng Li
- State Key Laboratory Breeding Base of Marine Genetic Resources, Third Institute of Oceanography Ministry of Natural Resources, Xiamen 361005, China
| | - Zixin Duan
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Menghan Guo
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Zhen Wang
- Department of Intensive Care Medicine, Army Medical Center of PLA, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Weiwei Xue
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| |
Collapse
|
7
|
Goverde CA, Pacesa M, Goldbach N, Dornfeld LJ, Balbi PEM, Georgeon S, Rosset S, Kapoor S, Choudhury J, Dauparas J, Schellhaas C, Kozlov S, Baker D, Ovchinnikov S, Vecchio AJ, Correia BE. Computational design of soluble and functional membrane protein analogues. Nature 2024; 631:449-458. [PMID: 38898281 PMCID: PMC11236705 DOI: 10.1038/s41586-024-07601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 05/23/2024] [Indexed: 06/21/2024]
Abstract
De novo design of complex protein folds using solely computational means remains a substantial challenge1. Here we use a robust deep learning pipeline to design complex folds and soluble analogues of integral membrane proteins. Unique membrane topologies, such as those from G-protein-coupled receptors2, are not found in the soluble proteome, and we demonstrate that their structural features can be recapitulated in solution. Biophysical analyses demonstrate the high thermal stability of the designs, and experimental structures show remarkable design accuracy. The soluble analogues were functionalized with native structural motifs, as a proof of concept for bringing membrane protein functions to the soluble proteome, potentially enabling new approaches in drug discovery. In summary, we have designed complex protein topologies and enriched them with functionalities from membrane proteins, with high experimental success rates, leading to a de facto expansion of the functional soluble fold space.
Collapse
Affiliation(s)
- Casper A Goverde
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Martin Pacesa
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicolas Goldbach
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Lars J Dornfeld
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Petra E M Balbi
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sandrine Georgeon
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Stéphane Rosset
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Srajan Kapoor
- Department of Structural Biology, University at Buffalo, Buffalo, NY, USA
| | - Jagrity Choudhury
- Department of Structural Biology, University at Buffalo, Buffalo, NY, USA
| | - Justas Dauparas
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Christian Schellhaas
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Simon Kozlov
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Sergey Ovchinnikov
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Alex J Vecchio
- Department of Structural Biology, University at Buffalo, Buffalo, NY, USA
| | - Bruno E Correia
- Laboratory of Protein Design and Immunoengineering, École Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
8
|
Grossman AS, Gell DA, Wu DG, Carper DL, Hettich RL, Goodrich-Blair H. Bacterial hemophilin homologs and their specific type eleven secretor proteins have conserved roles in heme capture and are diversifying as a family. J Bacteriol 2024; 206:e0044423. [PMID: 38506530 DOI: 10.1128/jb.00444-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 02/18/2024] [Indexed: 03/21/2024] Open
Abstract
Cellular life relies on enzymes that require metals, which must be acquired from extracellular sources. Bacteria utilize surface and secreted proteins to acquire such valuable nutrients from their environment. These include the cargo proteins of the type eleven secretion system (T11SS), which have been connected to host specificity, metal homeostasis, and nutritional immunity evasion. This Sec-dependent, Gram-negative secretion system is encoded by organisms throughout the phylum Proteobacteria, including human pathogens Neisseria meningitidis, Proteus mirabilis, Acinetobacter baumannii, and Haemophilus influenzae. Experimentally verified T11SS-dependent cargo include transferrin-binding protein B (TbpB), the hemophilin homologs heme receptor protein C (HrpC), hemophilin A (HphA), the immune evasion protein factor-H binding protein (fHbp), and the host symbiosis factor nematode intestinal localization protein C (NilC). Here, we examined the specificity of T11SS systems for their cognate cargo proteins using taxonomically distributed homolog pairs of T11SS and hemophilin cargo and explored the ligand binding ability of those hemophilin cargo homologs. In vivo expression in Escherichia coli of hemophilin homologs revealed that each is secreted in a specific manner by its cognate T11SS protein. Sequence analysis and structural modeling suggest that all hemophilin homologs share an N-terminal ligand-binding domain with the same topology as the ligand-binding domains of the Haemophilus haemolyticus heme binding protein (Hpl) and HphA. We term this signature feature of this group of proteins the hemophilin ligand-binding domain. Network analysis of hemophilin homologs revealed five subclusters and representatives from four of these showed variable heme-binding activities, which, combined with sequence-structure variation, suggests that hemophilins are diversifying in function.IMPORTANCEThe secreted protein hemophilin and its homologs contribute to the survival of several bacterial symbionts within their respective host environments. Here, we compared taxonomically diverse hemophilin homologs and their paired Type 11 secretion systems (T11SS) to determine if heme binding and T11SS secretion are conserved characteristics of this family. We establish the existence of divergent hemophilin sub-families and describe structural features that contribute to distinct ligand-binding behaviors. Furthermore, we demonstrate that T11SS are specific for their cognate hemophilin family cargo proteins. Our work establishes that hemophilin homolog-T11SS pairs are diverging from each other, potentially evolving into novel ligand acquisition systems that provide competitive benefits in host niches.
Collapse
Affiliation(s)
- Alex S Grossman
- Department of Microbiology, University of Tennessee Knoxville, Knoxville, Tennessee, USA
| | - David A Gell
- School of Medicine, University of Tasmania, Hobart, Tasmania, Australia
| | - Derek G Wu
- Department of Microbiology, University of Tennessee Knoxville, Knoxville, Tennessee, USA
- Department of Plant and Soil Sciences, University of Delaware, Newark, Delaware, USA
| | - Dana L Carper
- Bioscience Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Robert L Hettich
- Bioscience Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Heidi Goodrich-Blair
- Department of Microbiology, University of Tennessee Knoxville, Knoxville, Tennessee, USA
| |
Collapse
|
9
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
10
|
Toledo-Patiño S, Goetz SK, Shanmugaratnam S, Höcker B, Farías-Rico JA. Molecular handcraft of a well-folded protein chimera. FEBS Lett 2024; 598:1375-1386. [PMID: 38508768 DOI: 10.1002/1873-3468.14856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 02/11/2024] [Accepted: 02/12/2024] [Indexed: 03/22/2024]
Abstract
Modular assembly is a compelling pathway to create new proteins, a concept supported by protein engineering and millennia of evolution. Natural evolution provided a repository of building blocks, known as domains, which trace back to even shorter segments that underwent numerous 'copy-paste' processes culminating in the scaffolds we see today. Utilizing the subdomain-database Fuzzle, we constructed a fold-chimera by integrating a flavodoxin-like fragment into a periplasmic binding protein. This chimera is well-folded and a crystal structure reveals stable interfaces between the fragments. These findings demonstrate the adaptability of α/β-proteins and offer a stepping stone for optimization. By emphasizing the practicality of fragment databases, our work pioneers new pathways in protein engineering. Ultimately, the results substantiate the conjecture that periplasmic binding proteins originated from a flavodoxin-like ancestor.
Collapse
Affiliation(s)
- Saacnicteh Toledo-Patiño
- Max Planck Institute for Developmental Biology, Tübingen, Germany
- Okinawa Institute of Science and Technology Graduate University, Japan
| | | | - Sooruban Shanmugaratnam
- Max Planck Institute for Developmental Biology, Tübingen, Germany
- Department of Biochemistry, University of Bayreuth, Germany
| | - Birte Höcker
- Max Planck Institute for Developmental Biology, Tübingen, Germany
- Department of Biochemistry, University of Bayreuth, Germany
| | - José Arcadio Farías-Rico
- Max Planck Institute for Developmental Biology, Tübingen, Germany
- Synthetic Biology Program, Center for Genome Sciences, National Autonomous University of Mexico, Cuernavaca, Mexico
| |
Collapse
|
11
|
Xia Y, Pan X, Shen HB. A comprehensive survey on protein-ligand binding site prediction. Curr Opin Struct Biol 2024; 86:102793. [PMID: 38447285 DOI: 10.1016/j.sbi.2024.102793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 02/18/2024] [Accepted: 02/18/2024] [Indexed: 03/08/2024]
Abstract
Protein-ligand binding site prediction is critical for protein function annotation and drug discovery. Biological experiments are time-consuming and require significant equipment, materials, and labor resources. Developing accurate and efficient computational methods for protein-ligand interaction prediction is essential. Here, we summarize the key challenges associated with ligand binding site (LBS) prediction and introduce recently published methods from their input features, computational algorithms, and ligand types. Furthermore, we investigate the specificity of allosteric site identification as a particular LBS type. Finally, we discuss the prospective directions for machine learning-based LBS prediction in the near future.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| |
Collapse
|
12
|
Rajasekaran N, Kaiser CM. Navigating the complexities of multi-domain protein folding. Curr Opin Struct Biol 2024; 86:102790. [PMID: 38432063 DOI: 10.1016/j.sbi.2024.102790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/11/2024] [Accepted: 02/12/2024] [Indexed: 03/05/2024]
Abstract
Proteome complexity has expanded tremendously over evolutionary time, enabling biological diversification. Much of this complexity is achieved by combining a limited set of structural units into long polypeptides. This widely used evolutionary strategy poses challenges for folding of the resulting multi-domain proteins. As a consequence, their folding differs from that of small single-domain proteins, which generally fold quickly and reversibly. Co-translational processes and chaperone interactions are important aspects of multi-domain protein folding. In this review, we discuss some of the recent experimental progress toward understanding these processes.
Collapse
Affiliation(s)
| | - Christian M Kaiser
- Department of Biology, Johns Hopkins University, Baltimore, MD, United States; Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, Netherlands.
| |
Collapse
|
13
|
Choudhary P, Feng Z, Berrisford J, Chao H, Ikegawa Y, Peisach E, Piehl DW, Smith J, Tanweer A, Varadi M, Westbrook JD, Young JY, Patwardhan A, Morris KL, Hoch JC, Kurisu G, Velankar S, Burley SK. PDB NextGen Archive: centralizing access to integrated annotations and enriched structural information by the Worldwide Protein Data Bank. Database (Oxford) 2024; 2024:baae041. [PMID: 38803272 PMCID: PMC11130521 DOI: 10.1093/database/baae041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 01/29/2024] [Accepted: 05/14/2024] [Indexed: 05/29/2024]
Abstract
The Protein Data Bank (PDB) is the global repository for public-domain experimentally determined 3D biomolecular structural information. The archival nature of the PDB presents certain challenges pertaining to updating or adding associated annotations from trusted external biodata resources. While each Worldwide PDB (wwPDB) partner has made best efforts to provide up-to-date external annotations, accessing and integrating information from disparate wwPDB data centers can be an involved process. To address this issue, the wwPDB has established the PDB Next Generation (or NextGen) Archive, developed to centralize and streamline access to enriched structural annotations from wwPDB partners and trusted external sources. At present, the NextGen Archive provides mappings between experimentally determined 3D structures of proteins and UniProt amino acid sequences, domain annotations from Pfam, SCOP2 and CATH databases and intra-molecular connectivity information. Since launch, the PDB NextGen Archive has seen substantial user engagement with over 3.5 million data file downloads, ensuring researchers have access to accurate, up-to-date and easily accessible structural annotations. Database URL: http://www.wwpdb.org/ftp/pdb-nextgen-archive-site.
Collapse
Affiliation(s)
- Preeti Choudhary
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Zukang Feng
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - John Berrisford
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Henry Chao
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - Yasuyo Ikegawa
- Protein Data Bank Japan, Protein Research Foundation, 3-2, Yamadaoka, Minoh, Osaka 562-8686, Japan
| | - Ezra Peisach
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - Dennis W Piehl
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - James Smith
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - Ahsan Tanweer
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Mihaly Varadi
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - Jasmine Y Young
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
| | - Ardan Patwardhan
- The Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Kyle L Morris
- The Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Jeffrey C Hoch
- Biological Magnetic Resonance Data Bank, Department of Molecular Biology and Biophysics, UConn Health, 263 Farmington Avenue, Farmington, CT 06030-3305, USA
| | - Genji Kurisu
- Protein Data Bank Japan, Protein Research Foundation, 3-2, Yamadaoka, Minoh, Osaka 562-8686, Japan
- Protein Data Bank Japan, Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita-shi, Osaka 565-0871, Japan
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
- Rutgers Cancer Institute of New Jersey, 195 Little Albany St., New Brunswick, NJ 08901, USA
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 123 Bevier Rd., Piscataway, NJ 08854, USA
| |
Collapse
|
14
|
Kazakov AS, Rastrygina VA, Vologzhannikova AA, Zemskova MY, Bobrova LA, Deryusheva EI, Permyakova ME, Sokolov AS, Litus EA, Shevelyova MP, Uversky VN, Permyakov EA, Permyakov SE. Recognition of granulocyte-macrophage colony-stimulating factor by specific S100 proteins. Cell Calcium 2024; 119:102869. [PMID: 38484433 DOI: 10.1016/j.ceca.2024.102869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 03/01/2024] [Accepted: 03/03/2024] [Indexed: 04/05/2024]
Abstract
Granulocyte-macrophage colony-stimulating factor (GM-CSF) is a pleiotropic myelopoietic growth factor and proinflammatory cytokine, clinically used for multiple indications and serving as a promising target for treatment of many disorders, including cancer, multiple sclerosis, rheumatoid arthritis, psoriasis, asthma, COVID-19. We have previously shown that dimeric Ca2+-bound forms of S100A6 and S100P proteins, members of the multifunctional S100 protein family, are specific to GM-CSF. To probe selectivity of these interactions, the affinity of recombinant human GM-CSF to dimeric Ca2+-loaded forms of 18 recombinant human S100 proteins was studied by surface plasmon resonance spectroscopy. Of them, only S100A4 protein specifically binds to GM-CSF with equilibrium dissociation constant, Kd, values of 0.3-2 μM, as confirmed by intrinsic fluorescence and chemical crosslinking data. Calcium removal prevents S100A4 binding to GM-CSF, whereas monomerization of S100A4/A6/P proteins disrupts S100A4/A6 interaction with GM-CSF and induces a slight decrease in S100P affinity for GM-CSF. Structural modelling indicates the presence in the GM-CSF molecule of a conserved S100A4/A6/P-binding site, consisting of the residues from its termini, helices I and III, some of which are involved in the interaction with GM-CSF receptors. The predicted involvement of the 'hinge' region and F89 residue of S100P in GM-CSF recognition was confirmed by mutagenesis. Examination of S100A4/A6/P ability to affect GM-CSF signaling showed that S100A4/A6 inhibit GM-CSF-induced suppression of viability of monocytic THP-1 cells. The ability of the S100 proteins to modulate GM-CSF activity is relevant to progression of various neoplasms and other diseases, according to bioinformatics analysis. The direct regulation of GM-CSF signaling by extracellular forms of the S100 proteins should be taken into account in the clinical use of GM-CSF and development of the therapeutic interventions targeting GM-CSF or its receptors.
Collapse
Affiliation(s)
- Alexey S Kazakov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia.
| | - Victoria A Rastrygina
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Alisa A Vologzhannikova
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Marina Y Zemskova
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, G.K. Skryabin Institute of Biochemistry and Physiology of Microorganisms, pr. Nauki, 5, Pushchino, Moscow Region 142290, Russia
| | - Lolita A Bobrova
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Evgenia I Deryusheva
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia.
| | - Maria E Permyakova
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Andrey S Sokolov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Ekaterina A Litus
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Marina P Shevelyova
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Vladimir N Uversky
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA.
| | - Eugene A Permyakov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia
| | - Sergei E Permyakov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia.
| |
Collapse
|
15
|
Ellaway JIJ, Anyango S, Nair S, Zaki HA, Nadzirin N, Powell HR, Gutmanas A, Varadi M, Velankar S. Identifying protein conformational states in the Protein Data Bank: Toward unlocking the potential of integrative dynamics studies. STRUCTURAL DYNAMICS (MELVILLE, N.Y.) 2024; 11:034701. [PMID: 38774441 PMCID: PMC11106648 DOI: 10.1063/4.0000251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 05/08/2024] [Indexed: 05/24/2024]
Abstract
Studying protein dynamics and conformational heterogeneity is crucial for understanding biomolecular systems and treating disease. Despite the deposition of over 215 000 macromolecular structures in the Protein Data Bank and the advent of AI-based structure prediction tools such as AlphaFold2, RoseTTAFold, and ESMFold, static representations are typically produced, which fail to fully capture macromolecular motion. Here, we discuss the importance of integrating experimental structures with computational clustering to explore the conformational landscapes that manifest protein function. We describe the method developed by the Protein Data Bank in Europe - Knowledge Base to identify distinct conformational states, demonstrate the resource's primary use cases, through examples, and discuss the need for further efforts to annotate protein conformations with functional information. Such initiatives will be crucial in unlocking the potential of protein dynamics data, expediting drug discovery research, and deepening our understanding of macromolecular mechanisms.
Collapse
Affiliation(s)
- Joseph I. J. Ellaway
- Protein Data Bank in Europe, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Stephen Anyango
- Protein Data Bank in Europe, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Sreenath Nair
- Protein Data Bank in Europe, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Hossam A. Zaki
- The Warren Alpert Medical School of Brown University, Providence, Rhode Island 02903, USA
| | - Nurul Nadzirin
- Protein Data Bank in Europe, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Harold R. Powell
- Imperial College London, Department of Life Sciences, London, United Kingdom
| | - Aleksandras Gutmanas
- WaveBreak Therapeutics Ltd., Clarendon House, Clarendon Road, Cambridge, United Kingdom
| | - Mihaly Varadi
- Protein Data Bank in Europe, European Bioinformatics Institute, Hinxton, United Kingdom
| | - Sameer Velankar
- Protein Data Bank in Europe, European Bioinformatics Institute, Hinxton, United Kingdom
| |
Collapse
|
16
|
Liu R, Clayton J, Shen M, Bhatnagar S, Shen J. Machine Learning Models to Interrogate Proteome-Wide Covalent Ligandabilities Directed at Cysteines. JACS AU 2024; 4:1374-1384. [PMID: 38665640 PMCID: PMC11040703 DOI: 10.1021/jacsau.3c00749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/22/2024] [Accepted: 02/23/2024] [Indexed: 04/28/2024]
Abstract
Machine learning (ML) identification of covalently ligandable sites may accelerate targeted covalent inhibitor design and help expand the druggable proteome space. Here, we report the rigorous development and validation of the tree-based models and convolutional neural networks (CNNs) trained on a newly curated database (LigCys3D) of over 1000 liganded cysteines in nearly 800 proteins represented by over 10,000 three-dimensional structures in the protein data bank. The unseen tests yielded 94 and 93% area under the receiver operating characteristic curves for the tree models and CNNs, respectively. Based on the AlphaFold2 predicted structures, the ML models recapitulated the newly liganded cysteines in the PDB with over 90% recall values. To assist the community of covalent drug discoveries, we report the predicted ligandable cysteines in 392 human kinases and their locations in the sequence-aligned kinase structure, including the PH and SH2 domains. Furthermore, we disseminate a searchable online database LigCys3D (https://ligcys.computchem.org/) and a web prediction server DeepCys (https://deepcys.computchem.org/), both of which will be continuously updated and improved by including newly published experimental data. The present work represents an important step toward the ML-led integration of big genome data and structure models to annotate the human proteome space for the next-generation covalent drug discoveries.
Collapse
Affiliation(s)
- Ruibin Liu
- Department
of Pharmaceutical Sciences, University of
Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
| | - Joseph Clayton
- Department
of Pharmaceutical Sciences, University of
Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
- Division
of Applied Regulatory Science, Office of Clinical Pharmacology, Center
for Drug Evaluation and Research, U.S. Food
and Drug Administration, Silver
Spring, Maryland 20993, United States
| | - Mingzhe Shen
- Department
of Pharmaceutical Sciences, University of
Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
| | - Shubham Bhatnagar
- Department
of Computer Science, University of Maryland
at College Park, College
Park, Maryland 20742, United States
| | - Jana Shen
- Department
of Pharmaceutical Sciences, University of
Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
| |
Collapse
|
17
|
Pan Z, Zhuo L, Wan TY, Chen RY, Li YZ. DnaK duplication and specialization in bacteria correlates with increased proteome complexity. mSystems 2024; 9:e0115423. [PMID: 38530057 PMCID: PMC11019930 DOI: 10.1128/msystems.01154-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Accepted: 03/10/2024] [Indexed: 03/27/2024] Open
Abstract
The chaperone 70 kDa heat shock protein (Hsp70) is important for cells from bacteria to humans to maintain proteostasis, and all eukaryotes and several prokaryotes encode Hsp70 paralogs. Although the mechanisms of Hsp70 function have been clearly illuminated, the function and evolution of Hsp70 paralogs is not well studied. DnaK is a highly conserved bacterial Hsp70 family. Here, we show that dnaK is present in 98.9% of bacterial genomes, and 6.4% of them possess two or more DnaK paralogs. We found that the duplication of dnaK is positively correlated with an increase in proteomic complexity (proteome size, number of domains). We identified the interactomes of the two DnaK paralogs of Myxococcus xanthus DK1622 (MxDnaKs), which revealed that they are mostly nonoverlapping, although both prefer α and β domain proteins. Consistent with the entire M. xanthus proteome, MxDnaK substrates have both significantly more multi-domain proteins and a higher isoelectric point than that of Escherichia coli, which encodes a single DnaK homolog. MxDnaK1 is transcriptionally upregulated in response to heat shock and prefers to bind cytosolic proteins, while MxDnaK2 is downregulated by heat shock and is more associated with membrane proteins. Using domain swapping, we show that the nucleotide-binding domain and the substrate-binding β domain are responsible for the significant differences in DnaK interactomes, and the nucleotide binding domain also determines the dimerization of MxDnaK2, but not MxDnaK1. Our work suggests that bacterial DnaK has been duplicated in order to deal with a more complex proteome, and that this allows evolution of distinct domains to deal with different subsets of target proteins.IMPORTANCEAll eukaryotic and ~40% of prokaryotic species encode multiple 70 kDa heat shock protein (Hsp70) homologs with similar but diversified functions. Here, we show that duplication of canonical Hsp70 (DnaK in prokaryotes) correlates with increasing proteomic complexity and evolution of particular regions of the protein. Using the Myxococcus xanthus DnaK duplicates as a case, we found that their substrate spectrums are mostly nonoverlapping, and are both consistent to that of Escherichia coli DnaK in structural and molecular characteristics, but show differential enrichment of membrane proteins. Domain/region swapping demonstrated that the nucleotide-binding domain and the β substrate-binding domain (SBDβ), but not the SBDα or disordered C-terminal tail region, are responsible for this functional divergence. This work provides the first direct evidence for regional evolution of DnaK paralogs.
Collapse
Affiliation(s)
- Zhuo Pan
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, China
| | - Li Zhuo
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, China
- Suzhou Research Institute, Shandong University, Suzhou, China
| | - Tian-yu Wan
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, China
| | - Rui-yun Chen
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, China
| | - Yue-zhong Li
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, China
| |
Collapse
|
18
|
Glidden-Handgis G, Wheeler TJ. WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. BIOINFORMATICS ADVANCES 2024; 4:vbae052. [PMID: 38764475 PMCID: PMC11099658 DOI: 10.1093/bioadv/vbae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 03/31/2024] [Accepted: 04/04/2024] [Indexed: 05/21/2024]
Abstract
Background Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
Collapse
Affiliation(s)
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
| |
Collapse
|
19
|
Penteado RF, Iulek J. Crystal structure of Methionyl-tRNA Synthetase from Rickettsia typhi in complex with its cognate amino acid. Biochimie 2024; 219:63-73. [PMID: 37673171 DOI: 10.1016/j.biochi.2023.09.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 08/08/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Rickettsia typhi is the causative agent of murine typhus (endemic typhus), a febrile illness that can be self-contained, though in some cases it can progress to death. The three dimensional structure of Methionyl-tRNA Synthetase from R. typhi (RtMetRS) in complex with its substrate l-methionine was solved by molecular replacement and refined at 2.30 Å resolution in space group P1 from one X-ray diffraction dataset. Processing and refinement trials were decisive to establish the lower symmetry space group and indicated the presence of twinning with four domains. RtMetRS belongs to the MetRS1 family and was crystallized with the CP domain in an open conformation, what is distinctive from other MetRS1 enzymes whose structures were solved with a bound L-methionine (therefore, in a closed conformation). This conformation resembles the ones observed in the MetRS2 family.
Collapse
Affiliation(s)
- Renato Ferras Penteado
- Department of Chemistry, State University of Ponta Grossa, Ponta Grossa, PR, 84030-900, Brazil
| | - Jorge Iulek
- Department of Chemistry, State University of Ponta Grossa, Ponta Grossa, PR, 84030-900, Brazil.
| |
Collapse
|
20
|
Dutta A, Kanaujia SP. The Structural Features of MlaD Illuminate its Unique Ligand-Transporting Mechanism and Ancestry. Protein J 2024; 43:298-315. [PMID: 38347327 DOI: 10.1007/s10930-023-10179-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/22/2023] [Indexed: 05/01/2024]
Abstract
The membrane-associated solute-binding protein (SBP) MlaD of the maintenance of lipid asymmetry (Mla) system has been reported to help the transport of phospholipids (PLs) between the outer and inner membranes of Gram-negative bacteria. Despite the availability of structural information, the molecular mechanism underlying the transport of PLs and the ancestry of the protein MlaD remain unclear. In this study, we report the crystal structures of the periplasmic region of MlaD from Escherichia coli (EcMlaD) at a resolution range of 2.3-3.2 Å. The EcMlaD protomer consists of two distinct regions, viz. N-terminal β-barrel fold consisting of seven strands (referred to as MlaD domain) and C-terminal α-helical domain (HD). The protein EcMlaD oligomerizes to give rise to a homo-hexameric ring with a central channel that is hydrophobic and continuous with a variable diameter. Interestingly, the structural analysis revealed that the HD, instead of the MlaD domain, plays a critical role in determining the oligomeric state of the protein. Based on the analysis of available structural information, we propose a working mechanism of PL transport, viz. "asymmetric protomer movement (APM)". Wherein half of the EcMlaD hexamer would rise in the periplasmic side along with an outward movement of pore loops, resulting in the change of the central channel geometry. Furthermore, this study highlights that, unlike typical SBPs, EcMlaD possesses a fold similar to EF/AMT-type beta(6)-barrel and a unique ancestry. Altogether, the findings firmly establish EcMlaD to be a non-canonical SBP with a unique ligand-transport mechanism.
Collapse
Affiliation(s)
- Angshu Dutta
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Guwahati, Assam, 781039, India
| | - Shankar Prasad Kanaujia
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Guwahati, Assam, 781039, India.
| |
Collapse
|
21
|
Roel‐Touris J, Carcelén L, Marcos E. The structural landscape of the immunoglobulin fold by large-scale de novo design. Protein Sci 2024; 33:e4936. [PMID: 38501461 PMCID: PMC10949314 DOI: 10.1002/pro.4936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 02/02/2024] [Accepted: 02/06/2024] [Indexed: 03/20/2024]
Abstract
De novo designing immunoglobulin-like frameworks that allow for functional loop diversification shows great potential for crafting antibody-like scaffolds with fully customizable structures and functions. In this work, we combined de novo parametric design with deep-learning methods for protein structure prediction and design to explore the structural landscape of 7-stranded immunoglobulin domains. After screening folding of nearly 4 million designs, we have assembled a structurally diverse library of ~50,000 immunoglobulin domains with high-confidence AlphaFold2 predictions and structures diverging from naturally occurring ones. The designed dataset enabled us to identify structural requirements for the correct folding of immunoglobulin domains, shed light on β-sheet-β-sheet rotational preferences and how these are linked to functional properties. Our approach eliminates the need for preset loop conformations and opens the route to large-scale de novo design of immunoglobulin-like frameworks.
Collapse
Affiliation(s)
- Jorge Roel‐Touris
- Protein Design and Modeling Lab, Department of Structural and Molecular BiologyMolecular Biology Institute of Barcelona (IBMB), CSICBarcelonaSpain
| | - Lourdes Carcelén
- Protein Design and Modeling Lab, Department of Structural and Molecular BiologyMolecular Biology Institute of Barcelona (IBMB), CSICBarcelonaSpain
| | - Enrique Marcos
- Protein Design and Modeling Lab, Department of Structural and Molecular BiologyMolecular Biology Institute of Barcelona (IBMB), CSICBarcelonaSpain
| |
Collapse
|
22
|
Rozano L, Jones DAB, Hane JK, Mancera RL. Template-Based Modelling of the Structure of Fungal Effector Proteins. Mol Biotechnol 2024; 66:784-813. [PMID: 36940017 PMCID: PMC11043172 DOI: 10.1007/s12033-023-00703-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 02/14/2023] [Indexed: 03/21/2023]
Abstract
The discovery of new fungal effector proteins is necessary to enable the screening of cultivars for disease resistance. Sequence-based bioinformatics methods have been used for this purpose, but only a limited number of functional effector proteins have been successfully predicted and subsequently validated experimentally. A significant obstacle is that many fungal effector proteins discovered so far lack sequence similarity or conserved sequence motifs. The availability of experimentally determined three-dimensional (3D) structures of a number of effector proteins has recently highlighted structural similarities amongst groups of sequence-dissimilar fungal effectors, enabling the search for similar structural folds amongst effector sequence candidates. We have applied template-based modelling to predict the 3D structures of candidate effector sequences obtained from bioinformatics predictions and the PHI-BASE database. Structural matches were found not only with ToxA- and MAX-like effector candidates but also with non-fungal effector-like proteins-including plant defensins and animal venoms-suggesting the broad conservation of ancestral structural folds amongst cytotoxic peptides from a diverse range of distant species. Accurate modelling of fungal effectors were achieved using RaptorX. The utility of predicted structures of effector proteins lies in the prediction of their interactions with plant receptors through molecular docking, which will improve the understanding of effector-plant interactions.
Collapse
Affiliation(s)
- Lina Rozano
- Curtin Medical School, Curtin Health Innovation Research Institute, GPO Box U1987, Perth, WA, 6845, Australia
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
| | - Darcy A B Jones
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
| | - James K Hane
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
| | - Ricardo L Mancera
- Curtin Medical School, Curtin Health Innovation Research Institute, GPO Box U1987, Perth, WA, 6845, Australia.
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia.
| |
Collapse
|
23
|
Singleton MD, Eisen MB. Evolutionary analyses of intrinsically disordered regions reveal widespread signals of conservation. PLoS Comput Biol 2024; 20:e1012028. [PMID: 38662765 PMCID: PMC11075841 DOI: 10.1371/journal.pcbi.1012028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 05/07/2024] [Accepted: 03/28/2024] [Indexed: 05/08/2024] Open
Abstract
Intrinsically disordered regions (IDRs) are segments of proteins without stable three-dimensional structures. As this flexibility allows them to interact with diverse binding partners, IDRs play key roles in cell signaling and gene expression. Despite the prevalence and importance of IDRs in eukaryotic proteomes and various biological processes, associating them with specific molecular functions remains a significant challenge due to their high rates of sequence evolution. However, by comparing the observed values of various IDR-associated properties against those generated under a simulated model of evolution, a recent study found most IDRs across the entire yeast proteome contain conserved features. Furthermore, it showed clusters of IDRs with common "evolutionary signatures," i.e. patterns of conserved features, were associated with specific biological functions. To determine if similar patterns of conservation are found in the IDRs of other systems, in this work we applied a series of phylogenetic models to over 7,500 orthologous IDRs identified in the Drosophila genome to dissect the forces driving their evolution. By comparing models of constrained and unconstrained continuous trait evolution using the Brownian motion and Ornstein-Uhlenbeck models, respectively, we identified signals of widespread constraint, indicating conservation of distributed features is mechanism of IDR evolution common to multiple biological systems. In contrast to the previous study in yeast, however, we observed limited evidence of IDR clusters with specific biological functions, which suggests a more complex relationship between evolutionary constraints and function in the IDRs of multicellular organisms.
Collapse
Affiliation(s)
- Marc D. Singleton
- Howard Hughes Medical Institute, UC Berkeley, Berkeley, California, United States of America
| | - Michael B. Eisen
- Howard Hughes Medical Institute, UC Berkeley, Berkeley, California, United States of America
- Department of Molecular and Cell Biology, UC Berkeley, Berkeley, California, United States of America
| |
Collapse
|
24
|
Gupta MN, Uversky VN. Protein structure-function continuum model: Emerging nexuses between specificity, evolution, and structure. Protein Sci 2024; 33:e4968. [PMID: 38532700 DOI: 10.1002/pro.4968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 02/18/2024] [Accepted: 03/05/2024] [Indexed: 03/28/2024]
Abstract
The rationale for replacing the old binary of structure-function with the trinity of structure, disorder, and function has gained considerable ground in recent years. A continuum model based on the expanded form of the existing paradigm can now subsume importance of both conformational flexibility and intrinsic disorder in protein function. The disorder is actually critical for understanding the protein-protein interactions in many regulatory processes, formation of membrane-less organelles, and our revised notions of specificity as amply illustrated by moonlighting proteins. While its importance in formation of amyloids and function of prions is often discussed, the roles of intrinsic disorder in infectious diseases and protein function under extreme conditions are also becoming clear. This review is an attempt to discuss how our current understanding of protein function, specificity, and evolution fit better with the continuum model. This integration of structure and disorder under a single model may bring greater clarity in our continuing quest for understanding proteins and molecular mechanisms of their functionality.
Collapse
Affiliation(s)
- Munishwar Nath Gupta
- Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology, New Delhi, India
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida, USA
| |
Collapse
|
25
|
Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024; 40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.
Collapse
Affiliation(s)
- Edo Dotan
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gal Jaschek
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
26
|
Abbass J, Parisi C. Machine learning-based prediction of proteins' architecture using sequences of amino acids and structural alphabets. J Biomol Struct Dyn 2024:1-16. [PMID: 38505995 DOI: 10.1080/07391102.2024.2328736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 03/05/2024] [Indexed: 03/21/2024]
Abstract
In addition to the growth of protein structures generated through wet laboratory experiments and deposited in the PDB repository, AlphaFold predictions have significantly contributed to the creation of a much larger database of protein structures. Annotating such a vast number of structures has become an increasingly challenging task. CATH is widely recognized as one the most common platforms for addressing this challenge, as it classifies proteins based on their structural and evolutionary relationships, offering the scientific community an invaluable resource for uncovering various properties, including functional annotations. While CATH annotation involves - to some extent - human intervention, keeping up with the classification of the rapidly expanding repositories of protein structures has become exceedingly difficult. Therefore, there is a pressing need for a fully automated approach. On the other hand, the abundance of protein sequences stemming from next generation sequencing technologies, lacking structural annotations, presents an additional challenge to the scientific community. Consequently, 'pre-annotating' protein sequences with structural features, ensuring a high level of precision, could prove highly advantageous. In this paper, after a thorough investigation, we introduce a novel machine-learning model capable of classifying any protein domain, whether it has a known structure or not, into one of the 40 main CATH Architectures. We achieve an F1 Score of 0.92 using only the amino acid sequence and a score of 0.94 using both the sequence of amino acids and the sequence of structural alphabets.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Jad Abbass
- School of Computer Science and Mathematics, Kingston University, London, UK
| | - Charles Parisi
- School of Computer Science and Mathematics, Kingston University, London, UK
- Telecom Physique Strasbourg, Strasbourg University, Strasbourg, France
| |
Collapse
|
27
|
Goverde CA, Pacesa M, Goldbach N, Dornfeld LJ, Balbi PEM, Georgeon S, Rosset S, Kapoor S, Choudhury J, Dauparas J, Schellhaas C, Kozlov S, Baker D, Ovchinnikov S, Vecchio AJ, Correia BE. Computational design of soluble functional analogues of integral membrane proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.09.540044. [PMID: 38496615 PMCID: PMC10942269 DOI: 10.1101/2023.05.09.540044] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
De novo design of complex protein folds using solely computational means remains a significant challenge. Here, we use a robust deep learning pipeline to design complex folds and soluble analogues of integral membrane proteins. Unique membrane topologies, such as those from GPCRs, are not found in the soluble proteome and we demonstrate that their structural features can be recapitulated in solution. Biophysical analyses reveal high thermal stability of the designs and experimental structures show remarkable design accuracy. The soluble analogues were functionalized with native structural motifs, standing as a proof-of-concept for bringing membrane protein functions to the soluble proteome, potentially enabling new approaches in drug discovery. In summary, we designed complex protein topologies and enriched them with functionalities from membrane proteins, with high experimental success rates, leading to a de facto expansion of the functional soluble fold space.
Collapse
|
28
|
Makarova KS, Zhang C, Wolf YI, Karamycheva S, Whitaker RJ, Koonin EV. Computational analysis of genes with lethal knockout phenotype and prediction of essential genes in archaea. mBio 2024; 15:e0309223. [PMID: 38189270 PMCID: PMC10865827 DOI: 10.1128/mbio.03092-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 11/27/2023] [Indexed: 01/09/2024] Open
Abstract
The identification of microbial genes essential for survival as those with lethal knockout phenotype (LKP) is a common strategy for functional interrogation of genomes. However, interpretation of the LKP is complicated because a substantial fraction of the genes with this phenotype remains poorly functionally characterized. Furthermore, many genes can exhibit LKP not because their products perform essential cellular functions but because their knockout activates the toxicity of other genes (conditionally essential genes). We analyzed the sets of LKP genes for two archaea, Methanococcus maripaludis and Sulfolobus islandicus, using a variety of computational approaches aiming to differentiate between essential and conditionally essential genes and to predict at least a general function for as many of the proteins encoded by these genes as possible. This analysis allowed us to predict the functions of several LKP genes including previously uncharacterized subunit of the GINS protein complex with an essential function in genome replication and of the KEOPS complex that is responsible for an essential tRNA modification as well as GRP protease implicated in protein quality control. Additionally, several novel antitoxins (conditionally essential genes) were predicted, and this prediction was experimentally validated by showing that the deletion of these genes together with the adjacent genes apparently encoding the cognate toxins caused no growth defect. We applied principal component analysis based on sequence and comparative genomic features showing that this approach can separate essential genes from conditionally essential ones and used it to predict essential genes in other archaeal genomes.IMPORTANCEOnly a relatively small fraction of the genes in any bacterium or archaeon is essential for survival as demonstrated by the lethal effect of their disruption. The identification of essential genes and their functions is crucial for understanding fundamental cell biology. However, many of the genes with a lethal knockout phenotype remain poorly functionally characterized, and furthermore, many genes can exhibit this phenotype not because their products perform essential cellular functions but because their knockout activates the toxicity of other genes. We applied state-of-the-art computational methods to predict the functions of a number of uncharacterized genes with the lethal knockout phenotype in two archaeal species and developed a computational approach to predict genes involved in essential functions. These findings advance the current understanding of key functionalities of archaeal cells.
Collapse
Affiliation(s)
- Kira S. Makarova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Changyi Zhang
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Yuri I. Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Svetlana Karamycheva
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Rachel J. Whitaker
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
29
|
Doni D, Cavallari E, Noguera ME, Gentili HG, Cavion F, Parisi G, Fornasari MS, Sartori G, Santos J, Bellanda M, Carbonera D, Costantini P, Bortolus M. Searching for Frataxin Function: Exploring the Analogy with Nqo15, the Frataxin-like Protein of Respiratory Complex I from Thermus thermophilus. Int J Mol Sci 2024; 25:1912. [PMID: 38339189 PMCID: PMC10855754 DOI: 10.3390/ijms25031912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 01/26/2024] [Accepted: 02/02/2024] [Indexed: 02/12/2024] Open
Abstract
Nqo15 is a subunit of respiratory complex I of the bacterium Thermus thermophilus, with strong structural similarity to human frataxin (FXN), a protein involved in the mitochondrial disease Friedreich's ataxia (FRDA). Recently, we showed that the expression of recombinant Nqo15 can ameliorate the respiratory phenotype of FRDA patients' cells, and this prompted us to further characterize both the Nqo15 solution's behavior and its potential functional overlap with FXN, using a combination of in silico and in vitro techniques. We studied the analogy of Nqo15 and FXN by performing extensive database searches based on sequence and structure. Nqo15's folding and flexibility were investigated by combining nuclear magnetic resonance (NMR), circular dichroism, and coarse-grained molecular dynamics simulations. Nqo15's iron-binding properties were studied using NMR, fluorescence, and specific assays and its desulfurase activation by biochemical assays. We found that the recombinant Nqo15 isolated from complex I is monomeric, stable, folded in solution, and highly dynamic. Nqo15 does not share the iron-binding properties of FXN or its desulfurase activation function.
Collapse
Affiliation(s)
- Davide Doni
- Department of Biology, University of Padova, 35121 Padova, Italy; (D.D.); (F.C.)
| | - Eva Cavallari
- Department of Biology, University of Padova, 35121 Padova, Italy; (D.D.); (F.C.)
- Grenoble Alpes University, CNRS, CEA, INRAE, IRIG-LPCV, 38000 Grenoble, France
| | - Martin Ezequiel Noguera
- Department of Physiology and Molecular and Cellular Biology, Institute of Biosciences, Biotechnology and Translational Biology (iB3), Faculty of Exact and Natural Sciences, University of Buenos Aires, Intendente Güiraldes 2160, Buenos Aires C1428EG, Argentina; (M.E.N.); (H.G.G.); (J.S.)
- Institute of Biological Chemistry and Physical Chemistry, Dr Alejandro Paladini (UBA-CONICET), University of Buenos Aires, Junín 956, Buenos Aires 1113AAD, Argentina
- Department of Science and Technology, National University of Quilmes, Roque Saenz Peña 352, Bernal B1876BXD, Argentina; (G.P.); (M.S.F.)
| | - Hernan Gustavo Gentili
- Department of Physiology and Molecular and Cellular Biology, Institute of Biosciences, Biotechnology and Translational Biology (iB3), Faculty of Exact and Natural Sciences, University of Buenos Aires, Intendente Güiraldes 2160, Buenos Aires C1428EG, Argentina; (M.E.N.); (H.G.G.); (J.S.)
| | - Federica Cavion
- Department of Biology, University of Padova, 35121 Padova, Italy; (D.D.); (F.C.)
| | - Gustavo Parisi
- Department of Science and Technology, National University of Quilmes, Roque Saenz Peña 352, Bernal B1876BXD, Argentina; (G.P.); (M.S.F.)
| | - Maria Silvina Fornasari
- Department of Science and Technology, National University of Quilmes, Roque Saenz Peña 352, Bernal B1876BXD, Argentina; (G.P.); (M.S.F.)
| | - Geppo Sartori
- Department of Biomedical Sciences, University of Padova, 35121 Padova, Italy;
| | - Javier Santos
- Department of Physiology and Molecular and Cellular Biology, Institute of Biosciences, Biotechnology and Translational Biology (iB3), Faculty of Exact and Natural Sciences, University of Buenos Aires, Intendente Güiraldes 2160, Buenos Aires C1428EG, Argentina; (M.E.N.); (H.G.G.); (J.S.)
| | - Massimo Bellanda
- Department of Chemical Sciences, University of Padova, 35131 Padova, Italy; (M.B.); (D.C.)
- Consiglio Nazionale delle Ricerche Institute of Biomolecular Chemistry, 35131 Padova, Italy
| | - Donatella Carbonera
- Department of Chemical Sciences, University of Padova, 35131 Padova, Italy; (M.B.); (D.C.)
| | - Paola Costantini
- Department of Biology, University of Padova, 35121 Padova, Italy; (D.D.); (F.C.)
| | - Marco Bortolus
- Department of Chemical Sciences, University of Padova, 35131 Padova, Italy; (M.B.); (D.C.)
| |
Collapse
|
30
|
Satalkar V, Degaga GD, Li W, Pang YT, McShan AC, Gumbart JC, Mitchell JC, Torres MP. Generative β-hairpin design using a residue-based physicochemical property landscape. Biophys J 2024:S0006-3495(24)00070-5. [PMID: 38297834 DOI: 10.1016/j.bpj.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/20/2023] [Accepted: 01/25/2024] [Indexed: 02/02/2024] Open
Abstract
De novo peptide design is a new frontier that has broad application potential in the biological and biomedical fields. Most existing models for de novo peptide design are largely based on sequence homology that can be restricted based on evolutionarily derived protein sequences and lack the physicochemical context essential in protein folding. Generative machine learning for de novo peptide design is a promising way to synthesize theoretical data that are based on, but unique from, the observable universe. In this study, we created and tested a custom peptide generative adversarial network intended to design peptide sequences that can fold into the β-hairpin secondary structure. This deep neural network model is designed to establish a preliminary foundation of the generative approach based on physicochemical and conformational properties of 20 canonical amino acids, for example, hydrophobicity and residue volume, using extant structure-specific sequence data from the PDB. The beta generative adversarial network model robustly distinguishes secondary structures of β hairpin from α helix and intrinsically disordered peptides with an accuracy of up to 96% and generates artificial β-hairpin peptide sequences with minimum sequence identities around 31% and 50% when compared against the current NCBI PDB and nonredundant databases, respectively. These results highlight the potential of generative models specifically anchored by physicochemical and conformational property features of amino acids to expand the sequence-to-structure landscape of proteins beyond evolutionary limits.
Collapse
Affiliation(s)
- Vardhan Satalkar
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Gemechis D Degaga
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee
| | - Wei Li
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Yui Tik Pang
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia
| | - Andrew C McShan
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - James C Gumbart
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee.
| | - Matthew P Torres
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia.
| |
Collapse
|
31
|
Sayin AZ, Abali Z, Senyuz S, Cankara F, Gursoy A, Keskin O. Conformational diversity and protein-protein interfaces in drug repurposing in Ras signaling pathway. Sci Rep 2024; 14:1239. [PMID: 38216592 PMCID: PMC10786864 DOI: 10.1038/s41598-023-50913-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/27/2023] [Indexed: 01/14/2024] Open
Abstract
We focus on drug repurposing in the Ras signaling pathway, considering structural similarities of protein-protein interfaces. The interfaces formed by physically interacting proteins are found from PDB if available and via PRISM (PRotein Interaction by Structural Matching) otherwise. The structural coverage of these interactions has been increased from 21 to 92% using PRISM. Multiple conformations of each protein are used to include protein dynamics and diversity. Next, we find FDA-approved drugs bound to structurally similar protein-protein interfaces. The results suggest that HIV protease inhibitors tipranavir, indinavir, and saquinavir may bind to EGFR and ERBB3/HER3 interface. Tipranavir and indinavir may also bind to EGFR and ERBB2/HER2 interface. Additionally, a drug used in Alzheimer's disease can bind to RAF1 and BRAF interface. Hence, we propose a methodology to find drugs to be potentially used for cancer using a dataset of structurally similar protein-protein interface clusters rather than pockets in a systematic way.
Collapse
Affiliation(s)
- Ahenk Zeynep Sayin
- Department of Chemical and Biological Engineering, College of Engineering, Koc University, Rumeli Feneri Yolu Sariyer, 34450, Istanbul, Turkey
| | - Zeynep Abali
- Graduate School of Science and Engineering, Computational Sciences and Engineering, Koc University, 34450, Istanbul, Turkey
| | - Simge Senyuz
- Graduate School of Science and Engineering, Computational Sciences and Engineering, Koc University, 34450, Istanbul, Turkey
| | - Fatma Cankara
- Graduate School of Science and Engineering, Computational Sciences and Engineering, Koc University, 34450, Istanbul, Turkey
| | - Attila Gursoy
- Department of Computer Engineering, Koc University, 34450, Istanbul, Turkey
| | - Ozlem Keskin
- Department of Chemical and Biological Engineering, College of Engineering, Koc University, Rumeli Feneri Yolu Sariyer, 34450, Istanbul, Turkey.
| |
Collapse
|
32
|
Liu R, Clayton J, Shen M, Bhatnagar S, Shen J. Machine Learning Models to Interrogate Proteomewide Covalent Ligandabilities Directed at Cysteines. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.08.17.553742. [PMID: 37662346 PMCID: PMC10473668 DOI: 10.1101/2023.08.17.553742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Machine learning (ML) identification of covalently ligandable sites may accelerate targeted covalent inhibitor design and help expand the druggable proteome space. Here we report the rigorous development and validation of the tree-based models and convolutional neural networks (CNNs) trained on a newly curated database (LigCys3D) of over 1,000 liganded cysteines in nearly 800 proteins represented by over 10,000 three-dimensional structures in the protein data bank. The unseen tests yielded 94% and 93% AUCs (area under the receiver operating characteristic curve) for the tree models and CNNs, respectively. Based on the AlphaFold2 predicted structures, the ML models recapitulated the newly liganded cysteines in the PDB with over 90% recall values. To assist the community of covalent drug discoveries, we report the predicted ligandable cysteines in 392 human kinases and their locations in the sequence-aligned kinase structure including the PH and SH2 domains. Furthermore, we disseminate a searchable online database LigCys3D (https://ligcys.computchem.org/) and a web prediction server DeepCys (https://deepcys.computchem.org/), both of which will be continuously updated and improved by including newly published experimental data. The present work represents a first step towards the ML-led integration of big genome data and structure models to annotate the human proteome space for the next-generation covalent drug discoveries.
Collapse
Affiliation(s)
- Ruibin Liu
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, USA
| | - Joseph Clayton
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, USA
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Mingzhe Shen
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, USA
| | - Shubham Bhatnagar
- Department of Computer Science, University of Maryland at College Park, College Park, MD 20742, USA
| | - Jana Shen
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, MD 21201, USA
| |
Collapse
|
33
|
Schierholz L, Brown CR, Helena-Bueno K, Uversky VN, Hirt RP, Barandun J, Melnikov SV. A Conserved Ribosomal Protein Has Entirely Dissimilar Structures in Different Organisms. Mol Biol Evol 2024; 41:msad254. [PMID: 37987564 PMCID: PMC10764239 DOI: 10.1093/molbev/msad254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 10/23/2023] [Accepted: 11/16/2023] [Indexed: 11/22/2023] Open
Abstract
Ribosomes from different species can markedly differ in their composition by including dozens of ribosomal proteins that are unique to specific lineages but absent in others. However, it remains unknown how ribosomes acquire new proteins throughout evolution. Here, to help answer this question, we describe the evolution of the ribosomal protein msL1/msL2 that was recently found in ribosomes from the parasitic microorganism clade, microsporidia. We show that this protein has a conserved location in the ribosome but entirely dissimilar structures in different organisms: in each of the analyzed species, msL1/msL2 exhibits an altered secondary structure, an inverted orientation of the N-termini and C-termini on the ribosomal binding surface, and a completely transformed 3D fold. We then show that this fold switching is likely caused by changes in the ribosomal msL1/msL2-binding site, specifically, by variations in rRNA. These observations allow us to infer an evolutionary scenario in which a small, positively charged, de novo-born unfolded protein was first captured by rRNA to become part of the ribosome and subsequently underwent complete fold switching to optimize its binding to its evolving ribosomal binding site. Overall, our work provides a striking example of how a protein can switch its fold in the context of a complex biological assembly, while retaining its specificity for its molecular partner. This finding will help us better understand the origin and evolution of new protein components of complex molecular assemblies-thereby enhancing our ability to engineer biological molecules, identify protein homologs, and peer into the history of life on Earth.
Collapse
Affiliation(s)
- Léon Schierholz
- Department of Molecular Biology, Laboratory for Molecular Infection Medicine Sweden, Umeå Centre for Microbial Research, Science for Life Laboratory, Umeå University, Umeå 901 87, Sweden
| | - Charlotte R Brown
- Biosciences Institute, Newcastle University School of Medicine, Newcastle upon Tyne NE2 4HH, UK
| | - Karla Helena-Bueno
- Biosciences Institute, Newcastle University School of Medicine, Newcastle upon Tyne NE2 4HH, UK
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA
| | - Robert P Hirt
- Biosciences Institute, Newcastle University School of Medicine, Newcastle upon Tyne NE2 4HH, UK
| | - Jonas Barandun
- Department of Molecular Biology, Laboratory for Molecular Infection Medicine Sweden, Umeå Centre for Microbial Research, Science for Life Laboratory, Umeå University, Umeå 901 87, Sweden
| | - Sergey V Melnikov
- Biosciences Institute, Newcastle University School of Medicine, Newcastle upon Tyne NE2 4HH, UK
| |
Collapse
|
34
|
Pantolini L, Studer G, Pereira J, Durairaj J, Tauriello G, Schwede T. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics 2024; 40:btad786. [PMID: 38175775 PMCID: PMC10792726 DOI: 10.1093/bioinformatics/btad786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/27/2023] [Accepted: 12/29/2023] [Indexed: 01/06/2024] Open
Abstract
MOTIVATION Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a "semantic meaning" of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. RESULTS In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. AVAILABILITY AND IMPLEMENTATION The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.
Collapse
Affiliation(s)
- Lorenzo Pantolini
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Gabriel Studer
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Joana Pereira
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Janani Durairaj
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Gerardo Tauriello
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| |
Collapse
|
35
|
Denessiouk K, Denesyuk AI, Permyakov SE, Permyakov EA, Johnson MS, Uversky VN. The active site of the SGNH hydrolase-like fold proteins: Nucleophile-oxyanion (Nuc-Oxy) and Acid-Base zones. Curr Res Struct Biol 2023; 7:100123. [PMID: 38235349 PMCID: PMC10792757 DOI: 10.1016/j.crstbi.2023.100123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/25/2023] [Accepted: 12/27/2023] [Indexed: 01/19/2024] Open
Abstract
SGNH hydrolase-like fold proteins are serine proteases with the default Asp-His-Ser catalytic triad. Here, we show that these proteins share two unique conserved structural organizations around the active site: (1) the Nuc-Oxy Zone around the catalytic nucleophile and the oxyanion hole, and (2) the Acid-Base Zone around the catalytic acid and base. The Nuc-Oxy Zone consists of 14 amino acids cross-linked with eight conserved intra- and inter-block hydrogen bonds. The Acid-Base Zone is constructed from a single fragment of the polypeptide chain, which incorporates both the catalytic acid and base, and whose N- and C-terminal residues are linked together by a conserved hydrogen bond. The Nuc-Oxy and Acid-Base Zones are connected by an SHLink, a two-bond conserved interaction from amino acids, adjacent to the catalytic nucleophile and base.
Collapse
Affiliation(s)
- Konstantin Denessiouk
- Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, Pushchino, 142290, Russia
- Structural Bioinformatics Laboratory, Biochemistry, InFLAMES Research Flagship Center, Faculty of Science and Engineering, Biochemistry, Åbo Akademi University, Turku, 20520, Finland
| | - Alexander I. Denesyuk
- Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, Pushchino, 142290, Russia
- Structural Bioinformatics Laboratory, Biochemistry, InFLAMES Research Flagship Center, Faculty of Science and Engineering, Biochemistry, Åbo Akademi University, Turku, 20520, Finland
| | - Sergei E. Permyakov
- Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, Pushchino, 142290, Russia
| | - Eugene A. Permyakov
- Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, Pushchino, 142290, Russia
| | - Mark S. Johnson
- Structural Bioinformatics Laboratory, Biochemistry, InFLAMES Research Flagship Center, Faculty of Science and Engineering, Biochemistry, Åbo Akademi University, Turku, 20520, Finland
| | - Vladimir N. Uversky
- Institute for Biological Instrumentation of the Russian Academy of Sciences, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, Pushchino, 142290, Russia
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, 33612, USA
| |
Collapse
|
36
|
Subramanian AM, Thomson M. Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.22.573145. [PMID: 38187750 PMCID: PMC10769378 DOI: 10.1101/2023.12.22.573145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Nature has likely sampled only a fraction of all protein sequences and structures allowed by the laws of biophysics. However, the combinatorial scale of amino-acid sequence-space has traditionally precluded substantive study of the full protein sequence-structure map. In particular, it remains unknown how much of the vast uncharted landscape of far-from-natural sequences consists of alternate ways to encode the familiar ensemble of natural folds; proteins in this category also represent an opportunity to diversify candidates for downstream applications. Here, we characterize sequence-structure mapping in far-from-natural regions of sequence-space guided by the capacity of protein language models (pLMs) to explore sequences outside their natural training data through generation. We demonstrate that pretrained generative pLMs sample a limited structural snapshot of the natural protein universe, including >350 common (sub)domain elements. Incorporating pLM, structure prediction, and structure-based search techniques, we surpass this limitation by developing a novel "foldtuning" strategy that pushes a pretrained pLM into a generative regime that maintains structural similarity to a target protein fold (e.g. TIM barrel, thioredoxin, etc) while maximizing dissimilarity to natural amino-acid sequences. We apply "foldtuning" to build a library of pLMs for >700 naturally-abundant folds in the SCOP database, accessing swaths of proteins that take familiar structures yet lie far from known sequences, spanning targets that include enzymes, immune ligands, and signaling proteins. By revealing protein sequence-structure information at scale outside of the context of evolution, we anticipate that this work will enable future systematic searches for wholly novel folds and facilitate more immediate protein design goals in catalysis and medicine.
Collapse
Affiliation(s)
- Arjuna M Subramanian
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125
| | - Matt Thomson
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125
| |
Collapse
|
37
|
Lau AM, Kandathil SM, Jones DT. Merizo: a rapid and accurate protein domain segmentation method using invariant point attention. Nat Commun 2023; 14:8445. [PMID: 38114456 PMCID: PMC10730818 DOI: 10.1038/s41467-023-43934-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 11/24/2023] [Indexed: 12/21/2023] Open
Abstract
The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
38
|
Tsuchiya Y, Yonezawa T, Yamamori Y, Inoura H, Osawa M, Ikeda K, Tomii K. PoSSuM v.3: A Major Expansion of the PoSSuM Database for Finding Similar Binding Sites of Proteins. J Chem Inf Model 2023; 63:7578-7587. [PMID: 38016694 PMCID: PMC10716853 DOI: 10.1021/acs.jcim.3c01405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/28/2023] [Accepted: 11/01/2023] [Indexed: 11/30/2023]
Abstract
Information on structures of protein-ligand complexes, including comparisons of known and putative protein-ligand-binding pockets, is valuable for protein annotation and drug discovery and development. To facilitate biomedical and pharmaceutical research, we developed PoSSuM (https://possum.cbrc.pj.aist.go.jp/PoSSuM/), a database for identifying similar binding pockets in proteins. The current PoSSuM database includes 191 million similar pairs among almost 10 million identified pockets. PoSSuM drug search (PoSSuMds) is a resource for investigating ligand and receptor diversity among a set of pockets that can bind to an approved drug compound. The enhanced PoSSuMds covers pockets associated with both approved drugs and drug candidates in clinical trials from the latest release of ChEMBL. Additionally, we developed two new databases: PoSSuMAg for investigating antibody-antigen interactions and PoSSuMAF to simplify exploring putative pockets in AlphaFold human protein models.
Collapse
Affiliation(s)
- Yuko Tsuchiya
- Artificial
Intelligence Research Center, National Institute
of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Tomoki Yonezawa
- Division
of Physics for Life Functions, Keio University
Faculty of Pharmacy, 1-5-30 Shibakoen, Minato-ku, Tokyo 105-8512, Japan
| | - Yu Yamamori
- Artificial
Intelligence Research Center, National Institute
of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Hiroko Inoura
- Artificial
Intelligence Research Center, National Institute
of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Masanori Osawa
- Division
of Physics for Life Functions, Keio University
Faculty of Pharmacy, 1-5-30 Shibakoen, Minato-ku, Tokyo 105-8512, Japan
| | - Kazuyoshi Ikeda
- Division
of Physics for Life Functions, Keio University
Faculty of Pharmacy, 1-5-30 Shibakoen, Minato-ku, Tokyo 105-8512, Japan
- Medicinal
Chemistry Applied AI Unit, HPC- and AI-driven Drug Development Platform
Division, RIKEN Center for Computational
Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Kentaro Tomii
- Artificial
Intelligence Research Center, National Institute
of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
| |
Collapse
|
39
|
Segura J, Rose Y, Bi C, Duarte J, Burley SK, Bittrich S. RCSB Protein Data Bank: visualizing groups of experimentally determined PDB structures alongside computed structure models of proteins. FRONTIERS IN BIOINFORMATICS 2023; 3:1311287. [PMID: 38111685 PMCID: PMC10726007 DOI: 10.3389/fbinf.2023.1311287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 11/17/2023] [Indexed: 12/20/2023] Open
Abstract
Recent advances in Artificial Intelligence and Machine Learning (e.g., AlphaFold, RosettaFold, and ESMFold) enable prediction of three-dimensional (3D) protein structures from amino acid sequences alone at accuracies comparable to lower-resolution experimental methods. These tools have been employed to predict structures across entire proteomes and the results of large-scale metagenomic sequence studies, yielding an exponential increase in available biomolecular 3D structural information. Given the enormous volume of this newly computed biostructure data, there is an urgent need for robust tools to manage, search, cluster, and visualize large collections of structures. Equally important is the capability to efficiently summarize and visualize metadata, biological/biochemical annotations, and structural features, particularly when working with vast numbers of protein structures of both experimental origin from the Protein Data Bank (PDB) and computationally-predicted models. Moreover, researchers require advanced visualization techniques that support interactive exploration of multiple sequences and structural alignments. This paper introduces a suite of tools provided on the RCSB PDB research-focused web portal RCSB. org, tailor-made for efficient management, search, organization, and visualization of this burgeoning corpus of 3D macromolecular structure data.
Collapse
Affiliation(s)
- Joan Segura
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Yana Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Chunxiao Bi
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Jose Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| | - Stephen K. Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
- Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, United States
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, San Diego, CA, United States
| |
Collapse
|
40
|
Midlik A, Nair S, Anyango S, Deshpande M, Sehnal D, Varadi M, Velankar S. PDBImages: a command-line tool for automated macromolecular structure visualization. Bioinformatics 2023; 39:btad744. [PMID: 38085238 PMCID: PMC10746859 DOI: 10.1093/bioinformatics/btad744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 10/20/2023] [Accepted: 12/11/2023] [Indexed: 12/24/2023] Open
Abstract
SUMMARY PDBImages is an innovative, open-source Node.js package that harnesses the power of the popular macromolecule structure visualization software Mol*. Designed for use by the scientific community, PDBImages provides a means to generate high-quality images for PDB and AlphaFold DB models. Its unique ability to render and save images directly to files in a browserless mode sets it apart, offering users a streamlined, automated process for macromolecular structure visualization. Here, we detail the implementation of PDBImages, enumerating its diverse image types, and elaborating on its user-friendly setup. This powerful tool opens a new gateway for researchers to visualize, analyse, and share their work, fostering a deeper understanding of bioinformatics. AVAILABILITY AND IMPLEMENTATION PDBImages is available as an npm package from https://www.npmjs.com/package/pdb-images. The source code is available from https://github.com/PDBeurope/pdb-images.
Collapse
Affiliation(s)
- Adam Midlik
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Sreenath Nair
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Stephen Anyango
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Mandar Deshpande
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - David Sehnal
- Biological Data Management and Analysis Core Facility, Centre for Structural Biology, CEITEC—Central European Institute of Technology, Masaryk University, Brno 62500, Czech Republic
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 62500, Czech Republic
| | - Mihaly Varadi
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| |
Collapse
|
41
|
Hamamsy T, Barot M, Morton JT, Steinegger M, Bonneau R, Cho K. Learning sequence, structure, and function representations of proteins with language models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.26.568742. [PMID: 38045331 PMCID: PMC10690258 DOI: 10.1101/2023.11.26.568742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The sequence-structure-function relationships that ultimately generate the diversity of extant observed proteins is complex, as proteins bridge the gap between multiple informational and physical scales involved in nearly all cellular processes. One limitation of existing protein annotation databases such as UniProt is that less than 1% of proteins have experimentally verified functions, and computational methods are needed to fill in the missing information. Here, we demonstrate that a multi-aspect framework based on protein language models can learn sequence-structure-function representations of amino acid sequences, and can provide the foundation for sensitive sequence-structure-function aware protein sequence search and annotation. Based on this model, we introduce a multi-aspect information retrieval system for proteins, Protein-Vec, covering sequence, structure, and function aspects, that enables computational protein annotation and function prediction at tree-of-life scales.
Collapse
|
42
|
Wang T, Wang L, Zhang X, Shen C, Zhang O, Wang J, Wu J, Jin R, Zhou D, Chen S, Liu L, Wang X, Hsieh CY, Chen G, Pan P, Kang Y, Hou T. Comprehensive assessment of protein loop modeling programs on large-scale datasets: prediction accuracy and efficiency. Brief Bioinform 2023; 25:bbad486. [PMID: 38171930 PMCID: PMC10764206 DOI: 10.1093/bib/bbad486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 12/04/2023] [Accepted: 12/05/2023] [Indexed: 01/05/2024] Open
Abstract
Protein loops play a critical role in the dynamics of proteins and are essential for numerous biological functions, and various computational approaches to loop modeling have been proposed over the past decades. However, a comprehensive understanding of the strengths and weaknesses of each method is lacking. In this work, we constructed two high-quality datasets (i.e. the General dataset and the CASP dataset) and systematically evaluated the accuracy and efficiency of 13 commonly used loop modeling approaches from the perspective of loop lengths, protein classes and residue types. The results indicate that the knowledge-based method FREAD generally outperforms the other tested programs in most cases, but encountered challenges when predicting loops longer than 15 and 30 residues on the CASP and General datasets, respectively. The ab initio method Rosetta NGK demonstrated exceptional modeling accuracy for short loops with four to eight residues and achieved the highest success rate on the CASP dataset. The well-known AlphaFold2 and RoseTTAFold require more resources for better performance, but they exhibit promise for predicting loops longer than 16 and 30 residues in the CASP and General datasets. These observations can provide valuable insights for selecting suitable methods for specific loop modeling tasks and contribute to future advancements in the field.
Collapse
Affiliation(s)
- Tianyue Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Langcheng Wang
- Department of Pathology, New York University Medical Center, 550 First Avenue, New York, NY 10016, USA
| | - Xujun Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Chao Shen
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Odin Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Jike Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Jialu Wu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Ruofan Jin
- College of Life Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Donghao Zhou
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, Guangdong, China
| | - Shicheng Chen
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Liwei Liu
- Advanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd., Shenzhen 518129, Guangdong, China
| | - Xiaorui Wang
- State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Macao, China
| | - Chang-Yu Hsieh
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Guangyong Chen
- Zhejiang Lab, Zhejiang University, Hangzhou 311121, Zhejiang, China
| | - Peichen Pan
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Yu Kang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
43
|
Cao W, Wu LY, Xia XY, Chen X, Wang ZX, Pan XM. A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins. Sci Rep 2023; 13:20304. [PMID: 37985846 PMCID: PMC10662474 DOI: 10.1038/s41598-023-47496-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 11/14/2023] [Indexed: 11/22/2023] Open
Abstract
Because of the limited effectiveness of prevailing phylogenetic methods when applied to highly divergent protein sequences, the phylogenetic analysis problem remains challenging. Here, we propose a sequence-based evolutionary distance algorithm termed sequence distance (SD), which innovatively incorporates site-to-site correlation within protein sequences into the distance estimation. In protein superfamilies, SD can effectively distinguish evolutionary relationships both within and between protein families, producing phylogenetic trees that closely align with those based on structural information, even with sequence identity less than 20%. SD is highly correlated with the similarity of the protein structure, and can calculate evolutionary distances for thousands of protein pairs within seconds using a single CPU, which is significantly faster than most protein structure prediction methods that demand high computational resources and long run times. The development of SD will significantly advance phylogenetics, providing researchers with a more accurate and reliable tool for exploring evolutionary relationships.
Collapse
Affiliation(s)
- Wei Cao
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Lu-Yun Wu
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Xia-Yu Xia
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Xiang Chen
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China
| | - Zhi-Xin Wang
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China.
| | - Xian-Ming Pan
- Key Laboratory of Ministry of Education for Protein Science, School of Life Sciences, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|
44
|
Bale A, Rambo R, Prior C. The SKMT Algorithm: A method for assessing and comparing underlying protein entanglement. PLoS Comput Biol 2023; 19:e1011248. [PMID: 38011290 PMCID: PMC10703313 DOI: 10.1371/journal.pcbi.1011248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 12/07/2023] [Accepted: 11/06/2023] [Indexed: 11/29/2023] Open
Abstract
We present fast and simple-to-implement measures of the entanglement of protein tertiary structures which are appropriate for highly flexible structure comparison. These are performed using the SKMT algorithm, a novel method of smoothing the Cα backbone to achieve a minimal complexity curve representation of the manner in which the protein's secondary structure elements fold to form its tertiary structure. Its subsequent complexity is characterised using measures based on the writhe and crossing number quantities heavily utilised in DNA topology studies, and which have shown promising results when applied to proteins recently. The SKMT smoothing is used to derive empirical bounds on a protein's entanglement relative to its number of secondary structure elements. We show that large scale helical geometries dominantly account for the maximum growth in entanglement of protein monomers, and further that this large scale helical geometry is present in a large array of proteins, consistent across a number of different protein structure types and sequences. We also show how these bounds can be used to constrain the search space of protein structure prediction from small angle x-ray scattering experiments, a method highly suited to determining the likely structure of proteins in solution where crystal structure or machine learning based predictions often fail to match experimental data. Finally we develop a structural comparison metric based on the SKMT smoothing which is used in one specific case to demonstrate significant structural similarity between Rossmann fold and TIM Barrel proteins, a link which is potentially significant as attempts to engineer the latter have in the past produced the former. We provide the SWRITHE interactive python notebook to calculate these metrics.
Collapse
Affiliation(s)
- Arron Bale
- Department of Mathematical Sciences, Durham University, Durham, United Kingdom
| | - Robert Rambo
- Diamond Light Source, Harwell Science and Innovation Campus, Didcot, United Kingdom
| | - Christopher Prior
- Department of Mathematical Sciences, Durham University, Durham, United Kingdom
| |
Collapse
|
45
|
Casier R, Duhamel J. Appraisal of blob-Based Approaches in the Prediction of Protein Folding Times. J Phys Chem B 2023; 127:8852-8859. [PMID: 37793094 DOI: 10.1021/acs.jpcb.3c04958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/06/2023]
Abstract
A series of reports published in the last 3 years has illustrated that a blob-based model (BBM) can predict the folding time of proteins from their primary amino acid (aa) sequence based on three simple rules established to characterize the long-range backbone dynamics (LRBD) of racemic polypeptides. The sole use of LRBD to predict protein folding times with the BBM represents a radical departure from all other prediction methods currently applied to determine protein folding times, which rely instead on parameters such as the structure content, folding kinetics, chain length, amino acid properties, or contact topography of proteins. Furthermore, the built-in modularity of the BBM enables the parametrization and inclusion of new phenomena affecting the LRBD of polypeptides, while its conceptual simplicity makes it an interesting new mathematical tool for studying protein folding. However, its novelty implies that its relationship with many other methods used to predict protein folding times has not been well researched. Consequently, the purpose of this report is to uncover the physical phenomena encountered during protein folding that are best described by the BBM through the identification of parameters that have been recognized over the years as being strong predictors for protein folding, such as protein size, topology, structural class, and folding kinetics. This was accomplished by determining the parameters most strongly correlated with the folding times predicted by the BBM. While the BBM in its present form appears to be a good indicator of the folding times of the vast majority of the 195 proteins considered so far, this report finds that it excels for moderately large proteins that are primarily composed of locally formed structural motifs such as α-helices or for proteins that fold in multiple steps. Altogether, these observations based on the use of the BBM support the notion that proteins fold the way they do because the LRBD of polypeptides is mostly driven by the local interactions experienced between aa's within reach of one another.
Collapse
Affiliation(s)
- Remi Casier
- Institute for Polymer Research, Waterloo Institute for Nanotechnology, Department of Chemistry, University of Waterloo, Waterloo, Ontario N2L3G1, Canada
| | - Jean Duhamel
- Institute for Polymer Research, Waterloo Institute for Nanotechnology, Department of Chemistry, University of Waterloo, Waterloo, Ontario N2L3G1, Canada
| |
Collapse
|
46
|
Ooka K, Arai M. Accurate prediction of protein folding mechanisms by simple structure-based statistical mechanical models. Nat Commun 2023; 14:6338. [PMID: 37857633 PMCID: PMC10587348 DOI: 10.1038/s41467-023-41664-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 09/10/2023] [Indexed: 10/21/2023] Open
Abstract
Recent breakthroughs in highly accurate protein structure prediction using deep neural networks have made considerable progress in solving the structure prediction component of the 'protein folding problem'. However, predicting detailed mechanisms of how proteins fold into specific native structures remains challenging, especially for multidomain proteins constituting most of the proteomes. Here, we develop a simple structure-based statistical mechanical model that introduces nonlocal interactions driving the folding of multidomain proteins. Our model successfully predicts protein folding processes consistent with experiments, without the limitations of protein size and shape. Furthermore, slight modifications of the model allow prediction of disulfide-oxidative and disulfide-intact protein folding. These predictions depict details of the folding processes beyond reproducing experimental results and provide a rationale for the folding mechanisms. Thus, our physics-based models enable accurate prediction of protein folding mechanisms with low computational complexity, paving the way for solving the folding process component of the 'protein folding problem'.
Collapse
Affiliation(s)
- Koji Ooka
- Department of Physics, Graduate School of Science, The University of Tokyo, 3-8-1 Komaba, Meguro, Tokyo, 153-8902, Japan
- Komaba Organization for Educational Excellence, College of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro, Tokyo, 153-8902, Japan
| | - Munehito Arai
- Department of Physics, Graduate School of Science, The University of Tokyo, 3-8-1 Komaba, Meguro, Tokyo, 153-8902, Japan.
- Komaba Organization for Educational Excellence, College of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro, Tokyo, 153-8902, Japan.
- Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro, Tokyo, 153-8902, Japan.
| |
Collapse
|
47
|
Bae DW, Lee SH, Park JH, Son SY, Lin Y, Lee J, Jang BR, Lee KH, Lee YH, Lee H, Kang S, Kim B, Cha SS. An archaeal transcription factor EnfR with a novel 'eighth note' fold controls hydrogen production of a hyperthermophilic archaeon Thermococcus onnurineus NA1. Nucleic Acids Res 2023; 51:10026-10040. [PMID: 37650645 PMCID: PMC10570040 DOI: 10.1093/nar/gkad699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 07/13/2023] [Accepted: 08/14/2023] [Indexed: 09/01/2023] Open
Abstract
Thermococcus onnurineus NA1, a hyperthermophilic carboxydotrophic archaeon, produces H2 through CO oxidation catalyzed by proteins encoded in a carbon monoxide dehydrogenase (CODH) gene cluster. TON_1525 with a DNA-binding helix-turn-helix (HTH) motif is a putative repressor regulating the transcriptional expression of the codh gene cluster. The T55I mutation in TON_1525 led to enhanced H2 production accompanied by the increased expression of genes in the codh cluster. Here, TON_1525 was demonstrated to be a dimer. Monomeric TON_1525 adopts a novel 'eighth note' symbol-like fold (referred to as 'eighth note' fold regulator, EnfR), and the dimerization mode of EnfR is unique in that it has no resemblance to structures in the Protein Data Bank. According to footprinting and gel shift assays, dimeric EnfR binds to a 36-bp pseudo-palindromic inverted repeat in the promoter region of the codh gene cluster, which is supported by an in silico EnfR/DNA complex model and mutational studies revealing the implication of N-terminal loops as well as HTH motifs in DNA recognition. The DNA-binding affinity of the T55I mutant was lowered by ∼15-fold, for which the conformational change of N-terminal loops is responsible. In addition, transcriptome analysis suggested that EnfR could regulate diverse metabolic processes besides H2 production.
Collapse
Affiliation(s)
- Da-Woon Bae
- Department of Chemistry & Nanoscience, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Seong Hyuk Lee
- Marine Biotechnology Research Center, Korea Institute of Ocean Science and Technology, Busan, South Korea
| | - Ji Hye Park
- Department of Food Science and Biotechnology, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Se-Young Son
- Department of Chemistry & Nanoscience, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Yuxi Lin
- Research Center for Bioconvergence Analysis, Korea Basic Science Institute (KBSI), Cheongju, Chungbuk 28119, Republic of Korea
| | - Jung Hyen Lee
- Department of Food Science and Biotechnology, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Bo-Ram Jang
- Department of Life Science, Sogang University, 35 Baekbeom-Ro, Mapo-Gu, Seoul, South Korea
| | - Kyu-Ho Lee
- Department of Life Science, Sogang University, 35 Baekbeom-Ro, Mapo-Gu, Seoul, South Korea
| | - Young-Ho Lee
- Research Center for Bioconvergence Analysis, Korea Basic Science Institute (KBSI), Cheongju, Chungbuk 28119, Republic of Korea
- Bio-Analytical Science, University of Science and Technology, Daejeon 34113, Republic of Korea
- Department of Systems Biotechnology, Chung-Ang University, Anseong, Gyeonggi 17546, Republic of Korea
- Frontier Research Institute for Interdisciplinary Sciences, Tohoku University, Sendai, Miyagi 980-8578, Japan
| | - Hyun Sook Lee
- Marine Biotechnology Research Center, Korea Institute of Ocean Science and Technology, Busan, South Korea
- Department of Marine Biotechnology, KIOST School, University of Science and Technology, Daejeon, South Korea
| | - Sung Gyun Kang
- Marine Biotechnology Research Center, Korea Institute of Ocean Science and Technology, Busan, South Korea
- Department of Marine Biotechnology, KIOST School, University of Science and Technology, Daejeon, South Korea
| | - Byoung Sik Kim
- Department of Food Science and Biotechnology, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Sun-Shin Cha
- Department of Chemistry & Nanoscience, Ewha Womans University, Seoul 03760, Republic of Korea
| |
Collapse
|
48
|
Pavlopoulos GA, Baltoumas FA, Liu S, Selvitopi O, Camargo AP, Nayfach S, Azad A, Roux S, Call L, Ivanova NN, Chen IM, Paez-Espino D, Karatzas E, Iliopoulos I, Konstantinidis K, Tiedje JM, Pett-Ridge J, Baker D, Visel A, Ouzounis CA, Ovchinnikov S, Buluç A, Kyrpides NC. Unraveling the functional dark matter through global metagenomics. Nature 2023; 622:594-602. [PMID: 37821698 PMCID: PMC10584684 DOI: 10.1038/s41586-023-06583-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 08/30/2023] [Indexed: 10/13/2023]
Abstract
Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.
Collapse
Affiliation(s)
- Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece.
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece.
| | - Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece
| | - Sirui Liu
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
| | - Oguz Selvitopi
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Antonio Pedro Camargo
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Stephen Nayfach
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Ariful Azad
- Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington, Bloomington, IN, USA
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Lee Call
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Natalia N Ivanova
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - I Min Chen
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - David Paez-Espino
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece
| | - Ioannis Iliopoulos
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | | | - James M Tiedje
- Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA
| | - Jennifer Pett-Ridge
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Axel Visel
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Christos A Ouzounis
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
- Biological Computation & Computational Biology Group, Artificial Intelligence & Information Analysis Lab, School of Informatics, Aristotle University of Thessalonica, Thessalonica, Greece
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
| | - Aydin Buluç
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
49
|
Kazakov AS, Deryusheva EI, Rastrygina VA, Sokolov AS, Permyakova ME, Litus EA, Uversky VN, Permyakov EA, Permyakov SE. Interaction of S100A6 Protein with the Four-Helical Cytokines. Biomolecules 2023; 13:1345. [PMID: 37759746 PMCID: PMC10526228 DOI: 10.3390/biom13091345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 08/19/2023] [Accepted: 08/31/2023] [Indexed: 09/29/2023] Open
Abstract
S100 is a family of over 20 structurally homologous, but functionally diverse regulatory (calcium/zinc)-binding proteins of vertebrates. The involvement of S100 proteins in numerous vital (patho)physiological processes is mediated by their interaction with various (intra/extra)cellular protein partners, including cell surface receptors. Furthermore, recent studies have revealed the ability of specific S100 proteins to modulate cell signaling via direct interaction with cytokines. Previously, we revealed the binding of ca. 71% of the four-helical cytokines via the S100P protein, due to the presence in its molecule of a cytokine-binding site overlapping with the binding site for the S100P receptor. Here, we show that another S100 protein, S100A6 (that has a pairwise sequence identity with S100P of 35%), specifically binds numerous four-helical cytokines. We have studied the affinity of the recombinant forms of 35 human four-helical cytokines from all structural families of this fold to Ca2+-loaded recombinant human S100A6, using surface plasmon resonance spectroscopy. S100A6 recognizes 26 of the cytokines from all families of this fold, with equilibrium dissociation constants from 0.3 nM to 12 µM. Overall, S100A6 interacts with ca. 73% of the four-helical cytokines studied to date, with a selectivity equivalent to that for the S100P protein, with the differences limited to the binding of interleukin-2 and oncostatin M. The molecular docking study evidences the presence in the S100A6 molecule of a cytokine-binding site, analogous to that found in S100P. The findings argue the presence in some of the promiscuous members of the S100 family of a site specific to a wide range of four-helical cytokines. This unique feature of the S100 proteins potentially allows them to modulate the activity of the numerous four-helical cytokines in the disorders accompanied by an excessive release of the cytokines.
Collapse
Affiliation(s)
- Alexey S. Kazakov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Evgenia I. Deryusheva
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Victoria A. Rastrygina
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Andrey S. Sokolov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Maria E. Permyakova
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Ekaterina A. Litus
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Vladimir N. Uversky
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
- Department of Molecular, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA
- USF Health Byrd Alzheimer’s Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA
| | - Eugene A. Permyakov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| | - Sergei E. Permyakov
- Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, Institute for Biological Instrumentation, Institutskaya str., 7, Pushchino, Moscow Region 142290, Russia; (A.S.K.); (E.I.D.); (V.A.R.); (A.S.S.); (M.E.P.); (E.A.L.); (E.A.P.)
| |
Collapse
|
50
|
Xie L, Xie L. Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning. PLoS Comput Biol 2023; 19:e1010974. [PMID: 37590332 PMCID: PMC10464998 DOI: 10.1371/journal.pcbi.1010974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 08/29/2023] [Accepted: 07/27/2023] [Indexed: 08/19/2023] Open
Abstract
Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.
Collapse
Affiliation(s)
- Li Xie
- Department of Computer Science, Hunter College, The City University of New York, New York City, New York, United States of America
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York City, New York, United States of America
- Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York City, New York, United States of America
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York City, New York, United States of America
| |
Collapse
|