Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O'Donovan C, Martin MJ, Kleywegt GJ. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res 2012. [PMID: 23203869 PMCID: PMC3531078 DOI: 10.1093/nar/gks1258] [Citation(s) in RCA: 174] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

For:	Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O'Donovan C, Martin MJ, Kleywegt GJ. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res 2012. [PMID: 23203869 PMCID: PMC3531078 DOI: 10.1093/nar/gks1258] [Citation(s) in RCA: 174] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open

Number

Cited by Other Article(s)

Abali Z, Aydin Z, Khokhar M, Ates YC, Gursoy A, Keskin O. PPInterface: A Comprehensive Dataset of 3D Protein-Protein Interface Structures. J Mol Biol 2024;436:168686. [PMID: 38936693 DOI: 10.1016/j.jmb.2024.168686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 05/25/2024] [Accepted: 06/20/2024] [Indexed: 06/29/2024]

Majila K, Viswanath S. StrIDR: a database of intrinsically disordered regions of proteins with experimentally resolved structures. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.22.609111. [PMID: 39253485 PMCID: PMC11382991 DOI: 10.1101/2024.08.22.609111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]

Zhang Y, Leung AK, Kang JJ, Sun Y, Wu G, Li L, Sun J, Cheng L, Qiu T, Zhang J, Wierbowski S, Gupta S, Booth J, Yu H. A multiscale functional map of somatic mutations in cancer integrating protein structure and network topology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.03.06.531441. [PMID: 36945530 PMCID: PMC10028849 DOI: 10.1101/2023.03.06.531441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]

Abstract

A major goal of cancer biology is to understand the mechanisms underlying tumorigenesis driven by somatically acquired mutations. Two distinct types of computational methodologies have emerged: one focuses on analyzing clustering of mutations within protein sequences and 3D structures, while the other characterizes mutations by leveraging the topology of protein-protein interaction network. Their insights are largely non-overlapping, offering complementary strengths. Here, we established a unified, end-to-end 3D structurally-informed protein interaction network propagation framework, NetFlow3D, that systematically maps the multiscale mechanistic effects of somatic mutations in cancer. The establishment of NetFlow3D hinges upon the Human Protein Structurome, a comprehensive repository we compiled that incorporates the 3D structures of every single protein as well as the binding interfaces of all known protein interactions in humans. NetFlow3D leverages the Structurome to integrate information across atomic, residue, protein and network levels: It conducts 3D clustering of mutations across atomic and residue levels on protein structures to identify potential driver mutations. It then anisotropically propagates their impacts across the protein interaction network, with propagation guided by the specific 3D structural interfaces involved, to identify significantly interconnected network "modules", thereby uncovering key biological processes underlying disease etiology. Applied to 1,038,899 somatic protein-altering mutations in 9,946 TCGA tumors across 33 cancer types, NetFlow3D identified 1,4444 significant 3D clusters throughout the Human Protein Structurome, of which ~55% would not have been found if using only experimentally-determined structures. It then identified 26 significantly interconnected modules that encompass ~8-fold more proteins than applying standard network analyses. NetFlow3D and our pan-cancer results can be accessed from http://netflow3d.yulab.org.

Collapse

Affiliation(s)

Yingying Zhang Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA Department of Molecular Biology and Genetics, Cornell University; Ithaca, 14853, USA
Alden K. Leung Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
Jin Joo Kang Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
Yu Sun Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
Guanxi Wu College of Agriculture and Life Sciences, Cornell University; Ithaca, 14853, USA
Le Li Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
Jiayang Sun Department of Computational Biology, Cornell University; Ithaca, 14853, USA
Lily Cheng Department of Science and Technology Studies, Cornell University; Ithaca, 14853, USA
Tian Qiu School of Electrical and Computer Engineering, Cornell University; Ithaca, 14853, USA
Junke Zhang Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
Shayne Wierbowski Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
Shagun Gupta Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA
James Booth Department of Computational Biology, Cornell University; Ithaca, 14853, USA Department of Statistics and Data Science, Cornell University; Ithaca, 14853, USA
Haiyuan Yu Department of Computational Biology, Cornell University; Ithaca, 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University; Ithaca, 14853, USA

Collapse

Geist JL, Lee CY, Strom JM, de Jesús Naveja J, Luck K. Generation of a high confidence set of domain-domain interface types to guide protein complex structure predictions by AlphaFold. Bioinformatics 2024;40:btae482. [PMID: 39171834 PMCID: PMC11361816 DOI: 10.1093/bioinformatics/btae482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 07/10/2024] [Accepted: 08/20/2024] [Indexed: 08/23/2024] Open

Ye B, Tian W, Wang B, Liang J. CASTpFold: Computed Atlas of Surface Topography of the universe of protein Folds. Nucleic Acids Res 2024;52:W194-W199. [PMID: 38783102 PMCID: PMC11223844 DOI: 10.1093/nar/gkae415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 04/25/2024] [Accepted: 05/03/2024] [Indexed: 05/25/2024] Open

Choudhary P, Feng Z, Berrisford J, Chao H, Ikegawa Y, Peisach E, Piehl DW, Smith J, Tanweer A, Varadi M, Westbrook JD, Young JY, Patwardhan A, Morris KL, Hoch JC, Kurisu G, Velankar S, Burley SK. PDB NextGen Archive: centralizing access to integrated annotations and enriched structural information by the Worldwide Protein Data Bank. Database (Oxford) 2024;2024:baae041. [PMID: 38803272 PMCID: PMC11130521 DOI: 10.1093/database/baae041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 01/29/2024] [Accepted: 05/14/2024] [Indexed: 05/29/2024]

Affiliation(s)

Preeti Choudhary Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Zukang Feng Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
John Berrisford Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Henry Chao Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
Yasuyo Ikegawa Protein Data Bank Japan, Protein Research Foundation, 3-2, Yamadaoka, Minoh, Osaka 562-8686, Japan
Ezra Peisach Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
Dennis W Piehl Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
James Smith Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
Ahsan Tanweer Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Mihaly Varadi Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
John D Westbrook Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
Jasmine Y Young Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA
Ardan Patwardhan The Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Kyle L Morris The Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Jeffrey C Hoch Biological Magnetic Resonance Data Bank, Department of Molecular Biology and Biophysics, UConn Health, 263 Farmington Avenue, Farmington, CT 06030-3305, USA
Genji Kurisu Protein Data Bank Japan, Protein Research Foundation, 3-2, Yamadaoka, Minoh, Osaka 562-8686, Japan Protein Data Bank Japan, Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita-shi, Osaka 565-0871, Japan
Sameer Velankar Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
Stephen K Burley Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Rd., Piscataway, NJ 08854, USA Rutgers Cancer Institute of New Jersey, 195 Little Albany St., New Brunswick, NJ 08901, USA Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 123 Bevier Rd., Piscataway, NJ 08854, USA

Collapse

Kellman BP, Mariethoz J, Zhang Y, Shaul S, Alteri M, Sandoval D, Jeffris M, Armingol E, Bao B, Lisacek F, Bojar D, Lewis NE. Decoding glycosylation potential from protein structure across human glycoproteins with a multi-view recurrent neural network. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.15.594334. [PMID: 38798633 PMCID: PMC11118808 DOI: 10.1101/2024.05.15.594334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]

Affiliation(s)

Benjamin P. Kellman Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA Augment Biologics, La Jolla, CA 92092 Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA
Julien Mariethoz Proteome Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland
Yujie Zhang Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA
Sigal Shaul Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
Mia Alteri Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
Daniel Sandoval Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093, USA
Mia Jeffris Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA
Erick Armingol Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
Bokan Bao Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
Frederique Lisacek Proteome Informatics Group, Swiss Institute of Bioinformatics, CH-1227 Geneva, Switzerland Computer Science Department & Section of Biology, University of Geneva, route de Drize 7, CH-1227, Geneva, Switzerland
Daniel Bojar Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg 41390, Sweden Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg 41390, Sweden
Nathan E. Lewis Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, La Jolla, CA 92093, USA Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA

Collapse

Zhao H, Petrey D, Murray D, Honig B. ZEPPI: Proteome-scale sequence-based evaluation of protein-protein interaction models. Proc Natl Acad Sci U S A 2024;121:e2400260121. [PMID: 38743624 PMCID: PMC11127014 DOI: 10.1073/pnas.2400260121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 04/18/2024] [Indexed: 05/16/2024] Open

Ye B, Tian W, Wang B, Liang J. CASTpFold: Computed Atlas of Surface Topography of the universe of protein Folds. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.04.592496. [PMID: 38766001 PMCID: PMC11100609 DOI: 10.1101/2024.05.04.592496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]

Omelchenko AA, Siwek JC, Chhibbar P, Arshad S, Nazarali I, Nazarali K, Rosengart A, Rahimikollu J, Tilstra J, Shlomchik MJ, Koes DR, Joglekar AV, Das J. Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.01.592062. [PMID: 38746274 PMCID: PMC11092674 DOI: 10.1101/2024.05.01.592062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]

Abstract

The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM's representations are used as features. SWING was first applied to predicting peptide:MHC (pMHC) interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally. To further evaluate SWING's generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interaction-specific disruptions. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.

Collapse

Affiliation(s)

Alisa A. Omelchenko Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
Jane C. Siwek Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
Prabal Chhibbar Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Integrative systems biology PhD program, School of Medicine, University of Pittsburgh, PA, USA
Sanya Arshad Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
Iliyan Nazarali Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
Kiran Nazarali Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
AnnaElaine Rosengart Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
Javad Rahimikollu Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
Jeremy Tilstra Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, PA, USA
Mark J. Shlomchik Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
David R. Koes Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
Alok V. Joglekar Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
Jishnu Das Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA

Collapse

Ellaway JIJ, Anyango S, Nair S, Zaki HA, Nadzirin N, Powell HR, Gutmanas A, Varadi M, Velankar S. Identifying protein conformational states in the Protein Data Bank: Toward unlocking the potential of integrative dynamics studies. STRUCTURAL DYNAMICS (MELVILLE, N.Y.) 2024;11:034701. [PMID: 38774441 PMCID: PMC11106648 DOI: 10.1063/4.0000251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 05/08/2024] [Indexed: 05/24/2024]

MacGowan SA, Madeira F, Britto-Borges T, Barton GJ. A unified analysis of evolutionary and population constraint in protein domains highlights structural features and pathogenic sites. Commun Biol 2024;7:447. [PMID: 38605212 PMCID: PMC11009406 DOI: 10.1038/s42003-024-06117-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 03/27/2024] [Indexed: 04/13/2024] Open

Reveguk I, Simonson T. Classifying protein kinase conformations with machine learning. Protein Sci 2024;33:e4918. [PMID: 38501429 PMCID: PMC10962494 DOI: 10.1002/pro.4918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 03/20/2024]

Abstract

Protein kinases are key actors of signaling networks and important drug targets. They cycle between active and inactive conformations, distinguished by a few elements within the catalytic domain. One is the activation loop, whose conserved DFG motif can occupy DFG-in, DFG-out, and some rarer conformations. Annotation and classification of the structural kinome are important, as different conformations can be targeted by different inhibitors and activators. Valuable resources exist; however, large-scale applications will benefit from increased automation and interpretability of structural annotation. Interpretable machine learning models are described for this purpose, based on ensembles of decision trees. To train them, a set of catalytic domain sequences and structures was collected, somewhat larger and more diverse than existing resources. The structures were clustered based on the DFG conformation and manually annotated. They were then used as training input. Two main models were constructed, which distinguished active/inactive and in/out/other DFG conformations. They considered initially 1692 structural variables, spanning the whole catalytic domain, then identified ("learned") a small subset that sufficed for accurate classification. The first model correctly labeled all but 3 of 3289 structures as active or inactive, while the second assigned the correct DFG label to all but 17 of 8826 structures. The most potent classifying variables were all related to well-known structural elements in or near the activation loop and their ranking gives insights into the conformational preferences. The models were used to automatically annotate 3850 kinase structures predicted recently with the Alphafold2 tool, showing that Alphafold2 reproduced the active/inactive but not the DFG-in proportions seen in the Protein Data Bank. We expect the models will be useful for understanding and engineering kinases.

Collapse

Xiong D, Qiu Y, Zhao J, Zhou Y, Lee D, Gupta S, Torres M, Lu W, Liang S, Kang JJ, Eng C, Loscalzo J, Cheng F, Yu H. Structurally-informed human interactome reveals proteome-wide perturbations by disease mutations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.24.538110. [PMID: 37162909 PMCID: PMC10168245 DOI: 10.1101/2023.04.24.538110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]

Abstract

Human genome sequencing studies have identified numerous loci associated with complex diseases. However, translating human genetic and genomic findings to disease pathobiology and therapeutic discovery remains a major challenge at multiscale interactome network levels. Here, we present a deep-learning-based ensemble framework, termed PIONEER (Protein-protein InteractiOn iNtErfacE pRediction), that accurately predicts protein binding partner-specific interfaces for all known protein interactions in humans and seven other common model organisms, generating comprehensive structurally-informed protein interactomes. We demonstrate that PIONEER outperforms existing state-of-the-art methods. We further systematically validated PIONEER predictions experimentally through generating 2,395 mutations and testing their impact on 6,754 mutation-interaction pairs, confirming the high quality and validity of PIONEER predictions. We show that disease-associated mutations are enriched in PIONEER-predicted protein-protein interfaces after mapping mutations from ~60,000 germline exomes and ~36,000 somatic genomes. We identify 586 significant protein-protein interactions (PPIs) enriched with PIONEER-predicted interface somatic mutations (termed oncoPPIs) from pan-cancer analysis of ~11,000 tumor whole-exomes across 33 cancer types. We show that PIONEER-predicted oncoPPIs are significantly associated with patient survival and drug responses from both cancer cell lines and patient-derived xenograft mouse models. We identify a landscape of PPI-perturbing tumor alleles upon ubiquitination by E3 ligases, and we experimentally validate the tumorigenic KEAP1-NRF2 interface mutation p.Thr80Lys in non-small cell lung cancer. We show that PIONEER-predicted PPI-perturbing alleles alter protein abundance and correlates with drug responses and patient survival in colon and uterine cancers as demonstrated by proteogenomic data from the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium. PIONEER, implemented as both a web server platform and a software package, identifies functional consequences of disease-associated alleles and offers a deep learning tool for precision medicine at multiscale interactome network levels.

Collapse

Affiliation(s)

Dapeng Xiong Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA Center for Innovative Proteomics, Cornell University, Ithaca, NY 14853, USA
Yunguang Qiu Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
Junfei Zhao Department of Systems Biology, Herbert Irving Comprehensive Center, Columbia University, New York, NY 10032, USA
Yadi Zhou Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
Dongjin Lee Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
Shobhita Gupta Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA Center for Innovative Proteomics, Cornell University, Ithaca, NY 14853, USA Biophysics Program, Cornell University, Ithaca, NY 14853, USA
Mateo Torres Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA Center for Innovative Proteomics, Cornell University, Ithaca, NY 14853, USA
Weiqiang Lu Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
Siqi Liang Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
Jin Joo Kang Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA Center for Innovative Proteomics, Cornell University, Ithaca, NY 14853, USA
Charis Eng Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
Joseph Loscalzo Channing Division of Network Medicine, Division of Cardiovascular Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
Feixiong Cheng Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
Haiyuan Yu Department of Computational Biology, Cornell University, Ithaca, NY 14853, USA Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA Center for Innovative Proteomics, Cornell University, Ithaca, NY 14853, USA

Collapse

Giulini M, Honorato RV, Rivera JL, Bonvin AMJJ. ARCTIC-3D: automatic retrieval and clustering of interfaces in complexes from 3D structural information. Commun Biol 2024;7:49. [PMID: 38184711 PMCID: PMC10771469 DOI: 10.1038/s42003-023-05718-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 12/18/2023] [Indexed: 01/08/2024] Open

Wu F, Wu L, Radev D, Xu J, Li SZ. Integration of pre-trained protein language models into geometric deep learning networks. Commun Biol 2023;6:876. [PMID: 37626165 PMCID: PMC10457366 DOI: 10.1038/s42003-023-05133-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 07/11/2023] [Indexed: 08/27/2023] Open

Matic M, Miglionico P, Tatsumi M, Inoue A, Raimondi F. GPCRome-wide analysis of G-protein-coupling diversity using a computational biology approach. Nat Commun 2023;14:4361. [PMID: 37468476 DOI: 10.1038/s41467-023-40045-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 07/10/2023] [Indexed: 07/21/2023] Open

Mangione W, Falls Z, Samudrala R. Effective holistic characterization of small molecule effects using heterogeneous biological networks. Front Pharmacol 2023;14:1113007. [PMID: 37180722 PMCID: PMC10169664 DOI: 10.3389/fphar.2023.1113007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 04/11/2023] [Indexed: 05/16/2023] Open

Abstract

The two most common reasons for attrition in therapeutic clinical trials are efficacy and safety. We integrated heterogeneous data to create a human interactome network to comprehensively describe drug behavior in biological systems, with the goal of accurate therapeutic candidate generation. The Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun multiscale therapeutic discovery, repurposing, and design was enhanced by integrating drug side effects, protein pathways, protein-protein interactions, protein-disease associations, and the Gene Ontology, and complemented with its existing drug/compound, protein, and indication libraries. These integrated networks were reduced to a "multiscale interactomic signature" for each compound that describe its functional behavior as vectors of real values. These signatures are then used for relating compounds to each other with the hypothesis that similar signatures yield similar behavior. Our results indicated that there is significant biological information captured within our networks (particularly via side effects) which enhance the performance of our platform, as evaluated by performing all-against-all leave-one-out drug-indication association benchmarking as well as generating novel drug candidates for colon cancer and migraine disorders corroborated via literature search. Further, drug impacts on pathways derived from computed compound-protein interaction scores served as the features for a random forest machine learning model trained to predict drug-indication associations, with applications to mental disorders and cancer metastasis highlighted. This interactomic pipeline highlights the ability of Computational Analysis of Novel Drug Opportunities to accurately relate drugs in a multitarget and multiscale context, particularly for generating putative drug candidates using the information gleaned from indirect data such as side effect profiles and protein pathway information.

Collapse

Choudhary P, Anyango S, Berrisford J, Tolchard J, Varadi M, Velankar S. Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data. Sci Data 2023;10:204. [PMID: 37045837 PMCID: PMC10097656 DOI: 10.1038/s41597-023-02101-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 03/23/2023] [Indexed: 04/14/2023] Open

Trudeau SJ, Hwang H, Mathur D, Begum K, Petrey D, Murray D, Honig B. PrePCI: A structure- and chemical similarity-informed database of predicted protein compound interactions. Protein Sci 2023;32:e4594. [PMID: 36776141 PMCID: PMC10019447 DOI: 10.1002/pro.4594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 02/07/2023] [Accepted: 02/09/2023] [Indexed: 02/14/2023]

Zhang J, Zhou F, Liang X, Yang G. SCAMPER: Accurate Type-Specific Prediction of Calcium-Binding Residues Using Sequence-Derived Features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:1406-1416. [PMID: 35536812 DOI: 10.1109/tcbb.2022.3173437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Burley SK, Bhikadiya C, Bi C, Bittrich S, Chao H, Chen L, Craig PA, Crichlow GV, Dalenberg K, Duarte JM, Dutta S, Fayazi M, Feng Z, Flatt JW, Ganesan S, Ghosh S, Goodsell DS, Green RK, Guranovic V, Henry J, Hudson BP, Khokhriakov I, Lawson CL, Liang Y, Lowe R, Peisach E, Persikova I, Piehl DW, Rose Y, Sali A, Segura J, Sekharan M, Shao C, Vallat B, Voigt M, Webb B, Westbrook JD, Whetstone S, Young JY, Zalevsky A, Zardecki C. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res 2023;51:D488-D508. [PMID: 36420884 PMCID: PMC9825554 DOI: 10.1093/nar/gkac1077] [Citation(s) in RCA: 189] [Impact Index Per Article: 189.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 10/17/2022] [Accepted: 11/02/2022] [Indexed: 11/27/2022] Open

Affiliation(s)

Stephen K Burley Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08901, USA Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Charmi Bhikadiya Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Chunxiao Bi Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Sebastian Bittrich Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Henry Chao Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Li Chen Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Paul A Craig School of Chemistry and Materials Science, Rochester Institute of Technology, Rochester, NY 14623, USA
Gregg V Crichlow Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Kenneth Dalenberg Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Jose M Duarte Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Shuchismita Dutta Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08901, USA
Maryam Fayazi Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Zukang Feng Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Justin W Flatt Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Sai Ganesan Research Collaboratory for Structural Bioinformatics Protein Data Bank, Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, Quantitative Biosciences Institute, University of California San Francisco, San Francisco, CA 94158, USA
Sutapa Ghosh Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
David S Goodsell Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08901, USA Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
Rachel Kramer Green Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Vladimir Guranovic Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Jeremy Henry Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Brian P Hudson Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Igor Khokhriakov Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Catherine L Lawson Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Yuhe Liang Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Robert Lowe Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Ezra Peisach Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Irina Persikova Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Dennis W Piehl Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Yana Rose Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Andrej Sali Research Collaboratory for Structural Bioinformatics Protein Data Bank, Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, Quantitative Biosciences Institute, University of California San Francisco, San Francisco, CA 94158, USA
Joan Segura Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California San Diego, La Jolla, CA 92093, USA
Monica Sekharan Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Chenghua Shao Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Brinda Vallat Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Maria Voigt Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Ben Webb Research Collaboratory for Structural Bioinformatics Protein Data Bank, Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, Quantitative Biosciences Institute, University of California San Francisco, San Francisco, CA 94158, USA
John D Westbrook Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Rutgers Cancer Institute of New Jersey, New Brunswick, NJ 08901, USA
Shamara Whetstone Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Jasmine Y Young Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
Arthur Zalevsky Research Collaboratory for Structural Bioinformatics Protein Data Bank, Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, Quantitative Biosciences Institute, University of California San Francisco, San Francisco, CA 94158, USA
Christine Zardecki Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA

Collapse

Wu TH, Lin PC, Chou HH, Shen MR, Hsieh SY. Pathogenicity Prediction of Single Amino Acid Variants With Machine Learning Model Based on Protein Structural Energies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:606-615. [PMID: 34962874 DOI: 10.1109/tcbb.2021.3139048] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Joseph AP, Malhotra S, Burnley T, Winn MD. Overview and applications of map and model validation tools in the CCP-EM software suite. Faraday Discuss 2022;240:196-209. [PMID: 35916020 PMCID: PMC9642004 DOI: 10.1039/d2fd00103a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]

Velecký J, Hamsikova M, Stourac J, Musil M, Damborsk J, Bednar D, Mazurenko S. SoluProtMutDB: a manually curated database of protein solubility changes upon mutations. Comput Struct Biotechnol J 2022;20:6339-6347. [DOI: 10.1016/j.csbj.2022.11.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 11/04/2022] [Accepted: 11/04/2022] [Indexed: 11/11/2022] Open

Ma W, Zhang S, Li Z, Jiang M, Wang S, Lu W, Bi X, Jiang H, Zhang H, Wei Z. Enhancing Protein Function Prediction Performance by Utilizing AlphaFold-Predicted Protein Structures. J Chem Inf Model 2022;62:4008-4017. [PMID: 36006049 DOI: 10.1021/acs.jcim.2c00885] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Bernhofer M, Rost B. TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinformatics 2022;23:326. [PMID: 35941534 PMCID: PMC9358067 DOI: 10.1186/s12859-022-04873-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 08/03/2022] [Indexed: 12/30/2022] Open

Abstract

BACKGROUND

Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods such as AlphaFold2 accurately predict 3D structures for many TMPs, but annotating transmembrane regions remains a limiting step for proteome-wide predictions.

RESULTS

Here, we present TMbed, a novel method inputting embeddings from protein Language Models (pLMs, here ProtT5), to predict for each residue one of four classes: transmembrane helix (TMH), transmembrane strand (TMB), signal peptide, or other. TMbed completes predictions for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information from multiple sequence alignments (MSAs) of protein families. On the per-protein level, TMbed correctly identified 94 ± 8% of the beta barrel TMPs (53 of 57) and 98 ± 1% of the alpha helical TMPs (557 of 571) in a non-redundant data set, at false positive rates well below 1% (erred on 30 of 5654 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Our method can handle sequences of up to 4200 residues on standard graphics cards used in desktop PCs (e.g., NVIDIA GeForce RTX 3060).

CONCLUSIONS

Based on embeddings from pLMs and two novel filters (Gaussian and Viterbi), TMbed predicts alpha helical and beta barrel TMPs at least as accurately as any other method but at lower false positive rates. Given the few false positives and its outstanding speed, TMbed might be ideal to sieve through millions of 3D structures soon to be predicted, e.g., by AlphaFold2.

Collapse

Moriwaki H, Saito S, Matsumoto T, Serizawa T, Kunimoto R. Global Analysis of Deep Learning Prediction Using Large-Scale In-House Kinome-Wide Profiling Data. ACS OMEGA 2022;7:18374-18381. [PMID: 35694454 PMCID: PMC9178758 DOI: 10.1021/acsomega.2c00664] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 05/12/2022] [Indexed: 06/11/2023]

Raimondi D, Codicè F, Orlando G, Schymkowitz J, Rousseau F, Moreau Y. HPMPdb: a machine learning-ready database of protein molecular phenotypes associated to human missense variants. Curr Res Struct Biol 2022;4:167-174. [PMID: 35669450 PMCID: PMC9166469 DOI: 10.1016/j.crstbi.2022.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 03/24/2022] [Accepted: 04/25/2022] [Indexed: 11/10/2022] Open

Abstract

Current human Single Amino acid Variants (SAVs) databases provide a link between a SAVs and their effect on the carrier individual phenotype, often dividing them into Deleterious/Neutral variants. This is a very coarse-grained description of the genotype-to-phenotype relationship because it relies on un-realistic assumptions such as the perfect Mendelian behavior of each SAV and considers only dichotomic phenotypes. Moreover, the link between the effect of a SAV on a protein (its molecular phenotype) and the individual phenotype is often very complex, because multiple level of biological abstraction connect the protein and individual level phenotypes. Here we present HPMPdb, a manually curated database containing human SAVs associated with the detailed description of the molecular phenotype they cause on the affected proteins. With particular regards to machine learning (ML), this database can be used to let researchers go beyond the existing Deleterious/Neutral prediction paradigm, allowing them to build molecular phenotype predictors instead. Our class labels describe in a succinct way the effects that each SAV has on 15 protein molecular phenotypes, such as protein-protein interaction, small molecules binding, function, post-translational modifications (PTMs), sub-cellular localization, mimetic PTM, folding and protein expression. Moreover, we provide researchers with all necessary means to re-producibly train and test their models on our database. The webserver and the data described in this paper are available at hpmp.esat.kuleuven.be.

•

Current variant-effect predictors perform a coarse-grained modeling and rely on unrealistic assumptions.

•

The link between the effect of a variant and the individual phenotype is complex. It would be more intuitive to predict the molecular phenotype that each variant causes on the carrier protein.

•

HPMP is a manually curated database containing human variants associated with the molecular phenotype they cause on the affected proteins.

•

We manually translated variants from Uniprot into 15 Machine Learning-ready labels describing the affected protein molecular phenotype.

•

The goal of HPMP is to allow researchers to go beyond the existing variant-effect prediction paradigm and allow them to build molecular phenotype predictors instead.

•

The webserver and the data described in this paper are available at hpmp.esat.kuleuven.be

Collapse

Ammar A, Cavill R, Evelo C, Willighagen E. PSnpBind: a database of mutated binding site protein-ligand complexes constructed using a multithreaded virtual screening workflow. J Cheminform 2022;14:8. [PMID: 35227289 PMCID: PMC8886843 DOI: 10.1186/s13321-021-00573-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Accepted: 11/18/2021] [Indexed: 11/15/2022] Open

Stringer B, de Ferrante H, Abeln S, Heringa J, Feenstra KA, Haydarlou R. PIPENN: protein interface prediction from sequence with an ensemble of neural nets. Bioinformatics 2022;38:2111-2118. [PMID: 35150231 PMCID: PMC9004643 DOI: 10.1093/bioinformatics/btac071] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 01/16/2022] [Accepted: 02/04/2022] [Indexed: 02/03/2023] Open

Abstract

MOTIVATION

The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly and challenging task, while protein sequence data are ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different Deep Learning (DL) architectures and learning strategies for protein-protein, protein-nucleotide and protein-small molecule interface prediction has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six DL architectures and various learning strategies with sequence-derived input features.

RESULTS

We constructed a large dataset dubbed BioDL, comprising protein-protein interactions from the PDB, and DNA/RNA and small molecule interactions from the BioLip database. We also constructed six DL architectures, and evaluated them on the BioDL benchmarks. This shows that no single architecture performs best on all instances. An ensemble architecture, which combines all six architectures, does consistently achieve peak prediction accuracy. We confirmed these results on the published benchmark set by Zhang and Kurgan (ZK448), and on our own existing curated homo- and heteromeric protein interaction dataset. Our PIPENN sequence-based ensemble predictor outperforms current state-of-the-art sequence-based protein interface predictors on ZK448 on all interaction types, achieving an AUC-ROC of 0.718 for protein-protein, 0.823 for protein-nucleotide and 0.842 for protein-small molecule.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are available at https://github.com/ibivu/pipenn/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Chen YC, Chen YH, Wright JD, Lim C. PPI-Hotspot^DB: Database of Protein-Protein Interaction Hot Spots. J Chem Inf Model 2022;62:1052-1060. [PMID: 35147037 DOI: 10.1021/acs.jcim.2c00025] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Modi V, Dunbrack RL. Kincore: a web resource for structural classification of protein kinases and their inhibitors. Nucleic Acids Res 2022;50:D654-D664. [PMID: 34643709 PMCID: PMC8728253 DOI: 10.1093/nar/gkab920] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 09/21/2021] [Accepted: 09/28/2021] [Indexed: 11/13/2022] Open

Kumar G, Srinivasan N, Sandhya S. Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection. Methods Mol Biol 2022;2449:149-167. [PMID: 35507261 DOI: 10.1007/978-1-0716-2095-3_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]

Lu H, Li F, Yuan L, Domenzain I, Yu R, Wang H, Li G, Chen Y, Ji B, Kerkhoven EJ, Nielsen J. Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection. Mol Syst Biol 2021;17:e10427. [PMID: 34676984 PMCID: PMC8532513 DOI: 10.15252/msb.202110427] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 10/02/2021] [Accepted: 10/04/2021] [Indexed: 12/24/2022] Open

Liu HF, Liu R. Structure-based prediction of post-translational modification cross-talk within proteins using complementary residue- and residue pair-based features. Brief Bioinform 2021;21:609-620. [PMID: 30649184 DOI: 10.1093/bib/bby123] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Revised: 11/26/2018] [Accepted: 11/30/2018] [Indexed: 02/07/2023] Open

Utgés JS, Tsenkov MI, Dietrich NJM, MacGowan SA, Barton GJ. Ankyrin repeats in context with human population variation. PLoS Comput Biol 2021;17:e1009335. [PMID: 34428215 PMCID: PMC8415598 DOI: 10.1371/journal.pcbi.1009335] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 09/03/2021] [Accepted: 08/10/2021] [Indexed: 11/19/2022] Open

Wang X, Zhang X, Peng C, Shi Y, Li H, Xu Z, Zhu W. D3DistalMutation: a Database to Explore the Effect of Distal Mutations on Enzyme Activity. J Chem Inf Model 2021;61:2499-2508. [PMID: 33938221 DOI: 10.1021/acs.jcim.1c00318] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Chukwudozie OS, Gray CM, Fagbayi TA, Chukwuanukwu RC, Oyebanji VO, Bankole TT, Adewole RA, Daniel EM. Immuno-informatics design of a multimeric epitope peptide based vaccine targeting SARS-CoV-2 spike glycoprotein. PLoS One 2021;16:e0248061. [PMID: 33730022 PMCID: PMC7968690 DOI: 10.1371/journal.pone.0248061] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 02/18/2021] [Indexed: 12/20/2022] Open

Green AG, Elhabashy H, Brock KP, Maddamsetti R, Kohlbacher O, Marks DS. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun 2021;12:1396. [PMID: 33654096 PMCID: PMC7925567 DOI: 10.1038/s41467-021-21636-z] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 01/27/2021] [Indexed: 12/28/2022] Open

Monzon AM, Bonato P, Necci M, Tosatto SCE, Piovesan D. FLIPPER: Predicting and Characterizing Linear Interacting Peptides in the Protein Data Bank. J Mol Biol 2021;433:166900. [PMID: 33647288 DOI: 10.1016/j.jmb.2021.166900] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Revised: 02/22/2021] [Accepted: 02/22/2021] [Indexed: 12/31/2022]

Sayılgan JF, Haliloğlu T, Gönen M. Protein dynamics analysis identifies candidate cancer driver genes and mutations in TCGA data. Proteins 2021;89:721-730. [PMID: 33550612 DOI: 10.1002/prot.26054] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Revised: 01/04/2021] [Accepted: 01/31/2021] [Indexed: 11/09/2022]

Kooistra AJ, Mordalski S, Pándy-Szekeres G, Esguerra M, Mamyrbekov A, Munk C, Keserű GM, Gloriam D. GPCRdb in 2021: integrating GPCR sequence, structure and function. Nucleic Acids Res 2021;49:D335-D343. [PMID: 33270898 PMCID: PMC7778909 DOI: 10.1093/nar/gkaa1080] [Citation(s) in RCA: 220] [Impact Index Per Article: 73.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Revised: 10/20/2020] [Accepted: 10/22/2020] [Indexed: 01/27/2023] Open

Bradley D, Viéitez C, Rajeeve V, Selkrig J, Cutillas PR, Beltrao P. Sequence and Structure-Based Analysis of Specificity Determinants in Eukaryotic Protein Kinases. Cell Rep 2021;34:108602. [PMID: 33440154 PMCID: PMC7809594 DOI: 10.1016/j.celrep.2020.108602] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Revised: 11/03/2020] [Accepted: 12/14/2020] [Indexed: 01/04/2023] Open

Iqbal S, Pérez-Palma E, Jespersen JB, May P, Hoksza D, Heyne HO, Ahmed SS, Rifat ZT, Rahman MS, Lage K, Palotie A, Cottrell JR, Wagner FF, Daly MJ, Campbell AJ, Lal D. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc Natl Acad Sci U S A 2020;117:28201-28211. [PMID: 33106425 PMCID: PMC7668189 DOI: 10.1073/pnas.2002660117] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open

Abstract

Interpretation of the colossal number of genetic variants identified from sequencing applications is one of the major bottlenecks in clinical genetics, with the inference of the effect of amino acid-substituting missense variations on protein structure and function being especially challenging. Here we characterize the three-dimensional (3D) amino acid positions affected in pathogenic and population variants from 1,330 disease-associated genes using over 14,000 experimentally solved human protein structures. By measuring the statistical burden of variations (i.e., point mutations) from all genes on 40 3D protein features, accounting for the structural, chemical, and functional context of the variations' positions, we identify features that are generally associated with pathogenic and population missense variants. We then perform the same amino acid-level analysis individually for 24 protein functional classes, which reveals unique characteristics of the positions of the altered amino acids: We observe up to 46% divergence of the class-specific features from the general characteristics obtained by the analysis on all genes, which is consistent with the structural diversity of essential regions across different protein classes. We demonstrate that the function-specific 3D features of the variants match the readouts of mutagenesis experiments for BRCA1 and PTEN, and positively correlate with an independent set of clinically interpreted pathogenic and benign missense variants. Finally, we make our results available through a web server to foster accessibility and downstream research. Our findings represent a crucial step toward translational genetics, from highlighting the impact of mutations on protein structure to rationalizing the variants' pathogenicity in terms of the perturbed molecular mechanisms.

Collapse

Affiliation(s)

Sumaiya Iqbal Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142 Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114
Eduardo Pérez-Palma Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195
Jakob B Jespersen Department of Bio and Health Informatics, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
Patrick May Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 4365 Esch-sur-Alzette, Luxembourg
David Hoksza Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 4365 Esch-sur-Alzette, Luxembourg Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, Prague 11636, Czech Republic
Henrike O Heyne Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142 Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114 Institute for Molecular Medicine Finland, University of Helsinki, 00100 Helsinki, Finland
Shehab S Ahmed Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
Zaara T Rifat Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
M Sohel Rahman Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1205, Bangladesh
Kasper Lage Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142 Department of Surgery, Massachusetts General Hospital, Boston, MA 02114
Aarno Palotie Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142 Institute for Molecular Medicine Finland, University of Helsinki, 00100 Helsinki, Finland
Jeffrey R Cottrell Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142
Florence F Wagner Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142 Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142
Mark J Daly Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142 Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114 Institute for Molecular Medicine Finland, University of Helsinki, 00100 Helsinki, Finland
Arthur J Campbell Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142
Dennis Lal Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142; Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195 Cologne Center for Genomics, University of Cologne, 50931 Cologne, Germany Epilepsy Center, Neurological Institute, Cleveland Clinic, Cleveland, OH 44195

Collapse

Qiu J, Nechaev D, Rost B. Protein-protein and protein-nucleic acid binding residues important for common and rare sequence variants in human. BMC Bioinformatics 2020;21:452. [PMID: 33050876 PMCID: PMC7557062 DOI: 10.1186/s12859-020-03759-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 09/16/2020] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Any two unrelated people differ by about 20,000 missense mutations (also referred to as SAVs: Single Amino acid Variants or missense SNV). Many SAVs have been predicted to strongly affect molecular protein function. Common SAVs (> 5% of population) were predicted to have, on average, more effect on molecular protein function than rare SAVs (< 1% of population). We hypothesized that the prevalence of effect in common over rare SAVs might partially be caused by common SAVs more often occurring at interfaces of proteins with other proteins, DNA, or RNA, thereby creating subgroup-specific phenotypes. We analyzed SAVs from 60,706 people through the lens of two prediction methods, one (SNAP2) predicting the effects of SAVs on molecular protein function, the other (ProNA2020) predicting residues in DNA-, RNA- and protein-binding interfaces.

RESULTS

Three results stood out. Firstly, SAVs predicted to occur at binding interfaces were predicted to more likely affect molecular function than those predicted as not binding (p value < 2.2 × 10^-16). Secondly, for SAVs predicted to occur at binding interfaces, common SAVs were predicted more strongly with effect on protein function than rare SAVs (p value < 2.2 × 10^-16). Restriction to SAVs with experimental annotations confirmed all results, although the resulting subsets were too small to establish statistical significance for any result. Thirdly, the fraction of SAVs predicted at binding interfaces differed significantly between tissues, e.g. urinary bladder tissue was found abundant in SAVs predicted at protein-binding interfaces, and reproductive tissues (ovary, testis, vagina, seminal vesicle and endometrium) in SAVs predicted at DNA-binding interfaces.

CONCLUSIONS

Overall, the results suggested that residues at protein-, DNA-, and RNA-binding interfaces contributed toward predicting that common SAVs more likely affect molecular function than rare SAVs.

Collapse

Lasso G, Honig B, Shapira SD. A Sweep of Earth's Virome Reveals Host-Guided Viral Protein Structural Mimicry and Points to Determinants of Human Disease. Cell Syst 2020;12:82-91.e3. [PMID: 33053371 PMCID: PMC7552982 DOI: 10.1016/j.cels.2020.09.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 09/03/2020] [Accepted: 09/18/2020] [Indexed: 12/17/2022]

Iqbal S, Hoksza D, Pérez-Palma E, May P, Jespersen JB, Ahmed SS, Rifat ZT, Heyne HO, Rahman MS, Cottrell JR, Wagner FF, Daly MJ, Campbell AJ, Lal D. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe. Nucleic Acids Res 2020;48:W132-W139. [PMID: 32402084 PMCID: PMC7319582 DOI: 10.1093/nar/gkaa361] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 04/17/2020] [Accepted: 05/11/2020] [Indexed: 12/19/2022] Open

Affiliation(s)

Sumaiya Iqbal Center for Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA
David Hoksza Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg.,Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
Eduardo Pérez-Palma Genomic Medicine Institute, Lerner Research Institute Cleveland Clinic, Cleveland, OH 44195, USA
Patrick May Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
Jakob B Jespersen Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
Shehab S Ahmed Computer Science and Engineering, Bangladesh University of Engineering and Technology, ECE Building, West Palashi, Dhaka-1205, Bangladesh
Zaara T Rifat Computer Science and Engineering, Bangladesh University of Engineering and Technology, ECE Building, West Palashi, Dhaka-1205, Bangladesh
Henrike O Heyne Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.,Institute for Molecular Medicine Finland (FIMM), University of Helsinki, 00100 Helsinki, Finland
M Sohel Rahman Computer Science and Engineering, Bangladesh University of Engineering and Technology, ECE Building, West Palashi, Dhaka-1205, Bangladesh
Jeffrey R Cottrell Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
Florence F Wagner Center for Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
Mark J Daly Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114, USA.,Institute for Molecular Medicine Finland (FIMM), University of Helsinki, 00100 Helsinki, Finland
Arthur J Campbell Center for Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
Dennis Lal Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.,Genomic Medicine Institute, Lerner Research Institute Cleveland Clinic, Cleveland, OH 44195, USA.,Cologne Center for Genomics, University of Cologne, Cologne, Germany.,Epilepsy Center, Neurological Institute, Cleveland Clinic, Cleveland, OH 44195, USA

Collapse

Hanson J, Litfin T, Paliwal K, Zhou Y. Identifying molecular recognition features in intrinsically disordered regions of proteins by transfer learning. Bioinformatics 2020;36:1107-1113. [PMID: 31504193 DOI: 10.1093/bioinformatics/btz691] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Revised: 07/24/2019] [Accepted: 08/31/2019] [Indexed: 11/12/2022] Open

Abstract

MOTIVATION

Protein intrinsic disorder describes the tendency of sequence residues to not fold into a rigid three-dimensional shape by themselves. However, some of these disordered regions can transition from disorder to order when interacting with another molecule in segments known as molecular recognition features (MoRFs). Previous analysis has shown that these MoRF regions are indirectly encoded within the prediction of residue disorder as low-confidence predictions [i.e. in a semi-disordered state P(D)≈0.5]. Thus, what has been learned for disorder prediction may be transferable to MoRF prediction. Transferring the internal characterization of protein disorder for the prediction of MoRF residues would allow us to take advantage of the large training set available for disorder prediction, enabling the training of larger analytical models than is currently feasible on the small number of currently available annotated MoRF proteins. In this paper, we propose a new method for MoRF prediction by transfer learning from the SPOT-Disorder2 ensemble models built for disorder prediction.

RESULTS

We confirm that directly training on the MoRF set with a randomly initialized model yields substantially poorer performance on independent test sets than by using the transfer-learning-based method SPOT-MoRF, for both deep and simple networks. Its comparison to current state-of-the-art techniques reveals its superior performance in identifying MoRF binding regions in proteins across two independent testing sets, including our new dataset of >800 protein chains. These test chains share <30% sequence similarity to all training and validation proteins used in SPOT-Disorder2 and SPOT-MoRF, and provide a much-needed large-scale update on the performance of current MoRF predictors. The method is expected to be useful in locating functional disordered regions in proteins.

AVAILABILITY AND IMPLEMENTATION

SPOT-MoRF and its data are available as a web server and as a standalone program at: http://sparks-lab.org/jack/server/SPOT-MoRF/index.php.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Tang ZZ, Sliwoski GR, Chen G, Jin B, Bush WS, Li B, Capra JA. PSCAN: Spatial scan tests guided by protein structures improve complex disease gene discovery and signal variant detection. Genome Biol 2020;21:217. [PMID: 32847609 PMCID: PMC7448521 DOI: 10.1186/s13059-020-02121-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 07/27/2020] [Indexed: 12/25/2022] Open