26
|
Mezulis S, Sternberg MJE, Kelley LA. PhyreStorm: A Web Server for Fast Structural Searches Against the PDB. J Mol Biol 2015; 428:702-708. [PMID: 26517951 PMCID: PMC7610957 DOI: 10.1016/j.jmb.2015.10.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Revised: 10/13/2015] [Accepted: 10/18/2015] [Indexed: 11/10/2022]
Abstract
The identification of structurally similar proteins can provide a range of biological insights, and accordingly, the alignment of a query protein to a database of experimentally determined protein structures is a technique commonly used in the fields of structural and evolutionary biology. The PhyreStorm Web server has been designed to provide comprehensive, up-to-date and rapid structural comparisons against the Protein Data Bank (PDB) combined with a rich and intuitive user interface. It is intended that this facility will enable biologists inexpert in bioinformatics access to a powerful tool for exploring protein structure relationships beyond what can be achieved by sequence analysis alone. By partitioning the PDB into similar structures, PhyreStorm is able to quickly discard the majority of structures that cannot possibly align well to a query protein, reducing the number of alignments required by an order of magnitude. PhyreStorm is capable of finding 93 ± 2% of all highly similar (TM-score > 0.7) structures in the PDB for each query structure, usually in less than 60 s. PhyreStorm is available at http://www.sbg.bio.ic.ac.uk/phyrestorm/.
Collapse
|
27
|
Greener JG, Sternberg MJE. AlloPred: prediction of allosteric pockets on proteins using normal mode perturbation analysis. BMC Bioinformatics 2015; 16:335. [PMID: 26493317 PMCID: PMC4619270 DOI: 10.1186/s12859-015-0771-1] [Citation(s) in RCA: 67] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2015] [Accepted: 10/13/2015] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Despite being hugely important in biological processes, allostery is poorly understood and no universal mechanism has been discovered. Allosteric drugs are a largely unexplored prospect with many potential advantages over orthosteric drugs. Computational methods to predict allosteric sites on proteins are needed to aid the discovery of allosteric drugs, as well as to advance our fundamental understanding of allostery. RESULTS AlloPred, a novel method to predict allosteric pockets on proteins, was developed. AlloPred uses perturbation of normal modes alongside pocket descriptors in a machine learning approach that ranks the pockets on a protein. AlloPred ranked an allosteric pocket top for 23 out of 40 known allosteric proteins, showing comparable and complementary performance to two existing methods. In 28 of 40 cases an allosteric pocket was ranked first or second. The AlloPred web server, freely available at http://www.sbg.bio.ic.ac.uk/allopred/home, allows visualisation and analysis of predictions. The source code and dataset information are also available from this site. CONCLUSIONS Perturbation of normal modes can enhance our ability to predict allosteric sites on proteins. Computational methods such as AlloPred assist drug discovery efforts by suggesting sites on proteins for further experimental study.
Collapse
|
28
|
Reynolds CR, Muggleton SH, Sternberg MJE. Incorporating Virtual Reactions into a Logic-based Ligand-based Virtual Screening Method to Discover New Leads. Mol Inform 2015; 34:615-625. [PMID: 26583052 PMCID: PMC4641463 DOI: 10.1002/minf.201400162] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Accepted: 01/08/2015] [Indexed: 11/28/2022]
Abstract
The use of virtual screening has become increasingly central to the drug development pipeline, with ligand-based virtual screening used to screen databases of compounds to predict their bioactivity against a target. These databases can only represent a small fraction of chemical space, and this paper describes a method of exploring synthetic space by applying virtual reactions to promising compounds within a database, and generating focussed libraries of predicted derivatives. A ligand-based virtual screening tool Investigational Novel Drug Discovery by Example (INDDEx) is used as the basis for a system of virtual reactions. The use of virtual reactions is estimated to open up a potential space of 1.21×1012 potential molecules. A de novo design algorithm known as Partial Logical-Rule Reactant Selection (PLoRRS) is introduced and incorporated into the INDDEx methodology. PLoRRS uses logical rules from the INDDEx model to select reactants for the de novo generation of potentially active products. The PLoRRS method is found to increase significantly the likelihood of retrieving molecules similar to known actives with a p-value of 0.016. Case studies demonstrate that the virtual reactions produce molecules highly similar to known actives, including known blockbuster drugs.
Collapse
|
29
|
Cornish AJ, Filippis I, David A, Sternberg MJE. Exploring the cellular basis of human disease through a large-scale mapping of deleterious genes to cell types. Genome Med 2015; 7:95. [PMID: 26330083 PMCID: PMC4557825 DOI: 10.1186/s13073-015-0212-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2015] [Accepted: 07/31/2015] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Each cell type found within the human body performs a diverse and unique set of functions, the disruption of which can lead to disease. However, there currently exists no systematic mapping between cell types and the diseases they can cause. METHODS In this study, we integrate protein-protein interaction data with high-quality cell-type-specific gene expression data from the FANTOM5 project to build the largest collection of cell-type-specific interactomes created to date. We develop a novel method, called gene set compactness (GSC), that contrasts the relative positions of disease-associated genes across 73 cell-type-specific interactomes to map genes associated with 196 diseases to the cell types they affect. We conduct text-mining of the PubMed database to produce an independent resource of disease-associated cell types, which we use to validate our method. RESULTS The GSC method successfully identifies known disease-cell-type associations, as well as highlighting associations that warrant further study. This includes mast cells and multiple sclerosis, a cell population currently being targeted in a multiple sclerosis phase 2 clinical trial. Furthermore, we build a cell-type-based diseasome using the cell types identified as manifesting each disease, offering insight into diseases linked through etiology. CONCLUSIONS The data set produced in this study represents the first large-scale mapping of diseases to the cell types in which they are manifested and will therefore be useful in the study of disease systems. Overall, we demonstrate that our approach links disease-associated genes to the phenotypes they produce, a key goal within systems medicine.
Collapse
|
30
|
David A, Sternberg MJE. The Contribution of Missense Mutations in Core and Rim Residues of Protein-Protein Interfaces to Human Disease. J Mol Biol 2015; 427:2886-98. [PMID: 26173036 PMCID: PMC4548493 DOI: 10.1016/j.jmb.2015.07.004] [Citation(s) in RCA: 81] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Revised: 06/19/2015] [Accepted: 07/06/2015] [Indexed: 01/21/2023]
Abstract
Missense mutations at protein–protein interaction sites, called interfaces, are important contributors to human disease. Interfaces are non-uniform surface areas characterized by two main regions, “core” and “rim”, which differ in terms of evolutionary conservation and physicochemical properties. Moreover, within interfaces, only a small subset of residues (“hot spots”) is crucial for the binding free energy of the protein–protein complex. We performed a large-scale structural analysis of human single amino acid variations (SAVs) and demonstrated that disease-causing mutations are preferentially located within the interface core, as opposed to the rim (p < 0.01). In contrast, the interface rim is significantly enriched in polymorphisms, similar to the remaining non-interacting surface. Energetic hot spots tend to be enriched in disease-causing mutations compared to non-hot spots (p = 0.05), regardless of their occurrence in core or rim residues. For individual amino acids, the frequency of substitution into a polymorphism or disease-causing mutation differed to other amino acids and was related to its structural location, as was the type of physicochemical change introduced by the SAV. In conclusion, this study demonstrated the different distribution and properties of disease-causing SAVs and polymorphisms within different structural regions and in relation to the energetic contribution of amino acid in protein–protein interfaces, thus highlighting the importance of a structural system biology approach for predicting the effect of SAVs. Protein–protein interactions are fundamental in all biological processes. The distribution of deleterious and non-SAVs within protein interfaces is unknown. The distribution of deleterious SAVs differs within different interface structural regions. The distribution of SAVs differs in relation to interface residues energetic contribution. Structural analysis of protein complexes enhances the understanding of deleterious SAVs.
Collapse
|
31
|
Kelley LA, Sternberg MJE. Partial protein domains: evolutionary insights and bioinformatics challenges. Genome Biol 2015; 16:100. [PMID: 25986583 PMCID: PMC4436111 DOI: 10.1186/s13059-015-0663-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Protein domains are generally thought to correspond to units of evolution. New research raises questions about how such domains are defined with bioinformatics tools and sheds light on how evolution has enabled partial domains to be viable.
Collapse
|
32
|
Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE. The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 2015. [PMID: 25950237 DOI: 10.1038/nprot.2015-053] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Phyre2 is a suite of tools available on the web to predict and analyze protein structure, function and mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools. Phyre2 replaces Phyre, the original version of the server for which we previously published a paper in Nature Protocols. In this updated protocol, we describe Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyze the effect of amino acid variants (e.g., nonsynonymous SNPs (nsSNPs)) for a user's protein sequence. Users are guided through results by a simple interface at a level of detail they determine. This protocol will guide users from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, their domain composition and model quality. A range of additional available tools is described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins that are difficult to model. The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure prediction will be returned between 30 min and 2 h after submission.
Collapse
|
33
|
Irimia M, Weatheritt RJ, Ellis JD, Parikshak NN, Gonatopoulos-Pournatzis T, Babor M, Quesnel-Vallières M, Tapial J, Raj B, O'Hanlon D, Barrios-Rodiles M, Sternberg MJE, Cordes SP, Roth FP, Wrana JL, Geschwind DH, Blencowe BJ. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell 2015; 159:1511-23. [PMID: 25525873 DOI: 10.1016/j.cell.2014.11.035] [Citation(s) in RCA: 405] [Impact Index Per Article: 45.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/20/2014] [Accepted: 11/18/2014] [Indexed: 12/16/2022]
Abstract
Alternative splicing (AS) generates vast transcriptomic and proteomic complexity. However, which of the myriad of detected AS events provide important biological functions is not well understood. Here, we define the largest program of functionally coordinated, neural-regulated AS described to date in mammals. Relative to all other types of AS within this program, 3-15 nucleotide "microexons" display the most striking evolutionary conservation and switch-like regulation. These microexons modulate the function of interaction domains of proteins involved in neurogenesis. Most neural microexons are regulated by the neuronal-specific splicing factor nSR100/SRRM4, through its binding to adjacent intronic enhancer motifs. Neural microexons are frequently misregulated in the brains of individuals with autism spectrum disorder, and this misregulation is associated with reduced levels of nSR100. The results thus reveal a highly conserved program of dynamic microexon regulation associated with the remodeling of protein-interaction networks during neurogenesis, the misregulation of which is linked to autism.
Collapse
|
34
|
Di Fruscia P, Zacharioudakis E, Liu C, Moniot S, Laohasinnarong S, Khongkow M, Harrison IF, Koltsida K, Reynolds CR, Schmidtkunz K, Jung M, Chapman KL, Steegborn C, Dexter DT, Sternberg MJE, Lam EWF, Fuchter MJ. The discovery of a highly selective 5,6,7,8-tetrahydrobenzo[4,5]thieno[2,3-d]pyrimidin-4(3H)-one SIRT2 inhibitor that is neuroprotective in an in vitro Parkinson's disease model. ChemMedChem 2014; 10:69-82. [PMID: 25395356 DOI: 10.1002/cmdc.201402431] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Indexed: 02/03/2023]
Abstract
Sirtuins, NAD(+) -dependent histone deacetylases (HDACs), have recently emerged as potential therapeutic targets for the treatment of a variety of diseases. The discovery of potent and isoform-selective inhibitors of this enzyme family should provide chemical tools to help determine the roles of these targets and validate their therapeutic value. Herein, we report the discovery of a novel class of highly selective SIRT2 inhibitors, identified by pharmacophore screening. We report the identification and validation of 3-((2-methoxynaphthalen-1-yl)methyl)-7-((pyridin-3-ylmethyl)amino)-5,6,7,8-tetrahydrobenzo[4,5]thieno[2,3-d]pyrimidin-4(3H)-one (ICL-SIRT078), a substrate-competitive SIRT2 inhibitor with a Ki value of 0.62 ± 0.15 μM and more than 50-fold selectivity against SIRT1, 3 and 5. Treatment of MCF-7 breast cancer cells with ICL-SIRT078 results in hyperacetylation of α-tubulin, an established SIRT2 biomarker, at doses comparable with the biochemical IC50 data, while suppressing MCF-7 proliferation at higher concentrations. In concordance with the recent reports that suggest SIRT2 inhibition is a potential strategy for the treatment of Parkinson's disease, we find that compound ICL-SIRT078 has a significant neuroprotective effect in a lactacystin-induced model of Parkinsonian neuronal cell death in the N27 cell line. These results encourage further investigation into the effects of ICL-SIRT078, or an optimised derivative thereof, as a candidate neuroprotective agent in in vivo models of Parkinson's disease.
Collapse
|
35
|
Lewis TE, Sillitoe I, Andreeva A, Blundell TL, Buchan DWA, Chothia C, Cozzetto D, Dana JM, Filippis I, Gough J, Jones DT, Kelley LA, Kleywegt GJ, Minneci F, Mistry J, Murzin AG, Ochoa-Montaño B, Oates ME, Punta M, Rackham OJL, Stahlhacke J, Sternberg MJE, Velankar S, Orengo C. Genome3D: exploiting structure to help users understand their sequences. Nucleic Acids Res 2014; 43:D382-6. [PMID: 25348407 PMCID: PMC4384030 DOI: 10.1093/nar/gku973] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Genome3D (http://www.genome3d.eu) is a collaborative resource that provides predicted domain annotations and structural models for key sequences. Since introducing Genome3D in a previous NAR paper, we have substantially extended and improved the resource. We have annotated representatives from Pfam families to improve coverage of diverse sequences and added a fast sequence search to the website to allow users to find Genome3D-annotated sequences similar to their own. We have improved and extended the Genome3D data, enlarging the source data set from three model organisms to 10, and adding VIVACE, a resource new to Genome3D. We have analysed and updated Genome3D's SCOP/CATH mapping. Finally, we have improved the superposition tools, which now give users a more powerful interface for investigating similarities and differences between structural models.
Collapse
|
36
|
Talman AM, Prieto JH, Marques S, Ubaida-Mohien C, Lawniczak M, Wass MN, Xu T, Frank R, Ecker A, Stanway RS, Krishna S, Sternberg MJE, Christophides GK, Graham DR, Dinglasan RR, Yates JR, Sinden RE. Proteomic analysis of the Plasmodium male gamete reveals the key role for glycolysis in flagellar motility. Malar J 2014; 13:315. [PMID: 25124718 PMCID: PMC4150949 DOI: 10.1186/1475-2875-13-315] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2014] [Accepted: 07/28/2014] [Indexed: 12/22/2022] Open
Abstract
Background Gametogenesis and fertilization play crucial roles in malaria transmission. While male gametes are thought to be amongst the simplest eukaryotic cells and are proven targets of transmission blocking immunity, little is known about their molecular organization. For example, the pathway of energy metabolism that power motility, a feature that facilitates gamete encounter and fertilization, is unknown. Methods Plasmodium berghei microgametes were purified and analysed by whole-cell proteomic analysis for the first time. Data are available via ProteomeXchange with identifier PXD001163. Results 615 proteins were recovered, they included all male gamete proteins described thus far. Amongst them were the 11 enzymes of the glycolytic pathway. The hexose transporter was localized to the gamete plasma membrane and it was shown that microgamete motility can be suppressed effectively by inhibitors of this transporter and of the glycolytic pathway. Conclusions This study describes the first whole-cell proteomic analysis of the malaria male gamete. It identifies glycolysis as the likely exclusive source of energy for flagellar beat, and provides new insights in original features of Plasmodium flagellar organization. Electronic supplementary material The online version of this article (doi:10.1186/1475-2875-13-315) contains supplementary material, which is available to authorized users.
Collapse
|
37
|
Yates CM, Filippis I, Kelley LA, Sternberg MJE. SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. J Mol Biol 2014; 426:2692-701. [PMID: 24810707 PMCID: PMC4087249 DOI: 10.1016/j.jmb.2014.04.026] [Citation(s) in RCA: 159] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 04/23/2014] [Accepted: 04/28/2014] [Indexed: 11/16/2022]
Abstract
Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html. Bioinformatics approaches are key for identification of disease-causing variants. SAV phenotype prediction can be improved using network information. A method including these features, SuSPect, outperforms tested methods. SuSPect is available to use at www.sbg.bio.ic.ac.uk/suspect.
Collapse
|
38
|
Yates CM, Sternberg MJE. The effects of non-synonymous single nucleotide polymorphisms (nsSNPs) on protein-protein interactions. J Mol Biol 2013; 425:3949-63. [PMID: 23867278 DOI: 10.1016/j.jmb.2013.07.012] [Citation(s) in RCA: 123] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2013] [Revised: 07/02/2013] [Accepted: 07/09/2013] [Indexed: 12/23/2022]
Abstract
Non-synonymous single nucleotide polymorphisms (nsSNPs) are single base changes leading to a change to the amino acid sequence of the encoded protein. Many of these variants are associated with disease, so nsSNPs have been well studied, with studies looking at the effects of nsSNPs on individual proteins, for example, on stability and enzyme active sites. In recent years, the impact of nsSNPs upon protein-protein interactions has also been investigated, giving a greater insight into the mechanisms by which nsSNPs can lead to disease. In this review, we summarize these studies, looking at the various mechanisms by which nsSNPs can affect protein-protein interactions. We focus on structural changes that can impair interaction, changes to disorder, gain of interaction, and post-translational modifications before looking at some examples of nsSNPs at human-pathogen protein-protein interfaces and the analysis of nsSNPs from a network perspective.
Collapse
|
39
|
Bryant WA, Sternberg MJE, Pinney JW. AMBIENT: Active Modules for Bipartite Networks--using high-throughput transcriptomic data to dissect metabolic response. BMC SYSTEMS BIOLOGY 2013; 7:26. [PMID: 23531303 PMCID: PMC3656802 DOI: 10.1186/1752-0509-7-26] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2012] [Accepted: 03/01/2013] [Indexed: 11/10/2022]
Abstract
BACKGROUND With the continued proliferation of high-throughput biological experiments, there is a pressing need for tools to integrate the data produced in ways that produce biologically meaningful conclusions. Many microarray studies have analysed transcriptomic data from a pathway perspective, for instance by testing for KEGG pathway enrichment in sets of upregulated genes. However, the increasing availability of species-specific metabolic models provides the opportunity to analyse these data in a more objective, system-wide manner. RESULTS Here we introduce ambient (Active Modules for Bipartite Networks), a simulated annealing approach to the discovery of metabolic subnetworks (modules) that are significantly affected by a given genetic or environmental change. The metabolic modules returned by ambient are connected parts of the bipartite network that change coherently between conditions, providing a more detailed view of metabolic changes than standard approaches based on pathway enrichment. CONCLUSIONS ambient is an effective and flexible tool for the analysis of high-throughput data in a metabolic context. The same approach can be applied to any system in which reactions (or metabolites) can be assigned a score based on some biological observation, without the limitation of predefined pathways. A Python implementation of ambient is available at http://www.theosysbio.bio.ic.ac.uk/ambient.
Collapse
|
40
|
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nat Methods 2013; 10:221-7. [PMID: 23353650 PMCID: PMC3584181 DOI: 10.1038/nmeth.2340] [Citation(s) in RCA: 574] [Impact Index Per Article: 52.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 12/10/2012] [Indexed: 01/03/2023]
Abstract
A report on the results of the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Collapse
|
41
|
Yates CM, Sternberg MJE. Proteins and domains vary in their tolerance of non-synonymous single nucleotide polymorphisms (nsSNPs). J Mol Biol 2013; 425:1274-86. [PMID: 23357174 DOI: 10.1016/j.jmb.2013.01.026] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2012] [Revised: 01/11/2013] [Accepted: 01/19/2013] [Indexed: 02/05/2023]
Abstract
The widespread application of whole-genome sequencing is identifying numerous non-synonymous single nucleotide polymorphisms (nsSNPs), many of which are associated with disease. We analyzed nsSNPs from Humsavar and the 1000 Genomes Project to investigate why some proteins and domains are more tolerant of mutations than others. We identified 311 proteins and 112 Pfam families, corresponding to 2910 domains, as diseasesusceptible and 32 proteins and 67 Pfam families (10,783 domains) as diseaseresistant based on the relative numbers of disease-associated and neutral polymorphisms. Proteins with no significant difference from expected numbers of disease and polymorphism nsSNPs are classified as other. This classification takes into account the phenotypes of all known mutations in the protein or domain rather than simply classifying based on the presence or absence of disease nsSNPs. Of the two hypotheses suggested, our results support the model that disease-resistant domains and proteins are more able to tolerate mutations rather than having more lethal mutations that are not observed. Disease-resistant proteins and domains show significantly higher mutation rates and lower sequence conservation than disease-susceptible proteins and domains. Disease-susceptible proteins are more likely to be encoded by essential genes, are more central in protein-protein interaction networks and are less likely to contain loss-of-function mutations in healthy individuals. We use this classification for nsSNP phenotype prediction, predicting nsSNPs in disease-susceptible domains to be disease and those in disease-resistant domains to be polymorphism. In this way, we achieve higher accuracy than SIFT, a state-of-the-art algorithm.
Collapse
|
42
|
Janin J, Sternberg MJE. Protein flexibility, not disorder, is intrinsic to molecular recognition. F1000 BIOLOGY REPORTS 2013; 5:2. [PMID: 23361309 PMCID: PMC3542771 DOI: 10.3410/b5-2] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
An 'intrinsically disordered protein' (IDP) is assumed to be unfolded in the cell and perform its biological function in that state. We contend that most intrinsically disordered proteins are in fact proteins waiting for a partner (PWPs), parts of a multi-component complex that do not fold correctly in the absence of other components. Flexibility, not disorder, is an intrinsic property of proteins, exemplified by X-ray structures of many enzymes and protein-protein complexes. Disorder is often observed with purified proteins in vitro and sometimes also in crystals, where it is difficult to distinguish from flexibility. In the crowded environment of the cell, disorder is not compatible with the known mechanisms of protein-protein recognition, and, foremost, with its specificity. The self-assembly of multi-component complexes may, nevertheless, involve the specific recognition of nascent polypeptide chains that are incompletely folded, but then disorder is transient, and it must remain under the control of molecular chaperones and of the quality control apparatus that obviates the toxic effects it can have on the cell.
Collapse
|
43
|
David A, Kelley LA, Sternberg MJE. A new structural model of the acid-labile subunit: pathogenetic mechanisms of short stature-causing mutations. J Mol Endocrinol 2012; 49:213-20. [PMID: 22991227 DOI: 10.1530/jme-12-0086] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
The acid-labile subunit (ALS) is the main regulator of IGF1 and IGF2 bioavailability. ALS deficiency caused by mutations in the ALS (IGFALS) gene often results in mild short stature in adulthood. Little is known about the ALS structure-function relationship. A structural model built in 1999 suggested a doughnut shape, which has never been observed in the leucine-rich repeat (LRR) superfamily, to which ALS belongs. In this study, we built a new ALS structural model, analysed its glycosylation and charge distribution and studied mechanisms by which missense mutations affect protein structure. We used three structure prediction servers and integrated their results with information derived from ALS experimental studies. The ALS model was built at high confidence using Toll-like receptor protein templates and resembled a horseshoe with an extensively negatively charged concave surface. Enrichment in prolines and disulphide bonds was found at the ALS N- and C-termini. Moreover, seven N-glycosylation sites were identified and mapped. ALS mutations were predicted to affect protein structure by causing loss of hydrophobic interactions (p.Leu134Gln), alteration of the amino acid backbone (p.Leu241Pro, p.Leu172Phe and p.Leu244Phe), loss of disulphide bridges (p.Cys60Ser and p.Cys540Arg), change in structural constrains (p.Pro73Leu), creation of novel glycosylation sites (p.Asp440Asn) or alteration of LRRs (p.Asn276Ser). In conclusion, our ALS structural model was identified as a highly confident prediction by three independent methods and disagrees with the previously published ALS model. The new model allowed us to analyse the ALS core and its caps and to interpret the potential structural effects of ALS mutations.
Collapse
|
44
|
Lewis TE, Sillitoe I, Andreeva A, Blundell TL, Buchan DW, Chothia C, Cuff A, Dana JM, Filippis I, Gough J, Hunter S, Jones DT, Kelley LA, Kleywegt GJ, Minneci F, Mitchell A, Murzin AG, Ochoa-Montaño B, Rackham OJL, Smith J, Sternberg MJE, Velankar S, Yeats C, Orengo C. Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. Nucleic Acids Res 2012. [PMID: 23203986 PMCID: PMC3531217 DOI: 10.1093/nar/gks1266] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence–structure–function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker’s yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).
Collapse
|
45
|
Sternberg MJE, Tamaddoni-Nezhad A, Lesk VI, Kay E, Hitchen PG, Cootes A, van Alphen LB, Lamoureux MP, Jarrell HC, Rawlings CJ, Soo EC, Szymanski CM, Dell A, Wren BW, Muggleton SH. Gene function hypotheses for the Campylobacter jejuni glycome generated by a logic-based approach. J Mol Biol 2012; 425:186-97. [PMID: 23103756 PMCID: PMC3546167 DOI: 10.1016/j.jmb.2012.10.014] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2012] [Revised: 10/15/2012] [Accepted: 10/17/2012] [Indexed: 11/26/2022]
Abstract
Increasingly, experimental data on biological systems are obtained from several sources and computational approaches are required to integrate this information and derive models for the function of the system. Here, we demonstrate the power of a logic-based machine learning approach to propose hypotheses for gene function integrating information from two diverse experimental approaches. Specifically, we use inductive logic programming that automatically proposes hypotheses explaining the empirical data with respect to logically encoded background knowledge. We study the capsular polysaccharide biosynthetic pathway of the major human gastrointestinal pathogen Campylobacter jejuni. We consider several key steps in the formation of capsular polysaccharide consisting of 15 genes of which 8 have assigned function, and we explore the extent to which functions can be hypothesised for the remaining 7. Two sources of experimental data provide the information for learning—the results of knockout experiments on the genes involved in capsule formation and the absence/presence of capsule genes in a multitude of strains of different serotypes. The machine learning uses the pathway structure as background knowledge. We propose assignments of specific genes to five previously unassigned reaction steps. For four of these steps, there was an unambiguous optimal assignment of gene to reaction, and to the fifth, there were three candidate genes. Several of these assignments were consistent with additional experimental results. We therefore show that the logic-based methodology provides a robust strategy to integrate results from different experimental approaches and propose hypotheses for the behaviour of a biological system.
Collapse
|
46
|
A Santos JC, Nassif H, Page D, Muggleton SH, E Sternberg MJ. Automated identification of protein-ligand interaction features using Inductive Logic Programming: a hexose binding case study. BMC Bioinformatics 2012; 13:162. [PMID: 22783946 PMCID: PMC3458898 DOI: 10.1186/1471-2105-13-162] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2011] [Accepted: 06/15/2012] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND There is a need for automated methods to learn general features of the interactions of a ligand class with its diverse set of protein receptors. An appropriate machine learning approach is Inductive Logic Programming (ILP), which automatically generates comprehensible rules in addition to prediction. The development of ILP systems which can learn rules of the complexity required for studies on protein structure remains a challenge. In this work we use a new ILP system, ProGolem, and demonstrate its performance on learning features of hexose-protein interactions. RESULTS The rules induced by ProGolem detect interactions mediated by aromatics and by planar-polar residues, in addition to less common features such as the aromatic sandwich. The rules also reveal a previously unreported dependency for residues cys and leu. They also specify interactions involving aromatic and hydrogen bonding residues. This paper shows that Inductive Logic Programming implemented in ProGolem can derive rules giving structural features of protein/ligand interactions. Several of these rules are consistent with descriptions in the literature. CONCLUSIONS In addition to confirming literature results, ProGolem's model has a 10-fold cross-validated predictive accuracy that is superior, at the 95% confidence level, to another ILP system previously used to study protein/hexose interactions and is comparable with state-of-the-art statistical learners.
Collapse
|
47
|
Wass MN, Barton G, Sternberg MJE. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res 2012; 40:W466-70. [PMID: 22641853 PMCID: PMC3394346 DOI: 10.1093/nar/gks489] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein–protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.
Collapse
|
48
|
Reynolds CR, Amini AC, Muggleton SH, Sternberg MJE. Assessment of a Rule-Based Virtual Screening Technology (INDDEx) on a Benchmark Data Set. J Phys Chem B 2012; 116:6732-9. [DOI: 10.1021/jp212084f] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
49
|
Phan HTT, Sternberg MJE. PINALOG: a novel approach to align protein interaction networks--implications for complex detection and function prediction. ACTA ACUST UNITED AC 2012; 28:1239-45. [PMID: 22419782 PMCID: PMC3338015 DOI: 10.1093/bioinformatics/bts119] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Motivation: Analysis of protein–protein interaction networks (PPINs) at the system level has become increasingly important in understanding biological processes. Comparison of the interactomes of different species not only provides a better understanding of species evolution but also helps with detecting conserved functional components and in function prediction. Method and Results: Here we report a PPIN alignment method, called PINALOG, which combines information from protein sequence, function and network topology. Alignment of human and yeast PPINs reveals several conserved subnetworks between them that participate in similar biological processes, notably the proteasome and transcription related processes. PINALOG has been tested for its power in protein complex prediction as well as function prediction. Comparison with PSI-BLAST in predicting protein function in the twilight zone also shows that PINALOG is valuable in predicting protein function. Availability and implementation: The PINALOG web-server is freely available from http://www.sbg.bio.ic.ac.uk/~pinalog. The PINALOG program and associated data are available from the Download section of the web-server. Contact:m.sternberg@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
50
|
Di Fruscia P, Ho KK, Laohasinnarong S, Khongkow M, Kroll SHB, Islam SA, Sternberg MJE, Schmidtkunz K, Jung M, Lam EWF, Fuchter MJ. The Discovery of Novel 10,11-Dihydro-5H-dibenz[b,f]azepine SIRT2 Inhibitors. MEDCHEMCOMM 2012. [PMID: 24340169 DOI: 10.1039/c2md00290f] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Isoform selective inhibitors of the sirtuins (NAD+-dependent histone deacetylases) should enable an in depth study of the molecular biology underpinning these targets and how they are deregulated in diseases such as cancer and neurodegeneration. Herein, we present the discovery of structurally novel SIRT2 inhibitors. Hit molecule 8 was discovered through the chemical synthesis and biological characterization of a small-molecule compound library based around the 10,11-dihydro-5H-dibenz[b,f]azepine scaffold. In vitro screening assays revealed compound 8 to have an IC50 of 18 μM against SIRT2 and to exhibit more than 30-fold selectivity compared to SIRT1. Cellular assays, performed on MCF-7 cells, confirmed the in vitro selectivity and showed hit 8 to have antiproliferative activity at a concentration of 30 μM. Computational studies were performed to predict the SIRT2 binding mode and to rationalise the observed selectivity.
Collapse
|