1
|
Blaber M. Variable and Conserved Regions of Secondary Structure in the β-Trefoil Fold: Structure Versus Function. Front Mol Biosci 2022; 9:889943. [PMID: 35517858 PMCID: PMC9062101 DOI: 10.3389/fmolb.2022.889943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 04/01/2022] [Indexed: 11/13/2022] Open
Abstract
β-trefoil proteins exhibit an approximate C3 rotational symmetry. An analysis of the secondary structure for members of this diverse superfamily of proteins indicates that it is comprised of remarkably conserved β-strands and highly-divergent turn regions. A fundamental “minimal” architecture can be identified that is devoid of heterogenous and extended turn regions, and is conserved among all family members. Conversely, the different functional families of β-trefoils can potentially be identified by their unique turn patterns (or turn “signature”). Such analyses provide clues as to the evolution of the β-trefoil family, suggesting a folding/stability role for the β-strands and a functional role for turn regions. This viewpoint can also guide de novo protein design of β-trefoil proteins having novel functionality.
Collapse
Affiliation(s)
- Michael Blaber
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, FL, United States
| |
Collapse
|
2
|
Blaber M. Cooperative hydrophobic core interactions in the β-trefoil architecture. Protein Sci 2021; 30:956-965. [PMID: 33686691 DOI: 10.1002/pro.4059] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 03/05/2021] [Accepted: 03/05/2021] [Indexed: 11/09/2022]
Abstract
Symmetric protein architectures have a compelling aesthetic that suggests a plausible evolutionary process (i.e., gene duplication/fusion) yielding complex architecture from a simpler structural motif. Furthermore, symmetry inspires a practical approach to computational protein design that substantially reduces the combinatorial explosion problem, and may provide practical solutions for structure optimization. Despite such broad relevance, the role of structural symmetry in the key area of hydrophobic core-packing cooperativity has not been adequately studied. In the present report, the threefold rotational symmetry intrinsic to the β-trefoil architecture is shown to form a geometric basis for highly-cooperative core-packing interactions that both stabilize the local repeating motif and promote oligomerization/long-range contacts in the folding process. Symmetry in the β-trefoil structure also permits tolerance towards mutational drift that involves a structural quasi-equivalence at several key core positions.
Collapse
Affiliation(s)
- Michael Blaber
- Department of Biomedical Sciences, Florida State University, Tallahassee, Florida, USA
| |
Collapse
|
3
|
Yu JF, Cao Z, Yang Y, Wang CL, Su ZD, Zhao YW, Wang JH, Zhou Y. Natural protein sequences are more intrinsically disordered than random sequences. Cell Mol Life Sci 2016; 73:2949-57. [PMID: 26801222 PMCID: PMC4937073 DOI: 10.1007/s00018-016-2138-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Revised: 01/10/2016] [Accepted: 01/11/2016] [Indexed: 11/16/2022]
Abstract
Most natural protein sequences have resulted from millions or even billions of years of evolution. How they differ from random sequences is not fully understood. Previous computational and experimental studies of random proteins generated from noncoding regions yielded inclusive results due to species-dependent codon biases and GC contents. Here, we approach this problem by investigating 10,000 sequences randomized at the amino acid level. Using well-established predictors for protein intrinsic disorder, we found that natural sequences have more long disordered regions than random sequences, even when random and natural sequences have the same overall composition of amino acid residues. We also showed that random sequences are as structured as natural sequences according to contents and length distributions of predicted secondary structure, although the structures from random sequences may be in a molten globular-like state, according to molecular dynamics simulations. The bias of natural sequences toward more intrinsic disorder suggests that natural sequences are created and evolved to avoid protein aggregation and increase functional diversity.
Collapse
Affiliation(s)
- Jia-Feng Yu
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
| | - Zanxia Cao
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
- College of Physics and Electronic Information, Dezhou University, Dezhou, 253023, China
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr, Southport, QLD, 4222, Australia
| | - Chun-Ling Wang
- College of Physics and Electronic Information, Dezhou University, Dezhou, 253023, China
| | - Zhen-Dong Su
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
| | - Ya-Wei Zhao
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
| | - Ji-Hua Wang
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China
- College of Physics and Electronic Information, Dezhou University, Dezhou, 253023, China
| | - Yaoqi Zhou
- Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, 253023, China.
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr, Southport, QLD, 4222, Australia.
| |
Collapse
|
4
|
Ozdemir Isik G, Ozer AN. Prediction of substrate specificity in NS3/4A serine protease by biased sequence search threading. J Biomol Struct Dyn 2016; 35:1102-1114. [DOI: 10.1080/07391102.2016.1171801] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Affiliation(s)
- Gonca Ozdemir Isik
- Department of Bioengineering, Marmara University , Goztepe, Kadikoy, 34722 Istanbul, Turkey
| | - A. Nevra Ozer
- Department of Bioengineering, Marmara University , Goztepe, Kadikoy, 34722 Istanbul, Turkey
| |
Collapse
|
5
|
Wagner JR, Lee CT, Durrant JD, Malmstrom RD, Feher VA, Amaro RE. Emerging Computational Methods for the Rational Discovery of Allosteric Drugs. Chem Rev 2016; 116:6370-90. [PMID: 27074285 PMCID: PMC4901368 DOI: 10.1021/acs.chemrev.5b00631] [Citation(s) in RCA: 158] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
![]()
Allosteric drug development holds
promise for delivering medicines
that are more selective and less toxic than those that target orthosteric
sites. To date, the discovery of allosteric binding sites and lead
compounds has been mostly serendipitous, achieved through high-throughput
screening. Over the past decade, structural data has become more readily
available for larger protein systems and more membrane protein classes
(e.g., GPCRs and ion channels), which are common allosteric drug targets.
In parallel, improved simulation methods now provide better atomistic
understanding of the protein dynamics and cooperative motions that
are critical to allosteric mechanisms. As a result of these advances,
the field of predictive allosteric drug development is now on the
cusp of a new era of rational structure-based computational methods.
Here, we review algorithms that predict allosteric sites based on
sequence data and molecular dynamics simulations, describe tools that
assess the druggability of these pockets, and discuss how Markov state
models and topology analyses provide insight into the relationship
between protein dynamics and allosteric drug binding. In each section,
we first provide an overview of the various method classes before
describing relevant algorithms and software packages.
Collapse
Affiliation(s)
- Jeffrey R Wagner
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Christopher T Lee
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Jacob D Durrant
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Robert D Malmstrom
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Victoria A Feher
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| | - Rommie E Amaro
- Department of Chemistry & Biochemistry and ‡National Biomedical Computation Resource, University of California, San Diego , La Jolla, California 92093, United States
| |
Collapse
|
6
|
Natural vs. random protein sequences: Discovering combinatorics properties on amino acid words. J Theor Biol 2015; 391:13-20. [PMID: 26656109 DOI: 10.1016/j.jtbi.2015.11.022] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Revised: 07/29/2015] [Accepted: 11/23/2015] [Indexed: 01/02/2023]
Abstract
Casual mutations and natural selection have driven the evolution of protein amino acid sequences that we observe at present in nature. The question about which is the dominant force of proteins evolution is still lacking of an unambiguous answer. Casual mutations tend to randomize protein sequences while, in order to have the correct functionality, one expects that selection mechanisms impose rigid constraints on amino acid sequences. Moreover, one also has to consider that the space of all possible amino acid sequences is so astonishingly large that it could be reasonable to have a well tuned amino acid sequence indistinguishable from a random one. In order to study the possibility to discriminate between random and natural amino acid sequences, we introduce different measures of association between pairs of amino acids in a sequence, and apply them to a dataset of 1047 natural protein sequences and 10,470 random sequences, carefully generated in order to preserve the relative length and amino acid distribution of the natural proteins. We analyze the multidimensional measures with machine learning techniques and show that, to a reasonable extent, natural protein sequences can be differentiated from random ones.
Collapse
|
7
|
Wang J, Zuo Y, Man YG, Avital I, Stojadinovic A, Liu M, Yang X, Varghese RS, Tadesse MG, Ressom HW. Pathway and network approaches for identification of cancer signature markers from omics data. J Cancer 2015; 6:54-65. [PMID: 25553089 PMCID: PMC4278915 DOI: 10.7150/jca.10631] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 11/14/2014] [Indexed: 12/12/2022] Open
Abstract
The advancement of high throughput omic technologies during the past few years has made it possible to perform many complex assays in a much shorter time than the traditional approaches. The rapid accumulation and wide availability of omic data generated by these technologies offer great opportunities to unravel disease mechanisms, but also presents significant challenges to extract knowledge from such massive data and to evaluate the findings. To address these challenges, a number of pathway and network based approaches have been introduced. This review article evaluates these methods and discusses their application in cancer biomarker discovery using hepatocellular carcinoma (HCC) as an example.
Collapse
Affiliation(s)
- Jinlian Wang
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
- 7. Genetics and Genomics Science, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Yiming Zuo
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
- 6. Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Yan-gao Man
- 2. Bon Secours Cancer Institute, Richmond VA, USA
| | | | - Alexander Stojadinovic
- 2. Bon Secours Cancer Institute, Richmond VA, USA
- 3. Division of Surgical Oncology, Walter Reed National Military Medical Center, Bethesda, MD, USA
| | - Meng Liu
- 4. Department of Public Health School of Hunter College, City University of New York, NYC, USA
| | - Xiaowei Yang
- 4. Department of Public Health School of Hunter College, City University of New York, NYC, USA
| | - Rency S. Varghese
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
| | - Mahlet G Tadesse
- 5. Department of Mathematics and Statistics, Georgetown University, Washington DC, USA
| | - Habtom W Ressom
- 1. Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC, USA
| |
Collapse
|
8
|
Mahajan S, de Brevern AG, Sanejouand YH, Srinivasan N, Offmann B. Use of a structural alphabet to find compatible folds for amino acid sequences. Protein Sci 2014; 24:145-53. [PMID: 25297700 DOI: 10.1002/pro.2581] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Accepted: 10/06/2014] [Indexed: 01/01/2023]
Abstract
The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence-search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino-acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as "Protein Blocks" (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence-search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z-score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales-up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web-server that is freely available at http://www.bo-protscience.fr/forsa.
Collapse
Affiliation(s)
- Swapnil Mahajan
- Université de La Réunion, DSIMB, UMR-S S1134, Saint Denis Messag Cedex 09, La Réunion, F-97715, France; INSERM, UMR-S 1134, DSIMB, F-75739, Paris, France; Laboratoire d'Excellence, GR-Ex, Paris, F-75739, France; Université de Nantes, UFIP CNRS UMR 6286 Faculté des Sciences et Techniques, 2 rue de la Houssinière, 44392, Nantes Cedex 03, France
| | | | | | | | | |
Collapse
|
9
|
van der Linden MG, Ferreira DC, de Oliveira LC, Onuchic JN, Pereira de Araújo AF. Ab initio protein folding simulations using atomic burials as informational intermediates between sequence and structure. Proteins 2013; 82:1186-99. [DOI: 10.1002/prot.24483] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Revised: 11/08/2013] [Accepted: 11/19/2013] [Indexed: 11/06/2022]
Affiliation(s)
- Marx Gomes van der Linden
- Departamento de Biologia Celular, Laboratório de Biologia Teórica e Computacional; Universidade de Brasília; Brasília-DF 70910-900 Brazil
| | - Diogo César Ferreira
- Departamento de Biologia Celular, Laboratório de Biologia Teórica e Computacional; Universidade de Brasília; Brasília-DF 70910-900 Brazil
| | - Leandro Cristante de Oliveira
- Departamento de Biologia Celular, Laboratório de Biologia Teórica e Computacional; Universidade de Brasília; Brasília-DF 70910-900 Brazil
- Departamento de Física; Instituto de Biociências, Letras e Ciências Exatas; UNESP - Univ Estadual Paulista; São José do Rio Preto-SP 15054-000 Brazil
| | - José N. Onuchic
- Center for Theoretical Biological Physics; Rice University; Houston Texas 77005
| | - Antônio F. Pereira de Araújo
- Departamento de Biologia Celular, Laboratório de Biologia Teórica e Computacional; Universidade de Brasília; Brasília-DF 70910-900 Brazil
| |
Collapse
|
10
|
Rocha JR, van der Linden MG, Ferreira DC, Azevêdo PH, Pereira de Araújo AF. Information-theoretic analysis and prediction of protein atomic burials: on the search for an informational intermediate between sequence and structure. ACTA ACUST UNITED AC 2012; 28:2755-62. [PMID: 22923297 DOI: 10.1093/bioinformatics/bts512] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
MOTIVATION It has been recently suggested that atomic burials, as expressed by molecular central distances, contain sufficient information to determine the tertiary structure of small globular proteins. A possible approach to structural determination from sequence could therefore involve a sequence-to-burial intermediate prediction step whose accuracy, however, is theoretically limited by the mutual information between these two variables. We use a non-redundant set of globular protein structures to estimate the mutual information between local amino acid sequence and atomic burials. Discretizing central distances of or atoms in equiprobable burial levels, we estimate relevant mutual information measures that are compared with actual predictions obtained from a Naive Bayesian Classifier (NBC) and a Hidden Markov Model (HMM). RESULTS Mutual information density for 20 amino acids and two or three burial levels were estimated to be roughly 15% of the unconditional burial entropy density. Lower estimates for the mutual information between local amino acid sequence and burial of a single residue indicated an increase in mutual information with the number of burial levels up to at least five or six levels. Prediction schemes were found to efficiently extract the available burial information from local sequence. Lower estimates for the mutual information involving single burials are consistently approached by predictions from the NBC and actually surpassed by predictions from the HMM. Near-optimal prediction for the HMM is indicated by the agreement between its density of prediction information and the corresponding density of mutual information between input and output representations. AVAILABILITY The dataset of protein structures and the prediction implementations are available at http://www.btc.unb.br/ (in 'Software').
Collapse
Affiliation(s)
- Juliana R Rocha
- Laboratório de Biologia Teórica e Computacional, Departamento de Biologia Celular, Universidade de Brasília, Brasília-DF 70910-900, Brazil
| | | | | | | | | |
Collapse
|
11
|
De Lucrezia D, Slanzi D, Poli I, Polticelli F, Minervini G. Do natural proteins differ from random sequences polypeptides? Natural vs. random proteins classification using an evolutionary neural network. PLoS One 2012; 7:e36634. [PMID: 22615786 PMCID: PMC3353917 DOI: 10.1371/journal.pone.0036634] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2011] [Accepted: 04/04/2012] [Indexed: 11/19/2022] Open
Abstract
Are extant proteins the exquisite result of natural selection or are they random sequences slightly edited by evolution? This question has puzzled biochemists for long time and several groups have addressed this issue comparing natural protein sequences to completely random ones coming to contradicting conclusions. Previous works in literature focused on the analysis of primary structure in an attempt to identify possible signature of evolutionary editing. Conversely, in this work we compare a set of 762 natural proteins with an average length of 70 amino acids and an equal number of completely random ones of comparable length on the basis of their structural features. We use an ad hoc Evolutionary Neural Network Algorithm (ENNA) in order to assess whether and to what extent natural proteins are edited from random polypeptides employing 11 different structure-related variables (i.e. net charge, volume, surface area, coil, alpha helix, beta sheet, percentage of coil, percentage of alpha helix, percentage of beta sheet, percentage of secondary structure and surface hydrophobicity). The ENNA algorithm is capable to correctly distinguish natural proteins from random ones with an accuracy of 94.36%. Furthermore, we study the structural features of 32 random polypeptides misclassified as natural ones to unveil any structural similarity to natural proteins. Results show that random proteins misclassified by the ENNA algorithm exhibit a significant fold similarity to portions or subdomains of extant proteins at atomic resolution. Altogether, our results suggest that natural proteins are significantly edited from random polypeptides and evolutionary editing can be readily detected analyzing structural features. Furthermore, we also show that the ENNA, employing simple structural descriptors, can predict whether a protein chain is natural or random.
Collapse
Affiliation(s)
- Davide De Lucrezia
- European Centre for Living Technology, University Ca’ Foscari Venice. Venice, Italy
| | - Debora Slanzi
- Dept. of Environmental Sciences, Informatics and Statistics, University Ca’ Foscari Venice, Venice, Italy
| | - Irene Poli
- European Centre for Living Technology, University Ca’ Foscari Venice. Venice, Italy
- Dept. of Environmental Sciences, Informatics and Statistics, University Ca’ Foscari Venice, Venice, Italy
| | - Fabio Polticelli
- Dept. of Biology, University of Roma Tre. Rome, Italy
- National Institute for Nuclear Physics, Roma Tre Section. Rome, Italy
| | - Giovanni Minervini
- European Centre for Living Technology, University Ca’ Foscari Venice. Venice, Italy
- * E-mail:
| |
Collapse
|
12
|
Pandini A, Fornili A, Fraternali F, Kleinjung J. Detection of allosteric signal transmission by information-theoretic analysis of protein dynamics. FASEB J 2012; 26:868-81. [PMID: 22071506 PMCID: PMC3290435 DOI: 10.1096/fj.11-190868] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Allostery offers a highly specific way to modulate protein function. Therefore, understanding this mechanism is of increasing interest for protein science and drug discovery. However, allosteric signal transmission is difficult to detect experimentally and to model because it is often mediated by local structural changes propagating along multiple pathways. To address this, we developed a method to identify communication pathways by an information-theoretical analysis of molecular dynamics simulations. Signal propagation was described as information exchange through a network of correlated local motions, modeled as transitions between canonical states of protein fragments. The method was used to describe allostery in two-component regulatory systems. In particular, the transmission from the allosteric site to the signaling surface of the receiver domain NtrC was shown to be mediated by a layer of hub residues. The location of hubs preferentially connected to the allosteric site was found in close agreement with key residues experimentally identified as involved in the signal transmission. The comparison with the networks of the homologues CheY and FixJ highlighted similarities in their dynamics. In particular, we showed that a preorganized network of fragment connections between the allosteric and functional sites exists already in the inactive state of all three proteins.
Collapse
Affiliation(s)
- Alessandro Pandini
- Division of Mathematical Biology, Medical Research Council National Institute for Medical Research, London, UK; ,Randall Division of Cell and Molecular Biophysics, King's College London, London, UK; and , Correspondence: Division of Mathematical Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, NW7 1AA London, UK. E-mail: A.P., ; J.K.,
| | - Arianna Fornili
- Randall Division of Cell and Molecular Biophysics, King's College London, London, UK; and
| | - Franca Fraternali
- Randall Division of Cell and Molecular Biophysics, King's College London, London, UK; and ,The Thomas Young Centre for Theory and Simulation of Materials, London, UK
| | - Jens Kleinjung
- Division of Mathematical Biology, Medical Research Council National Institute for Medical Research, London, UK; , Correspondence: Division of Mathematical Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, NW7 1AA London, UK. E-mail: A.P., ; J.K.,
| |
Collapse
|
13
|
Rangwala H, Kauffman C, Karypis G. svmPRAT: SVM-based protein residue annotation toolkit. BMC Bioinformatics 2009; 10:439. [PMID: 20028521 PMCID: PMC2805646 DOI: 10.1186/1471-2105-10-439] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. AVAILABILITY http://www.cs.gmu.edu/~mlbio/svmprat.
Collapse
Affiliation(s)
- Huzefa Rangwala
- Computer Science Department, George Mason University, Fairfax, VA, USA.
| | | | | |
Collapse
|
14
|
Abstract
Empirical or knowledge-based potentials have many applications in structural biology such as the prediction of protein structure, protein-protein, and protein-ligand interactions and in the evaluation of stability for mutant proteins, the assessment of errors in experimentally solved structures, and the design of new proteins. Here, we describe a simple procedure to derive and use pairwise distance-dependent potentials that rely on the definition of effective atomic interactions, which attempt to capture interactions that are more likely to be physically relevant. Based on a difficult benchmark test composed of proteins with different secondary structure composition and representing many different folds, we show that the use of effective atomic interactions significantly improves the performance of potentials at discriminating between native and near-native conformations. We also found that, in agreement with previous reports, the potentials derived from the observed effective atomic interactions in native protein structures contain a larger amount of mutual information. A detailed analysis of the effective energy functions shows that atom connectivity effects, which mostly arise when deriving the potential by the incorporation of those indirect atomic interactions occurring beyond the first atomic shell, are clearly filtered out. The shape of the energy functions for direct atomic interactions representing hydrogen bonding and disulfide and salt bridges formation is almost unaffected when effective interactions are taken into account. On the contrary, the shape of the energy functions for indirect atom interactions (i.e., those describing the interaction between two atoms bound to a direct interacting pair) is clearly different when effective interactions are considered. Effective energy functions for indirect interacting atom pairs are not influenced by the shape or the energy minimum observed for the corresponding direct interacting atom pair. Our results suggest that the dependency between the signals in different energy functions is a key aspect that need to be addressed when empirical energy functions are derived and used, and also highlight the importance of additivity assumptions in the use of potential energy functions.
Collapse
Affiliation(s)
- Evandro Ferrada
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Alameda 340, Santiago, Chile
| | | |
Collapse
|
15
|
Lisewski AM. Random amino acid mutations and protein misfolding lead to Shannon limit in sequence-structure communication. PLoS One 2008; 3:e3110. [PMID: 18769673 PMCID: PMC2518838 DOI: 10.1371/journal.pone.0003110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2008] [Accepted: 07/28/2008] [Indexed: 11/18/2022] Open
Abstract
The transmission of genomic information from coding sequence to protein structure during protein synthesis is subject to stochastic errors. To analyze transmission limits in the presence of spurious errors, Shannon's noisy channel theorem is applied to a communication channel between amino acid sequences and their structures established from a large-scale statistical analysis of protein atomic coordinates. While Shannon's theorem confirms that in close to native conformations information is transmitted with limited error probability, additional random errors in sequence (amino acid substitutions) and in structure (structural defects) trigger a decrease in communication capacity toward a Shannon limit at 0.010 bits per amino acid symbol at which communication breaks down. In several controls, simulated error rates above a critical threshold and models of unfolded structures always produce capacities below this limiting value. Thus an essential biological system can be realistically modeled as a digital communication channel that is (a) sensitive to random errors and (b) restricted by a Shannon error limit. This forms a novel basis for predictions consistent with observed rates of defective ribosomal products during protein synthesis, and with the estimated excess of mutual information in protein contact potentials.
Collapse
Affiliation(s)
- Andreas Martin Lisewski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.
| |
Collapse
|
16
|
Solis AD, Rackovsky S. Information and discrimination in pairwise contact potentials. Proteins 2008; 71:1071-87. [PMID: 18004788 DOI: 10.1002/prot.21733] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
We examine the information-theoretic characteristics of statistical potentials that describe pairwise long-range contacts between amino acid residues in proteins. In our work, we seek to map out an efficient information-based strategy to detect and optimally utilize the structural information latent in empirical data, to make contact potentials, and other statistically derived folding potentials, more effective tools in protein structure prediction. Foremost, we establish fundamental connections between basic information-theoretic quantities (including the ubiquitous Z-score) and contact "energies" or scores used routinely in protein structure prediction, and demonstrate that the informatic quantity that mediates fold discrimination is the total divergence. We find that pairwise contacts between residues bear a moderate amount of fold information, and if optimized, can assist in the discrimination of native conformations from large ensembles of native-like decoys. Using an extensive battery of threading tests, we demonstrate that parameters that affect the information content of contact potentials (e.g., choice of atoms to define residue location and the cut-off distance between pairs) have a significant influence in their performance in fold recognition. We conclude that potentials that have been optimized for mutual information and that have high number of score events per sequence-structure alignment are superior in identifying the correct fold. We derive the quantity "information product" that embodies these two critical factors. We demonstrate that the information product, which does not require explicit threading to compute, is as effective as the Z-score, which requires expensive decoy threading to evaluate. This new objective function may be able to speed up the multidimensional parameter search for better statistical potentials. Lastly, by demonstrating the functional equivalence of quasi-chemically approximated "energies" to fundamental informatic quantities, we make statistical potentials less dependent on theoretically tenuous biophysical formalisms and more amenable to direct bioinformatic optimization.
Collapse
Affiliation(s)
- Armando D Solis
- Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York, New York 10029, USA
| | | |
Collapse
|
17
|
Classification tree based protein structure distances for testing sequence–structure correlation. Comput Biol Med 2008; 38:469-74. [DOI: 10.1016/j.compbiomed.2008.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2007] [Accepted: 01/15/2008] [Indexed: 11/21/2022]
|
18
|
Tang HY, Zhang ZG. Using C' deviation to study structures of central amino acids in peptide fragments. Amino Acids 2006; 33:689-93. [PMID: 17136509 DOI: 10.1007/s00726-006-0463-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2006] [Accepted: 10/15/2006] [Indexed: 11/29/2022]
Abstract
In this investigation, we attempted to study the backbone geometry of amino acids in peptides using C' deviation. Diameters of distribution were used to describe the various atomic structures, and scatter graphs provided visual evaluation. The length of peptide fragments and the secondary structure of amino acids in the central position of the peptide fragments were also analyzed. The results showed that the atomic distribution of the central amino acids of five-residue peptide fragments was much more restricted than that of their corresponding three-residue peptide fragments. In identical three-residue fragments, atoms of central amino acids with different secondary structures, were distributed in distinct areas.
Collapse
Affiliation(s)
- H-Y Tang
- School of Basic Medicine, Peking Union Medical College, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, Beijing, China
| | | |
Collapse
|
19
|
Karypis G. YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction. Proteins 2006; 64:575-86. [PMID: 16763996 DOI: 10.1002/prot.21036] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The accurate prediction of a protein's secondary structure plays an increasingly critical role in predicting its function and tertiary structure, as it is utilized by many of the current state-of-the-art methods for remote homology, fold recognition, and ab initio structure prediction. We developed a new secondary structure prediction algorithm called YASSPP, which uses a pair of cascaded models constructed from two sets of binary SVM-based models. YASSPP uses an input coding scheme that combines both position-specific and nonposition-specific information, utilizes a kernel function designed to capture the sequence conservation signals around the local window of each residue, and constructs a second-level model by incorporating both the three-state predictions produced by the first-level model and information about the original sequence. Experiments on three standard datasets (RS126, CB513, and EVA common subset 4) show that YASSPP is capable of producing the highest Q3 and SOV scores than that achieved by existing widely used schemes such as PSIPRED, SSPro 4.0, SAM-T99sec, as well as previously developed SVM-based schemes. On the EVA dataset it achieves a Q3 and SOV score of 79.34 and 78.65%, which are considerably higher than the best reported scores of 77.64 and 76.05%, respectively.
Collapse
Affiliation(s)
- George Karypis
- Department of Computer Science & Engineering, University of Minnesota, Army HPC Research Center, Minneapolis, Minnesota 55455, USA.
| |
Collapse
|
20
|
Ozer N, Haliloglu T, Schiffer CA. Substrate specificity in HIV-1 protease by a biased sequence search method. Proteins 2006; 64:444-56. [PMID: 16741993 DOI: 10.1002/prot.21023] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Drug resistance in HIV-1 protease can also occasionally confer a change in the substrate specificity. Through the use of computational techniques, a relationship can be determined between the substrate sequence and three-dimensional structure of HIV-1 protease, and be utilized to predict substrate specificity. In this study, we introduce a biased sequence search threading (BSST) methodology to analyze the preferences of substrate positions and correlations between them that might also identify which positions within known substrates can likely tolerate sequence variability and which cannot. The potential sequence space was efficiently explored using a low-resolution knowledge-based scoring function. The low-energy substrate sequences generated by the biased search are correlated with the natural substrates. Octameric sequences were predicted using the probabilities of residue positions in the sequences generated by BSST in three ways: considering each position in the substrate independently, considering pairwise interdependency, and considering triple-wise interdependency. The prediction of octameric sequences using the triple-wise conditional probabilities produces the most accurate results, reproducing most of the sequences for five of the nine natural substrates and implying that there is a complex interdependence between the different substrate residue positions. This likely reflects that HIV-1 protease recognizes the overall shape of the substrate more than its specific sequence.
Collapse
Affiliation(s)
- Nevra Ozer
- Polymer Research Center and Chemical Engineering Department, Bogazici University, Bebek, Istanbul, Turkey
| | | | | |
Collapse
|
21
|
Chu W, Ghahramani Z, Podtelezhnikov A, Wild DL. Bayesian segmental models with multiple sequence alignment profiles for protein secondary structure and contact map prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:98-113. [PMID: 17048397 DOI: 10.1109/tcbb.2006.17] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in beta-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html.
Collapse
Affiliation(s)
- Wei Chu
- Gatsby Computational Neuroscience Unit, University College London, London, UK.
| | | | | | | |
Collapse
|
22
|
Yu ZG, Anh VV, Lau KS, Zhou LQ. Clustering of protein structures using hydrophobic free energy and solvent accessibility of proteins. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 73:031920. [PMID: 16605571 DOI: 10.1103/physreve.73.031920] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2005] [Revised: 01/18/2006] [Indexed: 05/08/2023]
Abstract
The hydrophobic free energy and solvent accessibility of amino acids are used to study the relationship between the primary structure and structural classification of large proteins. A measure representation and a Z curve representation of protein sequences are proposed. Fractal analysis of the measure and Z curve representations of proteins and multifractal analysis of their hydrophobic free energy and solvent accessibility sequences indicate that the protein sequences possess correlations and multifractal scaling. The parameters from the fractal and multifractal analyses on these sequences are used to construct some parameter spaces. Each protein is represented by a point in these spaces. A method is proposed to distinguish and cluster proteins from the alpha, beta, alpha + beta, and alpha/beta structural classes in these parameter spaces. Fisher's linear discriminant algorithm is used to give a quantitative assessment of our clustering on the selected proteins. Numerical results indicate that the discriminant accuracies are satisfactory. In particular, they reach 94.12% and 88.89% in separating proteins from {alpha, alpha + beta, alpha/beta} proteins in a three-dimensional space.
Collapse
Affiliation(s)
- Z G Yu
- Program in Statistics and Operations Research, Queensland University of Technology, GPO Box 2434, Brisbane, Queensland 4001, Australia.
| | | | | | | |
Collapse
|
23
|
Abstract
MOTIVATION Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrices that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations and by assessing the ability of this algorithm to detect remote homologies. RESULTS Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation.
Collapse
Affiliation(s)
- Gavin E Crooks
- Department of Plant and Microbial Biology 111 Koshland Hall #3102 University of California, Berkeley, CA 94720-3102, USA.
| | | | | |
Collapse
|
24
|
Wiederstein M, Sippl MJ. Protein sequence randomization: efficient estimation of protein stability using knowledge-based potentials. J Mol Biol 2004; 345:1199-212. [PMID: 15644215 DOI: 10.1016/j.jmb.2004.11.012] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2004] [Revised: 11/05/2004] [Accepted: 11/07/2004] [Indexed: 11/27/2022]
Abstract
Modifications of the amino acid sequence generally affect protein stability. Here, we use knowledge-based potentials to estimate the stability of protein structures under sequence variation. Calculations on a variety of protein scaffolds result in a clear distinction of known mutable regions from arbitrarily chosen control patches. For example, randomly changing the sequence of an antibody paratope yields a significantly lower number of destabilized mutants as compared to the randomization of comparable regions on the protein surface. The technique is computationally efficient and can be used to screen protein structures for regions that are amenable to molecular tinkering by preserving the stability of the mutated proteins.
Collapse
Affiliation(s)
- Markus Wiederstein
- Center of Applied Molecular Engineering, University of Salzburg, Jakob Haringerstrasse 5, 5020 Salzburg, Austria
| | | |
Collapse
|
25
|
Abstract
MOTIVATION The observed correlations between pairs of homologous protein sequences are typically explained in terms of a Markovian dynamic of amino acid substitution. This model assumes that every location on the protein sequence has the same background distribution of amino acids, an assumption that is incompatible with the observed heterogeneity of protein amino acid profiles and with the success of profile multiple sequence alignment. RESULTS We propose an alternative model of amino acid replacement during protein evolution based upon the assumption that the variation of the amino acid background distribution from one residue to the next is sufficient to explain the observed sequence correlations of homologs. The resulting dynamical model of independent replacements drawn from heterogeneous backgrounds is simple and consistent, and provides a unified homology match score for sequence-sequence, sequence-profile and profile-profile alignment.
Collapse
Affiliation(s)
- Gavin E Crooks
- Department of Plant and Microbial Biology 111 Koshland Hall #3102 University of California Berkeley, CA 94720-3102, USA.
| | | |
Collapse
|