1
|
Rackovsky S. Nonlinearities in protein space limit the utility of informatics in protein biophysics. Proteins 2015; 83:1923-8. [PMID: 26315852 DOI: 10.1002/prot.24916] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2015] [Revised: 08/12/2015] [Accepted: 08/20/2015] [Indexed: 11/08/2022]
Abstract
We examine the utility of informatic-based methods in computational protein biophysics. To do so, we use newly developed metric functions to define completely independent sequence and structure spaces for a large database of proteins. By investigating the relationship between these spaces, we demonstrate quantitatively the limits of knowledge-based correlation between the sequences and structures of proteins. It is shown that there are well-defined, nonlinear regions of protein space in which dissimilar structures map onto similar sequences (the conformational switch), and dissimilar sequences map onto similar structures (remote homology). These nonlinearities are shown to be quite common-almost half the proteins in our database fall into one or the other of these two regions. They are not anomalies, but rather intrinsic properties of structural encoding in amino acid sequences. It follows that extreme care must be exercised in using bioinformatic data as a basis for computational structure prediction. The implications of these results for protein evolution are examined.
Collapse
Affiliation(s)
- S Rackovsky
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York, 14853.,Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York, New York, 10029
| |
Collapse
|
2
|
Rocha JR, van der Linden MG, Ferreira DC, Azevêdo PH, Pereira de Araújo AF. Information-theoretic analysis and prediction of protein atomic burials: on the search for an informational intermediate between sequence and structure. ACTA ACUST UNITED AC 2012; 28:2755-62. [PMID: 22923297 DOI: 10.1093/bioinformatics/bts512] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
MOTIVATION It has been recently suggested that atomic burials, as expressed by molecular central distances, contain sufficient information to determine the tertiary structure of small globular proteins. A possible approach to structural determination from sequence could therefore involve a sequence-to-burial intermediate prediction step whose accuracy, however, is theoretically limited by the mutual information between these two variables. We use a non-redundant set of globular protein structures to estimate the mutual information between local amino acid sequence and atomic burials. Discretizing central distances of or atoms in equiprobable burial levels, we estimate relevant mutual information measures that are compared with actual predictions obtained from a Naive Bayesian Classifier (NBC) and a Hidden Markov Model (HMM). RESULTS Mutual information density for 20 amino acids and two or three burial levels were estimated to be roughly 15% of the unconditional burial entropy density. Lower estimates for the mutual information between local amino acid sequence and burial of a single residue indicated an increase in mutual information with the number of burial levels up to at least five or six levels. Prediction schemes were found to efficiently extract the available burial information from local sequence. Lower estimates for the mutual information involving single burials are consistently approached by predictions from the NBC and actually surpassed by predictions from the HMM. Near-optimal prediction for the HMM is indicated by the agreement between its density of prediction information and the corresponding density of mutual information between input and output representations. AVAILABILITY The dataset of protein structures and the prediction implementations are available at http://www.btc.unb.br/ (in 'Software').
Collapse
Affiliation(s)
- Juliana R Rocha
- Laboratório de Biologia Teórica e Computacional, Departamento de Biologia Celular, Universidade de Brasília, Brasília-DF 70910-900, Brazil
| | | | | | | | | |
Collapse
|
3
|
Bettella F, Rasinski D, Knapp EW. Protein Secondary Structure Prediction with SPARROW. J Chem Inf Model 2012; 52:545-56. [DOI: 10.1021/ci200321u] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Affiliation(s)
- Francesco Bettella
- Freie Universität
Berlin,
Institut für Chemie, Fabeckstr. 36a, D-14195 Berlin, Germany
- deCODE genetics, Sturlugata
8, 101 Reykjavik, Iceland
| | - Dawid Rasinski
- Freie Universität
Berlin,
Institut für Chemie, Fabeckstr. 36a, D-14195 Berlin, Germany
| | - Ernst Walter Knapp
- Freie Universität
Berlin,
Institut für Chemie, Fabeckstr. 36a, D-14195 Berlin, Germany
| |
Collapse
|
4
|
Rackovsky S. Spectral analysis of a protein conformational switch. PHYSICAL REVIEW LETTERS 2011; 106:248101. [PMID: 21770602 DOI: 10.1103/physrevlett.106.248101] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2011] [Indexed: 05/31/2023]
Abstract
The existence of conformational switching in proteins, induced by single amino acid mutations, presents an important challenge to our understanding of the physics of protein folding. Sequence-local methods, commonly used to detect structural homology, are incapable of accounting for this phenomenon. We examine a set of proteins, derived from the G(A) and G(B) domains of Streptococcus protein G, which are known to show a dramatic conformational change as a result of single-residue replacement. It is shown that these sequences, which are almost identical locally, can have very different global patterns of physical properties. These differences are consistent with the observed complete change in conformation. These results suggest that sequence-local methods for identifying structural homology can be misleading. They point to the importance of global sequence analysis in understanding sequence-structure relationships.
Collapse
Affiliation(s)
- S Rackovsky
- Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine of NYU, New York, New York 10029, USA.
| |
Collapse
|
5
|
Abstract
Computational studies of the relationships between protein sequence, structure, and folding have traditionally relied on purely local sequence representations. Here we show that global representations, on the basis of parameters that encode information about complete sequences, contain otherwise inaccessible information about the organization of sequences. By studying the spectral properties of these parameters, we demonstrate that amino acid physical properties fall into two distinct classes. One class is comprised of properties that favor sequentially localized interaction clusters. The other class is comprised of properties that favor globally distributed interactions. This observation provides a bridge between two classic models of protein folding-the collapse model and the nucleation model-and provides a basis for understanding how any degree of intermediacy between these two extremes can occur.
Collapse
|
6
|
|
7
|
Malkov SN, Zivković MV, Beljanski MV, Hall MB, Zarić SD. A reexamination of the propensities of amino acids towards a particular secondary structure: classification of amino acids based on their chemical structure. J Mol Model 2008; 14:769-75. [PMID: 18504624 DOI: 10.1007/s00894-008-0313-0] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2007] [Accepted: 04/08/2008] [Indexed: 10/22/2022]
Abstract
The correlation between the primary and secondary structures of proteins was analysed using a large data set from the Protein Data Bank. Clear preferences of amino acids towards certain secondary structures classify amino acids into four groups: alpha-helix preferrers, strand preferrers, turn and bend preferrers, and His and Cys (the latter two amino acids show no clear preference for any secondary structure). Amino acids in the same group have similar structural characteristics at their Cbeta and Cgamma atoms that predicts their preference for a particular secondary structure. All alpha-helix preferrers have neither polar heteroatoms on Cbeta and Cgamma atoms, nor branching or aromatic group on the Cbeta atom. All strand preferrers have aromatic groups or branching groups on the Cbeta atom. All turn and bend preferrers have a polar heteroatom on the Cbeta or Cgamma atoms or do not have a Cbeta atom at all. These new rules could be helpful in making predictions about non-natural amino acids.
Collapse
Affiliation(s)
- Sasa N Malkov
- Department of Mathematics, University of Belgrade, Studentski trg 16, 11000, Belgrade, Serbia
| | | | | | | | | |
Collapse
|
8
|
Comparison of protein secondary structures based on backbone dihedral angles. J Theor Biol 2008; 250:382-7. [DOI: 10.1016/j.jtbi.2007.10.013] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2007] [Revised: 10/12/2007] [Accepted: 10/12/2007] [Indexed: 11/22/2022]
|
9
|
Zhong S, Moix JM, Quirk S, Hernandez R. Dihedral-angle information entropy as a gauge of secondary structure propensity. Biophys J 2006; 91:4014-23. [PMID: 16980371 PMCID: PMC1635691 DOI: 10.1529/biophysj.106.089243] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2006] [Accepted: 08/29/2006] [Indexed: 11/18/2022] Open
Abstract
Protein structural information can be uncovered using an information-theory-based entropy and auxiliary functions by taking advantage of high-quality correlation plots between the dihedral angles around a residue and those between sequential residues. A standard information entropy for a primary sequence has been defined using the values of the probabilities of the most likely dihedral angles along the sequence. The distribution of entropy differences relative to the standard for each protein in a reference set--a sublibrary of the Protein Data Bank at the 90% sequence redundancy level--appears to be nearly Gaussian. It gives rise to an auxiliary checking function whose value signals the extent to which the dihedral angle propensities differ from typical structures. Such deviations can arise either because of incorrect dihedral angle assignments or secondary structural propensities that are atypical of the structures in the reference set. This auxiliary checking function can be readily calculated at the public website, (http://www.d2check.gatech.edu). Its utility is demonstrated here in an analysis displaying differences between experimentally and theoretically derived structures, and in the analysis of structures derived by homology modeling. A comparison of the new measure, D(2)Check, to other checking functions based on backbone conformation-namely, PROCHECK and WHAT_CHECK--is also provided.
Collapse
Affiliation(s)
- Shi Zhong
- Center for Computational and Molecular Science and Technology, School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA 30332-0400., USA
| | | | | | | |
Collapse
|
10
|
Sen TZ, Cheng H, Kloczkowski A, Jernigan RL. A Consensus Data Mining secondary structure prediction by combining GOR V and Fragment Database Mining. Protein Sci 2006; 15:2499-506. [PMID: 17001039 PMCID: PMC2242411 DOI: 10.1110/ps.062125306] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The major aim of tertiary structure prediction is to obtain protein models with the highest possible accuracy. Fold recognition, homology modeling, and de novo prediction methods typically use predicted secondary structures as input, and all of these methods may significantly benefit from more accurate secondary structure predictions. Although there are many different secondary structure prediction methods available in the literature, their cross-validated prediction accuracy is generally <80%. In order to increase the prediction accuracy, we developed a novel hybrid algorithm called Consensus Data Mining (CDM) that combines our two previous successful methods: (1) Fragment Database Mining (FDM), which exploits the Protein Data Bank structures, and (2) GOR V, which is based on information theory, Bayesian statistics, and multiple sequence alignments (MSA). In CDM, the target sequence is dissected into smaller fragments that are compared with fragments obtained from related sequences in the PDB. For fragments with a sequence identity above a certain sequence identity threshold, the FDM method is applied for the prediction. The remainder of the fragments are predicted by GOR V. The results of the CDM are provided as a function of the upper sequence identities of aligned fragments and the sequence identity threshold. We observe that the value 50% is the optimum sequence identity threshold, and that the accuracy of the CDM method measured by Q(3) ranges from 67.5% to 93.2%, depending on the availability of known structural fragments with sufficiently high sequence identity. As the Protein Data Bank grows, it is anticipated that this consensus method will improve because it will rely more upon the structural fragments.
Collapse
Affiliation(s)
- Taner Z Sen
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020, USA.
| | | | | | | |
Collapse
|
11
|
Bodén M, Yuan Z, Bailey TL. Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures. BMC Bioinformatics 2006; 7:68. [PMID: 16478545 PMCID: PMC1386714 DOI: 10.1186/1471-2105-7-68] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 02/14/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models. RESULTS Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues. CONCLUSION Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Collapse
Affiliation(s)
- Mikael Bodén
- School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, St Lucia, Australia
| | - Zheng Yuan
- Institute of Molecular Bioscience, The University of Queensland, QLD 4072, St Lucia, Australia
| | - Timothy L Bailey
- Institute of Molecular Bioscience, The University of Queensland, QLD 4072, St Lucia, Australia
| |
Collapse
|
12
|
Nielsen BG, Røgen P, Bohr HG. Gauss-integral based representation of protein structure for predicting the fold class from the sequence. ACTA ACUST UNITED AC 2006. [DOI: 10.1016/j.mcm.2005.11.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|