1
|
Aina A, Hsueh SCC, Plotkin SS. PROTHON: A Local Order Parameter-Based Method for Efficient Comparison of Protein Ensembles. J Chem Inf Model 2023. [PMID: 37178169 DOI: 10.1021/acs.jcim.3c00145] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
The comparison of protein conformational ensembles is of central importance in structural biology. However, there are few computational methods for ensemble comparison, and those that are readily available, such as ENCORE, utilize methods that are sufficiently computationally expensive to be prohibitive for large ensembles. Here, a new method is presented for efficient representation and comparison of protein conformational ensembles. The method is based on the representation of a protein ensemble as a vector of probability distribution functions (pdfs), with each pdf representing the distribution of a local structural property such as the number of contacts between Cβ atoms. Dissimilarity between two conformational ensembles is quantified by the Jensen-Shannon distance between the corresponding set of probability distribution functions. The method is validated for conformational ensembles generated by molecular dynamics simulations of ubiquitin, as well as experimentally derived conformational ensembles of a 130 amino acid truncated form of human tau protein. In the ubiquitin ensemble data set, the method was up to 88 times faster than the existing ENCORE software, while simultaneously utilizing 48 times fewer computing cores. We make the method available as a Python package, called PROTHON, and provide a GitHub page with the Python source code at https://github.com/PlotkinLab/Prothon.
Collapse
Affiliation(s)
- Adekunle Aina
- Department of Physics and Astronomy, The University of British Columbia, Vancouver, BC V6T 1Z1, Canada
| | - Shawn C C Hsueh
- Department of Physics and Astronomy, The University of British Columbia, Vancouver, BC V6T 1Z1, Canada
| | - Steven S Plotkin
- Department of Physics and Astronomy, The University of British Columbia, Vancouver, BC V6T 1Z1, Canada
- Genome Science and Technology Program, The University of British Columbia, Vancouver, BC V6T 1Z1, Canada
| |
Collapse
|
2
|
Shi Q, Chen W, Huang S, Wang Y, Xue Z. Deep learning for mining protein data. Brief Bioinform 2019; 22:194-218. [PMID: 31867611 DOI: 10.1093/bib/bbz156] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/21/2019] [Accepted: 11/07/2019] [Indexed: 01/16/2023] Open
Abstract
The recent emergence of deep learning to characterize complex patterns of protein big data reveals its potential to address the classic challenges in the field of protein data mining. Much research has revealed the promise of deep learning as a powerful tool to transform protein big data into valuable knowledge, leading to scientific discoveries and practical solutions. In this review, we summarize recent publications on deep learning predictive approaches in the field of mining protein data. The application architectures of these methods include multilayer perceptrons, stacked autoencoders, deep belief networks, two- or three-dimensional convolutional neural networks, recurrent neural networks, graph neural networks, and complex neural networks and are described from five perspectives: residue-level prediction, sequence-level prediction, three-dimensional structural analysis, interaction prediction, and mass spectrometry data mining. The advantages and deficiencies of these architectures are presented in relation to various tasks in protein data mining. Additionally, some practical issues and their future directions are discussed, such as robust deep learning for protein noisy data, architecture optimization for specific tasks, efficient deep learning for limited protein data, multimodal deep learning for heterogeneous protein data, and interpretable deep learning for protein understanding. This review provides comprehensive perspectives on general deep learning techniques for protein data analysis.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering, Huazhong University of Science and Technology. His main interests cover machine learning especially deep learning, protein data analysis, and big data mining
| | - Weiya Chen
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, virtual reality, and data visualization
| | - Siqi Huang
- Software Engineering at Huazhong University of science and technology, focusing on Machine learning and data mining
| | - Yan Wang
- School of life, University of Science & Technology; her main interests cover protein structure and function prediction and big data mining
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science & Technology, Wuhan, China. His research interests cover bioinformatics, machine learning, and image processing
| |
Collapse
|
3
|
Kinjo AR. Cooperative "folding transition" in the sequence space facilitates function-driven evolution of protein families. J Theor Biol 2018; 443:18-27. [PMID: 29355538 DOI: 10.1016/j.jtbi.2018.01.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Revised: 01/16/2018] [Accepted: 01/17/2018] [Indexed: 12/23/2022]
Abstract
In the protein sequence space, natural proteins form clusters of families which are characterized by their unique native folds whereas the great majority of random polypeptides are neither clustered nor foldable to unique structures. Since a given polypeptide can be either foldable or unfoldable, a kind of "folding transition" is expected at the boundary of a protein family in the sequence space. By Monte Carlo simulations of a statistical mechanical model of protein sequence alignment that coherently incorporates both short-range and long-range interactions as well as variable-length insertions to reproduce the statistics of the multiple sequence alignment of a given protein family, we demonstrate the existence of such transition between natural-like sequences and random sequences in the sequence subspaces for 15 domain families of various folds. The transition was found to be highly cooperative and two-state-like. Furthermore, enforcing or suppressing consensus residues on a few of the well-conserved sites enhanced or diminished, respectively, the natural-like pattern formation over the entire sequence. In most families, the key sites included ligand binding sites. These results suggest some selective pressure on the key residues, such as ligand binding activity, may cooperatively facilitate the emergence of a protein family during evolution. From a more practical aspect, the present results highlight an essential role of long-range effects in precisely defining protein families, which are absent in conventional sequence models.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan.
| |
Collapse
|
4
|
Deng L, Fan C, Zeng Z. A sparse autoencoder-based deep neural network for protein solvent accessibility and contact number prediction. BMC Bioinformatics 2017; 18:569. [PMID: 29297299 PMCID: PMC5751690 DOI: 10.1186/s12859-017-1971-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Direct prediction of the three-dimensional (3D) structures of proteins from one-dimensional (1D) sequences is a challenging problem. Significant structural characteristics such as solvent accessibility and contact number are essential for deriving restrains in modeling protein folding and protein 3D structure. Thus, accurately predicting these features is a critical step for 3D protein structure building. RESULTS In this study, we present DeepSacon, a computational method that can effectively predict protein solvent accessibility and contact number by using a deep neural network, which is built based on stacked autoencoder and a dropout method. The results demonstrate that our proposed DeepSacon achieves a significant improvement in the prediction quality compared with the state-of-the-art methods. We obtain 0.70 three-state accuracy for solvent accessibility, 0.33 15-state accuracy and 0.74 Pearson Correlation Coefficient (PCC) for the contact number on the 5729 monomeric soluble globular protein dataset. We also evaluate the performance on the CASP11 benchmark dataset, DeepSacon achieves 0.68 three-state accuracy and 0.69 PCC for solvent accessibility and contact number, respectively. CONCLUSIONS We have shown that DeepSacon can reliably predict solvent accessibility and contact number with stacked sparse autoencoder and a dropout approach.
Collapse
Affiliation(s)
- Lei Deng
- School of Software, Central South University, No.22 Shaoshan South Road, Changsha, 410075 China
| | - Chao Fan
- School of Software, Central South University, No.22 Shaoshan South Road, Changsha, 410075 China
| | - Zhiwen Zeng
- School of Information Science and Engineering, Central South University, No.932 South Lushan Road, Changsha, 410083 China
| |
Collapse
|
5
|
Li H, Hou J, Adhikari B, Lyu Q, Cheng J. Deep learning methods for protein torsion angle prediction. BMC Bioinformatics 2017; 18:417. [PMID: 28923002 PMCID: PMC5604354 DOI: 10.1186/s12859-017-1834-2] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2017] [Accepted: 09/11/2017] [Indexed: 12/31/2022] Open
Abstract
Background Deep learning is one of the most powerful machine learning methods that has achieved the state-of-the-art performance in many domains. Since deep learning was introduced to the field of bioinformatics in 2012, it has achieved success in a number of areas such as protein residue-residue contact prediction, secondary structure prediction, and fold recognition. In this work, we developed deep learning methods to improve the prediction of torsion (dihedral) angles of proteins. Results We design four different deep learning architectures to predict protein torsion angles. The architectures including deep neural network (DNN) and deep restricted Boltzmann machine (DRBN), deep recurrent neural network (DRNN) and deep recurrent restricted Boltzmann machine (DReRBM) since the protein torsion angle prediction is a sequence related problem. In addition to existing protein features, two new features (predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments) are used as input to each of the four deep learning architectures to predict phi and psi angles of protein backbone. The mean absolute error (MAE) of phi and psi angles predicted by DRNN, DReRBM, DRBM and DNN is about 20–21° and 29–30° on an independent dataset. The MAE of phi angle is comparable to the existing methods, but the MAE of psi angle is 29°, 2° lower than the existing methods. On the latest CASP12 targets, our methods also achieved the performance better than or comparable to a state-of-the art method. Conclusions Our experiment demonstrates that deep learning is a valuable method for predicting protein torsion angles. The deep recurrent network architecture performs slightly better than deep feed-forward architecture, and the predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments are useful features for improving prediction accuracy. Electronic supplementary material The online version of this article (10.1186/s12859-017-1834-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Haiou Li
- Department of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, 215006, China
| | - Jie Hou
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| | - Badri Adhikari
- Department of Mathematics and Computer Science, University of Missouri-St. Louis, 1 University Blvd. 311 Express Scripts Hall, St. Louis, MO, 63121, USA
| | - Qiang Lyu
- Department of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, 215006, China
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
6
|
Li B, Mendenhall J, Nguyen ED, Weiner BE, Fischer AW, Meiler J. Improving prediction of helix-helix packing in membrane proteins using predicted contact numbers as restraints. Proteins 2017; 85:1212-1221. [PMID: 28263405 PMCID: PMC5476507 DOI: 10.1002/prot.25281] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 01/20/2017] [Accepted: 02/17/2017] [Indexed: 01/21/2023]
Abstract
One of the challenging problems in tertiary structure prediction of helical membrane proteins (HMPs) is the determination of rotation of α-helices around the helix normal. Incorrect prediction of helix rotations substantially disrupts native residue-residue contacts while inducing only a relatively small effect on the overall fold. We previously developed a method for predicting residue contact numbers (CNs), which measure the local packing density of residues within the protein tertiary structure. In this study, we tested the idea of incorporating predicted CNs as restraints to guide the sampling of helix rotation. For a benchmark set of 15 HMPs with simple to rather complicated folds, the average contact recovery (CR) of best-sampled models was improved for all targets, the likelihood of sampling models with CR greater than 20% was increased for 13 targets, and the average RMSD100 of best-sampled models was improved for 12 targets. This study demonstrated that explicit incorporation of CNs as restraints improves the prediction of helix-helix packing. Proteins 2017; 85:1212-1221. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Bian Li
- Department of Chemistry, Vanderbilt University, Nashville, TN 37232, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37232, USA
| | - Jeffrey Mendenhall
- Department of Chemistry, Vanderbilt University, Nashville, TN 37232, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37232, USA
| | | | - Brian E. Weiner
- Department of Chemistry, Vanderbilt University, Nashville, TN 37232, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37232, USA
| | - Axel W. Fischer
- Department of Chemistry, Vanderbilt University, Nashville, TN 37232, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37232, USA
| | - Jens Meiler
- Department of Chemistry, Vanderbilt University, Nashville, TN 37232, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN 37232, USA
| |
Collapse
|
7
|
Arana-Daniel N, Gallegos AA, López-Franco C, Alanís AY, Morales J, López-Franco A. Support Vector Machines Trained with Evolutionary Algorithms Employing Kernel Adatron for Large Scale Classification of Protein Structures. Evol Bioinform Online 2016; 12:285-302. [PMID: 27980384 PMCID: PMC5140013 DOI: 10.4137/ebo.s40912] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/19/2016] [Accepted: 10/20/2016] [Indexed: 11/05/2022] Open
Abstract
With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.
Collapse
Affiliation(s)
- Nancy Arana-Daniel
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Alberto A Gallegos
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Carlos López-Franco
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Alma Y Alanís
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Jacob Morales
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| | - Adriana López-Franco
- Centro Universitario de Ciencias Exactas e Ingenieras, Universidad de Guadalajara, Guadalajara, Jalisco, México
| |
Collapse
|
8
|
Rodríguez-Fdez I, Mucientes M, Bugarín A. S-FRULER: Scalable fuzzy rule learning through evolution for regression. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2016.07.034] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
9
|
Li B, Mendenhall J, Nguyen ED, Weiner BE, Fischer AW, Meiler J. Accurate Prediction of Contact Numbers for Multi-Spanning Helical Membrane Proteins. J Chem Inf Model 2016; 56:423-34. [PMID: 26804342 PMCID: PMC5537626 DOI: 10.1021/acs.jcim.5b00517] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Prediction of the three-dimensional (3D) structures of proteins by computational methods is acknowledged as an unsolved problem. Accurate prediction of important structural characteristics such as contact number is expected to accelerate the otherwise slow progress being made in the prediction of 3D structure of proteins. Here, we present a dropout neural network-based method, TMH-Expo, for predicting the contact number of transmembrane helix (TMH) residues from sequence. Neuronal dropout is a strategy where certain neurons of the network are excluded from back-propagation to prevent co-adaptation of hidden-layer neurons. By using neuronal dropout, overfitting was significantly reduced and performance was noticeably improved. For multi-spanning helical membrane proteins, TMH-Expo achieved a remarkable Pearson correlation coefficient of 0.69 between predicted and experimental values and a mean absolute error of only 1.68. In addition, among those membrane protein-membrane protein interface residues, 76.8% were correctly predicted. Mapping of predicted contact numbers onto structures indicates that contact numbers predicted by TMH-Expo reflect the exposure patterns of TMHs and reveal membrane protein-membrane protein interfaces, reinforcing the potential of predicted contact numbers to be used as restraints for 3D structure prediction and protein-protein docking. TMH-Expo can be accessed via a Web server at www.meilerlab.org .
Collapse
Affiliation(s)
- Bian Li
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37232, United States
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37232, United States
| | - Jeffrey Mendenhall
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37232, United States
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37232, United States
| | - Elizabeth Dong Nguyen
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37232, United States
| | - Brian E. Weiner
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37232, United States
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37232, United States
| | - Axel W. Fischer
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37232, United States
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37232, United States
| | - Jens Meiler
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee 37232, United States
- Center for Structural Biology, Vanderbilt University, Nashville, Tennessee 37232, United States
| |
Collapse
|
10
|
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields. Sci Rep 2016; 6:18962. [PMID: 26752681 PMCID: PMC4707437 DOI: 10.1038/srep18962] [Citation(s) in RCA: 255] [Impact Index Per Article: 31.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2015] [Accepted: 11/26/2015] [Indexed: 12/29/2022] Open
Abstract
Protein secondary structure (SS) prediction is important for studying protein structure and function. When only the sequence (profile) information is used as input feature, currently the best predictors can obtain ~80% Q3 accuracy, which has not been improved in the past decade. Here we present DeepCNF (Deep Convolutional Neural Fields) for protein SS prediction. DeepCNF is a Deep Learning extension of Conditional Neural Fields (CNF), which is an integration of Conditional Random Fields (CRF) and shallow neural networks. DeepCNF can model not only complex sequence-structure relationship by a deep hierarchical architecture, but also interdependency between adjacent SS labels, so it is much more powerful than CNF. Experimental results show that DeepCNF can obtain ~84% Q3 accuracy, ~85% SOV score, and ~72% Q8 accuracy, respectively, on the CASP and CAMEO test proteins, greatly outperforming currently popular predictors. As a general framework, DeepCNF can be used to predict other protein structure properties such as contact number, disorder regions, and solvent accessibility.
Collapse
|
11
|
AcconPred: Predicting Solvent Accessibility and Contact Number Simultaneously by a Multitask Learning Framework under the Conditional Neural Fields Model. BIOMED RESEARCH INTERNATIONAL 2015; 2015:678764. [PMID: 26339631 PMCID: PMC4538422 DOI: 10.1155/2015/678764] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2014] [Accepted: 03/11/2015] [Indexed: 12/14/2022]
Abstract
Motivation. The solvent accessibility of protein residues is one of the driving forces of protein folding, while the contact number of protein residues limits the possibilities of protein conformations. The de novo prediction of these properties from protein sequence is important for the study of protein structure and function. Although these two properties are certainly related with each other, it is challenging to exploit this dependency for the prediction. Method. We present a method AcconPred for predicting solvent accessibility and contact number simultaneously, which is based on a shared weight multitask learning framework under the CNF (conditional neural fields) model. The multitask learning framework on a collection of related tasks provides more accurate prediction than the framework trained only on a single task. The CNF method not only models the complex relationship between the input features and the predicted labels, but also exploits the interdependency among adjacent labels. Results. Trained on 5729 monomeric soluble globular protein datasets, AcconPred could reach 0.68 three-state accuracy for solvent accessibility and 0.75 correlation for contact number. Tested on the 105 CASP11 domain datasets for solvent accessibility, AcconPred could reach 0.64 accuracy, which outperforms existing methods.
Collapse
|
12
|
Feng Y, Luo L. Using long-range contact number information for protein secondary structure prediction. INT J BIOMATH 2014. [DOI: 10.1142/s1793524514500521] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we first combine tetra-peptide structural words with contact number for protein secondary structure prediction. We used the method of increment of diversity combined with quadratic discriminant analysis to predict the structure of central residue for a sequence fragment. The method is used tetra-peptide structural words and long-range contact number as information resources. The accuracy of Q3 is over 83% in 194 proteins. The accuracies of predicted secondary structures for 20 amino acid residues are ranged from 81% to 88%. Moreover, we have introduced the residue long-range contact, which directly indicates the separation of contacting residue in terms of the position in the sequence, and examined the negative influence of long-range residue interactions on predicting secondary structure in a protein. The method is also compared with existing prediction methods. The results show that our method is more effective in protein secondary structures prediction.
Collapse
Affiliation(s)
- Yonge Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, P. R. China
| | - Liaofu Luo
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, P. R. China
| |
Collapse
|
13
|
Specific non-local interactions are not necessary for recovering native protein dynamics. PLoS One 2014; 9:e91347. [PMID: 24625758 PMCID: PMC3953337 DOI: 10.1371/journal.pone.0091347] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 02/11/2014] [Indexed: 11/25/2022] Open
Abstract
The elastic network model (ENM) is a widely used method to study native protein dynamics by normal mode analysis (NMA). In ENM we need information about all pairwise distances, and the distance between contacting atoms is restrained to the native value. Therefore ENM requires O(N2) information to realize its dynamics for a protein consisting of N amino acid residues. To see if (or to what extent) such a large amount of specific structural information is required to realize native protein dynamics, here we introduce a novel model based on only O(N) restraints. This model, named the ‘contact number diffusion’ model (CND), includes specific distance restraints for only local (along the amino acid sequence) atom pairs, and semi-specific non-local restraints imposed on each atom, rather than atom pairs. The semi-specific non-local restraints are defined in terms of the non-local contact numbers of atoms. The CND model exhibits the dynamic characteristics comparable to ENM and more correlated with the explicit-solvent molecular dynamics simulation than ENM. Moreover, unrealistic surface fluctuations often observed in ENM were suppressed in CND. On the other hand, in some ligand-bound structures CND showed larger fluctuations of buried protein atoms interacting with the ligand compared to ENM. In addition, fluctuations from CND and ENM show comparable correlations with the experimental B-factor. Although there are some indications of the importance of some specific non-local interactions, the semi-specific non-local interactions are mostly sufficient for reproducing the native protein dynamics.
Collapse
|
14
|
Nicolau DV, Paszek E, Fulga F, Nicolau DV. Protein molecular surface mapped at different geometrical resolutions. PLoS One 2013; 8:e58896. [PMID: 23516572 PMCID: PMC3597524 DOI: 10.1371/journal.pone.0058896] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2012] [Accepted: 02/08/2013] [Indexed: 01/08/2023] Open
Abstract
Many areas of biochemistry and molecular biology, both fundamental and applications-orientated, require an accurate construction, representation and understanding of the protein molecular surface and its interaction with other, usually small, molecules. There are however many situations when the protein molecular surface gets in physical contact with larger objects, either biological, such as membranes, or artificial, such as nanoparticles. The contribution presents a methodology for describing and quantifying the molecular properties of proteins, by geometrical and physico-chemical mapping of the molecular surfaces, with several analytical relationships being proposed for molecular surface properties. The relevance of the molecular surface-derived properties has been demonstrated through the calculation of the statistical strength of the prediction of protein adsorption. It is expected that the extension of this methodology to other phenomena involving proteins near solid surfaces, in particular the protein interaction with nanoparticles, will result in important benefits in the understanding and design of protein-specific solid surfaces.
Collapse
Affiliation(s)
- Dan V Nicolau
- Department of Electrical Engineering & Electronics, University of Liverpool, Liverpool, United Kingdom.
| | | | | | | |
Collapse
|
15
|
Kauffman C, Karypis G. Coarse- and fine-grained models for proteins: Evaluation by decoy discrimination. Proteins 2013. [DOI: 10.1002/prot.24222] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Chris Kauffman
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA.
| | | |
Collapse
|
16
|
|
17
|
Bacardit J, Widera P, Márquez-Chamorro A, Divina F, Aguilar-Ruiz JS, Krasnogor N. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics 2012; 28:2441-8. [PMID: 22833524 DOI: 10.1093/bioinformatics/bts472] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The prediction of a protein's contact map has become in recent years, a crucial stepping stone for the prediction of the complete 3D structure of a protein. In this article, we describe a methodology for this problem that was shown to be successful in CASP8 and CASP9. The methodology is based on (i) the fusion of the prediction of a variety of structural aspects of protein residues, (ii) an ensemble strategy used to facilitate the training process and (iii) a rule-based machine learning system from which we can extract human-readable explanations of the predictor and derive useful information about the contact map representation. RESULTS The main part of the evaluation is the comparison against the sequence-based contact prediction methods from CASP9, where our method presented the best rank in five out of the six evaluated metrics. We also assess the impact of the size of the ensemble used in our predictor to show the trade-off between performance and training time of our method. Finally, we also study the rule sets generated by our machine learning system. From this analysis, we are able to estimate the contribution of the attributes in our representation and how these interact to derive contact predictions. AVAILABILITY http://icos.cs.nott.ac.uk/servers/psp.html. CONTACT natalio.krasnogor@nottingham.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jaume Bacardit
- Interdisciplinary Computing and Complex Systems research group, School of Computer Science, University of Nottingham, Nottingham, NG8 1BB, UK
| | | | | | | | | | | |
Collapse
|
18
|
iFC²: an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content. Amino Acids 2010; 40:963-73. [PMID: 20730460 DOI: 10.1007/s00726-010-0721-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Accepted: 08/06/2010] [Indexed: 10/19/2022]
Abstract
Several descriptors of protein structure at the sequence and residue levels have been recently proposed. They are widely adopted in the analysis and prediction of structural and functional characteristics of proteins. Numerous in silico methods have been developed for sequence-based prediction of these descriptors. However, many of them do not have a public web-server and only a few integrate multiple descriptors to improve the predictions. We introduce iFC² (integrated prediction of fold, class, and content) server that is the first to integrate three modern predictors of sequence-level descriptors. They concern fold type (PFRES), structural class (SCEC), and secondary structure content (PSSC-core). The server exploits relations between the three descriptors to implement a cross-evaluation procedure that improves over the predictions of the individual methods. The iFC² annotates fold and class predictions as potentially correct/incorrect. When tested on datasets with low-similarity chains, for the fold prediction iFC² labels 82% of the PFRES predictions as correct and the accuracy of these predictions equals 72%. The accuracy of the remaining 28% of the PFRES predictions equals 38%. Similarly, our server assigns correct labels for over 79% of SCEC predictions, which are shown to be 98% accurate, while the remaining SCEC predictions are only 15% accurate. These results are shown to be competitive when contrasted against recent relevant web-servers. Predictions on CASP8 targets show that the content predicted by iFC² is competitive when compared with the content computed from the tertiary structures predicted by three best-performing methods in CASP8. The iFC² server is available at http://biomine.ece.ualberta.ca/1D/1D.html .
Collapse
|
19
|
Shah AA, Folino G, Krasnogor N. Toward High-Throughput, Multicriteria Protein-Structure Comparison and Analysis. IEEE Trans Nanobioscience 2010; 9:144-55. [DOI: 10.1109/tnb.2010.2043851] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
20
|
Teichert F, Minning J, Bastolla U, Porto M. High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABER-TOOTH. BMC Bioinformatics 2010; 11:251. [PMID: 20470364 PMCID: PMC2885375 DOI: 10.1186/1471-2105-11-251] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2009] [Accepted: 05/14/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein alignments are an essential tool for many bioinformatics analyses. While sequence alignments are accurate for proteins of high sequence similarity, they become unreliable as they approach the so-called 'twilight zone' where sequence similarity gets indistinguishable from random. For such distant pairs, structure alignment is of much better quality. Nevertheless, sequence alignment is the only choice in the majority of cases where structural data is not available. This situation demands development of methods that extend the applicability of accurate sequence alignment to distantly related proteins. RESULTS We develop a sequence alignment method that combines the prediction of a structural profile based on the protein's sequence with the alignment of that profile using our recently published alignment tool SABERTOOTH. In particular, we predict the contact vector of protein structures using an artificial neural network based on position-specific scoring matrices generated by PSI-BLAST and align these predicted contact vectors. The resulting sequence alignments are assessed using two different tests: First, we assess the alignment quality by measuring the derived structural similarity for cases in which structures are available. In a second test, we quantify the ability of the significance score of the alignments to recognize structural and evolutionary relationships. As a benchmark we use a representative set of the SCOP (structural classification of proteins) database, with similarities ranging from closely related proteins at SCOP family level, to very distantly related proteins at SCOP fold level. Comparing these results with some prominent sequence alignment tools, we find that SABERTOOTH produces sequence alignments of better quality than those of Clustal W, T-Coffee, MUSCLE, and PSI-BLAST. HHpred, one of the most sophisticated and computationally expensive tools available, outperforms our alignment algorithm at family and superfamily levels, while the use of SABERTOOTH is advantageous for alignments at fold level. Our alignment scheme will profit from future improvements of structural profiles prediction. CONCLUSIONS We present the automatic sequence alignment tool SABERTOOTH that computes pairwise sequence alignments of very high quality. SABERTOOTH is especially advantageous when applied to alignments of remotely related proteins. The source code is available at http://www.fkp.tu-darmstadt.de/sabertooth_project/, free for academic users upon request.
Collapse
Affiliation(s)
- Florian Teichert
- Institut für Festkörperphysik, Technische Universität Darmstadt, Hochschulstr, Darmstadt, Germany
| | | | | | | |
Collapse
|
21
|
Rangwala H, Kauffman C, Karypis G. svmPRAT: SVM-based protein residue annotation toolkit. BMC Bioinformatics 2009; 10:439. [PMID: 20028521 PMCID: PMC2805646 DOI: 10.1186/1471-2105-10-439] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. AVAILABILITY http://www.cs.gmu.edu/~mlbio/svmprat.
Collapse
Affiliation(s)
- Huzefa Rangwala
- Computer Science Department, George Mason University, Fairfax, VA, USA.
| | | | | |
Collapse
|
22
|
Song J, Tan H, Mahmood K, Law RHP, Buckle AM, Webb GI, Akutsu T, Whisstock JC. Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS One 2009; 4:e7072. [PMID: 19759917 PMCID: PMC2742725 DOI: 10.1371/journal.pone.0007072] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Accepted: 08/20/2009] [Indexed: 11/24/2022] Open
Abstract
Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
- * E-mail: (JS); (JCW)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Khalid Mahmood
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ruby H. P. Law
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ashley M. Buckle
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
| | - James C. Whisstock
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
- * E-mail: (JS); (JCW)
| |
Collapse
|
23
|
Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N. Automated alphabet reduction for protein datasets. BMC Bioinformatics 2009; 10:6. [PMID: 19126227 PMCID: PMC2646702 DOI: 10.1186/1471-2105-10-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2008] [Accepted: 01/06/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques. RESULTS We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations. CONCLUSION Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.
Collapse
Affiliation(s)
- Jaume Bacardit
- ASAP research group, School of Computer Science, University of Nottingham, Jubilee Campus, Wollaton Road, Nottingham, NG8 1BB, UK.
| | | | | | | | | | | |
Collapse
|
24
|
Stout M, Bacardit J, Hirst JD, Smith RE, Krasnogor N. Prediction of topological contacts in proteins using learning classifier systems. Soft comput 2008. [DOI: 10.1007/s00500-008-0318-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
25
|
Afonnikov DA, Morozov AV, Kolchanov NA. Prediction of contact numbers of amino acid residues using a neural network regression algorithm. Biophysics (Nagoya-shi) 2008. [DOI: 10.1134/s0006350906070128] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
26
|
Shi Y, Zhou J, Arndt D, Wishart DS, Lin G. Protein contact order prediction from primary sequences. BMC Bioinformatics 2008; 9:255. [PMID: 18513429 PMCID: PMC2440764 DOI: 10.1186/1471-2105-9-255] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2007] [Accepted: 05/30/2008] [Indexed: 11/11/2022] Open
Abstract
Background Contact order is a topological descriptor that has been shown to be correlated with several interesting protein properties such as protein folding rates and protein transition state placements. Contact order has also been used to select for viable protein folds from ab initio protein structure prediction programs. For proteins of known three-dimensional structure, their contact order can be calculated directly. However, for proteins with unknown three-dimensional structure, there is no effective prediction method currently available. Results In this paper, we propose several simple yet very effective methods to predict contact order from the amino acid sequence only. One set of methods is based on a weighted linear combination of predicted secondary structure content and amino acid composition. Depending on the number of components used in these equations it is possible to achieve a correlation coefficient of 0.857–0.870 between the observed and predicted contact order. A second method, based on sequence similarity to known three-dimensional structures, is able to achieve a correlation coefficient of 0.977. We have also developed a much more robust implementation for calculating contact order directly from PDB coordinates that works for > 99% PDB files. All of these contact order predictors and calculators have been implemented as a web server (see Availability and requirements section for URL). Conclusion Protein contact order can be effectively predicted from the primary sequence, at the absence of three-dimensional structure. Three factors, percentage of residues in alpha helices, percentage of residues in beta strands, and sequence length, appear to be strongly correlated with the absolute contact order.
Collapse
Affiliation(s)
- Yi Shi
- Department of Computing Science, University of Alberta, Edmonton, Alberta, T6G 2E8, Canada.
| | | | | | | | | |
Collapse
|
27
|
Song J, Tan H, Takemoto K, Akutsu T. HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics 2008; 24:1489-97. [DOI: 10.1093/bioinformatics/btn222] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
28
|
Miyazawa S, Kinjo AR. Properties of contact matrices induced by pairwise interactions in proteins. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2008; 77:051910. [PMID: 18643105 DOI: 10.1103/physreve.77.051910] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2008] [Indexed: 05/26/2023]
Abstract
The properties of contact matrices ( C matrices) needed for native proteins to be the lowest-energy conformations are considered in relation to a contact energy matrix ( E matrix). The total conformational energy is assumed to consist of pairwise interaction energies between atoms or residues, each of which is expressed as a product of a conformation-dependent function (an element of the C matrix) and a sequence-dependent energy parameter (an element of the E matrix). Such pairwise interactions in proteins force native C matrices to be in a relationship as if the interactions are a Go-like potential [N. Go, Annu. Rev. Biophys. Bioeng. 12, 183 (1983)] for the native C matrix, because the lowest bound of the total energy function is equal to the total energy of the native conformation interacting in a Go-like pairwise potential. This relationship between C and E matrices corresponds to (a) a parallel relationship between the eigenvectors of the C and E matrices and a linear relationship between their eigenvalues and (b) a parallel relationship between a contact number vector and the principal eigenvectors of the C and E matrices, where the E matrix is expanded in a series of eigenspaces with an additional constant term. The additional constant term in the spectral expansion of the E matrix is indicated by the lowest bound of the total energy function to correspond to a threshold of contact energy that approximately separates native contacts from non-native ones. Inner products between the principal eigenvector of the C matrix, that of the E matrix, and a contact number vector have been examined for 182 proteins, each of which is a representative from each family of the SCOP database [Murzin, J. Mol. Biol. 247, 536 (1995)], and the results indicate the parallel tendencies between those vectors. A statistical contact potential [S. Miyazawa and R. L. Jernigan, Proteins 34, 49 (1999); S. Miyazawa and R. L. Jernigan, Proteins50, 35 (2003)] estimated from protein crystal structures was used to evaluate pairwise residue-residue interactions in the proteins. In addition, the spectral representation of C and E matrices reveals that pairwise residue-residue interactions, which depend only on the types of interacting amino acids, but not on other residues in a protein, are insufficient and other interactions including residue connectivities and steric hindrance are needed to make native structures unique lowest-energy conformations.
Collapse
Affiliation(s)
- Sanzo Miyazawa
- Graduate School of Engineering, Gunma University, Kiryu, Gunma 376-8515, Japan.
| | | |
Collapse
|
29
|
Kinjo AR, Nakamura H. Nature of protein family signatures: insights from singular value analysis of position-specific scoring matrices. PLoS One 2008; 3:e1963. [PMID: 18398479 PMCID: PMC2276316 DOI: 10.1371/journal.pone.0001963] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2008] [Accepted: 03/05/2008] [Indexed: 11/19/2022] Open
Abstract
Position-specific scoring matrices (PSSMs) are useful for detecting weak homology in protein sequence analysis, and they are thought to contain some essential signatures of the protein families. In order to elucidate what kind of ingredients constitute such family-specific signatures, we apply singular value decomposition to a set of PSSMs and examine the properties of dominant right and left singular vectors. The first right singular vectors were correlated with various amino acid indices including relative mutability, amino acid composition in protein interior, hydropathy, or turn propensity, depending on proteins. A significant correlation between the first left singular vector and a measure of site conservation was observed. It is shown that the contribution of the first singular component to the PSSMs act to disfavor potentially but falsely functionally important residues at conserved sites. The second right singular vectors were highly correlated with hydrophobicity scales, and the corresponding left singular vectors with contact numbers of protein structures. It is suggested that sequence alignment with a PSSM is essentially equivalent to threading supplemented with functional information. In addition, singular vectors may be useful for analyzing and annotating the characteristics of conserved sites in protein families.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, Suita, Osaka, Japan.
| | | |
Collapse
|
30
|
Stout M, Bacardit J, Hirst JD, Krasnogor N. Prediction of recursive convex hull class assignments for protein residues. Bioinformatics 2008; 24:916-23. [DOI: 10.1093/bioinformatics/btn050] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
31
|
Fourty G, Callebaut I, Mornon JP. Characterization of non-trivial neighborhood fold constraints from protein sequences using generalized topohydrophobicity. Bioinform Biol Insights 2008; 2:47-66. [PMID: 19812765 PMCID: PMC2735972 DOI: 10.4137/bbi.s426] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Prediction of key features of protein structures, such as secondary structure, solvent accessibility and number of contacts between residues, provides useful structural constraints for comparative modeling, fold recognition, ab-initio fold prediction and detection of remote relationships. In this study, we aim at characterizing the number of non-trivial close neighbors, or long-range contacts of a residue, as a function of its “topohydrophobic” index deduced from multiple sequence alignments and of the secondary structure in which it is embedded. The “topohydrophobic” index is calculated using a two-class distribution of amino acids, based on their mean atom depths. From a large set of structural alignments processed from the FSSP database, we selected 1485 structural sub-families including at least 8 members, with accurate alignments and limited redundancy. We show that residues within helices, even when deeply buried, have few non-trivial neighbors (0–2), whereas β-strand residues clearly exhibit a multimodal behavior, dominated by the local geometry of the tetrahedron (3 non-trivial close neighbors associated with one tetrahedron; 6 with two tetrahedra). This observed behavior allows the distinction, from sequence profiles, between edge and central β-strands within β-sheets. Useful topological constraints on the immediate neighborhood of an amino acid, but also on its correlated solvent accessibility, can thus be derived using this approach, from the simple knowledge of multiple sequence alignments.
Collapse
Affiliation(s)
- Guillaume Fourty
- Département de Biologie Structurale, Institut de Minéralogie et de Physique des Milieux Condensés, CNRS UMR 7590 - Universités Paris 6/Paris 7, France
| | | | | |
Collapse
|
32
|
Pietropaolo A, Muccioli L, Berardi R, Zannoni C. A chirality index for investigating protein secondary structures and their time evolution. Proteins 2008; 70:667-77. [PMID: 17879347 DOI: 10.1002/prot.21578] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We propose a methodology for the description of the secondary structure of proteins, based on assigning a chirality parameter to short aminoacid sequences according to their arrangement in space at a certain time. We validated the method on ideal and crystalline structures, showing that it can assign secondary structures and that this assignment is robust with respect to random conformational perturbations. From the values of the index and its pattern along a sequence it is possible to recognize many structural motifs of a protein, and in particular poly-L-proline II left-handed helices, often not detected by secondary structure assignment algorithms. Assigning an instantaneous chirality index to the fragments also allows the dynamics to be studied. With this purpose, molecular dynamics simulations were carried out in water for selected hemoglobin (110 ns) and immunoglobulin antigen fragments (50 ns), showing the capability of the chiral index in identifying the stable secondary structure elements, as well as in following their time evolution and conformational changes during the trajectory.
Collapse
Affiliation(s)
- Adriana Pietropaolo
- Dipartimento di Chimica Fisica ed Inorganica and INSTM, Università di Bologna, V.le Risorgimento, 4, I-40136 Bologna, Italy
| | | | | | | |
Collapse
|
33
|
|
34
|
Taylor WR. Protein knots and fold complexity: Some new twists. Comput Biol Chem 2007; 31:151-62. [PMID: 17500039 DOI: 10.1016/j.compbiolchem.2007.03.002] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2007] [Accepted: 03/17/2007] [Indexed: 10/23/2022]
Abstract
The current knowledge on topological knots in protein structure is reviewed, considering in turn, knots with three, four and five strand crossings. The latter is the most recent to be identified and has two distinct topological forms. The knot observed in the protein structure is the form that requires the least number of strand crossings to become un-knotted. The position of the chain termini must also correspond to a position that allows (un) knotting in one move. This is postulated as a general property of protein knots and other more complex knots with this property are proposed as the next most likely knots that might be found in a protein. It is also noted that the "Jelly-roll" fold found in some all-beta proteins would provide likely candidates. Alternative measures of knottedness and entanglement are reviewed, including the occurrence of slip-knots. These measures are related to the complexity of the protein fold and may provide useful filters for selecting predicted model structures.
Collapse
Affiliation(s)
- William R Taylor
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK.
| |
Collapse
|
35
|
Paluszewski M, Hamelryck T, Winter P. Reconstructing protein structure from solvent exposure using tabu search. Algorithms Mol Biol 2006; 1:20. [PMID: 17069644 PMCID: PMC1635054 DOI: 10.1186/1748-7188-1-20] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2006] [Accepted: 10/27/2006] [Indexed: 11/10/2022] Open
Abstract
Background A new, promising solvent exposure measure, called half-sphere-exposure (HSE), has recently been proposed. Here, we study the reconstruction of a protein's Cα trace solely from structure-derived HSE information. This problem is of relevance for de novo structure prediction using predicted HSE measure. For comparison, we also consider the well-established contact number (CN) measure. We define energy functions based on the HSE- or CN-vectors and minimize them using two conformational search heuristics: Monte Carlo simulation (MCS) and tabu search (TS). While MCS has been the dominant conformational search heuristic in literature, TS has been applied only a few times. To discretize the conformational space, we use lattice models with various complexity. Results The proposed TS heuristic with a novel tabu definition generally performs better than MCS for this problem. Our experiments show that, at least for small proteins (up to 35 amino acids), it is possible to reconstruct the protein backbone solely from the HSE or CN information. In general, the HSE measure leads to better models than the CN measure, as judged by the RMSD and the angle correlation with the native structure. The angle correlation, a measure of structural similarity, evaluates whether equivalent residues in two structures have the same general orientation. Our results indicate that the HSE measure is potentially very useful to represent solvent exposure in protein structure prediction, design and simulation.
Collapse
Affiliation(s)
- Martin Paluszewski
- Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark
| | - Thomas Hamelryck
- Bioinformatics Center, Institute of Molecular Biology, University of Copenhagen, Universitetsparken 15 building 10, 2100 Copenhagen, Denmark
| | - Pawel Winter
- Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark
| |
Collapse
|
36
|
Song J, Burrage K. Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics 2006; 7:425. [PMID: 17014735 PMCID: PMC1618864 DOI: 10.1186/1471-2105-7-425] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2006] [Accepted: 10/03/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Collapse
Affiliation(s)
- Jiangning Song
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| | - Kevin Burrage
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| |
Collapse
|
37
|
Ishida T, Nakamura S, Shimizu K. Potential for assessing quality of protein structure based on contact number prediction. Proteins 2006; 64:940-7. [PMID: 16788993 DOI: 10.1002/prot.21047] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We developed a novel knowledge-based residue environment potential for assessing the quality of protein structures in protein structure prediction. The potential uses the contact number of residues in a protein structure and the absolute contact number of residues predicted from its amino acid sequence using a new prediction method based on a support vector regression (SVR). The contact number of an amino acid residue in a protein structure is defined by the number of residues around a given residue. First, the contact number of each residue is predicted using SVR from an amino acid sequence of a target protein. Then, the potential of the protein structure is calculated from the probability distribution of the native contact numbers corresponding to the predicted ones. The performance of this potential is compared with other score functions using decoy structures to identify both native structure from other structures and near-native structures from nonnative structures. This potential improves not only the ability to identify native structures from other structures but also the ability to discriminate near-native structures from nonnative structures.
Collapse
Affiliation(s)
- Takashi Ishida
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan.
| | | | | |
Collapse
|
38
|
Kinjo AR, Nishikawa K. CRNPRED: highly accurate prediction of one-dimensional protein structures by large-scale critical random networks. BMC Bioinformatics 2006; 7:401. [PMID: 16952323 PMCID: PMC1578593 DOI: 10.1186/1471-2105-7-401] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2006] [Accepted: 09/05/2006] [Indexed: 11/28/2022] Open
Abstract
Background One-dimensional protein structures such as secondary structures or contact numbers are useful for three-dimensional structure prediction and helpful for intuitive understanding of the sequence-structure relationship. Accurate prediction methods will serve as a basis for these and other purposes. Results We implemented a program CRNPRED which predicts secondary structures, contact numbers and residue-wise contact orders. This program is based on a novel machine learning scheme called critical random networks. Unlike most conventional one-dimensional structure prediction methods which are based on local windows of an amino acid sequence, CRNPRED takes into account the whole sequence. CRNPRED achieves, on average per chain, Q3 = 81% for secondary structure prediction, and correlation coefficients of 0.75 and 0.61 for contact number and residue-wise contact order predictions, respectively. Conclusion CRNPRED will be a useful tool for computational as well as experimental biologists who need accurate one-dimensional protein structure predictions.
Collapse
Affiliation(s)
- Akira R Kinjo
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima, 411-8540, Japan
- Department of Genetics, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
- Research Center for Structural and Functional Proteomics, Institute for Protein Research, Osaka University, 3-2 Suita, 565-0871, Japan
| | - Ken Nishikawa
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima, 411-8540, Japan
- Department of Genetics, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
| |
Collapse
|
39
|
Kinjo AR, Nishikawa K. Predicting secondary structures, contact numbers, and residue-wise contact orders of native protein structures from amino acid sequences using critical random networks. Biophysics (Nagoya-shi) 2005; 1:67-74. [PMID: 27857554 PMCID: PMC5036631 DOI: 10.2142/biophysics.1.67] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2005] [Accepted: 10/20/2005] [Indexed: 12/01/2022] Open
Abstract
Predictions of one-dimensional protein structures such as secondary structures and contact numbers are useful for predicting three-dimensional structure and important for understanding the sequence-structure relationship. Here we present a new machine-learning method, critical random networks (CRNs), for predicting one-dimensional structures, and apply it, with position-specific scoring matrices, to the prediction of secondary structures (SS), contact numbers (CN), and residue-wise contact orders (RWCO). The present method achieves, on average, Q3 accuracy of 77.8% for SS, and correlation coefficients of 0.726 and 0.601 for CN and RWCO, respectively. The accuracy of the SS prediction is comparable to that obtained with other state-of-the-art methods, and accuracy of the CN prediction is a significant improvement over that with previous methods. We give a detailed formulation of the critical random networks-based prediction scheme, and examine the context-dependence of prediction accuracies. In order to study the nonlinear and multi-body effects, we compare the CRNs-based method with a purely linear method based on position-specific scoring matrices. Although not superior to the CRNs-based method, the surprisingly good accuracy achieved by the linear method highlights the difficulty in extracting structural features of higher order from an amino acid sequence beyond the information provided by the position-specific scoring matrices.
Collapse
Affiliation(s)
- Akira R Kinjo
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan; Department of Genetics, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
| | - Ken Nishikawa
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan; Department of Genetics, The Graduate University for Advanced Studies (SOKENDAI), Mishima 411-8540, Japan
| |
Collapse
|
40
|
Yuan Z. Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics 2005; 6:248. [PMID: 16221309 PMCID: PMC1277819 DOI: 10.1186/1471-2105-6-248] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2005] [Accepted: 10/13/2005] [Indexed: 11/10/2022] Open
Abstract
Background Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of Cβ atoms in other residues within a sphere around the Cβ atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence. Results We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either "contacted" or "non-contacted", the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds. Conclusion The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary protein sequence and higher order consecutive protein structural and functional properties.
Collapse
Affiliation(s)
- Zheng Yuan
- Institute for Molecular Bioscience, ARC Centre in Bioinformatics, The University of Queensland, St. Lucia, 4072, Australia.
| |
Collapse
|
41
|
Kinjo AR, Nishikawa K. Recoverable one-dimensional encoding of three-dimensional protein structures. Bioinformatics 2005; 21:2167-70. [PMID: 15722374 DOI: 10.1093/bioinformatics/bti330] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
One-dimensional (1D) structures of proteins such as secondary structure and contact number provide intuitive pictures to understand how the native three-dimensional (3D) structure of a protein is encoded in the amino acid sequence. However, it is still not clear whether a given set of 1D structures contains sufficient information for recovering the underlying 3D structure. Here we show that the 3D structure of a protein can be recovered from a set of three types of 1D structures, namely, secondary structure, contact number and residue-wise contact order which is introduced here for the first time. Using simulated annealing molecular dynamics simulations, the structures satisfying the given native 1D structural restraints were sought for 16 proteins of various structural classes and of sizes ranging from 56 to 146 residues. By selecting the structures best satisfying the restraints, all the proteins showed a coordinate RMS deviation of <4 A from the native structure, and, for most of them, the deviation was even <2 A. The present result opens a new possibility to protein structure prediction and our understanding of the sequence-structure relationship.
Collapse
Affiliation(s)
- Akira R Kinjo
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan.
| | | |
Collapse
|