1
|
Meysman P, Zhou C, Cule B, Goethals B, Laukens K. Mining the entire Protein DataBank for frequent spatially cohesive amino acid patterns. BioData Min 2015; 8:4. [PMID: 25657820 PMCID: PMC4318390 DOI: 10.1186/s13040-015-0038-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 01/18/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The three-dimensional structure of a protein is an essential aspect of its functionality. Despite the large diversity in protein structures and functionality, it is known that there are common patterns and preferences in the contacts between amino acid residues, or between residues and other biomolecules, such as DNA. The discovery and characterization of these patterns is an important research topic within structural biology as it can give fundamental insight into protein structures and can aid in the prediction of unknown structures. RESULTS Here we apply an efficient spatial pattern miner to search for sets of amino acids that occur frequently in close spatial proximity in the protein structures of the Protein DataBank. This allowed us to mine for a new class of amino acid patterns, that we term FreSCOs (Frequent Spatially Cohesive Component sets), which feature synergetic combinations. To demonstrate the relevance of these FreSCOs, they were compared in relation to the thermostability of the protein structure and the interaction preferences of DNA-protein complexes. In both cases, the results matched well with prior investigations using more complex methods on smaller data sets. CONCLUSIONS The currently characterized protein structures feature a diverse set of frequent amino acid patterns that can be related to the stability of the protein molecular structure and that are independent from protein function or specific conserved domains.
Collapse
Affiliation(s)
- Pieter Meysman
- Advanced Database Research and Modelling (ADReM), Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - Cheng Zhou
- Advanced Database Research and Modelling (ADReM), Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
| | - Boris Cule
- Advanced Database Research and Modelling (ADReM), Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
| | - Bart Goethals
- Advanced Database Research and Modelling (ADReM), Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kris Laukens
- Advanced Database Research and Modelling (ADReM), Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| |
Collapse
|
2
|
Sun W, He J. Understanding on the residue contact network using the log-normal cluster model and the multilevel wheel diagram. Biopolymers 2010; 93:904-16. [DOI: 10.1002/bip.21494] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
3
|
Zhong W, Altun G, Harrison R, Tai PC, Pan Y. Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property. IEEE Trans Nanobioscience 2005; 4:255-65. [PMID: 16220690 DOI: 10.1109/tnb.2005.853667] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Information about local protein sequence motifs is very important to the analysis of biologically significant conserved regions of protein sequences. These conserved regions can potentially determine the diverse conformation and activities of proteins. In this work, recurring sequence motifs of proteins are explored with an improved K-means clustering algorithm on a new dataset. The structural similarity of these recurring sequence clusters to produce sequence motifs is studied in order to evaluate the relationship between sequence motifs and their structures. To the best of our knowledge, the dataset used by our research is the most updated dataset among similar studies for sequence motifs. A new greedy initialization method for the K-means algorithm is proposed to improve traditional K-means clustering techniques. The new initialization method tries to choose suitable initial points, which are well separated and have the potential to form high-quality clusters. Our experiments indicate that the improved K-means algorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity. Careful comparison of sequence motifs obtained by the improved and traditional algorithms also suggests that the improved K-means clustering algorithm may discover some relatively weak and subtle sequence motifs, which are undetectable by the traditional K-means algorithms. Many biochemical tests reported in the literature show that these sequence motifs are biologically meaningful. Experimental results also indicate that the improved K-means algorithm generates more detailed sequence motifs representing common structures than previous research. Furthermore, these motifs are universally conserved sequence patterns across protein families, overcoming some weak points of other popular sequence motifs. The satisfactory result of the experiment suggests that this new K-means algorithm may be applied to other areas of bioinformatics research in order to explore the underlying relationships between data samples more effectively.
Collapse
Affiliation(s)
- Wei Zhong
- Computer Science Department, Georgia State University, Atlanta, GA 30303-4110, USA.
| | | | | | | | | |
Collapse
|
4
|
Gromiha MM, Selvaraj S. Inter-residue interactions in protein folding and stability. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2004; 86:235-77. [PMID: 15288760 DOI: 10.1016/j.pbiomolbio.2003.09.003] [Citation(s) in RCA: 225] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
During the process of protein folding, the amino acid residues along the polypeptide chain interact with each other in a cooperative manner to form the stable native structure. The knowledge about inter-residue interactions in protein structures is very helpful to understand the mechanism of protein folding and stability. In this review, we introduce the classification of inter-residue interactions into short, medium and long range based on a simple geometric approach. The features of these interactions in different structural classes of globular and membrane proteins, and in various folds have been delineated. The development of contact potentials and the application of inter-residue contacts for predicting the structural class and secondary structures of globular proteins, solvent accessibility, fold recognition and ab initio tertiary structure prediction have been evaluated. Further, the relationship between inter-residue contacts and protein-folding rates has been highlighted. Moreover, the importance of inter-residue interactions in protein-folding kinetics and for understanding the stability of proteins has been discussed. In essence, the information gained from the studies on inter-residue interactions provides valuable insights for understanding protein folding and de novo protein design.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Aomi Frontier Building 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
5
|
Kumarevel TS, Gromiha MM, Selvaraj S, Gayatri K, Kumar PKR. Influence of medium- and long-range interactions in different folding types of globular proteins. Biophys Chem 2002; 99:189-98. [PMID: 12377369 DOI: 10.1016/s0301-4622(02)00183-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Recognition of protein fold from amino acid sequence is a challenging task. The structure and stability of proteins from different fold are mainly dictated by inter-residue interactions. In our earlier work, we have successfully used the medium- and long-range contacts for predicting the protein folding rates, discriminating globular and membrane proteins and for distinguishing protein structural classes. In this work, we analyze the role of inter-residue interactions in commonly occurring folds of globular proteins in order to understand their folding mechanisms. In the medium-range contacts, the globin fold and four-helical bundle proteins have more contacts than that of DNA-RNA fold although they all belong to all-alpha class. In long-range contacts, only the ribonuclease fold prefers 4-10 range and the other folding types prefer the range 21-30 in alpha/beta class proteins. Further, the preferred residues and residue pairs influenced by these different folds are discussed. The information about the preference of medium- and long-range contacts exhibited by the 20 amino acid residues can be effectively used to predict the folding type of each protein.
Collapse
Affiliation(s)
- T S Kumarevel
- National Institute of Advanced Industrial Science and Technology (AIST), Institute of Molecular and Cell Biology, Functional Nucleic Acids Group, Tsukuba Central 6, 1-1 Higashi, Tsukuba Science City, Ibaraki, Japan.
| | | | | | | | | |
Collapse
|
6
|
Fariselli P, Olmea O, Valencia A, Casadio R. Prediction of contact maps with neural networks and correlated mutations. PROTEIN ENGINEERING 2001; 14:835-43. [PMID: 11742102 DOI: 10.1093/protein/14.11.835] [Citation(s) in RCA: 149] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Contact maps of proteins are predicted with neural network-based methods, using as input codings of increasing complexity including evolutionary information, sequence conservation, correlated mutations and predicted secondary structures. Neural networks are trained on a data set comprising the contact maps of 173 non-homologous proteins as computed from their well resolved three-dimensional structures. Proteins are selected from the Protein Data Bank database provided that they align with at least 15 similar sequences in the corresponding families. The predictors are trained to learn the association rules between the covalent structure of each protein and its contact map with a standard back propagation algorithm and tested on the same protein set with a cross-validation procedure. Our results indicate that the method can assign protein contacts with an average accuracy of 0.21 and with an improvement over a random predictor of a factor >6, which is higher than that previously obtained with methods only based either on neural networks or on correlated mutations. Furthermore, filtering the network outputs with a procedure based on the residue coordination numbers, the accuracy of predictions increases up to 0.25 for all the proteins, with an 8-fold deviation from a random predictor. These scores are the highest reported so far for predicting protein contact maps.
Collapse
Affiliation(s)
- P Fariselli
- CIRB and Department of Biology, University of Bologna, via Irnerio 42, Bologna, Italy
| | | | | | | |
Collapse
|
7
|
Simon I, Fiser A, Tusnády GE. Predicting protein conformation by statistical methods. BIOCHIMICA ET BIOPHYSICA ACTA 2001; 1549:123-36. [PMID: 11690649 DOI: 10.1016/s0167-4838(01)00253-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The unique folded structure makes a polypeptide a functional protein. The number of known sequences is about a hundred times larger than the number of known structures and the gap is increasing rapidly. The primary goal of all structure prediction methods is to obtain structure-related information on proteins, whose structures have not been determined experimentally. Besides this goal, the development of accurate prediction methods helps to reveal principles of protein folding. Here we present a brief survey of protein structure predictions based on statistical analyses of known sequence and structure data. We discuss the background of these methods and attempt to elucidate principles, which govern structure formation of soluble and membrane proteins.
Collapse
Affiliation(s)
- I Simon
- Institute of Enzymology, BRC, Hungarian Academy of Sciences, Budapest, Hungary.
| | | | | |
Collapse
|
8
|
Webber CL, Giuliani A, Zbilut JP, Colosimo A. Elucidating protein secondary structures using alpha-carbon recurrence quantifications. Proteins 2001; 44:292-303. [PMID: 11455602 DOI: 10.1002/prot.1094] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Secondary structures of proteins were studied by recurrence quantification analysis (RQA). High-resolution, 3-dimensional coordinates of alpha-carbon atoms comprising a set of 68 proteins were downloaded from the Protein Data Bank. By fine-tuning four recurrence parameters (radius, line, residue, separation), it was possible to establish excellent agreement between percent contribution of alpha-helix and beta-sheet structures determined independently by RQA and that of the DSSP algorithm (Define Secondary Structure of Proteins). These results indicate that there is an equivalency between these two techniques, which are based upon totally different pattern recognition strategies. RQA enhances qualitative contact maps by quantifying the arrangements of recurrent points of alpha carbons close in 3-dimensional space. For example, the radius was systematically increased, moving the analysis beyond local alpha-carbon neighborhoods in order to capture super-secondary and tertiary structures. However, differences between proteins could only be detected within distances up to about 6-11 A, but not higher. This result underscores the complexity of alpha-carbon spacing when super-secondary structures appear at larger distances. Finally, RQA-defined secondary structures were found to be robust against random displacement of alpha carbons upwards of 1 A. This finding has potential import for the dynamic functions of proteins in motion.
Collapse
Affiliation(s)
- C L Webber
- Department of Physiology, Loyola University Chicago, Stritch School of Medicine, Maywood, Illinois 60153, USA
| | | | | | | |
Collapse
|
9
|
Gromiha MM, Thangakani AM. Role of medium- and long-range interactions to the stability of the mutants of T4 lysozyme. Prep Biochem Biotechnol 2001; 31:217-27. [PMID: 11513088 DOI: 10.1081/pb-100104905] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Inter-residue interactions play an important role to the folding and stability of protein molecules. In this work, we analyze the role of medium- and long-range interactions to the stability of T4 lysozyme mutants. We found that, in buried mutations, the increase in long-range contacts upon mutations destabilizes the protein, whereas, in surface mutations, the increase in long-range contacts increases the stability, indicating the importance of surrounding polar residues to the stability of surface mutations. Further, the increase in medium-range contacts decreases the stability of buried and surface mutations and a direct relationship is observed between the increase of medium-range contacts and increase in stability for partially buried/exposed mutations. Moreover, the relationship between amino acid properties and stability of T4 lysozyme mutants at positions Ile3, Phe53, and Leu99 showed that the effect of medium- and long-range contacts is less for buried mutations and the inter-residue contacts have significant correlation with the stability of partially buried mutations.
Collapse
Affiliation(s)
- M M Gromiha
- RIKEN Tsukuba Institute, The Institute of Physical and Chemical Research, Ibaraki, Japan.
| | | |
Collapse
|
10
|
Gromiha MM, Selvaraj S. Influence of medium and long range interactions in protein folding. Prep Biochem Biotechnol 1999; 29:339-51. [PMID: 10548251 DOI: 10.1080/10826069908544933] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Protein structures are stabilized by both local and long range interactions. In this work, we analyze the residue-residue contacts and the role of medium- and long-range interactions in globular proteins belonging to different structural classes. The results show that while medium range interactions predominate in all-alpha class proteins, long-range interactions predominate in all-beta class. Based on this, we analyze the performance of several structure prediction methods in different structural classes of globular proteins and found that all the methods predict the secondary structures of all-alpha proteins more accurately than other classes. Also, we observed that the residues occurring in the range of 21-30 residues apart contributes more towards long-range contacts and about 85% of residues are involved in long-range contacts. Further, the preference of residue pairs to the folding and stability of globular proteins is discussed.
Collapse
Affiliation(s)
- M M Gromiha
- RIKEN Life Science Center, The Institute of Physical and Chemical Research, Tsukuba, Ibaraki, Japan
| | | |
Collapse
|
11
|
Abstract
Long-range interactions play an active role in the stability of protein molecules. In this work, we have analyzed the importance of long-range interactions in different structural classes of globular proteins in terms of residue distances. We found that 85% of residues are involved in long-range contacts. The residues occurring in the range of 4-10 residues apart contribute more towards long-range contacts in all-alpha proteins while the range is 11-20 in all-beta proteins. The hydrophobic residues Cys, Ile and Val prefer the 11-20 range and all other residues prefer the 4-10 range. The residues in all-beta proteins have an average of 3-8 long-range contacts whereas the residues in other classes have 1-4 long-range contracts. Furthermore, the preference of residue pairs to the folding and stability will be discussed.
Collapse
Affiliation(s)
- M M Gromiha
- Institute of Physical and Chemical Research (RIKEN), Tsukuba Life Science Center, Ibaraki, Japan.
| | | |
Collapse
|
12
|
Fariselli P, Casadio R. A neural network based predictor of residue contacts in proteins. PROTEIN ENGINEERING 1999; 12:15-21. [PMID: 10065706 DOI: 10.1093/protein/12.1.15] [Citation(s) in RCA: 115] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
We describe a method based on neural networks for predicting contact maps of proteins using as input chemicophysical and evolutionary information. Neural networks are trained on a data set comprising the contact maps of 200 non-homologous proteins of well resolved three-dimensional structures. The systems learn the association rules between the covalent structure of each protein and its correspondent contact map by means of a standard back propagation algorithm. Validation of the predictor on the training set and on 408 proteins of known structure which are not homologous to those contained in the training set indicate that this method scores higher than statistical approaches previously described and based on correlated mutations and sequence information.
Collapse
Affiliation(s)
- P Fariselli
- Biocomputing Group (Centro Interdipartimentale per le Ricerche Biotecnologiche), Bologna, Italy
| | | |
Collapse
|