1
|
CHEN YUEHUI, CHEN FENG, YANG JACKY, YANG MARYQU. ENSEMBLE VOTING SYSTEM FOR MULTICLASS PROTEIN FOLD RECOGNITION. INT J PATTERN RECOGN 2011. [DOI: 10.1142/s0218001408006454] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein structure classification is an important issue in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recently structural genomes initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. In this paper, three types of classifiers, k nearest neighbors, class center and nearest neighbor and probabilistic neural networks and their homogenous ensemble for multiclass protein fold recognition problem are evaluated firstly, and then a heterogenous ensemble Voting System is designed for the same problem. The different features and/or their combinations extracted from the protein fold dataset are used in these classification models. The heterogenous classification results are then put into a voting system to get the final result. The experimental results show that the proposed method can improve prediction accuracy by 4%–10% on a benchmark dataset containing 27 SCOP folds.
Collapse
Affiliation(s)
- YUEHUI CHEN
- School of Information Science and Engineering, University of Jinan, 106 Jiwei Road, 250022 Jinan, P. R. China
| | - FENG CHEN
- School of Software, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - JACK Y. YANG
- Harvard Medical School, Harvard University, P.O. Box 400888, Cambridge, MA 02140, USA
| | - MARY QU YANG
- National Human Genome Research Institute, National Institutes of Health, US Department of Health and Human Services Bethesda, MD 20852, USA
| |
Collapse
|
2
|
Taguchi YH, Gromiha MM. Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinformatics 2007; 8:404. [PMID: 17953741 PMCID: PMC2174517 DOI: 10.1186/1471-2105-8-404] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2007] [Accepted: 10/22/2007] [Indexed: 11/10/2022] Open
Abstract
Background Predicting the three-dimensional structure of a protein from its amino acid sequence is a long-standing goal in computational/molecular biology. The discrimination of different structural classes and folding types are intermediate steps in protein structure prediction. Results In this work, we have proposed a method based on linear discriminant analysis (LDA) for discriminating 30 different folding types of globular proteins using amino acid occurrence. Our method was tested with a non-redundant set of 1612 proteins and it discriminated them with the accuracy of 38%, which is comparable to or better than other methods in the literature. A web server has been developed for discriminating the folding type of a query protein from its amino acid sequence and it is available at http://granular.com/PROLDA/. Conclusion Amino acid occurrence has been successfully used to discriminate different folding types of globular proteins. The discrimination accuracy obtained with amino acid occurrence is better than that obtained with amino acid composition and/or amino acid properties. In addition, the method is very fast to obtain the results.
Collapse
Affiliation(s)
- Y-h Taguchi
- Department of Physics, Faculty of Science and Technology, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan.
| | | |
Collapse
|
3
|
Gromiha MM. Motifs in outer membrane protein sequences: Applications for discrimination. Biophys Chem 2005; 117:65-71. [PMID: 15905018 DOI: 10.1016/j.bpc.2005.04.005] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2005] [Revised: 04/01/2005] [Accepted: 04/01/2005] [Indexed: 12/31/2022]
Abstract
Discriminating outer membrane proteins (OMPs) from other folding types of globular and membrane proteins is an important problem for predicting their secondary and tertiary structures and detecting outer membrane proteins from genomic sequences as well. In this work, we have systematically analyzed the distribution of amino acid residues in the sequences of globular and outer membrane proteins with several motifs, such as A*B, A**B, etc. We observed that the motifs E*L, A*K and L*E occur frequently in globular proteins while S*S, N*S and R*D predominantly occur in OMPs. We have devised a statistical method based on frequently occurring motifs in globular and OMPs and obtained an accuracy of 96% and 82% for correctly identifying OMPs and excluding globular proteins, respectively. Further, we noticed that the motifs of transmembrane helical (TMH) proteins are different from that of OMPs. While I*A, I*L and L*I prefer in TMH proteins S*S, N*S and N*N predominantly occur in OMPs. The information about the occurrence of A*B motifs in TMH and OMPs could discriminate them with an accuracy of 80% for excluding OMPs and 100% for identifying OMPs. The influence of protein size and structural class for discrimination is discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
| |
Collapse
|
4
|
Zhang GZ, Huang DS. Prediction of inter-residue contacts map based on genetic algorithm optimized radial basis function neural network and binary input encoding scheme. J Comput Aided Mol Des 2005; 18:797-810. [PMID: 16075311 DOI: 10.1007/s10822-005-0578-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2004] [Accepted: 12/14/2004] [Indexed: 10/25/2022]
Abstract
Inter-residue contacts map prediction is one of the most important intermediate steps to the protein folding problem. In this paper, we focus on the problem of protein inter-residue contacts map prediction based on neural network technique. Firstly, we use a genetic algorithm (GA) to optimize the radial basis function widths and hidden centers of a radial basis function neural network (RBFNN), then a novel binary encoding scheme is employed to train the network for the purpose of learning and predicting the inter-residue contacts patterns of protein sequences got from the protein data bank (PDB). The experimental evidence indicates the utility of our proposed encoding strategy and GA optimized RBFNN. Moreover, the simulation results demonstrate that the network got a better performance for these proteins, whose residue length falls into the area of (100, 300), and the predicted accuracy with a contact threshold of 7 Angstroms scores higher than the other 3 values with 5, 6, and 8 Angstroms.
Collapse
Affiliation(s)
- Guang-Zheng Zhang
- Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences
| | | |
Collapse
|
5
|
Gromiha MM, Ahmad S, Suwa M. Application of residue distribution along the sequence for discriminating outer membrane proteins. Comput Biol Chem 2005; 29:135-42. [PMID: 15833441 DOI: 10.1016/j.compbiolchem.2005.02.006] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Revised: 02/22/2005] [Accepted: 02/22/2005] [Indexed: 12/01/2022]
Abstract
Discriminating outer membrane proteins from other folding types of globular and membrane proteins is an important problem both for detecting outer membrane proteins from genomic sequences and for the successful prediction of their secondary and tertiary structures. In this work, we have systematically analyzed the distribution of amino acid residues in the sequences of globular and outer membrane proteins. We observed that the occurrence of two neighboring aliphatic and polar residues is significantly higher in outer membrane proteins than in globular proteins. From the information about the dipeptide composition we have devised a statistical method for discriminating outer membrane proteins from other globular and membrane proteins. Our approach correctly picked up the outer membrane proteins with an accuracy of 95% for the training set of 337 proteins. On the other hand, our method has correctly excluded the globular proteins at an accuracy of 79% in a non-redundant dataset of 674 proteins. Furthermore, the present method is able to correctly exclude alpha-helical membrane proteins up to an accuracy of 87%. These accuracy levels are comparable to other methods in the literature. The influence of protein size and structural class for discrimination is discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Walterfront Bio-IT Research Building 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
6
|
Chen J, Chaudhari N. Bidirectional segmented-memory recurrent neural network for protein secondary structure prediction. Soft comput 2005. [DOI: 10.1007/s00500-005-0489-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
7
|
Gromiha MM, Selvaraj S. Inter-residue interactions in protein folding and stability. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2004; 86:235-77. [PMID: 15288760 DOI: 10.1016/j.pbiomolbio.2003.09.003] [Citation(s) in RCA: 225] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
During the process of protein folding, the amino acid residues along the polypeptide chain interact with each other in a cooperative manner to form the stable native structure. The knowledge about inter-residue interactions in protein structures is very helpful to understand the mechanism of protein folding and stability. In this review, we introduce the classification of inter-residue interactions into short, medium and long range based on a simple geometric approach. The features of these interactions in different structural classes of globular and membrane proteins, and in various folds have been delineated. The development of contact potentials and the application of inter-residue contacts for predicting the structural class and secondary structures of globular proteins, solvent accessibility, fold recognition and ab initio tertiary structure prediction have been evaluated. Further, the relationship between inter-residue contacts and protein-folding rates has been highlighted. Moreover, the importance of inter-residue interactions in protein-folding kinetics and for understanding the stability of proteins has been discussed. In essence, the information gained from the studies on inter-residue interactions provides valuable insights for understanding protein folding and de novo protein design.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Aomi Frontier Building 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
8
|
Gromiha MM, Suwa M. A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 2004; 21:961-8. [PMID: 15531602 DOI: 10.1093/bioinformatics/bti126] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Discriminating outer membrane proteins from other folding types of globular and membrane proteins is an important task both for identifying outer membrane proteins from genomic sequences and for the successful prediction of their secondary and tertiary structures. RESULTS We have systematically analyzed the amino acid composition of globular proteins from different structural classes and outer membrane proteins. We found that the residues, Glu, His, Ile, Cys, Gln, Asn and Ser, show a significant difference between globular and outer membrane proteins. Based on this information, we have devised a statistical method for discriminating outer membrane proteins from other globular and membrane proteins. Our approach correctly picked up the outer membrane proteins with an accuracy of 89% for the training set of 337 proteins. On the other hand, our method has correctly excluded the globular proteins at an accuracy of 79% in a non-redundant dataset of 674 proteins. Furthermore, the present method is able to correctly exclude alpha-helical membrane proteins up to an accuracy of 80%. These accuracy levels are comparable to other methods in the literature, and this is a simple method, which could be used for dissecting outer membrane proteins from genomic sequences. The influence of protein size, structural class and specific residues for discrimination is discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST) Aomi Frontier Building 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
9
|
Ahmad S, Gromiha MM, Sarai A. Real value prediction of solvent accessibility from amino acid sequence. Proteins 2003; 50:629-35. [PMID: 12577269 DOI: 10.1002/prot.10328] [Citation(s) in RCA: 159] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The solvent accessibility of amino acid residues has been predicted in the past by classifying them into exposure states with varying thresholds. This classification provides a wide range of values for the accessible surface area (ASA) within which a residue may fall. Thus far, no attempt has been made to predict real values of ASA from the sequence information without a priori classification into exposure states. Here, we present a new method with which to predict real value ASAs for residues, based on neighborhood information. Our real value prediction neural network could estimate the ASA for four different nonhomologous, nonredundant data sets of varying size, with 18.0-19.5% mean absolute error, defined as per residue absolute difference between the predicted and experimental values of relative ASA. Correlation between the predicted and experimental values ranged from 0.47 to 0.50. It was observed that the ASA of a residue could be predicted within a 23.7% mean absolute error, even when no information about its neighbors is included. Prediction of real values answers the issue of arbitrary choice of ASA state thresholds, and carries more information than category prediction. Prediction error for each residue type strongly correlates with the variability in its experimental ASA values.
Collapse
|
10
|
Luo RY, Feng ZP, Liu JK. Prediction of protein structural class by amino acid and polypeptide composition. EUROPEAN JOURNAL OF BIOCHEMISTRY 2002; 269:4219-25. [PMID: 12199700 DOI: 10.1046/j.1432-1033.2002.03115.x] [Citation(s) in RCA: 110] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
A new approach of predicting structural classes of protein domain sequences is presented in this paper. Besides the amino acid composition, the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides are taken into account based on the stepwise discriminant analysis. The result of jackknife test shows that this new approach can lead to higher predictive sensitivity and specificity for reduced sequence similarity datasets. Considering the dataset PDB40-B constructed by Brenner and colleagues, 75.2% protein domain sequences are correctly assigned in the jackknife test for the four structural classes: all-alpha, all-beta, alpha/beta and alpha + beta, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the component-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross-validation test with dataset PDB40-J constructed by Park and colleagues, more than 80% predictive accuracy is obtained. Furthermore, for the dataset constructed by Chou and Maggiona, the accuracy of 100% and 99.7% can be easily achieved, respectively, in the resubstitution test and in the jackknife test merely taking the composition of dipeptides into account. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which can be used for the systematic analysis of small or medium size protein sequences. The computer programs used in this paper are available on request.
Collapse
Affiliation(s)
- Rui-yan Luo
- Department of Mathematics, Tianjin University, Tianjin 300 072, China
| | | | | |
Collapse
|
11
|
Gromiha MM, Selvaraj S. Important amino acid properties for determining the transition state structures of two-state protein mutants. FEBS Lett 2002; 526:129-34. [PMID: 12208519 DOI: 10.1016/s0014-5793(02)03122-8] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Understanding the mechanism in the folding pathways of proteins is an important problem in molecular biology. The Phi-value analysis provides insight into the transition state structures during protein folding. In this work, we have analyzed the relationship between the observed Phi values upon mutations in two-state proteins (FK506 binding protein, chymotrypsin inhibitor and src SH3 domain) and the changes in 48 various physico-chemical, energetic and conformational properties. We found that the classification of mutations based on solvent accessibility improved the correlation significantly. The relationship between conformational properties and Phi values determines the presence/absence of secondary structures in the transition state. In buried mutations, the physical properties volume, shape and flexibility, and the thermodynamic properties enthalpy, entropy and free-energy change have significant correlation with Phi. The short and medium-range non-bonded energy in partially buried mutations and average long-range contacts in exposed mutations showed a strong correlation with Phi values. Multiple regression analysis incorporating combinations of three properties from among all possible combinations of the 48 properties increased the correlation coefficient up to 0.99, by an average rise of 20% for all the data sets. Information about local sequence and structure is more important in surface mutations than those in buried mutations for explaining the transition state structures of two-state proteins. Further, the implications of our results for understanding the process of protein folding have been discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center, AIST, 2-41-6 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|