101
|
Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis. Amino Acids 2008; 37:415-25. [DOI: 10.1007/s00726-008-0170-2] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 08/03/2008] [Indexed: 10/21/2022]
|
102
|
Xiao X, Wang P, Chou KC. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008; 254:691-6. [PMID: 18634802 DOI: 10.1016/j.jtbi.2008.06.016] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2008] [Revised: 06/18/2008] [Accepted: 06/18/2008] [Indexed: 11/28/2022]
Abstract
A novel approach was developed for predicting the structural classes of proteins based on their sequences. It was assumed that proteins belonging to the same structural class must bear some sort of similar texture on the images generated by the cellular automaton evolving rule [Wolfram, S., 1984. Cellular automation as models of complexity. Nature 311, 419-424]. Based on this, two geometric invariant moment factors derived from the image functions were used as the pseudo amino acid components [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, vol. 44, 60) 43, 246-255] to formulate the protein samples for statistical prediction. The success rates thus obtained on a previously constructed benchmark dataset are quite promising, implying that the cellular automaton image can help to reveal some inherent and subtle features deeply hidden in a pile of long and complicated amino acid sequences.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China.
| | | | | |
Collapse
|
103
|
Feng Y, Luo L. Use of tetrapeptide signals for protein secondary-structure prediction. Amino Acids 2008; 35:607-14. [PMID: 18431531 DOI: 10.1007/s00726-008-0089-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2007] [Accepted: 03/04/2008] [Indexed: 10/22/2022]
Abstract
This paper develops a novel sequence-based method, tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary-structure prediction. The proposed TPIDQD method is based on tetra-peptide signals and is used to predict the structure of the central residue of a sequence fragment. The three-state overall per-residue accuracy (Q (3)) is about 80% in the threefold cross-validated test for 21-residue fragments in the CB513 dataset. The accuracy can be further improved by taking long-range sequence information (fragments of more than 21 residues) into account in prediction. The results show the tetra-peptide signals can indeed reflect some relationship between an amino acid's sequence and its secondary structure, indicating the importance of tetra-peptide signals as the protein folding code in the protein structure prediction.
Collapse
Affiliation(s)
- Yonge Feng
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | | |
Collapse
|
104
|
Zhang SW, Chen W, Yang F, Pan Q. Using Chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 2008; 35:591-8. [PMID: 18427713 DOI: 10.1007/s00726-008-0086-x] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 02/28/2008] [Indexed: 12/11/2022]
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou's pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282-289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2-89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
105
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|
106
|
Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 2008; 35:581-90. [DOI: 10.1007/s00726-008-0084-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Accepted: 01/31/2008] [Indexed: 10/22/2022]
|
107
|
An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence. Amino Acids 2008; 35:573-80. [DOI: 10.1007/s00726-008-0083-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2008] [Accepted: 02/26/2008] [Indexed: 11/26/2022]
|
108
|
Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L. Bridging protein local structures and protein functions. Amino Acids 2008; 35:627-50. [PMID: 18421562 PMCID: PMC7088341 DOI: 10.1007/s00726-008-0088-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 03/10/2008] [Indexed: 12/11/2022]
Abstract
One of the major goals of molecular and evolutionary biology is to understand the functions of proteins by extracting functional information from protein sequences, structures and interactions. In this review, we summarize the repertoire of methods currently being applied and report recent progress in the field of in silico annotation of protein function based on the accumulation of vast amounts of sequence and structure data. In particular, we emphasize the newly developed structure-based methods, which are able to identify locally structural motifs and reveal their relationship with protein functions. These methods include computational tools to identify the structural motifs and reveal the strong relationship between these pre-computed local structures and protein functions. We also discuss remaining problems and possible directions for this exciting and challenging area.
Collapse
Affiliation(s)
- Zhi-Ping Liu
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100080, Beijing, China
| | | | | | | | | |
Collapse
|
109
|
Du P, Li Y. Prediction of C-to-U RNA editing sites in plant mitochondria using both biochemical and evolutionary information. J Theor Biol 2008; 253:579-86. [PMID: 18511083 DOI: 10.1016/j.jtbi.2008.04.006] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2008] [Revised: 04/06/2008] [Accepted: 04/09/2008] [Indexed: 01/30/2023]
Abstract
Although cytidine-to-uridine conversions in plant mitochondria were discovered 18 years ago, it was still an enigmatic process. Since the sequencing projects of plant mitochondrial genomes are providing more and more available sequences, the requirements of computationally identifying C-to-U RNA editing sites are also increasing. By incorporating both evolutionary and biochemical information, we developed a novel algorithm for predicting C-to-U RNA editing sites in plant mitochondria. The algorithm has been implemented as an online service called CURE (Cytidine-to-Uridine Recognizing Editor). CURE performs better than other methods that are based on only biochemical or only evolutionary information. CURE also provides the ability of predicting C-to-U RNA editing sites in non-coding regions and the synonymous C-to-U RNA editing sites in coding regions that are impossible for other methods. Furthermore, CURE can carry out prediction directly on the entire mitochondria genome sequence. The prediction results of CURE suggest the functional importance of synonymous RNA editing sites, which was neglected before. The CURE service can be accessed at http://bioinfo.au.tsinghua.edu.cn/cure.
Collapse
Affiliation(s)
- Pufeng Du
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China.
| | | |
Collapse
|
110
|
Chen FF, Tang YN, Wang SL, Gao HW. Binding of brilliant red compound to lysozyme: insights into the enzyme toxicity of water-soluble aromatic chemicals. Amino Acids 2008; 36:399-407. [DOI: 10.1007/s00726-008-0069-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2008] [Accepted: 03/18/2008] [Indexed: 10/22/2022]
|
111
|
An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 2008; 36:167-75. [DOI: 10.1007/s00726-008-0044-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Accepted: 02/07/2008] [Indexed: 10/22/2022]
|
112
|
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. J Theor Biol 2008; 252:350-6. [PMID: 18355838 DOI: 10.1016/j.jtbi.2008.02.004] [Citation(s) in RCA: 182] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2007] [Revised: 12/02/2007] [Accepted: 02/04/2008] [Indexed: 11/15/2022]
Abstract
The outer membrane proteins (OMPs) are beta-barrel membrane proteins that performed lots of biology functions. The discriminating OMPs from other non-OMPs is a very important task for understanding some biochemical process. In this study, a method that combines increment of diversity with modified Mahalanobis Discriminant, called IDQD, is presented to predict 208 OMPs, 206 transmembrane helical proteins (TMHPs) and 673 globular proteins (GPs) by using Chou's pseudo amino acid compositions as parameters. The overall accuracy of jackknife cross-validation is 93.2% and 96.1%, respectively, for three datasets (OMPs, TMHPs and GPs) and two datasets (OMPs and non-OMPs). These predicted results suggest that the method can be effectively applied to discriminate OMPs, TMHPs and GPs. And it also indicates that the pseudo amino acid composition can better reflect the core feature of membrane proteins than the classical amino acid composition.
Collapse
Affiliation(s)
- Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
113
|
Using Chou's pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location. Amino Acids 2008; 34:669-75. [PMID: 18256886 DOI: 10.1007/s00726-008-0034-9] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 01/17/2008] [Indexed: 10/22/2022]
Abstract
The knowledge of subnuclear localization in eukaryotic cells is essential for understanding the life function of nucleus. Developing prediction methods and tools for proteins subnuclear localization become important research fields in protein science for special characteristics in cell nuclear. In this study, a novel approach has been proposed to predict protein subnuclear localization. Sample of protein is represented by Pseudo Amino Acid (PseAA) composition based on approximate entropy (ApEn) concept, which reflects the complexity of time series. A novel ensemble classifier is designed incorporating three AdaBoost classifiers. The base classifier algorithms in three AdaBoost are decision stumps, fuzzy K nearest neighbors classifier, and radial basis-support vector machines, respectively. Different PseAA compositions are used as input data of different AdaBoost classifier in ensemble. Genetic algorithm is used to optimize the dimension and weight factor of PseAA composition. Two datasets often used in published works are used to validate the performance of the proposed approach. The obtained results of Jackknife cross-validation test are higher and more balance than them of other methods on same datasets. The promising results indicate that the proposed approach is effective and practical. It might become a useful tool in protein subnuclear localization. The software in Matlab and supplementary materials are available freely by contacting the corresponding author.
Collapse
|
114
|
Yan C, Hu J, Wang Y. Discrimination of outer membrane proteins using a K-nearest neighbor method. Amino Acids 2008; 35:65-73. [PMID: 18219549 DOI: 10.1007/s00726-007-0628-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 10/28/2007] [Indexed: 11/29/2022]
Abstract
Identification of outer membrane proteins (OMPs) from genome is an important task. This paper presents a k-nearest neighbor (K-NN) method for discriminating outer membrane proteins (OMPs). The method makes predictions based on a weighted Euclidean distance that is computed from residue composition. The method achieves 89.1% accuracy with 0.668 MCC (Matthews correlation coefficient) in discriminating OMPs and non-OMPs. The performance of the method is improved by including homologous information into the calculation of residue composition. The final method achieves an accuracy of 96.1%, with 0.873 MCC, 87.5% sensitivity, and 98.2% specificity. Comparisons with multiple recently published methods show that the method proposed in this study outperforms the others.
Collapse
Affiliation(s)
- C Yan
- Department of Computer Science, Utah State University, Logan, UT 84322-4205, USA.
| | | | | |
Collapse
|
115
|
Nanni L, Lumini A. Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids 2008; 34:653-60. [PMID: 18175047 DOI: 10.1007/s00726-007-0018-1] [Citation(s) in RCA: 131] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2007] [Accepted: 12/11/2007] [Indexed: 01/25/2023]
|
116
|
Nanni L, Lumini A. Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids 2008; 34:635-41. [PMID: 18175049 DOI: 10.1007/s00726-007-0016-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2007] [Accepted: 12/06/2007] [Indexed: 12/11/2022]
Abstract
Given a novel protein it is very important to know if it is a DNA-binding protein, because DNA-binding proteins participate in the fundamental role to regulate gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. Matthews's correlation coefficient obtained by our fusion method is approximately 0.97 when the jackknife cross-validation is used; this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou's pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. Matthews's correlation coefficient obtained using the single best GO-number is only 0.7211 and hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, which means that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be studied more.
Collapse
Affiliation(s)
- Loris Nanni
- DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
| | | |
Collapse
|
117
|
Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids 2007; 34:565-72. [PMID: 18074191 DOI: 10.1007/s00726-007-0010-9] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2007] [Accepted: 11/15/2007] [Indexed: 11/24/2022]
Abstract
The rapidly increasing number of sequence entering into the genome databank has called for the need for developing automated methods to analyze them. Information on the subcellular localization of new found protein sequences is important for helping to reveal their functions in time and conducting the study of system biology at the cellular level. Based on the concept of Chou's pseudo-amino acid composition, a series of useful information and techniques, such as residue conservation scores, von Neumann entropies, multi-scale energy, and weighted auto-correlation function were utilized to generate the pseudo-amino acid components for representing the protein samples. Based on such an infrastructure, a hybridization predictor was developed for identifying uncharacterized proteins among the following 12 subcellular localizations: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Compared with the results reported by the previous investigators, higher success rates were obtained, suggesting that the current approach is quite promising, and may become a useful high-throughput tool in the relevant areas.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, No. 127 Youyi West Road, Xi'an 710072, China.
| | | | | | | | | |
Collapse
|