51
|
Hayat M, Khan A. MemHyb: Predicting membrane protein types by hybridizing SAAC and PSSM. J Theor Biol 2012; 292:93-102. [DOI: 10.1016/j.jtbi.2011.09.026] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2011] [Revised: 09/21/2011] [Accepted: 09/22/2011] [Indexed: 01/08/2023]
|
52
|
Fan GL, Li QZ. Predicting protein submitochondria locations by combining different descriptors into the general form of Chou's pseudo amino acid composition. Amino Acids 2011; 43:545-55. [PMID: 22102053 DOI: 10.1007/s00726-011-1143-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/27/2011] [Indexed: 10/15/2022]
Abstract
Knowledge of the submitochondria location of protein is integral to understanding its function and a necessity in the proteomics era. In this work, a new submitochondria data set is constructed, and an approach for predicting protein submitochondria locations is proposed by combining the amino acid composition, dipeptide composition, reduced physicochemical properties, gene ontology, evolutionary information, and pseudo-average chemical shift. The overall prediction accuracy is 93.57% for the submitochondria location and 97.79% for the three membrane protein types in the mitochondria inner membrane using the algorithm of the increment of diversity combined with the support vector machine. The performance of the pseudo-average chemical shift is excellent. For contrast, the method is also used to predict submitochondria locations in the data set constructed by Du and Li; an accuracy of 94.95% is obtained by our method, which is better than that of other existing methods.
Collapse
Affiliation(s)
- Guo-Liang Fan
- Department of Physics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
53
|
Chen W, Feng P, Lin H. Prediction of ketoacyl synthase family using reduced amino acid alphabets. J Ind Microbiol Biotechnol 2011; 39:579-84. [PMID: 22042516 DOI: 10.1007/s10295-011-1047-z] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2011] [Accepted: 10/04/2011] [Indexed: 11/28/2022]
Abstract
Ketoacyl synthases are enzymes involved in fatty acid synthesis and can be classified into five families based on primary sequence similarity. Different families have different catalytic mechanisms. Developing cost-effective computational models to identify the family of ketoacyl synthases will be helpful for enzyme engineering and in knowing individual enzymes' catalytic mechanisms. In this work, a support vector machine-based method was developed to predict ketoacyl synthase family using the n-peptide composition of reduced amino acid alphabets. In jackknife cross-validation, the model based on the 2-peptide composition of a reduced amino acid alphabet of size 13 yielded the best overall accuracy of 96.44% with average accuracy of 93.36%, which is superior to other state-of-the-art methods. This result suggests that the information provided by n-peptide compositions of reduced amino acid alphabets provides efficient means for enzyme family classification and that the proposed model can be efficiently used for ketoacyl synthase family annotation.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics, College of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China.
| | | | | |
Collapse
|
54
|
Lu JL, Hu XH, Hu DG. A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences. J Theor Biol 2011; 293:74-81. [PMID: 22001320 DOI: 10.1016/j.jtbi.2011.09.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2011] [Revised: 09/23/2011] [Accepted: 09/26/2011] [Indexed: 01/20/2023]
Abstract
Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 degree plays a major role in helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm and fractal dimension, and then predict the DNA sequence thermostability by these fractal features and support vector machine (SVM). We have conducted experiments on three groups: 17-dimensional vector, 65-dimensional vector, and 257-dimensional vector. Each group is evaluated by the 10-fold cross-validation test. For the results, the group of 257-dimensional vector gets the best results: the average accuracy is 0.9456 and average MCC is 0.8878. The results are also compared with the previous work with single CGR features. The comparison shows the high effectiveness of the new hybrid fractal algorithm.
Collapse
Affiliation(s)
- Jin-Long Lu
- College of Science, Huazhong Agricultural University, Wuhan, PR China
| | | | | |
Collapse
|
55
|
Jingbo X, Silan Z, Feng S, Huijuan X, Xuehai H, Xiaohui N, Zhi L. Using the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: An approach from chaos games representation. J Theor Biol 2011; 284:16-23. [DOI: 10.1016/j.jtbi.2011.06.003] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Revised: 06/02/2011] [Accepted: 06/03/2011] [Indexed: 10/18/2022]
|
56
|
Nanni L, Lumini A, Gupta D, Garg A. Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:467-475. [PMID: 21860064 DOI: 10.1109/tcbb.2011.117] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The availability of a reliable prediction method for prediction of bacterial virulent proteins has several important applications in research efforts targeted aimed at finding novel drug targets, vaccine candidates, and understanding virulence mechanisms in pathogens. In this work, we have studied several feature extraction approaches for representing proteins and propose a novel bacterial virulent protein prediction method, based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence and from the evolutionary information of a given protein. We have evaluated and compared several ensembles obtained by combining six feature extraction methods and several classification approaches based on two general purpose classifiers (i.e., Support Vector Machine and a variant of input decimated ensemble) and their random subspace version. An extensive evaluation was performed according to a blind testing protocol, where the parameters of the system are optimized using the training set and the system is validated in three different independent data sets, allowing selection of the most performing system and demonstrating the validity of the proposed method. Based on the results obtained using the blind test protocol, it is interesting to note that even if in each independent data set the most performing stand-alone method is not always the same, the fusion of different methods enhances prediction efficiency in all the tested independent data sets.
Collapse
|
57
|
Hayat M, Khan A, Yeasin M. Prediction of membrane proteins using split amino acid and ensemble classification. Amino Acids 2011; 42:2447-60. [DOI: 10.1007/s00726-011-1053-5] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Accepted: 07/29/2011] [Indexed: 02/01/2023]
|
58
|
Kandaswamy KK, Pugalenthi G, Hazrati MK, Kalies KU, Martinetz T. BLProt: prediction of bioluminescent proteins based on support vector machine and relieff feature selection. BMC Bioinformatics 2011; 12:345. [PMID: 21849049 PMCID: PMC3176267 DOI: 10.1186/1471-2105-12-345] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2010] [Accepted: 08/17/2011] [Indexed: 11/29/2022] Open
Abstract
Background Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures, but some insects, plants, fungi etc, also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence. Results In this paper, we propose a novel predictive method that uses a Support Vector Machine (SVM) and physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non-bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non-bioluminescent proteins. To identify the most prominent features, we carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. We selected five different feature subsets by decreasing the number of features, and the performance of each feature subset was evaluated. Conclusion BLProt achieves 80% accuracy from training (5 fold cross-validations) and 80.06% accuracy from testing. The performance of BLProt was compared with BLAST and HMM. High prediction accuracy and successful prediction of hypothetical proteins suggests that BLProt can be a useful approach to identify bioluminescent proteins from sequence information, irrespective of their sequence similarity. The BLProt software is available at http://www.inb.uni-luebeck.de/tools-demos/bioluminescent%20protein/BLProt
Collapse
|
59
|
Yu X, Zheng X, Liu T, Dou Y, Wang J. Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation. Amino Acids 2011; 42:1619-25. [DOI: 10.1007/s00726-011-0848-8] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Accepted: 02/09/2011] [Indexed: 12/13/2022]
|
60
|
Hayat M, Khan A. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol 2011; 271:10-7. [DOI: 10.1016/j.jtbi.2010.11.017] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2010] [Revised: 11/10/2010] [Accepted: 11/10/2010] [Indexed: 11/28/2022]
|
61
|
Pugalenthi G, Kandaswamy KK, Suganthan PN, Sowdhamini R, Martinetz T, Kolatkar PR. SMpred: a support vector machine approach to identify structural motifs in protein structure without using evolutionary information. J Biomol Struct Dyn 2011; 28:405-14. [PMID: 20919755 DOI: 10.1080/07391102.2010.10507369] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.
Collapse
Affiliation(s)
- Ganesan Pugalenthi
- Laboratory of Structural Biochemistry, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672
| | | | | | | | | | | |
Collapse
|
62
|
Lin H, Ding H. Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J Theor Biol 2011; 269:64-9. [DOI: 10.1016/j.jtbi.2010.10.019] [Citation(s) in RCA: 110] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2010] [Revised: 08/31/2010] [Accepted: 10/15/2010] [Indexed: 12/11/2022]
|
63
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 966] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
64
|
AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 2010; 270:56-62. [PMID: 21056045 DOI: 10.1016/j.jtbi.2010.10.037] [Citation(s) in RCA: 191] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 10/29/2010] [Accepted: 10/29/2010] [Indexed: 12/11/2022]
Abstract
Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.
Collapse
|
65
|
Xie Z, Zhang T, Wang JF, Chou KC, Wei DQ. The computational model to predict accurately inhibitory activity for inhibitors towardsCYP3A4. Comput Biol Med 2010; 40:845-52. [DOI: 10.1016/j.compbiomed.2010.09.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2010] [Revised: 08/09/2010] [Accepted: 09/21/2010] [Indexed: 10/18/2022]
|
66
|
Zakeri P, Moshiri B, Sadeghi M. Prediction of protein submitochondria locations based on data fusion of various features of sequences. J Theor Biol 2010; 269:208-16. [PMID: 21040732 DOI: 10.1016/j.jtbi.2010.10.026] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Revised: 10/16/2010] [Accepted: 10/22/2010] [Indexed: 01/16/2023]
Abstract
In this study, the predictors are developed for protein submitochondria locations based on various features of sequences. Information about the submitochondria location for a mitochondria protein can provide much better understanding about its function. We use ten representative models of protein samples such as pseudo amino acid composition, dipeptide composition, functional domain composition, the combining discrete model based on prediction of solvent accessibility and secondary structure elements, the discrete model of pairwise sequence similarity, etc. We construct a predictor based on support vector machines (SVMs) for each representative model. The overall prediction accuracy by the leave-one-out cross validation test obtained by the predictor which is based on the discrete model of pairwise sequence similarity is 1% better than the best computational system that exists for this problem. Moreover, we develop a method based on ordered weighted averaging (OWA) which is one of the fusion data operators. Therefore, OWA is applied on the 11 best SVM-based classifiers that are constructed based on various features of sequence. This method is called Mito-Loc. The overall leave-one-out cross validation accuracy obtained by Mito-Loc is about 95%. This indicates that our proposed approach (Mito-Loc) is superior to the result of the best existing approach which has already been reported.
Collapse
Affiliation(s)
- Pooya Zakeri
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran
| | | | | |
Collapse
|
67
|
Yu L, Guo Y, Li Y, Li G, Li M, Luo J, Xiong W, Qin W. SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. J Theor Biol 2010; 267:1-6. [PMID: 20691704 DOI: 10.1016/j.jtbi.2010.08.001] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2010] [Revised: 07/30/2010] [Accepted: 08/01/2010] [Indexed: 11/17/2022]
Abstract
Protein secretion plays an important role in bacterial lifestyles. Secreted proteins are crucial for bacterial pathogenesis by making bacteria interact with their environments, particularly delivering pathogenic and symbiotic bacteria into their eukaryotic hosts. Therefore, identification of bacterial secreted proteins becomes an important process for the study of various diseases and the corresponding drugs. In this paper, fusing several new features into Chou's pseudo-amino acid composition (PseAAC), two support vector machine (SVM)-based ternary classifiers are developed to predict secreted proteins of Gram-negative and Gram-positive bacteria. For the two types of bacteria, the high accuracy of 94.03% and 94.36% are obtained in distinguishing classically secreted, non-classically secreted and non-secreted proteins by our method. In order to compare the practical ability of our method in identifying bacterial secreted proteins with those of six published methods, proteins in Escherichia coli and Bacillus subtilis are collected to construct the test sets of Gram-negative and Gram-positive bacteria, and the prediction results of our method are comparable to those of existing methods. When performed on two public independent data sets for predicting NCSPs, it also yields satisfactory results for Gram-negative bacterial proteins. The prediction server SecretP can be accessed at http://cic.scu.edu.cn/bioinformatics/secretPV2/index.htm.
Collapse
Affiliation(s)
- Lezheng Yu
- College of Chemistry, Sichuan University, Chengdu 610064, PR China
| | | | | | | | | | | | | | | |
Collapse
|
68
|
Jahandideh S, Abdolmaleki P. Prediction of melatonin excretion patterns in the rat exposed to ELF magnetic fields based on support vector machine and linear discriminant analysis. Micron 2010; 41:882-5. [PMID: 20554210 DOI: 10.1016/j.micron.2010.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2010] [Revised: 04/03/2010] [Accepted: 04/05/2010] [Indexed: 10/19/2022]
Abstract
Bioeffects of magnetic field exposure have been motivated accomplishing various studies. However, no consensus or guideline is available for experimental designs relating exposure conditions as yet. In the present work, in order to analyze and predict the melatonin excretion patterns in the rat exposed to extremely low frequency magnetic fields (ELF-MF), linear discriminate analysis (LDA) and support vector machines (SVMs) were utilized. Subsequently, performances of LDA and SVMs were compared through resubstitution and jackknife tests on a database containing 33 experiments. Predictor variables were more effective parameters including frequency, polarization, exposure duration and strength of magnetic fields. Also, five performance measures including accuracy, sensitivity, specificity, matthew's correlation coefficient (MCC) and normalized percentage better than random (S) were used to evaluate the performance of models. The LDA as a conventional model obtained poor prediction performance. On the other hand, SVMs as a more powerful model, which has not been introduced in predicting melatonin excretion patterns in the rat exposed to ELF-MF, showed 0.38 value of MCC through jackknife test that confirms the higher reliability of the SVMs.
Collapse
Affiliation(s)
- Samad Jahandideh
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | | |
Collapse
|
69
|
Yu L, Guo Y, Zhang Z, Li Y, Li M, Li G, Xiong W, Zeng Y. SecretP: a new method for predicting mammalian secreted proteins. Peptides 2010; 31:574-8. [PMID: 20045033 DOI: 10.1016/j.peptides.2009.12.026] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/13/2009] [Revised: 12/17/2009] [Accepted: 12/17/2009] [Indexed: 11/19/2022]
Abstract
In contrast to a large number of classically secreted proteins (CSPs) and non-secreted proteins (NSPs), only a few proteins have been experimentally proved to enter non-classical secretory pathways. So it is difficult to identify non-classically secreted proteins (NCSPs), and no methods are available for distinguishing the three types of proteins simultaneously. In order to solve this problem, a data mining has been taken firstly, and mammalian proteins exported via ER-Golgi-independent pathways are collected through extensive literature searches. In this paper, a support vector machine (SVM)-based ternary classifier named SecretP is proposed to predict mammalian secreted proteins by using pseudo-amino acid composition (PseAA) and five additional features. When distinguishing the three types of proteins, SecretP yielded an accuracy of 88.79%. Evaluating the performance of our method by an independent test set of 92 human proteins, 76 of them are correctly predicted as NCSPs. When performed on another public independent data set, the prediction result of SecretP is comparable to those of other existing computational methods. Therefore, SecretP can be a useful supplementary tool for future secretome studies. The web server SecretP and all supplementary tables listed in this paper are freely available at http://cic.scu.edu.cn/bioinformatics/secretp/index.htm.
Collapse
Affiliation(s)
- Lezheng Yu
- College of Chemistry, Sichuan University, Chengdu 610064, PR China
| | | | | | | | | | | | | | | |
Collapse
|
70
|
Zhang N, Duan G, Gao S, Ruan J, Zhang T. Prediction of the parallel/antiparallel orientation of beta-strands using amino acid pairing preferences and support vector machines. J Theor Biol 2010; 263:360-8. [PMID: 20035768 DOI: 10.1016/j.jtbi.2009.12.019] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Revised: 11/05/2009] [Accepted: 12/17/2009] [Indexed: 10/20/2022]
|
71
|
Nanni L, Shi JY, Brahnam S, Lumini A. Protein classification using texture descriptors extracted from the protein backbone image. J Theor Biol 2010; 264:1024-32. [PMID: 20307550 DOI: 10.1016/j.jtbi.2010.03.020] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Revised: 01/28/2010] [Accepted: 03/11/2010] [Indexed: 10/19/2022]
Abstract
In this work, we propose a method for protein classification that combines different texture descriptors extracted from the 2-D distance matrix obtained from the 3-D tertiary structure of a given protein. Instead of considering all atoms in the protein, the distance matrix is calculated by considering only those atoms that belong to the protein backbone. The positive results reported in this paper offer further experimental confirmation that the distance matrix contains sufficient information for describing a protein. Moreover, we show that combining features extracted from the primary structure with features extracted from the distance matrix increases the performance of our classification system. We demonstrate this finding by comparing the performance of an ensemble of classifiers that uses the combined features. The classifiers used in our experiments are support vector machines and random subspace of support vector machines. The experimental results, validated using three different datasets (protein fold recognition, DNA-binding proteins recognition, biological processes, and molecular functions recognition) along with different texture feature extraction methods (variants of local binary patterns, Radon feature transform based approaches, and Haralick descriptors) demonstrate the effectiveness of the proposed approach. Particularly interesting are the results in the classification of 27 types of structural properties: our proposed approach achieves significant improvement compared with other reported methods.
Collapse
Affiliation(s)
- Loris Nanni
- DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
| | | | | | | |
Collapse
|
72
|
Huang C, Zhang R, Chen Z, Jiang Y, Shang Z, Sun P, Zhang X, Li X. Predict potential drug targets from the ion channel proteins based on SVM. J Theor Biol 2009; 262:750-6. [PMID: 19903486 DOI: 10.1016/j.jtbi.2009.11.002] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2009] [Revised: 11/04/2009] [Accepted: 11/04/2009] [Indexed: 11/28/2022]
Abstract
The identification of molecular targets is a critical step in the drug discovery and development process. Ion channel proteins represent highly attractive drug targets implicated in a diverse range of disorders, in particular in the cardiovascular and central nervous systems. Due to the limits of experimental technique and low-throughput nature of patch-clamp electrophysiology, they remain a target class waiting to be exploited. In our study, we combined three types of protein features, primary sequence, secondary structure and subcellular localization to predict potential drug targets from ion channel proteins applying classical support vector machine (SVM) method. In addition, our prediction comprised two stages. In stage 1, we predicted ion channel target proteins based on whole-genome target protein characteristics. Firstly, we performed feature selection by Mann-Whitney U test, then made predictions to identify potential ion channel targets by SVM and designed a new evaluating indicator Q to prioritize results. In stage 2, we made a prediction based on known ion channel target protein characteristics. Genetic algorithm was used to select features and SVM was used to predict ion channel targets. Then, we integrated results of two stages, and found that five ion channel proteins appeared in both prediction results including CGMP-gated cation channel beta subunit and Gamma-aminobutyric acid receptor subunit alpha-5, etc., and four of which were relative to some nerve diseases. It suggests that these five proteins are potential targets for drug discovery and our prediction strategies are effective.
Collapse
Affiliation(s)
- Chen Huang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150086, China
| | | | | | | | | | | | | | | |
Collapse
|
73
|
Chen K, Jiang Y, Du L, Kurgan L. Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs. J Comput Chem 2009; 30:163-72. [DOI: 10.1002/jcc.21053] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
74
|
Shazman S, Mandel-Gutfreund Y. Classifying RNA-binding proteins based on electrostatic properties. PLoS Comput Biol 2008; 4:e1000146. [PMID: 18716674 PMCID: PMC2518515 DOI: 10.1371/journal.pcbi.1000146] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Accepted: 06/26/2008] [Indexed: 01/15/2023] Open
Abstract
Protein structure can provide new insight into the biological function of a protein and can enable the design of better experiments to learn its biological roles. Moreover, deciphering the interactions of a protein with other molecules can contribute to the understanding of the protein's function within cellular processes. In this study, we apply a machine learning approach for classifying RNA-binding proteins based on their three-dimensional structures. The method is based on characterizing unique properties of electrostatic patches on the protein surface. Using an ensemble of general protein features and specific properties extracted from the electrostatic patches, we have trained a support vector machine (SVM) to distinguish RNA-binding proteins from other positively charged proteins that do not bind nucleic acids. Specifically, the method was applied on proteins possessing the RNA recognition motif (RRM) and successfully classified RNA-binding proteins from RRM domains involved in protein–protein interactions. Overall the method achieves 88% accuracy in classifying RNA-binding proteins, yet it cannot distinguish RNA from DNA binding proteins. Nevertheless, by applying a multiclass SVM approach we were able to classify the RNA-binding proteins based on their RNA targets, specifically, whether they bind a ribosomal RNA (rRNA), a transfer RNA (tRNA), or messenger RNA (mRNA). Finally, we present here an innovative approach that does not rely on sequence or structural homology and could be applied to identify novel RNA-binding proteins with unique folds and/or binding motifs. Gene expression in all living organisms is regulated by a complex set of events at both transcriptional and posttranscriptional levels. RNA-binding proteins play a key role in posttranscriptional events including splicing, stability, transport, and translation. Nowadays, there is increasing evidence that many other cellular processes may be mediated by RNA. Identifying new proteins involved in interaction with RNA is thus essential to unraveling the cellular processes in which these interactions are involved. In the current study we present a successful computational approach for classifying RNA-binding proteins and distinguishing them from other proteins based on structural and electrostatic properties. We test the method on a unique protein domain, the RNA recognition motif (RRM), which mediates both RNA and protein interactions. We show that we can discriminate RNA-binding RRMs from protein-binding RRMs. Further, we demonstrate that we can classify known RNA-binding proteins based on their RNA target (mRNA, rRNA, or tRNA). Our method does not rely on any kind of evolutionary information and thus can be applied to identify RNA-binding proteins with novel modes of RNA recognition.
Collapse
Affiliation(s)
- Shula Shazman
- Faculty of Biology, Technion—Israel Institute of Technology, Haifa, Israel
| | | |
Collapse
|
75
|
Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 2008; 35:581-90. [DOI: 10.1007/s00726-008-0084-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Accepted: 01/31/2008] [Indexed: 10/22/2022]
|
76
|
Shen HB, Yang J, Chou KC. Methodology development for predicting subcellular localization and other attributes of proteins. Expert Rev Proteomics 2007; 4:453-63. [PMID: 17705704 DOI: 10.1586/14789450.4.4.453] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Facing the explosion of newly generated protein sequences in the postgenomic age, we are challenged to develop computational methods for the fast and accurate identification of their subcellular localization and other attributes. This review summarizes recent methodology developments, with a focus on artificial neural networks, the statistical learning and support vector machine, the fuzzy logic-based algorithm and the evidence-theory-based algorithm, as well as the ensemble classifier approach. Meanwhile, an outline of the use of different descriptors for protein samples is given. In addition, a series of web servers established recently based on various ensemble classifiers are also briefly introduced.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Shanghai Jiaotong University, Institute of Image Processing & Pattern Recognition, Shanghai, China.
| | | | | |
Collapse
|
77
|
Chen YL, Li QZ. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J Theor Biol 2007; 248:377-81. [PMID: 17572445 DOI: 10.1016/j.jtbi.2007.05.019] [Citation(s) in RCA: 113] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2007] [Revised: 04/18/2007] [Accepted: 05/10/2007] [Indexed: 10/23/2022]
Abstract
Apoptosis proteins are very important for understanding the mechanism of programmed cell death. The apoptosis protein localization can provide valuable information about its molecular function. The prediction of localization of an apoptosis protein is a challenging task. In our previous work we proposed an increment of diversity (ID) method using protein sequence information for this prediction task. In this work, based on the concept of Chou's pseudo-amino acid composition [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct. Funct. Genet. (Erratum: Chou, K.C., 2001, vol. 44, 60) 43, 246-255, Chou, K.C., 2005. Using amphiphilic pseudo-amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19], a different pseudo-amino acid composition by using the hydropathy distribution information is introduced. A novel ID_SVM algorithm combined ID with support vector machine (SVM) is proposed. This method is applied to three data sets (317 apoptosis proteins, 225 apoptosis proteins and 98 apoptosis proteins). The higher predictive success rates than the previous algorithms are obtained by the jackknife tests.
Collapse
Affiliation(s)
- Ying-Li Chen
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
78
|
Pu X, Guo J, Leung H, Lin Y. Prediction of membrane protein types from sequences and position-specific scoring matrices. J Theor Biol 2007; 247:259-65. [PMID: 17433369 DOI: 10.1016/j.jtbi.2007.01.016] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2006] [Revised: 12/22/2006] [Accepted: 01/18/2007] [Indexed: 11/15/2022]
Abstract
Membrane protein plays an important role in some biochemical process such as signal transduction, transmembrane transport, etc. Membrane proteins are usually classified into five types [Chou, K.C., Elrod, D.W., 1999. Prediction of membrane protein types and subcellular locations. Proteins: Struct. Funct. Genet. 34, 137-153] or six types [Chou, K.C., Cai, Y.D., 2005. J. Chem. Inf. Modelling 45, 407-413]. Designing in silico methods to identify and classify membrane protein can help us understand the structure and function of unknown proteins. This paper introduces an integrative approach, IAMPC, to classify membrane proteins based on protein sequences and protein profiles. These modules extract the amino acid composition of the whole profiles, the amino acid composition of N-terminal and C-terminal profiles, the amino acid composition of profile segments and the dipeptide composition of the whole profiles. In the computational experiment, the overall accuracy of the proposed approach is comparable with the functional-domain-based method. In addition, the performance of the proposed approach is complementary to the functional-domain-based method for different membrane protein types.
Collapse
Affiliation(s)
- Xian Pu
- Department of Computer Sciences, The City University of Hong Kong, Hong Kong
| | | | | | | |
Collapse
|
79
|
Liu B, Li S, Wang Y, Lu L, Li Y, Cai Y. Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochem Biophys Res Commun 2007; 358:136-9. [PMID: 17470363 DOI: 10.1016/j.bbrc.2007.04.097] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2007] [Accepted: 04/12/2007] [Indexed: 11/24/2022]
Abstract
Protein SUMO modification is an important post-translational modification and the optimization of prediction methods remains a challenge. Here, by using Support Vector Machines algorithm (SVM), a novel computational method was developed for SUMO modification site prediction based on Sequential Forward Selection (SFS) of hundreds of amino acid properties, which are collected by Amino Acid Index database (http://www.genome.jp/aaindex). Our method also compares with the 0/1 system, in which the 20 amino acids are represented by 20-dimensional vectors (A = 00000000000000000001, C = 00000000000000000010 and so on). The overall accuracy of leave-one-out cross-validation for our method reaches 89.18%, which is higher than 0/1 system. It indicated that the SUMO modification prediction process is highly related to the amino acid property and this approach here provide a helpful tool for further investigation of the SUMO modification and identification of sumoylation sites in proteins. The software is available at http://www.biosino.org/sumo.
Collapse
Affiliation(s)
- Boshu Liu
- Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institute for Biological Sciences, Chinese Academy of Sciences, China
| | | | | | | | | | | |
Collapse
|
80
|
Lu L, Qian Z, Cai YD, Li Y. ECS: an automatic enzyme classifier based on functional domain composition. Comput Biol Chem 2007; 31:226-32. [PMID: 17500036 DOI: 10.1016/j.compbiolchem.2007.03.008] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2007] [Accepted: 03/26/2007] [Indexed: 11/19/2022]
Abstract
Classification for enzymes is a prerequisite for understanding their function. Here, an automatic enzyme identifier based on support vector machine (SVM) with feature vectors from protein functional domain composition was built to identify enzymes and further a classifier to classify enzymes into six different classes: oxidoreductase, transferase, hydrolase, lyase, isomerase and ligase. Jackknife cross-validation test was adopted to evaluate the performance of our classifier. The 86.03% success rate achieved for enzyme/non-enzyme identification and 91.32% for enzyme classification, which is much better than that of the BLAST and PSI-BLAST based method, also outperforms several existed works. The results indicate that protein functional domain composition is able to capture the major features which facilitate the identification/classification of proteins, thus demonstrating that our predictor could be a more effective and promising high-throughput method in enzyme research. Moreover, a web-based software Enzyme Classification System (ECS) for identification as well as classification of enzymes can be accessed at: http://pcal.biosino.org/.
Collapse
Affiliation(s)
- Lingyi Lu
- Bioinformatics Center, Key Lab of Molecular Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China
| | | | | | | |
Collapse
|
81
|
Zhang GQ, Cao ZW, Luo QM, Cai YD, Li YX. Operon prediction based on SVM. Comput Biol Chem 2006; 30:233-40. [PMID: 16716751 DOI: 10.1016/j.compbiolchem.2006.03.002] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2005] [Revised: 03/17/2006] [Accepted: 03/24/2006] [Indexed: 11/27/2022]
Abstract
The operon is a specific functional organization of genes found in bacterial genomes. Most genes within operons share common features. The support vector machine (SVM) approach is here used to predict operons at the genomic level. Four features were chosen as SVM input vectors: the intergenic distances, the number of common pathways, the number of conserved gene pairs and the mutual information of phylogenetic profiles. The analysis reveals that these common properties are indeed characteristic of the genes within operons and are different from that of non-operonic genes. Jackknife testing indicates that these input feature vectors, employed with RBF kernel SVM, achieve high accuracy. To validate the method, Escherichia coli K12 and Bacillus subtilis were taken as benchmark genomes of known operon structure, and the prediction results in both show that the SVM can detect operon genes in target genomes efficiently and offers a satisfactory balance between sensitivity and specificity.
Collapse
Affiliation(s)
- Guo-qing Zhang
- Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | | | | | | | | |
Collapse
|
82
|
Shen HB, Yang J, Chou KC. Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. J Theor Biol 2006; 240:9-13. [PMID: 16197963 DOI: 10.1016/j.jtbi.2005.08.016] [Citation(s) in RCA: 140] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2005] [Revised: 08/15/2005] [Accepted: 08/18/2005] [Indexed: 11/30/2022]
Abstract
Cell membranes are vitally important to the life of a cell. Although the basic structure of biological membrane is provided by the lipid bilayer, membrane proteins perform most of the specific functions. Membrane proteins are putatively classified into five different types. Identification of their types is currently an important topic in bioinformatics and proteomics. In this paper, based on the concept of representing protein samples in terms of their pseudo-amino acid composition, the fuzzy K-nearest neighbors (KNN) algorithm has been introduced to predict membrane protein types, and high success rates were observed. It is anticipated that, the current approach, which is based on a branch of fuzzy mathematics and represents a new strategy, may play an important complementary role to the existing methods in this area. The novel approach may also have notable impact on prediction of the other attributes, such as protein structural class, protein subcellular localization, and enzyme family class, among many others.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, 200030 Shanghai, China
| | | | | |
Collapse
|
83
|
Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 2006; 240:175-84. [PMID: 16274699 DOI: 10.1016/j.jtbi.2005.09.018] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2005] [Revised: 09/09/2005] [Accepted: 09/09/2005] [Indexed: 11/18/2022]
Abstract
In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function. In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was < 25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are approximately 84%, approximately 78%, approximately 72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request.
Collapse
Affiliation(s)
- Xiaojing Yu
- Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Graduate School of the Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, PR China
| | | | | | | | | |
Collapse
|
84
|
Liu H, Yang J, Wang M, Xue L, Chou KC. Using fourier spectrum analysis and pseudo amino acid composition for prediction of membrane protein types. Protein J 2006; 24:385-9. [PMID: 16323044 DOI: 10.1007/s10930-005-7592-4] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Membrane proteins are generally classified into the following five types: (1) type I membrane protein, (2) type II membrane protein, (3) multipass transmembrane proteins, (4) lipid chain-anchored membrane proteins, and (5) GPI-anchored membrane proteins. Given the sequence of an uncharacterized membrane protein, how can we identify which one of the above five types it belongs to? This is important because the biological function of a membrane protein is closely correlated with its type. Particularly, with the explosion of protein sequences entering into databanks, it is in high demand to develop an automated method to address this problem. To realize this, the key is to catch the statistical characteristics for each of the five types. However, it is not easy because they are buried in a pile of long and complicated sequences. In this paper, based on the concept of the pseudo amino acid composition (Chou, K. C. (2001). PROTEINS: Structure, Function, and Genetics 43: 246-255), the technique of Fourier spectrum analysis is introduced. By doing so, the sample of a protein is represented by a set of discrete components that can incorporate a considerable amount of the sequence order effects as well as its amino acid composition information. On the basis of such a statistical frame, the support vector machine (SVM) is introduced to perform predictions. High success rates were yielded by the self-consistency test, jackknife test, and independent dataset test, suggesting that the current approach holds a promising potential to become a high throughput tool for membrane protein type prediction as well as other related areas.
Collapse
Affiliation(s)
- Hui Liu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, 200030, China
| | | | | | | | | |
Collapse
|
85
|
Yeh JI, Mao L. Prediction of Membrane Proteins in Mycobacterium tuberculosis Using a Support Vector Machine Algorithm. J Comput Biol 2006; 13:126-9. [PMID: 16472026 DOI: 10.1089/cmb.2006.13.126] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We report our finding of linear clustering of signal sequences at the N-terminus of M.tb membrane proteins, directing membrane localization. Although it is widely accepted that membrane proteins have signal peptides at the N-terminus, statistical ensemble analysis of Support Vector Machine prediction results indicate that M.tb membrane proteins have embedded N-terminal sequence patterns beyond the signal peptides previously identified in E. coli. The additional patterns at the N-terminus of M.tb membrane proteins may have correlations to their unique enzymatic functions and unusual characteristics such as membrane interaction in pathogenes.
Collapse
Affiliation(s)
- Joanne I Yeh
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI 20906, USA.
| | | |
Collapse
|
86
|
Cai YD, Feng KY, Lu WC, Chou KC. Using LogitBoost classifier to predict protein structural classes. J Theor Biol 2006; 238:172-6. [PMID: 16043193 DOI: 10.1016/j.jtbi.2005.05.034] [Citation(s) in RCA: 156] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2005] [Revised: 05/04/2005] [Accepted: 05/05/2005] [Indexed: 11/19/2022]
Abstract
Prediction of protein classification is an important topic in molecular biology. This is because it is able to not only provide useful information from the viewpoint of structure itself, but also greatly stimulate the characterization of many other features of proteins that may be closely correlated with their biological functions. In this paper, the LogitBoost, one of the boosting algorithms developed recently, is introduced for predicting protein structural classes. It performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. It was demonstrated that the LogitBoost outperformed the support vector machines in predicting the structural classes for a given dataset, indicating that the new classifier is very promising. It is anticipated that the power in predicting protein structural classes as well as many other bio-macromolecular attributes will be further strengthened if the LogitBoost and some other existing algorithms can be effectively complemented with each other.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai 200436, China
| | | | | | | |
Collapse
|
87
|
Bhasin M, Raghava GPS. GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res 2005; 33:W143-7. [PMID: 15980444 PMCID: PMC1160112 DOI: 10.1093/nar/gki351] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
The receptors of amine subfamily are specifically major drug targets for therapy of nervous disorders and psychiatric diseases. The recognition of novel amine type of receptors and their cognate ligands is of paramount interest for pharmaceutical companies. In the past, Chou and co-workers have shown that different types of amine receptors are correlated with their amino acid composition and are predictable on its basis with considerable accuracy [Elrod and Chou (2002) Protein Eng., 15, 713–715]. This motivated us to develop a better method for the recognition of novel amine receptors and for their further classification. The method was developed on the basis of amino acid composition and dipeptide composition of proteins using support vector machine. The method was trained and tested on 167 proteins of amine subfamily of G-protein-coupled receptors (GPCRs). The method discriminated amine subfamily of GPCRs from globular proteins with Matthew's correlation coefficient of 0.98 and 0.99 using amino acid composition and dipeptide composition, respectively. In classifying different types of amine receptors using amino acid composition and dipeptide composition, the method achieved an accuracy of 89.8 and 96.4%, respectively. The performance of the method was evaluated using 5-fold cross-validation. The dipeptide composition based method predicted 67.6% of protein sequences with an accuracy of 100% with a reliability index ≥5. A web server GPCRsclass has been developed for predicting amine-binding receptors from its amino acid sequence [ and (mirror site)].
Collapse
MESH Headings
- Artificial Intelligence
- Dipeptides/chemistry
- Internet
- Receptors, Adrenergic/chemistry
- Receptors, Adrenergic/classification
- Receptors, Biogenic Amine/chemistry
- Receptors, Biogenic Amine/classification
- Receptors, Cholinergic/chemistry
- Receptors, Cholinergic/classification
- Receptors, Dopamine/chemistry
- Receptors, Dopamine/classification
- Receptors, G-Protein-Coupled/chemistry
- Receptors, G-Protein-Coupled/classification
- Receptors, Serotonin/chemistry
- Receptors, Serotonin/classification
- Sequence Analysis, Protein
- Software
Collapse
Affiliation(s)
| | - G. P. S. Raghava
- To whom the correspondence should be addressed. Tel: +91 172 2690557/2695225; Fax: +91 172 2690632/2690585;
| |
Collapse
|
88
|
Shen H, Chou KC. Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. Biochem Biophys Res Commun 2005; 334:288-92. [PMID: 16002049 DOI: 10.1016/j.bbrc.2005.06.087] [Citation(s) in RCA: 128] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2005] [Accepted: 06/14/2005] [Indexed: 10/25/2022]
Abstract
Knowledge of membrane protein type often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences emerging during the post-genomic era, it is highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins according to their primary sequences, so as to timely make the relevant annotations on them for the reference usage in both basic research and drug discovery. Based on the concept of pseudo-amino acid composition [K.C. Chou, Proteins: Struct. Funct. Genet. 43 (2001) 246-255; Erratum: Proteins: Struct. Funct. Genet. 44 (2001) 60] that has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, a novel predictor, the so-called "optimized evidence-theoretic K-nearest neighbor" or "OET-KNN" classifier, was proposed. It was demonstrated via the self-consistency test, jackknife test, and independent dataset test that the new predictor, compared with many previous ones, yielded higher success rates in most cases. The new predictor can also be used to improve the prediction quality for, among many other protein attributes, structural class, subcellular localization, enzyme family class, and G-protein coupled receptor type. The OET-KNN classifier will be available as a web-server at http://www.pami.sjtu.edu.cn/kcchou.
Collapse
Affiliation(s)
- Hongbin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China
| | | |
Collapse
|
89
|
Cai YD, Chou KC. Predicting membrane protein type by functional domain composition and pseudo-amino acid composition. J Theor Biol 2005; 238:395-400. [PMID: 16040052 DOI: 10.1016/j.jtbi.2005.05.035] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2005] [Revised: 05/25/2005] [Accepted: 05/26/2005] [Indexed: 10/25/2022]
Abstract
Given the sequence of a protein, how can we predict whether it is a membrane protein or non-membrane protein? If it is, what membrane protein type it belongs to? Since these questions are closely relevant to the function of an uncharacterized protein, their importance is self-evident. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. By hybridizing the functional domain (FunD) and pseudo-amino acid composition (PseAA), a new strategy called FunD-PseAA predictor was introduced. To test the power of the predictor, a highly non-homologous data set was constructed where none of proteins has 25% sequence identity to any other. The overall success rates obtained with the FunD-PseAA predictor on such a data set by the jackknife cross-validation test was 85% for the case in identifying membrane protein and non-membrane protein, and 91% in identifying the membrane protein type among the following 5 categories: (1) type-1 membrane protein, (2) type-2 membrane protein, (3) multipass transmembrane protein, (4) lipid chain-anchored membrane protein, and (5) GPI-anchored membrane protein. These rates are much higher than those obtained by the other methods on the same stringent data set, indicating that the FunD-PseAA predictor may become a useful high throughput tool in bioinformatics and proteomics.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Biomolecular Sciences Department, University of Manchester Institute of Science & Technology, P.O. Box 88, Manchester, M60 1QD, UK.
| | | |
Collapse
|
90
|
Chou KC, Cai YD. Using GO-PseAA predictor to identify membrane proteins and their types. Biochem Biophys Res Commun 2005; 327:845-7. [PMID: 15649422 DOI: 10.1016/j.bbrc.2004.12.069] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2004] [Indexed: 11/21/2022]
Abstract
Cell membranes are crucial to the life of a cell. Although the basic structure of biological membrane is provided by the lipid bilayer, most of the specific functions are carried out by membrane proteins. Knowledge of membrane protein type often offers important clues toward determining the function of an uncharacterized protein. Therefore, predicting the type of a membrane protein from its primary sequence, or even just identifying whether the uncharacterized protein belongs to a membrane protein or not, is an important and challenging problem in bioinformatics and proteomics. To deal with these problems, the GO-PseAA predictor is introduced that is operated in a hybridization space by combining the gene ontology and pseudo amino acid composition. Meanwhile, to test the prediction quality, a dataset was constructed that contains 6476 non-membrane proteins and 5122 membrane proteins classified into five different types. To avoid redundancy and bias, none of the proteins included has > or = 40% sequence identity to any other. It has been observed that the overall success rate by the jackknife cross-validation test in identifying non-membrane proteins and membrane proteins was 94.76%, and that in identifying the five membrane protein types was 95.84%. The high success rates suggest that the GO-PseAA predictor can catch the core feature of the statistical samples concerned and may become an automated high throughput toll in molecular and cell biology.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
91
|
Wang L, Chen K, Ong YS. Bio-kernel Self-organizing Map for HIV Drug Resistance Classification. LECTURE NOTES IN COMPUTER SCIENCE 2005. [PMCID: PMC7122014 DOI: 10.1007/11539087_20] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Kernel self-organizing map has been recently studied by Fyfe and his colleagues [1]. This paper investigates the use of a novel bio-kernel function for the kernel self-organizing map. For verification, the application of the proposed new kernel self-organizing map to HIV drug resistance classification using mutation patterns in protease sequences is presented. The original self-organizing map together with the distributed encoding method was compared. It has been found that the use of the kernel self-organizing map with the novel bio-kernel function leads to better classification and faster convergence rate ...
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Nanyang Avenue, 639798 Singapore
| | - Ke Chen
- School of Software, Sun Yat-Sen University, 510275 Guangzhou, China
| | - Yew Soon Ong
- School of Computer Engineering, Nanyang Technological University, BLK N4, 2b-39, Nanyang Avenue, 639798 Singapore
| |
Collapse
|
92
|
Jiang-Ning S, Wei-Jiang L, Wen-Bo X. Cooperativity of the oxidization of cysteines in globular proteins. J Theor Biol 2004; 231:85-95. [PMID: 15363931 DOI: 10.1016/j.jtbi.2004.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2003] [Revised: 06/01/2004] [Accepted: 06/07/2004] [Indexed: 11/17/2022]
Abstract
Based on the 639 non-homologous proteins with 2910 cysteine-containing segments of well-resolved three-dimensional structures, a novel approach has been proposed to predict the disulfide-bonding state of cysteines in proteins by constructing a two-stage classifier combining a first global linear discriminator based on their amino acid composition and a second local support vector machine classifier. The overall prediction accuracy of this hybrid classifier for the disulfide-bonding state of cysteines in proteins has scored 84.1% and 80.1%, when measured on cysteine and protein basis using the rigorous jack-knife procedure, respectively. It shows that whether cysteines should form disulfide bonds depends not only on the global structural features of proteins but also on the local sequence environment of proteins. The result demonstrates the applicability of this novel method and provides comparable prediction performance compared with existing methods for the prediction of the oxidation states of cysteines in proteins.
Collapse
Affiliation(s)
- Song Jiang-Ning
- The Key Laboratory of Industrial Biotechnology, Ministry of Education, Southern Yangtze University, 170 Huihe Road, Wuxi 214036, China.
| | | | | |
Collapse
|
93
|
Bhasin M, Raghava GPS. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 2004; 279:23262-6. [PMID: 15039428 DOI: 10.1074/jbc.m401932200] [Citation(s) in RCA: 175] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide composition-based classifiers were 82.6 and 97.5%, respectively. Therefore, our results prove that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred.
Collapse
Affiliation(s)
- Manoj Bhasin
- Institute of Microbial Technology, Chandigarh 160036, India
| | | |
Collapse
|