51
|
Solis AD, Rackovsky SR. Fold homology detection using sequence fragment composition profiles of proteins. Proteins 2011; 78:2745-56. [PMID: 20635424 DOI: 10.1002/prot.22788] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called "twilight zone" problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (approximately 15-30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database that share low pairwise sequence similarity. Using the receiver-operating characteristic measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the "twilight zone".
Collapse
Affiliation(s)
- Armando D Solis
- Department of Biological Sciences, New York City College of Technology, The City University of New York, Brooklyn, New York 11201, USA.
| | | |
Collapse
|
52
|
Zhang S, Ding S, Wang T. High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure. Biochimie 2011; 93:710-4. [PMID: 21237245 DOI: 10.1016/j.biochi.2011.01.001] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Accepted: 01/04/2011] [Indexed: 11/30/2022]
Abstract
Information on the structural classes of proteins has been proven to be important in many fields of bioinformatics. Prediction of protein structural class for low-similarity sequences is a challenge problem. In this study, 11 features (including 8 re-used features and 3 newly-designed features) are rationally utilized to reflect the general contents and spatial arrangements of the secondary structural elements of a given protein sequence. To evaluate the performance of the proposed method, jackknife cross-validation tests are performed on two widely used benchmark datasets, 1189 and 25PDB with sequence similarity lower than 40% and 25%, respectively. Comparison of our results with other methods shows that our proposed method is very promising and may provide a cost-effective alternative to predict protein structural class in particular for low-similarity datasets.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematical Sciences, Dalian University of Technology, Ganjingzi District, Dalian, Liaoning, PR China.
| | | | | |
Collapse
|
53
|
Liu T, Zheng X, Wang J. Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie 2010; 92:1330-4. [DOI: 10.1016/j.biochi.2010.06.013] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2010] [Accepted: 06/16/2010] [Indexed: 11/25/2022]
|
54
|
Zheng X, Li C, Wang J. An information-theoretic approach to the prediction of protein structural class. J Comput Chem 2010; 31:1201-6. [PMID: 19777491 DOI: 10.1002/jcc.21406] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
An information-theoretical approach, which combines a sequence decomposition technique and a fuzzy clustering algorithm, is proposed for prediction of protein structural class. This approach could bypass the process of selecting and comparing sequence features as done previously. First, distances between each pair of protein sequences are estimated using a conditional decomposition technique in information theory. Then, the fuzzy k-nearest neighbor algorithm is used to identify the structural class of a protein given as set of sample sequences. To verify the strength of our method, we choose three widely used datasets constructed by Chou and Zhou. It is shown by the Jackknife test that our approach represents an improvement in the prediction of accuracy over existing methods.
Collapse
Affiliation(s)
- Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai 200234, China
| | | | | |
Collapse
|
55
|
Chen L, Lu L, Feng K, Li W, Song J, Zheng L, Yuan Y, Zeng Z, Feng K, Lu W, Cai Y. Multiple classifier integration for the prediction of protein structural classes. J Comput Chem 2010; 30:2248-54. [PMID: 19274708 DOI: 10.1002/jcc.21230] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Supervised classifiers, such as artificial neural network, partition trees, and support vector machines, are often used for the prediction and analysis of biological data. However, choosing an appropriate classifier is not straightforward because each classifier has its own strengths and weaknesses, and each biological dataset has its own characteristics. By integrating many classifiers together, people can avoid the dilemma of choosing an individual classifier out of many to achieve an optimized classification results (Rahman et al., Multiple Classifier Combination for Character Recognition: Revisiting the Majority Voting System and Its Variation, Springer, Berlin, 2002, 167-178). The classification algorithms come from Weka (Witten and Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, 2005) (a collection of software tools for machine learning algorithms). By integrating many predictors (classifiers) together through simple voting, the correct prediction (classification) rates are 65.21% and 65.63% for a basic training dataset and an independent test set, respectively. These results are better than any single machine learning algorithm collected in Weka when exactly the same data are used. Furthermore, we introduce an integration strategy which takes care of both classifier weightings and classifier redundancy. A feature selection strategy, called minimum redundancy maximum relevance (mRMR), is transferred into algorithm selection to deal with classifier redundancy in this research, and the weightings are based on the performance of each classifier. The best classification results are obtained when 11 algorithms are selected by mRMR method, and integrated together through majority votes with weightings. As a result, the prediction correct rates are 68.56% and 69.29% for the basic training dataset and the independent test dataset, respectively. The web-server is available at http://chemdata.shu.edu.cn/protein_st/.
Collapse
Affiliation(s)
- Lei Chen
- Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, People's Republic of China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
56
|
Yang JY, Peng ZL, Chen X. Prediction of protein structural classes for low-homology sequences based on predicted secondary structure. BMC Bioinformatics 2010; 11 Suppl 1:S9. [PMID: 20122246 PMCID: PMC3009544 DOI: 10.1186/1471-2105-11-s1-s9] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Prediction of protein structural classes (α, β, α + β and α/β) from amino acid sequences is of great importance, as it is beneficial to study protein function, regulation and interactions. Many methods have been developed for high-homology protein sequences, and the prediction accuracies can achieve up to 90%. However, for low-homology sequences whose average pairwise sequence identity lies between 20% and 40%, they perform relatively poorly, yielding the prediction accuracy often below 60%. Results We propose a new method to predict protein structural classes on the basis of features extracted from the predicted secondary structures of proteins rather than directly from their amino acid sequences. It first uses PSIPRED to predict the secondary structure for each protein sequence. Then, the chaos game representation is employed to represent the predicted secondary structure as two time series, from which we generate a comprehensive set of 24 features using recurrence quantification analysis, K-string based information entropy and segment-based analysis. The resulting feature vectors are finally fed into a simple yet powerful Fisher's discriminant algorithm for the prediction of protein structural classes. We tested the proposed method on three benchmark datasets in low homology and achieved the overall prediction accuracies of 82.9%, 83.1% and 81.3%, respectively. Comparisons with ten existing methods showed that our method consistently performs better for all the tested datasets and the overall accuracy improvements range from 2.3% to 27.5%. A web server that implements the proposed method is freely available at http://www1.spms.ntu.edu.sg/~chenxin/RKS_PPSC/. Conclusion The high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the predicted secondary structure sequences, which is capable of characterizing the sequence order information, local interactions of the secondary structural elements, and spacial arrangements of α helices and β strands. Thus, it is a valuable method to predict protein structural classes particularly for low-homology amino acid sequences.
Collapse
Affiliation(s)
- Jian-Yi Yang
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore.
| | | | | |
Collapse
|
57
|
Mizianty MJ, Kurgan L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 2009; 10:414. [PMID: 20003388 PMCID: PMC2805645 DOI: 10.1186/1471-2105-10-414] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2009] [Accepted: 12/13/2009] [Indexed: 11/13/2022] Open
Abstract
Background Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. Results The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. Conclusions The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada.
| | | |
Collapse
|
58
|
Song J, Tan H, Mahmood K, Law RHP, Buckle AM, Webb GI, Akutsu T, Whisstock JC. Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS One 2009; 4:e7072. [PMID: 19759917 PMCID: PMC2742725 DOI: 10.1371/journal.pone.0007072] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Accepted: 08/20/2009] [Indexed: 11/24/2022] Open
Abstract
Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
- * E-mail: (JS); (JCW)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Khalid Mahmood
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ruby H. P. Law
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ashley M. Buckle
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
| | - James C. Whisstock
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
- * E-mail: (JS); (JCW)
| |
Collapse
|
59
|
Sequence physical properties encode the global organization of protein structure space. Proc Natl Acad Sci U S A 2009; 106:14345-8. [PMID: 19706520 DOI: 10.1073/pnas.0903433106] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
It is demonstrated that, properly represented, the amino acid composition of protein sequences contains the information necessary to delineate the global properties of protein structure space. A numerical representation of amino acid sequence in terms of a set of property factors is used, and the values of those property factors are averaged over individual sequences and then over sets of sequences belonging to structurally defined groups. These sequence sets then can be viewed as points in a 10-dimensional space, and the organization of that space, determined only by sequence properties, is similar at both local and global scales to that of the space of protein structures determined previously.
Collapse
|
60
|
Qiu JD, Luo SH, Huang JH, Liang RP. Using support vector machines for prediction of protein structural classes based on discrete wavelet transform. J Comput Chem 2009; 30:1344-50. [DOI: 10.1002/jcc.21115] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
61
|
Mielke SP, Krishnan V. Characterization of protein secondary structure from NMR chemical shifts. PROGRESS IN NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY 2009; 54:141-165. [PMID: 20160946 PMCID: PMC2766081 DOI: 10.1016/j.pnmrs.2008.06.002] [Citation(s) in RCA: 84] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Affiliation(s)
- Steven P. Mielke
- UC Davis Genome Center, University of California, Davis, California
| | - V.V. Krishnan
- Department of Applied Science and Center for Comparative Medicine, University of California, Davis, California
- Department of Chemistry, California State University, Fresno, California
- Correspondence to or
| |
Collapse
|
62
|
Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 2009; 257:618-26. [DOI: 10.1016/j.jtbi.2008.12.027] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2008] [Revised: 11/07/2008] [Accepted: 12/19/2008] [Indexed: 11/17/2022]
|
63
|
Prediction of protein structural classes using hybrid properties. Mol Divers 2008; 12:171-9. [PMID: 18953662 DOI: 10.1007/s11030-008-9093-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2008] [Accepted: 09/25/2008] [Indexed: 10/21/2022]
Abstract
In this paper, amino acid compositions are combined with some protein sequence properties (physiochemical properties) to predict protein structural classes. We are able to predict protein structural classes using a mathematical model that combines the nearest neighbor algorithm (NNA), mRMR (minimum redundancy, maximum relevance), and feature forward searching strategy. Jackknife cross-validation is used to evaluate the prediction accuracy. As a result, the prediction success rate improves to 68.8%, which is better than the 62.2% obtained when using only amino acid compositions. Therefore, we conclude that the physiochemical properties are factors that contribute to the protein folding phenomena and the most contributing features are found to be the amino acid composition. We expect that prediction accuracy will improve further as more sequence information comes to light. A web server for predicting the protein structural classes is available at http://app3.biosino.org:8080/liwenjin/index.jsp.
Collapse
|
64
|
Prediction of the protein structural class by specific peptide frequencies. Biochimie 2008; 91:226-9. [PMID: 18957316 DOI: 10.1016/j.biochi.2008.09.005] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2008] [Accepted: 09/18/2008] [Indexed: 11/21/2022]
Abstract
We evaluated the i-peptides occurrence frequency in the protein sequences belonging to the two datasets which include proteins with a sequence similarity lower than 25% and 40%, respectively. We worked out a new structural class prediction algorithm using the most frequent i-peptides (with i=2, 3, 4), which characterize the four structural classes. Using the tri-peptides, much more able to gain structural information from sequences compared to the di-peptides, the best results were obtained. Compared to the other methods, similarly founded on peptide occurrence frequencies, our method achieves the best prediction accuracy. We compared it also with methods founded on more sophisticated computational approaches.
Collapse
|
65
|
Chen K, Kurgan LA, Ruan J. Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem 2008; 29:1596-604. [PMID: 18293306 DOI: 10.1002/jcc.20918] [Citation(s) in RCA: 131] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Knowledge of structural classes is useful in understanding of folding patterns in proteins. Although existing structural class prediction methods applied virtually all state-of-the-art classifiers, many of them use a relatively simple protein sequence representation that often includes amino acid (AA) composition. To this end, we propose a novel sequence representation that incorporates evolutionary information encoded using PSI-BLAST profile-based collocation of AA pairs. We used six benchmark datasets and five representative classifiers to quantify and compare the quality of the structural class prediction with the proposed representation. The best, classifier support vector machine achieved 61-96% accuracy on the six datasets. These predictions were comprehensively compared with a wide range of recently proposed methods for prediction of structural classes. Our comprehensive comparison shows superiority of the proposed representation, which results in error rate reductions that range between 14% and 26% when compared with predictions of the best-performing, previously published classifiers on the considered datasets. The study also shows that, for the benchmark dataset that includes sequences characterized by low identity (i.e., 25%, 30%, and 40%), the prediction accuracies are 20-35% lower than for the other three datasets that include sequences with a higher degree of similarity. In conclusion, the proposed representation is shown to substantially improve the accuracy of the structural class prediction. A web server that implements the presented prediction method is freely available at http://biomine.ece.ualberta.ca/Structural_Class/SCEC.html.
Collapse
Affiliation(s)
- Ke Chen
- Department of Electrical and Computer Engineering, ECERF, University of Alberta, Edmonton, Alberta, Canada
| | | | | |
Collapse
|
66
|
Gu F, Chen H, Ni J. Protein structural class prediction based on an improved statistical strategy. BMC Bioinformatics 2008; 9 Suppl 6:S5. [PMID: 18541058 PMCID: PMC2423446 DOI: 10.1186/1471-2105-9-s6-s5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A protein structural class (PSC) belongs to the most basic but important classification in protein structures. The prediction technique of protein structural class has been developing for decades. Two popular indices are the amino-acid-frequency (AAF) based, and amino-acid-arrangement (AAA) with long-term correlation (LTC) - based indices. They were proposed in many works. Both indices have its pros and cons. For example, the AAF index focuses on a statistical analysis, while the AAA-LTC emphasizes the long-term, biological significance. Unfortunately, the datasets used in previous work were not very reliable for a small number of sequences with a high-sequence similarity. RESULTS By modifying a statistical strategy, we proposed a new index method that combines probability and information theory together with a long-term correlation. We also proposed a numerically and biologically reliable dataset included more than 5700 sequences with a low sequence similarity. The results showed that the proposed approach has its high accuracy. Comparing with amino acid composition (AAC) index using a distance method, the accuracy of our approach has a 16-20% improvement for re-substitution test and about 6-11% improvement for cross-validation test. The values were about 23% and 15% for the component coupled method (CCM). CONCLUSION A new index method, combining probability and information theory together with a long-term correlation was proposed in this paper. The statistical method was improved significantly based on our new index. The cross validation test was conducted, and the result show the proposed method has a great improvement.
Collapse
Affiliation(s)
- Fei Gu
- Department of Biotechnology, College of Life Sciences, Zhejiang University, Hangzhou, 310027, China.
| | | | | |
Collapse
|
67
|
SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008; 9:226. [PMID: 18452616 PMCID: PMC2391167 DOI: 10.1186/1471-2105-9-226] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2007] [Accepted: 05/01/2008] [Indexed: 11/16/2022] Open
Abstract
Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Collapse
|
68
|
Kurgan LA, Zhang T, Zhang H, Shen S, Ruan J. Secondary structure-based assignment of the protein structural classes. Amino Acids 2008; 35:551-64. [DOI: 10.1007/s00726-008-0080-3] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 02/27/2008] [Indexed: 11/24/2022]
|
69
|
Anand A, Pugalenthi G, Suganthan PN. Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. J Theor Biol 2008; 253:375-80. [PMID: 18423492 DOI: 10.1016/j.jtbi.2008.02.031] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2007] [Revised: 01/24/2008] [Accepted: 02/25/2008] [Indexed: 11/27/2022]
Abstract
Determination of protein structural class solely from sequence information is a challenging task. Several attempts to solve this problem using various methods can be found in literature. We present support vector machine (SVM) approach where probability-based decision is used along with class-wise optimized feature sets. This approach has two distinguishing characteristics from earlier attempts: (1) it uses class-wise optimized features and (2) decisions of different SVM classifiers are coupled with probability estimates to make the final prediction. The algorithm was tested on three datasets, containing 498 domains, 1092 domains and 5261 domains. Ten-fold external cross-validation was performed to assess the performance of the algorithm. Significantly high accuracy of 92.89% was obtained for the 498-dataset. We achieved 54.67% accuracy for the dataset with 1092 domains, which is better than the previously reported best accuracy of 53.8%. We obtained 59.43% prediction accuracy for the larger and less redundant 5261-dataset. We also investigated the advantage of using class-wise features over union of these features (conventional approach) in one-vs.-all SVM framework. Our results clearly show the advantage of using class-wise optimized features. Brief analysis of the selected class-wise features indicates their biological significance.
Collapse
Affiliation(s)
- Ashish Anand
- School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
| | | | | |
Collapse
|
70
|
Xu J, He Y, Qiang B, Yuan J, Peng X, Pan XM. A novel method for high accuracy sumoylation site prediction from protein sequences. BMC Bioinformatics 2008; 9:8. [PMID: 18179724 PMCID: PMC2245905 DOI: 10.1186/1471-2105-9-8] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2007] [Accepted: 01/08/2008] [Indexed: 11/21/2022] Open
Abstract
Background Protein sumoylation is an essential dynamic, reversible post translational modification that plays a role in dozens of cellular activities, especially the regulation of gene expression and the maintenance of genomic stability. Currently, the complexities of sumoylation mechanism can not be perfectly solved by experimental approaches. In this regard, computational approaches might represent a promising method to direct experimental identification of sumoylation sites and shed light on the understanding of the reaction mechanism. Results Here we presented a statistical method for sumoylation site prediction. A 5-fold cross validation test over the experimentally identified sumoylation sites yielded excellent prediction performance with correlation coefficient, specificity, sensitivity and accuracy equal to 0.6364, 97.67%, 73.96% and 96.71% respectively. Additionally, the predictor performance is maintained when high level homologs are removed. Conclusion By using a statistical method, we have developed a new SUMO site prediction method – SUMOpre, which has shown its great accuracy with correlation coefficient, specificity, sensitivity and accuracy.
Collapse
Affiliation(s)
- Jialin Xu
- The Key Laboratory of Bioinformatics, Ministry of Education, China, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, 100084, China.
| | | | | | | | | | | |
Collapse
|
71
|
Taguchi YH, Gromiha MM. Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinformatics 2007; 8:404. [PMID: 17953741 PMCID: PMC2174517 DOI: 10.1186/1471-2105-8-404] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2007] [Accepted: 10/22/2007] [Indexed: 11/10/2022] Open
Abstract
Background Predicting the three-dimensional structure of a protein from its amino acid sequence is a long-standing goal in computational/molecular biology. The discrimination of different structural classes and folding types are intermediate steps in protein structure prediction. Results In this work, we have proposed a method based on linear discriminant analysis (LDA) for discriminating 30 different folding types of globular proteins using amino acid occurrence. Our method was tested with a non-redundant set of 1612 proteins and it discriminated them with the accuracy of 38%, which is comparable to or better than other methods in the literature. A web server has been developed for discriminating the folding type of a query protein from its amino acid sequence and it is available at http://granular.com/PROLDA/. Conclusion Amino acid occurrence has been successfully used to discriminate different folding types of globular proteins. The discrimination accuracy obtained with amino acid occurrence is better than that obtained with amino acid composition and/or amino acid properties. In addition, the method is very fast to obtain the results.
Collapse
Affiliation(s)
- Y-h Taguchi
- Department of Physics, Faculty of Science and Technology, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan.
| | | |
Collapse
|
72
|
Abstract
We propose a simple model for the calculation of pK(a) values of ionizable residues in proteins. It is based on the premise that the pK(a) shift of ionizable residues is linearly correlated to the interaction between a particular residue and the local environment created by the surrounding residues. Despite its simplicity, the model displays good prediction performance. Under the sixfold cross test prediction over a data set of 405 experimental pK(a) values in 73 protein chains with known structures, the root-mean-square deviation (RMSD) between the experimental and calculated pK(a) was found to be 0.77. The accuracy of this model increases with increasing size of the data set: the RMSD is 0.609 for glutamate (the largest data set with 141 sites) and approximately 1 pH unit for lysine, with a data set containing 45 sites.
Collapse
Affiliation(s)
- Yun He
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China
| | | | | |
Collapse
|
73
|
Li Z, Wu S, Chen Z, Ye N, Yang S, Liao C, Zhang M, Yang L, Mei H, Yang Y, Zhao N, Zhou Y, Zhou P, Xiong Q, Xu H, Liu S, Ling Z, Chen G, Li G. Structural parameterization and functional prediction of antigenic polypeptome sequences with biological activity through quantitative sequence-activity models (QSAM) by molecular electronegativity edge-distance vector (VMED). SCIENCE IN CHINA. SERIES C, LIFE SCIENCES 2007; 50:706-16. [PMID: 17879071 PMCID: PMC7089106 DOI: 10.1007/s11427-007-0080-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/30/2006] [Accepted: 06/14/2007] [Indexed: 11/18/2022]
Abstract
Only from the primary structures of peptides, a new set of descriptors called the molecular electronegativity edge-distance vector (VMED) was proposed and applied to describing and characterizing the molecular structures of oligopeptides and polypeptides, based on the electronegativity of each atom or electronic charge index (ECI) of atomic clusters and the bonding distance between atom-pairs. Here, the molecular structures of antigenic polypeptides were well expressed in order to propose the automated technique for the computerized identification of helper T lymphocyte (Th) epitopes. Furthermore, a modified MED vector was proposed from the primary structures of polypeptides, based on the ECI and the relative bonding distance of the fundamental skeleton groups. The side-chains of each amino acid were here treated as a pseudo-atom. The developed VMED was easy to calculate and able to work. Some quantitative model was established for 28 immunogenic or antigenic polypeptides (AGPP) with 14 (1-14) A(d) and 14 other restricted activities assigned as "1"(+) and "0"(-), respectively. The latter comprised 6 A(b)(15-20), 3 A(k)(21-23), 2 E(k)(24-26), 2 H-2(k)(27 and 28) restricted sequences. Good results were obtained with 90% correct classification (only 2 wrong ones for 20 training samples) and 100% correct prediction (none wrong for 8 testing samples); while contrastively 100% correct classification (none wrong for 20 training samples) and 88% correct classification (1 wrong for 8 testing samples). Both stochastic samplings and cross validations were performed to demonstrate good performance. The described method may also be suitable for estimation and prediction of classes I and II for major histocompatibility antigen (MHC) epitope of human. It will be useful in immune identification and recognition of proteins and genes and in the design and development of subunit vaccines. Several quantitative structure activity relationship (QSAR) models were developed for various oligopeptides and polypeptides including 58 dipeptides and 31 pentapeptides with angiotensin converting enzyme (ACE) inhibition by multiple linear regression (MLR) method. In order to explain the ability to characterize molecular structure of polypeptides, a molecular modeling investigation on QSAR was performed for functional prediction of polypeptide sequences with antigenic activity and heptapeptide sequences with tachykinin activity through quantitative sequence-activity models (QSAMs) by the molecular electronegativity edge-distance vector (VMED). The results showed that VMED exhibited both excellent structural selectivity and good activity prediction. Moreover, the results showed that VMED behaved quite well for both QSAR and QSAM of poly-and oligopeptides, which exhibited both good estimation ability and prediction power, equal to or better than those reported in the previous references. Finally, a preliminary conclusion was drawn: both classical and modified MED vectors were very useful structural descriptors. Some suggestions were proposed for further studies on QSAR/QSAM of proteins in various fields.
Collapse
Affiliation(s)
- ZhiLiang Li
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - ShiRong Wu
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - ZeCong Chen
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Nancy Ye
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - ShengXi Yang
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - ChunYang Liao
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - MengJun Zhang
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
- Department of Medical Analysis/PLA Center of Bioinformatics Immunology, Surgeon Third University, Chongqing, 400031 China
| | - Li Yang
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Hu Mei
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
- Technology Centre for Life Sciences, Singapore Polytechnic, 500 Dover Road, Singapore, 139651 Singapore
| | - Yan Yang
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Na Zhao
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Yuan Zhou
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Ping Zhou
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Qing Xiong
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Hong Xu
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - ShuShen Liu
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - ZiHua Ling
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| | - Gang Chen
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
- Technology Centre for Life Sciences, Singapore Polytechnic, 500 Dover Road, Singapore, 139651 Singapore
| | - GenRong Li
- College of Chemistry and Chemical Engineering/Key Laboratory for Chemobiomedical Science and Engineering under Chongqing Municipality, College of Life Science and Biological Engineering/Key Laboratory for Biomechanics and Tissue Engineering under Ministry of Education, Chongqing University, Chongqing, 400044 China
- State Key Laboratory for Chemobiosensors and Chemobiometrics under MOST at Hunan University, Changsha, 410012 China
| |
Collapse
|
74
|
Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. J Theor Biol 2007; 250:186-93. [PMID: 17959199 DOI: 10.1016/j.jtbi.2007.09.014] [Citation(s) in RCA: 132] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2007] [Revised: 09/08/2007] [Accepted: 09/10/2007] [Indexed: 11/21/2022]
Abstract
Compared with the conventional amino acid (AA) composition, the pseudo-amino acid (PseAA) composition as originally introduced for protein subcellular location prediction can incorporate much more information of a protein sequence, so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. In this study, based on the concept of PseAA composition, the approximate entropy and hydrophobicity pattern of a protein sequence are used to characterize the PseAA components. Also, the immune genetic algorithm (IGA) is applied to search the optimal weight factors in generating the PseAA composition. Thus, for a given protein sequence sample, a 27-D (dimensional) PseAA composition is generated as its descriptor. The fuzzy K nearest neighbors (FKNN) classifier is adopted as the prediction engine. The results thus obtained in predicting protein structural classification are quite encouraging, indicating that the current approach may also be used to improve the prediction quality of other protein attributes, or at least can play a complimentary role to the existing methods in the relevant areas. Our algorithm is written in Matlab that is available by contacting the corresponding author.
Collapse
|
75
|
Lampros C, Exarchos TP, Fotiadis DI. Sequence-based protein structure prediction using a reduced state-space hidden Markov model. Comput Biol Med 2007; 37:1211-24. [PMID: 17161834 DOI: 10.1016/j.compbiomed.2006.10.014] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2006] [Revised: 10/24/2006] [Accepted: 10/30/2006] [Indexed: 10/23/2022]
Abstract
This work describes the use of a hidden Markov model (HMM), with a reduced number of states, which simultaneously learns amino acid sequence and secondary structure for proteins of known three-dimensional structure and it is used for two tasks: protein class prediction and fold recognition. The Protein Data Bank and the annotation of the SCOP database are used for training and evaluation of the proposed HMM for a number of protein classes and folds. Results demonstrate that the reduced state-space HMM performs equivalently, or even better in some cases, on classifying proteins than a HMM trained with the amino acid sequence. The major advantage of the proposed approach is that a small number of states is employed and the training algorithm is of low complexity and thus relatively fast.
Collapse
Affiliation(s)
- Christos Lampros
- Unit of Medical Technology and Intelligent Information Systems, Department of Computer Science, University of Ioannina, GR 45110 Ioannina, Greece
| | | | | |
Collapse
|
76
|
Kurgan L, Chen K. Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Commun 2007; 357:453-60. [PMID: 17433260 DOI: 10.1016/j.bbrc.2007.03.164] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2007] [Accepted: 03/26/2007] [Indexed: 11/26/2022]
Abstract
Structural class characterizes the overall folding type of a protein or its domain. This paper develops an accurate method for in silico prediction of structural classes from low homology (twilight zone) protein sequences. The proposed LLSC-PRED method applies linear logistic regression classifier and a custom-designed, feature-based sequence representation to provide predictions. The main advantages of the LLSC-PRED are the comprehensive representation that includes 58 features describing composition and physicochemical properties of the sequences and transparency of the prediction model. The representation also includes predicted secondary structure content, thus for the first time exploring synergy between these two related predictions. Based on tests performed with a large set of 1673 twilight zone domains, the LLSC-PRED's prediction accuracy, which equals over 62%, is shown to be better than accuracy of over a dozen recently published competing in silico methods and similar to accuracy of other, non-transparent classifiers that use the proposed representation.
Collapse
Affiliation(s)
- Lukasz Kurgan
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada.
| | | |
Collapse
|
77
|
Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI. Mining sequential patterns for protein fold recognition. J Biomed Inform 2007; 41:165-79. [PMID: 17573243 DOI: 10.1016/j.jbi.2007.05.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2006] [Revised: 04/06/2007] [Accepted: 05/05/2007] [Indexed: 10/23/2022]
Abstract
Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.
Collapse
Affiliation(s)
- Themis P Exarchos
- Department of Medical Physics, Medical School, University of Ioannina, GR 45110 Ioannina, Greece
| | | | | | | |
Collapse
|
78
|
Kurgan L, Kedarisetti KD. Sequence representation and prediction of protein secondary structure for structural motifs in twilight zone proteins. Protein J 2007; 25:463-74. [PMID: 17115254 DOI: 10.1007/s10930-006-9029-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Characterizing and classifying regularities in protein structure is an important element in uncovering the mechanisms that regulate protein structure, function and evolution. Recent research concentrates on analysis of structural motifs that can be used to describe larger, fold-sized structures based on homologous primary sequences. At the same time, accuracy of secondary protein structure prediction based on multiple sequence alignment drops significantly when low homology (twilight zone) sequences are considered. To this end, this paper addresses a problem of providing an alternative sequences representation that would improve ability to distinguish secondary structures for the twilight zone sequences without using alignment. We consider a novel classification problem, in which, structural motifs, referred to as structural fragments (SFs) are defined as uniform strand, helix and coil fragments. Classification of SFs allows to design novel sequence representations, and to investigate which other factors and prediction algorithms may result in the improved discrimination. Comprehensive experimental results show that statistically significant improvement in classification accuracy can be achieved by: (1) improving sequence representations, and (2) removing possible noise on the terminal residues in the SFs. Combining these two approaches reduces the error rate on average by 15% when compared to classification using standard representation and noisy information on the terminal residues, bringing the classification accuracy to over 70%. Finally, we show that certain prediction algorithms, such as neural networks and boosted decision trees, are superior to other algorithms.
Collapse
Affiliation(s)
- Lukasz Kurgan
- Electrical and Computer Engineering Department, University of Alberta, Edmonton, Alberta, Canada, T6G 2V4.
| | | |
Collapse
|
79
|
Wu Z, Wang Y, Feng E, Chen L. A new geometric-topological method to measure protein fold similarity. Chem Phys Lett 2007. [DOI: 10.1016/j.cplett.2006.11.071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
80
|
Kedarisetti KD, Kurgan L, Dick S. Classifier ensembles for protein structural class prediction with varying homology. Biochem Biophys Res Commun 2006; 348:981-8. [PMID: 16904630 DOI: 10.1016/j.bbrc.2006.07.141] [Citation(s) in RCA: 104] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2006] [Accepted: 07/23/2006] [Indexed: 11/17/2022]
Abstract
Structural class characterizes the overall folding type of a protein or its domain. A number of computational methods have been proposed to predict structural class based on primary sequences; however, the accuracy of these methods is strongly affected by sequence homology. This paper proposes, an ensemble classification method and a compact feature-based sequence representation. This method improves prediction accuracy for the four main structural classes compared to competing methods, and provides highly accurate predictions for sequences of widely varying homologies. The experimental evaluation of the proposed method shows superior results across sequences that are characterized by entire homology spectrum, ranging from 25% to 90% homology. The error rates were reduced by over 20% when compared with using individual prediction methods and most commonly used composition vector representation of protein sequences. Comparisons with competing methods on three large benchmark datasets consistently show the superiority of the proposed method.
Collapse
Affiliation(s)
- Kanaka Durga Kedarisetti
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alta., Canada
| | | | | |
Collapse
|
81
|
Nielsen BG, Røgen P, Bohr HG. Gauss-integral based representation of protein structure for predicting the fold class from the sequence. ACTA ACUST UNITED AC 2006. [DOI: 10.1016/j.mcm.2005.11.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
82
|
Ruan J, Wang K, Yang J, Kurgan LA, Cios K. Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences. Artif Intell Med 2005; 35:19-35. [PMID: 16081261 DOI: 10.1016/j.artmed.2005.02.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2004] [Revised: 01/22/2005] [Accepted: 02/22/2005] [Indexed: 11/25/2022]
Abstract
OBJECTIVE One of interesting computational topics in bioinformatics is prediction of secondary structure of proteins. Over 30 years of research has been devoted to the topic but we are still far away from having reliable prediction methods. A critical piece of information for accurate prediction of secondary structure is the helix and strand content of a given protein sequence. Ability to accurately predict content of those two secondary structures has a good potential to improve accuracy of prediction of the secondary structure. Most of the existing methods use composition vector to predict the content. Their underlying assumption is that the vector can be used to provide functional mapping between primary sequence and helix/strand content. While this is true for small sets of proteins we show that for larger protein sets such mapping are inconsistent, i.e. the same composition vectors correspond to different contents. To this end, we propose a method for prediction of helix/strand content from primary protein sequences that is fundamentally different from currently available methods. METHODS AND MATERIAL Our method is accurate and uses a novel approach to obtain information from primary sequence based on a composition moment vector, which is a measure that includes information about both composition of a given primary sequence and the position of amino acids in the sequence. In contrast to the composition vector, we show that it provides functional mapping between primary sequence and the helix/strand content. RESULTS A set of benchmarks involving a large protein dataset consisting of over 11,000 protein sequences from Protein Data Bank was performed to validate the method. Prediction done by a neural network had average accuracy of 91.5% for the helix and 94.5% for the strand contents. We also show that using the new measure results in about 40% reduction of error rates when compared with the composition vector results. CONCLUSIONS The developed method has much better accuracy when compared with other existing methods, as shown on a large body of proteins, in contrast to other reported results that often target small sets of specific protein types, such as globular proteins.
Collapse
Affiliation(s)
- Jishou Ruan
- College of Mathematics and LPMC, Nankai University, Tianjin 300071, PR China
| | | | | | | | | |
Collapse
|
83
|
Gromiha MM, Ahmad S, Suwa M. TMBETA-NET: discrimination and prediction of membrane spanning beta-strands in outer membrane proteins. Nucleic Acids Res 2005; 33:W164-7. [PMID: 15980447 PMCID: PMC1160128 DOI: 10.1093/nar/gki367] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We have developed a web-server, TMBETA-NET for discriminating outer membrane proteins and predicting their membrane spanning β-strand segments. The amino acid compositions of globular and outer membrane proteins have been systematically analyzed and a statistical method has been proposed for discriminating outer membrane proteins. The prediction of membrane spanning segments is mainly based on feed forward neural network and refined with β-strand length. Our program takes the amino acid sequence as input and displays the type of the protein along with membrane-spanning β-strand segments as a stretch of highlighted amino acid residues. Further, the probability of residues to be in transmembrane β-strand has been provided with a coloring scheme. We observed that outer membrane proteins were discriminated with an accuracy of 89% and their membrane spanning β-strand segments at an accuracy of 73% just from amino acid sequence information. The prediction server is available at .
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
84
|
Gromiha MM. Motifs in outer membrane protein sequences: Applications for discrimination. Biophys Chem 2005; 117:65-71. [PMID: 15905018 DOI: 10.1016/j.bpc.2005.04.005] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2005] [Revised: 04/01/2005] [Accepted: 04/01/2005] [Indexed: 12/31/2022]
Abstract
Discriminating outer membrane proteins (OMPs) from other folding types of globular and membrane proteins is an important problem for predicting their secondary and tertiary structures and detecting outer membrane proteins from genomic sequences as well. In this work, we have systematically analyzed the distribution of amino acid residues in the sequences of globular and outer membrane proteins with several motifs, such as A*B, A**B, etc. We observed that the motifs E*L, A*K and L*E occur frequently in globular proteins while S*S, N*S and R*D predominantly occur in OMPs. We have devised a statistical method based on frequently occurring motifs in globular and OMPs and obtained an accuracy of 96% and 82% for correctly identifying OMPs and excluding globular proteins, respectively. Further, we noticed that the motifs of transmembrane helical (TMH) proteins are different from that of OMPs. While I*A, I*L and L*I prefer in TMH proteins S*S, N*S and N*N predominantly occur in OMPs. The information about the occurrence of A*B motifs in TMH and OMPs could discriminate them with an accuracy of 80% for excluding OMPs and 100% for identifying OMPs. The influence of protein size and structural class for discrimination is discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Waterfront Bio-IT Research Building, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
| |
Collapse
|
85
|
Gromiha MM, Ahmad S, Suwa M. Application of residue distribution along the sequence for discriminating outer membrane proteins. Comput Biol Chem 2005; 29:135-42. [PMID: 15833441 DOI: 10.1016/j.compbiolchem.2005.02.006] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Revised: 02/22/2005] [Accepted: 02/22/2005] [Indexed: 12/01/2022]
Abstract
Discriminating outer membrane proteins from other folding types of globular and membrane proteins is an important problem both for detecting outer membrane proteins from genomic sequences and for the successful prediction of their secondary and tertiary structures. In this work, we have systematically analyzed the distribution of amino acid residues in the sequences of globular and outer membrane proteins. We observed that the occurrence of two neighboring aliphatic and polar residues is significantly higher in outer membrane proteins than in globular proteins. From the information about the dipeptide composition we have devised a statistical method for discriminating outer membrane proteins from other globular and membrane proteins. Our approach correctly picked up the outer membrane proteins with an accuracy of 95% for the training set of 337 proteins. On the other hand, our method has correctly excluded the globular proteins at an accuracy of 79% in a non-redundant dataset of 674 proteins. Furthermore, the present method is able to correctly exclude alpha-helical membrane proteins up to an accuracy of 87%. These accuracy levels are comparable to other methods in the literature. The influence of protein size and structural class for discrimination is discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), AIST Tokyo Walterfront Bio-IT Research Building 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
86
|
Gromiha MM, Suwa M. A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 2004; 21:961-8. [PMID: 15531602 DOI: 10.1093/bioinformatics/bti126] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Discriminating outer membrane proteins from other folding types of globular and membrane proteins is an important task both for identifying outer membrane proteins from genomic sequences and for the successful prediction of their secondary and tertiary structures. RESULTS We have systematically analyzed the amino acid composition of globular proteins from different structural classes and outer membrane proteins. We found that the residues, Glu, His, Ile, Cys, Gln, Asn and Ser, show a significant difference between globular and outer membrane proteins. Based on this information, we have devised a statistical method for discriminating outer membrane proteins from other globular and membrane proteins. Our approach correctly picked up the outer membrane proteins with an accuracy of 89% for the training set of 337 proteins. On the other hand, our method has correctly excluded the globular proteins at an accuracy of 79% in a non-redundant dataset of 674 proteins. Furthermore, the present method is able to correctly exclude alpha-helical membrane proteins up to an accuracy of 80%. These accuracy levels are comparable to other methods in the literature, and this is a simple method, which could be used for dissecting outer membrane proteins from genomic sequences. The influence of protein size, structural class and specific residues for discrimination is discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST) Aomi Frontier Building 17F, 2-43 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
87
|
|
88
|
Jin L, Fang W, Tang H. Prediction of protein structural classes by a new measure of information discrepancy. Comput Biol Chem 2003; 27:373-80. [PMID: 12927111 DOI: 10.1016/s1476-9271(02)00087-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Since it was observed that the structural class of a protein is related to its amino acid composition, various methods based on amino acid composition have been proposed to predict protein structural classes. Though those methods are effective to some degree, their predictive quality is confined because amino acid composition cannot sufficiently include the information of protein sequences. In this paper, a measure of information discrepancy is applied to the prediction of protein structural classes; different from the previous methods, this new approach is based on the comparisons of subsequence distributions; therefore, the effect of residue order on protein structure is taken into account. The predictive results of the new approach on the same data set are better than those of the previous methods. As to a data set of 1401 sequences with no more than 30% redundancy, the overall correctness rates of resubstitution test and Jackknife test are 99.4 and 75.02%, respectively, and to other data sets the similar results are also obtained. All tests demonstrate that the residue order along protein sequences plays an important role on recognition of protein structural classes, especially for alpha/beta proteins and alpha+beta proteins. In addition, the tests also show that the new method is simple and efficient.
Collapse
Affiliation(s)
- Lixia Jin
- Institute of Computational Biology and Bioinformatics, Dalian University of Technology, 116025, Dalian, People's Republic of China.
| | | | | |
Collapse
|
89
|
Kumarevel TS, Gromiha MM, Selvaraj S, Gayatri K, Kumar PKR. Influence of medium- and long-range interactions in different folding types of globular proteins. Biophys Chem 2002; 99:189-98. [PMID: 12377369 DOI: 10.1016/s0301-4622(02)00183-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Recognition of protein fold from amino acid sequence is a challenging task. The structure and stability of proteins from different fold are mainly dictated by inter-residue interactions. In our earlier work, we have successfully used the medium- and long-range contacts for predicting the protein folding rates, discriminating globular and membrane proteins and for distinguishing protein structural classes. In this work, we analyze the role of inter-residue interactions in commonly occurring folds of globular proteins in order to understand their folding mechanisms. In the medium-range contacts, the globin fold and four-helical bundle proteins have more contacts than that of DNA-RNA fold although they all belong to all-alpha class. In long-range contacts, only the ribonuclease fold prefers 4-10 range and the other folding types prefer the range 21-30 in alpha/beta class proteins. Further, the preferred residues and residue pairs influenced by these different folds are discussed. The information about the preference of medium- and long-range contacts exhibited by the 20 amino acid residues can be effectively used to predict the folding type of each protein.
Collapse
Affiliation(s)
- T S Kumarevel
- National Institute of Advanced Industrial Science and Technology (AIST), Institute of Molecular and Cell Biology, Functional Nucleic Acids Group, Tsukuba Central 6, 1-1 Higashi, Tsukuba Science City, Ibaraki, Japan.
| | | | | | | | | |
Collapse
|
90
|
Luo RY, Feng ZP, Liu JK. Prediction of protein structural class by amino acid and polypeptide composition. EUROPEAN JOURNAL OF BIOCHEMISTRY 2002; 269:4219-25. [PMID: 12199700 DOI: 10.1046/j.1432-1033.2002.03115.x] [Citation(s) in RCA: 110] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
A new approach of predicting structural classes of protein domain sequences is presented in this paper. Besides the amino acid composition, the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides are taken into account based on the stepwise discriminant analysis. The result of jackknife test shows that this new approach can lead to higher predictive sensitivity and specificity for reduced sequence similarity datasets. Considering the dataset PDB40-B constructed by Brenner and colleagues, 75.2% protein domain sequences are correctly assigned in the jackknife test for the four structural classes: all-alpha, all-beta, alpha/beta and alpha + beta, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the component-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross-validation test with dataset PDB40-J constructed by Park and colleagues, more than 80% predictive accuracy is obtained. Furthermore, for the dataset constructed by Chou and Maggiona, the accuracy of 100% and 99.7% can be easily achieved, respectively, in the resubstitution test and in the jackknife test merely taking the composition of dipeptides into account. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which can be used for the systematic analysis of small or medium size protein sequences. The computer programs used in this paper are available on request.
Collapse
Affiliation(s)
- Rui-yan Luo
- Department of Mathematics, Tianjin University, Tianjin 300 072, China
| | | | | |
Collapse
|
91
|
Abstract
EVA is a web-based server that evaluates automatic structure prediction servers continuously and objectively. Since June 2000, EVA collected more than 20,000 secondary structure predictions. The EVA sets sufficed to conclude that the field of secondary structure prediction has advanced again. Accuracy increased substantially in the 1990s through using evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than 4% to its current height around 76% of all residues predicted correctly in one of the three states: helix, strand, or other. The best current methods solved most of the problems raised at earlier CASP meetings: All good methods now get segments right and perform well on strands. Is the recent increase in accuracy significant enough to make predictions even more useful? We believe the answer is affirmative. What is the limit of prediction accuracy? We shall see. All data are available through the EVA web site at [cubic.bioc.columbia.edu/eva/]. The raw data for the results presented are available at [eva]/sec/bup_common/2001_02_22/.
Collapse
Affiliation(s)
- B Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.
| | | |
Collapse
|
92
|
Abstract
The ultimate goal of structural genomics is to obtain the structure of each protein coded by each gene within a genome to determine gene function. Because of cost and time limitations, it remains impractical to solve the structure for every gene product experimentally. Up to a point, reasonably accurate three-dimensional structures can be deduced for proteins with homologous sequences by using comparative modeling. Beyond this, fold recognition or threading methods can be used for proteins showing little homology to any known fold, although this is relatively time-consuming and limited by the library of template folds currently available. Therefore, it is appropriate to develop methods that can increase our knowledge base, expanding our fold libraries by earmarking potentially "novel" folds for experimental structure determination. How can we sift through proteomic data rapidly and yet reliably identify novel folds as targets for structural genomics? We have analyzed a number of simple methods that discriminate between "novel" and "known" folds. We propose that simple alignments of secondary structure elements using predicted secondary structure could potentially be a more selective method than both a simple fold recognition method (GenTHREADER) and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures.
Collapse
Affiliation(s)
- Liam J McGuffin
- Institute of Cancer Genetics and Pharmacogenomics, Department of Biological Sciences, Brunel University, Uxbridge, Middlesex, United Kingdom
| | | |
Collapse
|
93
|
Abstract
The paradox recently raised by Wang and Yuan (Proteins 2000;38:165-175) in protein structural class prediction is actually a misinterpretation of the data reported in the literature. The Bayes decision rule, which was deemed by Wang and Yuan to be the most powerful method for predicting protein structural classes based on the amino acid composition, and applied by these investigators to derive the upper limit of prediction rate for structural classes, is actually completely the same as the component-coupled algorithm proposed by previous investigators (Chou et al., Proteins 1998;31:97-103). Owing to lack of a complete or near-complete training data set, the upper limit rate thus derived by these investigators might be both invalid and misleading. Clarification of these points will further stimulate investigation of this interesting area.
Collapse
Affiliation(s)
- Y D Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai, People's Republic of China
| |
Collapse
|
94
|
Wang ZX. The prediction accuracy for protein structural class by the component-coupled method is around 60%. Proteins 2001; 43:339-40. [PMID: 11288185 DOI: 10.1002/prot.1046] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Z X Wang
- National Laboratory of Biomacromolecules, Institute of Biophysics, Academia Sinica, Beijing, China
| |
Collapse
|
95
|
Abstract
Methods predicting protein secondary structure improved substantially in the 1990s through the use of evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height of around 76% of all residues predicted correctly in one of the three states, helix, strand, and other. The past year also brought successful new concepts to the field. These new methods may be particularly interesting in light of the improvements achieved through simple combining of existing methods. Divergent evolutionary profiles contain enough information not only to substantially improve prediction accuracy, but also to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on nonlocal conditions. An example is a method automatically identifying structural switches and thus finding a remarkable connection between predicted secondary structure and aspects of function. Secondary structure predictions are increasingly becoming the work horse for numerous methods aimed at predicting protein structure and function. Is the recent increase in accuracy significant enough to make predictions even more useful? Because the recent improvement yields a better prediction of segments, and in particular of beta strands, I believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.
Collapse
Affiliation(s)
- B Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, New York 10032, USA
| |
Collapse
|
96
|
Lin Z, Pan XM. Accurate prediction of protein secondary structural content. JOURNAL OF PROTEIN CHEMISTRY 2001; 20:217-20. [PMID: 11565901 DOI: 10.1023/a:1010967008838] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
An improved multiple linear regression (MLR) method is proposed to predict a protein's secondary structural content based on its primary sequence. The amino acid composition, the autocorrelation function, and the interaction function of side-chain mass derived from the primary sequence are taken into account. The average absolute errors of prediction over 704 unrelated proteins with the jackknife test are 0.088, 0.081, and 0.059 with standard deviations 0.073, 0.066, and 0.055 for alpha-helix, beta-sheet, and coil, respectively. That the sum of predicted secondary structure content should be close to 1.0 was introduced as a criterion to evaluate whether the prediction is acceptable. While only the predictions with the sum of predicted secondary structure content between 0.99 and 1.01 are accepted (about 11% of all proteins), the absolute errors are 0.058 for alpha-helix, 0.054 for beta-sheet, and 0.045 for coil.
Collapse
Affiliation(s)
- Z Lin
- National Laboratory of Biomacromolecules, Institute of Biophysics, Academia Sinica, Beijing, China
| | | |
Collapse
|
97
|
Abstract
It has been quite clear that the success rate for predicting protein structural class can be improved significantly by using the algorithms that incorporate the coupling effect among different amino acid components of a protein. However, there is still a lot of confusion in understanding the relationship of these advanced algorithms, such as the least Mahalanobis distance algorithm, the component-coupled algorithm, and the Bayes decision rule. In this communication, a simple, rigorous derivation is provided to prove that the Bayes decision rule introduced recently for protein structural class prediction is completely the same as the earlier component-coupled algorithm. Meanwhile, it is also very clear from the derivative equations that the least Mahalanobis distance algorithm is an approximation of the component-coupled algorithm, also named as the covariant-discriminant algorithm introduced by Chou and Elrod in protein subcellular location prediction (Protein Engineering, 1999; 12:107-118). Clarification of the confusion will help use these powerful algorithms effectively and correctly interpret the results obtained by them, so as to conduce to the further development not only in the structural prediction area, but in some other relevant areas in protein science as well.
Collapse
Affiliation(s)
- G P Zhou
- Department of Structural Biology, Burnham Institute, La Jolla, California, USA.
| | | |
Collapse
|