1
|
Otaki JM, Tsutsumi M, Gotoh T, Yamamoto H. Secondary structure characterization based on amino acid composition and availability in proteins. J Chem Inf Model 2010; 50:690-700. [PMID: 20210310 DOI: 10.1021/ci900452z] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The importance of thorough analyses of the secondary structures in proteins as basic structural units cannot be overemphasized. Although recent computational methods have achieved reasonably high accuracy for predicting secondary structures from amino acid sequences, a simple and fundamental empirical approach to characterize the amino acid composition of secondary structures was performed mainly in 1970s, with a small number of analyzed structures. To extend this classical approach using a large number of analyzed structures, here we characterized the amino acid sequences of secondary structures (12 154 alpha-helix units, 4592 3(10)-helix units, 16 787 beta-strand units, and 30 811 "other" units), using the representative three-dimensional protein structure records (1641 protein chains) from the Protein Data Bank. We first examined the length and the amino acid compositions of secondary structures, including rank order differences and assignment relationships among amino acids. These compositional results were largely, but not entirely, consistent with the previous studies. In addition, we examined the frequency of 400 amino acid doublets and 8000 triplets in secondary structures based on their relative counts, termed the availability. We identified not only some triplets that were specific to a certain secondary structure but also so-called zero-count triplets, which did not occur in a given secondary structure at all, even though they were probabilistically predicted to occur several times. Taken together, the present study revealed essential features of secondary structures and suggests potential applications in the secondary structure prediction and the functional design of protein sequences.
Collapse
Affiliation(s)
- Joji M Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology, and Marine Science, University of the Ryukyus, Nishihara, Okinawa 903-0213, Japan.
| | | | | | | |
Collapse
|
2
|
Liang G, Zhao W. Using factor analysis scales of generalized amino acid information for prediction and characteristic analysis of β-turns in proteins based on a support vector machine model. Sci China Chem 2010. [DOI: 10.1007/s11426-010-0165-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
3
|
Homaeian L, Kurgan LA, Ruan J, Cios KJ, Chen K. Prediction of protein secondary structure content for the twilight zone sequences. Proteins 2007; 69:486-98. [PMID: 17623861 DOI: 10.1002/prot.21527] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Secondary protein structure carries information about local structural arrangements, which include three major conformations: alpha-helices, beta-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a comprehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for beta-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the secondary protein structure.
Collapse
Affiliation(s)
- Leila Homaeian
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada
| | | | | | | | | |
Collapse
|
4
|
Huang JT, Cheng JP. Prediction of folding transition-state position (βT) of small, two-state proteins from local secondary structure content. Proteins 2007; 68:218-22. [PMID: 17469192 DOI: 10.1002/prot.21411] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Folding kinetics of proteins is governed by the free energy and position of transition states. But attempts to predict the position of folding transition state on reaction pathway from protein structure have been met with only limited success, unlike the folding-rate prediction. Here, we find that the folding transition-state position is related to the secondary structure content of native two-state proteins. We present a simple method for predicting the transition-state position from their alpha-helix, turn and polyproline secondary structures. The method achieves 81% correlation with experiment over 24 small, two-state proteins, suggesting that the local secondary structure content, especially for content of alpha-helix, is a determinant of the solvent accessibility of the transition state ensemble and size of folding nucleus.
Collapse
Affiliation(s)
- Ji-Tao Huang
- College of Chemistry and State Key Laboratory of Elemento-Organic Chemistry, Nankai University, Tianjin 300071, China
| | | |
Collapse
|
5
|
Ruan J, Wang K, Yang J, Kurgan LA, Cios K. Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences. Artif Intell Med 2005; 35:19-35. [PMID: 16081261 DOI: 10.1016/j.artmed.2005.02.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2004] [Revised: 01/22/2005] [Accepted: 02/22/2005] [Indexed: 11/25/2022]
Abstract
OBJECTIVE One of interesting computational topics in bioinformatics is prediction of secondary structure of proteins. Over 30 years of research has been devoted to the topic but we are still far away from having reliable prediction methods. A critical piece of information for accurate prediction of secondary structure is the helix and strand content of a given protein sequence. Ability to accurately predict content of those two secondary structures has a good potential to improve accuracy of prediction of the secondary structure. Most of the existing methods use composition vector to predict the content. Their underlying assumption is that the vector can be used to provide functional mapping between primary sequence and helix/strand content. While this is true for small sets of proteins we show that for larger protein sets such mapping are inconsistent, i.e. the same composition vectors correspond to different contents. To this end, we propose a method for prediction of helix/strand content from primary protein sequences that is fundamentally different from currently available methods. METHODS AND MATERIAL Our method is accurate and uses a novel approach to obtain information from primary sequence based on a composition moment vector, which is a measure that includes information about both composition of a given primary sequence and the position of amino acids in the sequence. In contrast to the composition vector, we show that it provides functional mapping between primary sequence and the helix/strand content. RESULTS A set of benchmarks involving a large protein dataset consisting of over 11,000 protein sequences from Protein Data Bank was performed to validate the method. Prediction done by a neural network had average accuracy of 91.5% for the helix and 94.5% for the strand contents. We also show that using the new measure results in about 40% reduction of error rates when compared with the composition vector results. CONCLUSIONS The developed method has much better accuracy when compared with other existing methods, as shown on a large body of proteins, in contrast to other reported results that often target small sets of specific protein types, such as globular proteins.
Collapse
Affiliation(s)
- Jishou Ruan
- College of Mathematics and LPMC, Nankai University, Tianjin 300071, PR China
| | | | | | | | | |
Collapse
|
6
|
Prediction of Secondary Protein Structure Content from Primary Sequence Alone – A Feature Selection Based Approach. ACTA ACUST UNITED AC 2005. [DOI: 10.1007/11510888_33] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
7
|
Chou KC, Cai YD. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J Cell Biochem 2003; 90:1250-60. [PMID: 14635197 DOI: 10.1002/jcb.10719] [Citation(s) in RCA: 136] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Given a protein sequence, how to identify its subcellular location? With the rapid increase in newly found protein sequences entering into databanks, the problem has become more and more important because the function of a protein is closely correlated with its localization. To practically deal with the challenge, a dataset has been established that allows the identification performed among the following 14 subcellular locations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6) endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, (10) mitochondria, (11) nucleus, (12) peroxisome, (13) plasma membrane, and (14) vacuole. Compared with the datasets constructed by the previous investigators, the current one represents the largest in the scope of localizations covered, and hence many proteins which were totally out of picture in the previous treatments, can now be investigated. Meanwhile, to enhance the potential and flexibility in taking into account the sequence-order effect, the series-mode pseudo-amino-acid-composition has been introduced as a representation for a protein. High success rates are obtained by the re-substitution test, jackknife test, and independent dataset test, respectively. It is anticipated that the current automated method can be developed to a high throughput tool for practical usage in both basic research and pharmaceutical industry.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA
| | | |
Collapse
|
8
|
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, that associate through noncovalent interactions and, occasionally, disulfide bonds. With the number of protein sequences entering into data banks rapidly increasing, we are confronted with a challenge: how to develop an automated method to identify the quaternary attribute for a new polypeptide chain (i.e., whether it is formed just as a monomer, or as a dimer, trimer, or any other oligomer). This is important, because the functions of proteins are closely related to their quaternary attribute. For example, some critical ligands only bind to dimers but not to monomers; some marvelous allosteric transitions only occur in tetramers but not other oligomers; and some ion channels are formed by tetramers, whereas others are formed by pentamers. To explore this problem, we adopted the pseudo amino acid composition originally proposed for improving the prediction of protein subcellular location (Chou, Proteins, 2001; 43:246-255). The advantage of using the pseudo amino acid composition to represent a protein is that it has paved a way that can take into account a considerable amount of sequence-order effects to significantly improve prediction quality. Results obtained by resubstitution, jack-knife, and independent data set tests, have indicated that the current approach might be quite promising in dealing with such an extremely complicated and difficult problem.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Kalamazoo, Michigan 49009, USA.
| | | |
Collapse
|
9
|
Cai YD, Liu XJ, Xu XB, Chou KC. Artificial neural network method for predicting protein secondary structure content. COMPUTERS & CHEMISTRY 2002; 26:347-50. [PMID: 12139417 DOI: 10.1016/s0097-8485(01)00125-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In this paper, the neural network method was applied to predict the content of protein secondary structure elements that was based on 'pair-coupled amino acid composition', in which the sequence coupling effects are explicitly included through a series of conditional probability elements. The prediction was examined by a self-consistency test and an independent-dataset. Both indicated good results obtained when using the neural network method to predict the contents of alpha-helix, beta-sheet, parallel beta-sheet strand, antiparallel beta-sheet strand, beta-bridge, 3(10)-helix, pi-helix, H-bonded turn, bend, and random coil.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences.
| | | | | | | |
Collapse
|
10
|
Lin Z, Pan XM. Accurate prediction of protein secondary structural content. JOURNAL OF PROTEIN CHEMISTRY 2001; 20:217-20. [PMID: 11565901 DOI: 10.1023/a:1010967008838] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
An improved multiple linear regression (MLR) method is proposed to predict a protein's secondary structural content based on its primary sequence. The amino acid composition, the autocorrelation function, and the interaction function of side-chain mass derived from the primary sequence are taken into account. The average absolute errors of prediction over 704 unrelated proteins with the jackknife test are 0.088, 0.081, and 0.059 with standard deviations 0.073, 0.066, and 0.055 for alpha-helix, beta-sheet, and coil, respectively. That the sum of predicted secondary structure content should be close to 1.0 was introduced as a criterion to evaluate whether the prediction is acceptable. While only the predictions with the sum of predicted secondary structure content between 0.99 and 1.01 are accepted (about 11% of all proteins), the absolute errors are 0.058 for alpha-helix, 0.054 for beta-sheet, and 0.045 for coil.
Collapse
Affiliation(s)
- Z Lin
- National Laboratory of Biomacromolecules, Institute of Biophysics, Academia Sinica, Beijing, China
| | | |
Collapse
|
11
|
Zhang Z, Sun ZR, Zhang CT. A new approach to predict the helix/strand content of globular proteins. J Theor Biol 2001; 208:65-78. [PMID: 11162053 DOI: 10.1006/jtbi.2000.2201] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
An improved multiple linear regression method has been proposed to predict the content of alpha-helix and beta-strand of a globular protein based on its primary sequence and structural class. The amino acid composition and the auto-correlation functions derived from the hydrophobicity profile of the primary sequence have been taken into account. However, only the compositions of a part of the amino acids and a part of the auto-correlation functions are selected as the regression terms, which lead to the least prediction error. The resubstitution test shows that the average absolute errors are 0.052 and 0.047 with the standard deviations 0.050 and 0.047 for the prediction of helix/strand content, respectively. A rigorous cross-validation test, the jackknife test shows that the average absolute errors are 0.058 and 0.053 with the standard deviations 0.057 and 0.053 for the prediction of helix/strand content, respectively. Both tests indicate the self-consistency and the extrapolating effectiveness of the new method. The high prediction accuracy means that the method is suitable for practical applications.
Collapse
Affiliation(s)
- Z Zhang
- Chemical Engineering Research Center, Tianjin University, Tianjin 300072, China
| | | | | |
Collapse
|
12
|
Chou KC. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 2000; 278:477-83. [PMID: 11097861 DOI: 10.1006/bbrc.2000.3815] [Citation(s) in RCA: 213] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
How to incorporate the sequence order effect is a key and logical step for improving the prediction quality of protein subcellular location, but meanwhile it is a very difficult problem as well. This is because the number of possible sequence order patterns in proteins is extremely large, which has posed a formidable barrier to construct an effective training data set for statistical treatment based on the current knowledge. That is why most of the existing prediction algorithms are operated based on the amino-acid composition alone. In this paper, based on the physicochemical distance between amino acids, a set of sequence-order-coupling numbers was introduced to reflect the sequence order effect, or in a rigorous term, the quasi-sequence-order effect. Furthermore, the covariant discriminant algorithm by Chou and Elrod (Protein Eng. 12, 107-118, 1999) developed recently was augmented to allow the prediction performed by using the input of both the sequence-order-coupling numbers and amino-acid composition. A remarkable improvement was observed in the prediction quality using the augmented covariant discriminant algorithm. The approach described here represents one promising step forward in the efforts of incorporating sequence order effect in protein subcellular location prediction. It is anticipated that the current approach may also have a series of impacts on the prediction of other protein features by statistical approaches.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia, Kalamazoo, Michigan 49007-4940, USA
| |
Collapse
|
13
|
Abstract
A tight turn in protein structure is defined as a site where (i) a polypeptide chain reverses its overall direction, i.e., leads the chain to fold back on itself by nearly 180 degrees, and (ii) the amino acid residues directly involved in forming the turn are no more than six. Tight turns are generally categorized as delta-turn, gamma-turn, beta-turn, alpha-turn, and pi-turn, which are formed by two-, three-, four-, five-, and six-amino-acid residues, respectively. According to the folding mode, each of such tight turns can be further classified into several different types. Tight turns play an important role in globular proteins from both the structural and functional points of view. In view of this, various efforts have been made to predict tight turns and their types. This Review summarizes the development in this area, with an emphasis focused on the most recent work concerned that is featured by the sequence-coupled model. Meanwhile, the future challenge in this area has also been briefly addressed.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, Michigan, 49007-4940, USA
| |
Collapse
|
14
|
Bu WS, Feng ZP, Zhang Z, Zhang CT. Prediction of protein (domain) structural classes based on amino-acid index. EUROPEAN JOURNAL OF BIOCHEMISTRY 1999; 266:1043-9. [PMID: 10583400 DOI: 10.1046/j.1432-1327.1999.00947.x] [Citation(s) in RCA: 44] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
A protein (domain) is usually classified into one of the following four structural classes: all-alpha, all-beta, alpha/beta and alpha + beta. In this paper, a new formulation is proposed to predict the structural class of a protein (domain) from its primary sequence. Instead of the amino-acid composition used widely in the previous structural class prediction work, the auto-correlation functions based on the profile of amino-acid index along the primary sequence of the query protein (domain) are used for the structural class prediction. Consequently, the overall predictive accuracy is remarkably improved. For the same training database consisting of 359 proteins (domains) and the same component-coupled algorithm [Chou, K.C. & Maggiora, G.M. (1998) Protein Eng. 11, 523-538], the overall predictive accuracy of the new method for the jackknife test is 5-7% higher than the accuracy based only on the amino-acid composition. The overall predictive accuracy finally obtained for the jackknife test is as high as 90.5%, implying that a significant improvement has been achieved by making full use of the information contained in the primary sequence for the class prediction. This improvement depends on the size of the training database, the auto-correlation functions selected and the amino-acid index used. We have found that the amino-acid index proposed by Oobatake and Ooi, i.e. the average nonbonded energy per residue, leads to the optimal predictive result in the case for the database sets studied in this paper. This study may be considered as an alternative step towards making the structural class prediction more practical.
Collapse
Affiliation(s)
- W S Bu
- Department of Physics, Tianjin University, China
| | | | | | | |
Collapse
|
15
|
Abstract
All existing algorithms for predicting the content of protein secondary structure elements have been based on the conventional amino-acid-composition, where no sequence coupling effects are taken into account. In this article, an algorithm was developed for predicting the content of protein secondary structure elements that was based on a new amino-acid-composition, in which the sequence coupling effects are explicitly included through a series of conditional probability elements. The prediction was examined by a self-consistency test and an independent dataset test. Both indicated a remarkable improvement obtained when using the current algorithm to predict the contents of alpha-helix, beta-sheet, beta-bridge, 3(10)-helix, pi-helix, H-bonded turn, bend and random coil. Examples of the improved accuracy by introducing the new amino-acid-composition, as well as its impact on the study of protein structural class and biologically function, are discussed.
Collapse
Affiliation(s)
- W Liu
- Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, MI 49007-4940, USA
| | | |
Collapse
|
16
|
Abstract
The three-dimensional structure of a protein is uniquely dictated by its primary sequence. However, owing to the very high degenerative nature of the sequence-structure relationship, proteins are generally folded into one of only a few structural classes that are closely correlated with the amino-acid composition. This suggests that the interaction among the components of amino acid composition may play a considerable role in determining the structural class of a protein. To quantitatively test such a hypothesis at a deeper level, three potential functions, U((0)), U((1)), and U((2)), were formulated that respectively represent the 0th-order, 1st-order, and 2nd-order approximations for the interaction among the components of the amino acid composition in a protein. It was observed that the correct rates in recognizing protein structural classes by U((2)) are significantly higher than those by U((0)) and U((1)), indicating that an algorithm that can more completely incorporate the interaction contributions will yield better recognition quality, and hence further demonstrate that the interaction among the components of amino acid composition is an important driving force in determining the structural class of a protein during the sequence folding process.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia and Upjohn, Kalamazoo, Michigan, 49007-4940, USA
| |
Collapse
|
17
|
Chou KC. Using pair-coupled amino acid composition to predict protein secondary structure content. JOURNAL OF PROTEIN CHEMISTRY 1999; 18:473-80. [PMID: 10449044 DOI: 10.1023/a:1020696810938] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The pair-coupled amino acid composition is introduced to predict the secondary structure contents of a protein. Compared with the existing methods all based on singlewise amino acid composition as defined in a 20D (dimensional) space, this represents a step forward to the consideration of the sequence coupling effect. The test results indicate that the introduction of the pair-coupled amino acid composition can significantly improve the prediction quality. It is anticipated that the concept of the pair-coupled amino acid composition can be used to simplify the formulation of sequence coupling (or sequence order) effects and to study many other features of proteins as well.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, Michigan 49007-4940, USA
| |
Collapse
|
18
|
Abstract
Tight turns play an important role in globular proteins from both the structural and functional points of view. Of tight turns, beta-turns and gamma-turns have been extensively studied, but alpha-turns were little investigated. Recently, a systematic search for alpha-turns was conducted by V. Pavone et al. [(1996) Biopolymers, Vol. 38, pp. 705-721] from 190 proteins (221 protein chains). They found 356 alpha-turns that were classified into nine different types according to their backbone trajectory features. In view of this new discovery, a sequence-coupled model based on Markov chain theory is proposed for predicting the alpha-turn types in proteins. The high rates of correct prediction by resubstitution test and jackknife test imply that that the formation of different alpha-turn types is evidently correlated with the sequence of a pentapeptide, and hence can be approximately predicted based on the sequence information of the pentapeptide alone, although the role of its interaction with the other part of a protein cannot be completely ignored. The algorithm presented here can also be used to conduct the prediction in which a distinction between alpha-turns and non-alpha-turns is also required.
Collapse
Affiliation(s)
- K C Chou
- Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, MI 49007-4940, USA
| |
Collapse
|