151
|
Qiu J, Sheffler W, Baker D, Noble WS. Ranking predicted protein structures with support vector regression. Proteins 2007; 71:1175-82. [PMID: 18004754 DOI: 10.1002/prot.21809] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Jian Qiu
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | | | | | | |
Collapse
|
152
|
Chen J, Chaudhari N. Cascaded bidirectional recurrent neural networks for protein secondary structure prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:572-582. [PMID: 17975269 DOI: 10.1109/tcbb.2007.1055] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Protein secondary structure (PSS) prediction is an important topic in bioinformatics. Our study on a large set of non-homologous proteins shows that long-range interactions commonly exist and negatively affect PSS prediction. Besides, we also reveal strong correlations between secondary structure (SS) elements. In order to take into account the long-range interactions and SS-SS correlations, we propose a novel prediction system based on cascaded bidirectional recurrent neural network (BRNN). We compare the cascaded BRNN against another two BRNN architectures, namely the original BRNN architecture used for speech recognition as well as Pollastri's BRNN that was proposed for PSS prediction. Our cascaded BRNN achieves an overall three state accuracy Q3 of 74.38\%, and reaches a high Segment OVerlap (SOV) of 66.0455. It outperforms the original BRNN and Pollastri's BRNN in both Q3 and SOV. Specifically, it improves the SOV score by 4-6%.
Collapse
|
153
|
Xiong H, Zhang Y, Chen XW. Data-dependent kernel machines for microarray data classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:583-595. [PMID: 17975270 DOI: 10.1109/tcbb.2007.1048] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
One important application of gene expression analysis is to classify tissue samples according to their gene expression levels. Gene expression data are typically characterized by high dimensionality and small sample size, which makes the classification task quite challenging. In this paper, we present a data-dependent kernel for microarray data classification. This kernel function is engineered so that the class separability of the training data is maximized. A bootstrapping-based resampling scheme is introduced to reduce the possible training bias. The effectiveness of this adaptive kernel for microarray data classification is illustrated with a k-Nearest Neighbor (KNN) classifier. Our experimental study shows that the data-dependent kernel leads to a significant improvement in the accuracy of KNN classifiers. Furthermore, this kernel-based KNN scheme has been demonstrated to be competitive to, if not better than, more sophisticated classifiers such as Support Vector Machines (SVMs) and the Uncorrelated Linear Discriminant Analysis (ULDA) for classifying gene expression data.
Collapse
|
154
|
Zarei R, Arab S, Sadeghi M. A method for protein accessibility prediction based on residue types and conformational states. Comput Biol Chem 2007; 31:384-8. [PMID: 17888743 DOI: 10.1016/j.compbiolchem.2007.08.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2007] [Revised: 08/14/2007] [Accepted: 08/15/2007] [Indexed: 11/30/2022]
Abstract
Prediction of protein accessibility from sequence, as prediction of protein secondary structure is an intermediate step for predicting structures and consequently functions of proteins. Most of the currently used methods are based on single residue prediction, either by statistical means or evolutionary information, and accessibility state of central residue in a window predicted. By expansion of databases of proteins with known 3D structures, we extracted information of pairwise residue types and conformational states of pairs simultaneously. For solving the problem of ambiguity in state prediction by one residue window sliding, we used dynamic programming algorithm to find the path with maximum score. The three state overall per-residue accuracy, Q(3), of this method in a Jackknife test with dataset of known proteins is more than 65% which is an improvement on results of methods based on evolutionary information.
Collapse
|
155
|
Won KJ, Hamelryck T, Prügel-Bennett A, Krogh A. An evolutionary method for learning HMM structure: prediction of protein secondary structure. BMC Bioinformatics 2007; 8:357. [PMID: 17888163 PMCID: PMC2072961 DOI: 10.1186/1471-2105-8-357] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2007] [Accepted: 09/21/2007] [Indexed: 11/24/2022] Open
Abstract
Background The prediction of the secondary structure of proteins is one of the most studied problems in bioinformatics. Despite their success in many problems of biological sequence analysis, Hidden Markov Models (HMMs) have not been used much for this problem, as the complexity of the task makes manual design of HMMs difficult. Therefore, we have developed a method for evolving the structure of HMMs automatically, using Genetic Algorithms (GAs). Results In the GA procedure, populations of HMMs are assembled from biologically meaningful building blocks. Mutation and crossover operators were designed to explore the space of such Block-HMMs. After each step of the GA, the standard HMM estimation algorithm (the Baum-Welch algorithm) was used to update model parameters. The final HMM captures several features of protein sequence and structure, with its own HMM grammar. In contrast to neural network based predictors, the evolved HMM also calculates the probabilities associated with the predictions. We carefully examined the performance of the HMM based predictor, both under the multiple- and single-sequence condition. Conclusion We have shown that the proposed evolutionary method can automatically design the topology of HMMs. The method reads the grammar of protein sequences and converts it into the grammar of an HMM. It improved previously suggested evolutionary methods and increased the prediction quality. Especially, it shows good performance under the single-sequence condition and provides probabilistic information on the prediction result. The protein secondary structure predictor using HMMs (P.S.HMM) is on-line available http://www.binf.ku.dk/~won/pshmm.htm. It runs under the single-sequence condition.
Collapse
Affiliation(s)
- Kyoung-Jae Won
- Bioinformatics Centre, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, DK-2200 Copenhagen, Denmark
- School of Electronics and Computer Science, University of Southampton, SO17 1BJ, UK
- Department of Chemistry & Biochemistry, UCSD, 9500 Gilman Drive, Mail Code 0359, La Jolla, CA, 92093-0359, USA
| | - Thomas Hamelryck
- Bioinformatics Centre, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, DK-2200 Copenhagen, Denmark
| | - Adam Prügel-Bennett
- School of Electronics and Computer Science, University of Southampton, SO17 1BJ, UK
| | - Anders Krogh
- Bioinformatics Centre, Department of Molecular Biology, University of Copenhagen, Ole Maaloes Vej 5, DK-2200 Copenhagen, Denmark
| |
Collapse
|
156
|
Tong J, Liu S. Three-Dimensional Holographic Vector of Atomic Interaction Field Applied in QSAR of Anti-HIV HEPT Analogues. ACTA ACUST UNITED AC 2007. [DOI: 10.1002/qsar.200710076] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
157
|
Hu HJ, Holley J, He J, Harrison RW, Yang H, Tai PC, Pan Y. To be or not to be: predicting soluble SecAs as membrane proteins. IEEE Trans Nanobioscience 2007; 6:168-79. [PMID: 17695753 DOI: 10.1109/tnb.2007.897486] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
SecA is an important component of protein translocation in bacteria, and exists in soluble and membrane-integrated forms. Most membrane prediction programs predict SecA as being a soluble protein, with the exception of TMpred and Top-Pred. However, the membrane associated predicted segments by TMpred and TopPred are inconsistent across bacterial species in spite of high sequence homology. In this paper we describe a new method for membrane protein prediction, PSSM_SVM, which provides consistent results for integral membrane domains of SecAs across bacterial species. This PSSM encoding scheme demonstrates the highest accuracy in terms of Q2 among the common prediction methods, and produces consistent results on blind test data. None of the previously described methods showed this kind of consistency when tested against the same blind test set. This scheme predicts traditional transmembrane segments and most of the soluble proteins accurately. The PSSM scheme applied to the membrane-associated protein SecA shows characteristic features. In the set of 223 known SecA sequences, the PSSM_SVM prediction scheme predicts eight to nine residue embedded membrane segments. This predicted region is part of a 12 residue helix from known X-ray crystal structures of SecAs. This information could be important for determining the structure of SecA proteins in the membrane which have different conformational properties from other transmembrane proteins, as well as other soluble proteins that may similarly integrate into lipid bi-layers.
Collapse
Affiliation(s)
- Hae-Jin Hu
- Molecular Basis of Disease Program, Georgia State University, Atlanta, GA 30303, USA.
| | | | | | | | | | | | | |
Collapse
|
158
|
Guo JT, Jaromczyk JW, Xu Y. Analysis of chameleon sequences and their implications in biological processes. Proteins 2007; 67:548-58. [PMID: 17299764 DOI: 10.1002/prot.21285] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Chameleon sequences have been implicated in amyloid related diseases. Here we report an analysis of two types of chameleon sequences, chameleon-HS (Helix vs. Strand) and chameleon-HE (Helix vs. Sheet), based on known structures in Protein Data Bank. Our survey shows that the longest chameleon-HS is eight residues while the longest chameleon-HE is seven residues. We have done a detailed analysis on the local and global environment that might contribute to the unique conformation of a chameleon sequence. We found that the existence of chameleon sequences does not present a problem for secondary structure prediction programs, including the first generation prediction programs, such as Chou-Fasman algorithm, and the third generation prediction programs that utilize evolution information. We have also investigated the possible implication of chameleon sequences in structural conservation and functional diversity of alternatively spliced protein isoforms.
Collapse
Affiliation(s)
- Jun-Tao Guo
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of Georgia, Athens, Georgia 30602, USA
| | | | | |
Collapse
|
159
|
Liang G, Li Z. Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine. J Mol Graph Model 2007; 26:269-81. [PMID: 17291800 DOI: 10.1016/j.jmgm.2006.12.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2006] [Revised: 11/18/2006] [Accepted: 12/10/2006] [Indexed: 10/23/2022]
Abstract
A novel base sequence representation technique, namely SGBP (scores of generalized base properties), was derived from principal component analysis of a matrix of 1209 property parameters including 0D, 1D, 2D and 3D information for five bases such as A, C, G, T and U. It was then employed to represent sequence structures of E. coli promoters. Variables which were used as inputs of partial least square (PLS) and support vector machine (SVM) were selected by genetic arithmetic-partial least square. All samples were divided into train set which was applied to develop quantitative sequence-activity modelings (QSAMs) and test set which was used to validate the predictive power of the resulting models according to D-optimal design. Investigation on QSAM by PLS showed properties of base of position -42, -34, -31, -33, -41, -46 and -29 may yield more influence on strengths, which has thus pointed us further into the direction of strong promoters. Parameters of SVM were determined by response surface methodology. Satisfactory results indicated that the simulative and the predictive abilities for the internal and external samples of QSAM by SVM were better than those of PLS. Those results showed that SGBP is a useful structural representation methodology in QSAMs due to its many advantages including plentiful structural information, easy manipulation, and high characterization competence. Moreover, SGBP-GA-SVM route for sequences design and activities prediction of DNA or RNA can further be applied.
Collapse
Affiliation(s)
- Guizhao Liang
- College of Bioengineering, Chongqing University, Chongqing 400030, PR China
| | | |
Collapse
|
160
|
Sivan S, Filo O, Siegelmann H. Application of expert networks for predicting proteins secondary structure. ACTA ACUST UNITED AC 2007; 24:237-43. [PMID: 17236807 DOI: 10.1016/j.bioeng.2006.12.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2006] [Revised: 12/05/2006] [Accepted: 12/06/2006] [Indexed: 02/02/2023]
Abstract
The present study utilizes expert neural networks for the prediction of proteins secondary structure. We use three independent networks, one for each structure (alpha, beta and coil) as the first-level processing unit; decision upon the chosen structure for each residue is carried out by a second-level, post-processing unit, which utilizes the Chou and Fasman frequency values Falpha and Fbeta in order to strengthen and/or deplete the probability of the specific structure under investigation. The highest prediction case was 76%. Our method requires primitive computational means and a relatively small training set, while still been comparable to previous work. It is not meant to be an alternative to the determination of secondary structure by means of free energy minimization, integration of dynamic equations of motion or crystallography, which are expensive, time-consuming and complicated, but to provide additional constrains, which might be considered and incorporated into larger computing setups in order to reduce the initial search space for the above methods.
Collapse
Affiliation(s)
- Sarit Sivan
- Department of Biomedical Engineering, Technion, Israel Institute of Technology, IIT, Haifa 32000, Israel.
| | | | | |
Collapse
|
161
|
Xu JR, Zhang JX, Han BC, Liang L, Ji ZL. CytoSVM: an advanced server for identification of cytokine-receptor interactions. Nucleic Acids Res 2007; 35:W538-42. [PMID: 17526528 PMCID: PMC1933174 DOI: 10.1093/nar/gkm254] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The interactions between cytokines and their complementary receptors are the gateways to properly understand a large variety of cytokine-specific cellular activities such as immunological responses and cell differentiation. To discover novel cytokine-receptor interactions, an advanced support vector machines (SVMs) model, CytoSVM, was constructed in this study. This model was iteratively trained using 449 mammal (except rat) cytokine-receptor interactions and about 1 million virtually generated positive and negative vectors in an enriched way. Final independent evaluation by rat's data received sensitivity of 97.4%, specificity of 99.2% and the Matthews correlation coefficient (MCC) of 0.89. This performance is better than normal SVM-based models. Upon this well-optimized model, a web-based server was created to accept primary protein sequence and present its probabilities to interact with one or several cytokines. Moreover, this model was applied to identify putative cytokine-receptor pairs in the whole genomes of human and mouse. Excluding currently known cytokine-receptor interactions, total 1609 novel cytokine-receptor pairs were discovered from human genome with probability ∼80% after further transmembrane analysis. These cover 220 novel receptors (excluding their isoforms) for 126 human cytokines. The screening results have been deposited in a database. Both the server and the database can be freely accessed at http://bioinf.xmu.edu.cn/software/cytosvm/cytosvm.php.
Collapse
Affiliation(s)
- Jin-Rui Xu
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Jing-Xian Zhang
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Bu-Cong Han
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Liang Liang
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Zhi-Liang Ji
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
- *To whom correspondence should be addressed. 86-0592-218289786-0592-2181015;
| |
Collapse
|
162
|
Gassend B, O'Donnell CW, Thies W, Lee A, van Dijk M, Devadas S. Learning biophysically-motivated parameters for alpha helix prediction. BMC Bioinformatics 2007; 8 Suppl 5:S3. [PMID: 17570862 PMCID: PMC1892091 DOI: 10.1186/1471-2105-8-s5-s3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Our goal is to develop a state-of-the-art protein secondary structure predictor, with an intuitive and biophysically-motivated energy model. We treat structure prediction as an optimization problem, using parameterizable cost functions representing biological "pseudo-energies". Machine learning methods are applied to estimate the values of the parameters to correctly predict known protein structures. Results Focusing on the prediction of alpha helices in proteins, we show that a model with 302 parameters can achieve a Qα value of 77.6% and an SOVα value of 73.4%. Such performance numbers are among the best for techniques that do not rely on external databases (such as multiple sequence alignments). Further, it is easier to extract biological significance from a model with so few parameters. Conclusion The method presented shows promise for the prediction of protein secondary structure. Biophysically-motivated elementary free-energies can be learned using SVM techniques to construct an energy cost function whose predictive performance rivals state-of-the-art. This method is general and can be extended beyond the all-alpha case described here.
Collapse
Affiliation(s)
- Blaise Gassend
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Charles W O'Donnell
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - William Thies
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Andrew Lee
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Marten van Dijk
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Srinivas Devadas
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
163
|
Abstract
Disulfide bonds play an important role in stabilizing protein structure and regulating protein function. Therefore, the ability to infer disulfide connectivity from protein sequences will be valuable in structural modeling and functional analysis. However, to predict disulfide connectivity directly from sequences presents a challenge to computational biologists due to the nonlocal nature of disulfide bonds, i.e., the close spatial proximity of the cysteine pair that forms the disulfide bond does not necessarily imply the short sequence separation of the cysteine residues. Recently, Chen and Hwang (Proteins 2005;61:507-512) treated this problem as a multiple class classification by defining each distinct disulfide pattern as a class. They used multiple support vector machines based on a variety of sequence features to predict the disulfide patterns. Their results compare favorably with those in the literature for a benchmark dataset sharing less than 30% sequence identity. However, since the number of disulfide patterns grows rapidly when the number of disulfide bonds increases, their method performs unsatisfactorily for the cases of large number of disulfide bonds. In this work, we propose a novel method to represent disulfide connectivity in terms of cysteine pairs, instead of disulfide patterns. Since the number of bonding states of the cysteine pairs is independent of that of disulfide bonds, the problem of class explosion is avoided. The bonding states of the cysteine pairs are predicted using the support vector machines together with the genetic algorithm optimization for feature selection. The complete disulfide patterns are then determined from the connectivity matrices that are constructed from the predicted bonding states of the cysteine pairs. Our approach outperforms the current approaches in the literature.
Collapse
Affiliation(s)
- Chih-Hao Lu
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu 30050, Taiwan
| | | | | | | |
Collapse
|
164
|
Holloway DT, Kon M, DeLisi C. Machine learning for regulatory analysis and transcription factor target prediction in yeast. SYSTEMS AND SYNTHETIC BIOLOGY 2007; 1:25-46. [PMID: 19003435 PMCID: PMC2533145 DOI: 10.1007/s11693-006-9003-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps-the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104 Saccharomyces cerevisiae regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying k-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.
Collapse
Affiliation(s)
- Dustin T. Holloway
- Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA 02215 USA
| | - Mark Kon
- Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA
- Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA
| | - Charles DeLisi
- Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA
| |
Collapse
|
165
|
Zhong W, Altun G, Tian X, Harrison R, Tai PC, Pan Y. Parallel protein secondary structure prediction based on neural networks. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2968-71. [PMID: 17270901 DOI: 10.1109/iembs.2004.1403842] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, binary and tertiary classifiers of protein secondary structure prediction are implemented on Denoeux belief neural network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 and PSSM (position specific scoring matrix) are experimented separately as the encoding schemes for DBNN. The experimental results contribute to the design of new encoding schemes. New binary classifier for Helix versus not Helix ( approximately H) for DBNN produces prediction accuracy of 87% when PSSM is used for the input profile. The performance of DBNN binary classifier is comparable to other best prediction methods. The good test results for binary classifiers open a new approach for protein structure prediction with neural networks. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the hyperthreading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that hyperthreading technology for Intel architecture is efficient for parallel biological algorithms.
Collapse
Affiliation(s)
- Wei Zhong
- Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA
| | | | | | | | | | | |
Collapse
|
166
|
Bi R, Zhou Y, Lu F, Wang W. Predicting Gene Ontology functions based on support vector machines and statistical significance estimation. Neurocomputing 2007. [DOI: 10.1016/j.neucom.2006.10.006] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
167
|
Youn E, Peters B, Radivojac P, Mooney SD. Evaluation of features for catalytic residue prediction in novel folds. PROTEIN SCIENCE : A PUBLICATION OF THE PROTEIN SOCIETY 2006. [PMID: 17189479 DOI: 10.1110/ps.062523907.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 09/29/2022]
Abstract
Structural genomics projects are determining the three-dimensional structure of proteins without full characterization of their function. A critical part of the annotation process involves appropriate knowledge representation and prediction of functionally important residue environments. We have developed a method to extract features from sequence, sequence alignments, three-dimensional structure, and structural environment conservation, and used support vector machines to annotate homologous and nonhomologous residue positions based on a specific training set of residue functions. In order to evaluate this pipeline for automated protein annotation, we applied it to the challenging problem of prediction of catalytic residues in enzymes. We also ranked the features based on their ability to discriminate catalytic from noncatalytic residues. When applying our method to a well-annotated set of protein structures, we found that top-ranked features were a measure of sequence conservation, a measure of structural conservation, a degree of uniqueness of a residue's structural environment, solvent accessibility, and residue hydrophobicity. We also found that features based on structural conservation were complementary to those based on sequence conservation and that they were capable of increasing predictor performance. Using a family nonredundant version of the ASTRAL 40 v1.65 data set, we estimated that the true catalytic residues were correctly predicted in 57.0% of the cases, with a precision of 18.5%. When testing on proteins containing novel folds not used in training, the best features were highly correlated with the training on families, thus validating the approach to nonhomologous catalytic residue prediction in general. We then applied the method to 2781 coordinate files from the structural genomics target pipeline and identified both highly ranked and highly clustered groups of predicted catalytic residues.
Collapse
Affiliation(s)
- Eunseog Youn
- Center for Computational Biology and Bioinformatics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | | | |
Collapse
|
168
|
Youn E, Peters B, Radivojac P, Mooney SD. Evaluation of features for catalytic residue prediction in novel folds. Protein Sci 2006; 16:216-26. [PMID: 17189479 PMCID: PMC2203287 DOI: 10.1110/ps.062523907] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Structural genomics projects are determining the three-dimensional structure of proteins without full characterization of their function. A critical part of the annotation process involves appropriate knowledge representation and prediction of functionally important residue environments. We have developed a method to extract features from sequence, sequence alignments, three-dimensional structure, and structural environment conservation, and used support vector machines to annotate homologous and nonhomologous residue positions based on a specific training set of residue functions. In order to evaluate this pipeline for automated protein annotation, we applied it to the challenging problem of prediction of catalytic residues in enzymes. We also ranked the features based on their ability to discriminate catalytic from noncatalytic residues. When applying our method to a well-annotated set of protein structures, we found that top-ranked features were a measure of sequence conservation, a measure of structural conservation, a degree of uniqueness of a residue's structural environment, solvent accessibility, and residue hydrophobicity. We also found that features based on structural conservation were complementary to those based on sequence conservation and that they were capable of increasing predictor performance. Using a family nonredundant version of the ASTRAL 40 v1.65 data set, we estimated that the true catalytic residues were correctly predicted in 57.0% of the cases, with a precision of 18.5%. When testing on proteins containing novel folds not used in training, the best features were highly correlated with the training on families, thus validating the approach to nonhomologous catalytic residue prediction in general. We then applied the method to 2781 coordinate files from the structural genomics target pipeline and identified both highly ranked and highly clustered groups of predicted catalytic residues.
Collapse
Affiliation(s)
- Eunseog Youn
- Center for Computational Biology and Bioinformatics, Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | | | |
Collapse
|
169
|
Wee LJK, Tan TW, Ranganathan S. SVM-based prediction of caspase substrate cleavage sites. BMC Bioinformatics 2006; 7 Suppl 5:S14. [PMID: 17254298 PMCID: PMC1764470 DOI: 10.1186/1471-2105-7-s5-s14] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Caspases belong to a class of cysteine proteases which function as critical effectors in apoptosis and inflammation by cleaving substrates immediately after unique sites. Prediction of such cleavage sites will complement structural and functional studies on substrates cleavage as well as discovery of new substrates. Recently, different computational methods have been developed to predict the cleavage sites of caspase substrates with varying degrees of success. As the support vector machines (SVM) algorithm has been shown to be useful in several biological classification problems, we have implemented an SVM-based method to investigate its applicability to this domain. RESULTS A set of unique caspase substrates cleavage sites were obtained from literature and used for evaluating the SVM method. Datasets containing (i) the tetrapeptide cleavage sites, (ii) the tetrapeptide cleavage sites, augmented by two adjacent residues, P1' and P2' amino acids and (iii) the tetrapeptide cleavage sites with ten additional upstream and downstream flanking sequences (where available) were tested. The SVM method achieved an accuracy ranging from 81.25% to 97.92% on independent test sets. The SVM method successfully predicted the cleavage of a novel caspase substrate and its mutants. CONCLUSION This study presents an SVM approach for predicting caspase substrate cleavage sites based on the cleavage sites and the downstream and upstream flanking sequences. The method shows an improvement over existing methods and may be useful for predicting hitherto undiscovered cleavage sites.
Collapse
Affiliation(s)
- Lawrence JK Wee
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Tin Wee Tan
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Shoba Ranganathan
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Department of Chemistry and Biomolecular Sciences & Biotechnology Research Institute, Macquarie University, Sydney, Australia
| |
Collapse
|
170
|
Baten AKMA, Chang BCH, Halgamuge SK, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 2006; 7 Suppl 5:S15. [PMID: 17254299 PMCID: PMC1764471 DOI: 10.1186/1471-2105-7-s5-s15] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive. RESULTS The proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases. CONCLUSION We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.
Collapse
Affiliation(s)
- AKMA Baten
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| | - BCH Chang
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| | - SK Halgamuge
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| | - Jason Li
- Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia
| |
Collapse
|
171
|
Sen TZ, Cheng H, Kloczkowski A, Jernigan RL. A Consensus Data Mining secondary structure prediction by combining GOR V and Fragment Database Mining. Protein Sci 2006; 15:2499-506. [PMID: 17001039 PMCID: PMC2242411 DOI: 10.1110/ps.062125306] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2006] [Revised: 05/11/2006] [Accepted: 07/31/2006] [Indexed: 10/24/2022]
Abstract
The major aim of tertiary structure prediction is to obtain protein models with the highest possible accuracy. Fold recognition, homology modeling, and de novo prediction methods typically use predicted secondary structures as input, and all of these methods may significantly benefit from more accurate secondary structure predictions. Although there are many different secondary structure prediction methods available in the literature, their cross-validated prediction accuracy is generally <80%. In order to increase the prediction accuracy, we developed a novel hybrid algorithm called Consensus Data Mining (CDM) that combines our two previous successful methods: (1) Fragment Database Mining (FDM), which exploits the Protein Data Bank structures, and (2) GOR V, which is based on information theory, Bayesian statistics, and multiple sequence alignments (MSA). In CDM, the target sequence is dissected into smaller fragments that are compared with fragments obtained from related sequences in the PDB. For fragments with a sequence identity above a certain sequence identity threshold, the FDM method is applied for the prediction. The remainder of the fragments are predicted by GOR V. The results of the CDM are provided as a function of the upper sequence identities of aligned fragments and the sequence identity threshold. We observe that the value 50% is the optimum sequence identity threshold, and that the accuracy of the CDM method measured by Q(3) ranges from 67.5% to 93.2%, depending on the availability of known structural fragments with sufficiently high sequence identity. As the Protein Data Bank grows, it is anticipated that this consensus method will improve because it will rely more upon the structural fragments.
Collapse
Affiliation(s)
- Taner Z Sen
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011-3020, USA.
| | | | | | | |
Collapse
|
172
|
Xue Y, Chen H, Jin C, Sun Z, Yao X. NBA-Palm: prediction of palmitoylation site implemented in Naïve Bayes algorithm. BMC Bioinformatics 2006; 7:458. [PMID: 17044919 PMCID: PMC1624852 DOI: 10.1186/1471-2105-7-458] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2006] [Accepted: 10/17/2006] [Indexed: 11/16/2022] Open
Abstract
Background Protein palmitoylation, an essential and reversible post-translational modification (PTM), has been implicated in cellular dynamics and plasticity. Although numerous experimental studies have been performed to explore the molecular mechanisms underlying palmitoylation processes, the intrinsic feature of substrate specificity has remained elusive. Thus, computational approaches for palmitoylation prediction are much desirable for further experimental design. Results In this work, we present NBA-Palm, a novel computational method based on Naïve Bayes algorithm for prediction of palmitoylation site. The training data is curated from scientific literature (PubMed) and includes 245 palmitoylated sites from 105 distinct proteins after redundancy elimination. The proper window length for a potential palmitoylated peptide is optimized as six. To evaluate the prediction performance of NBA-Palm, 3-fold cross-validation, 8-fold cross-validation and Jack-Knife validation have been carried out. Prediction accuracies reach 85.79% for 3-fold cross-validation, 86.72% for 8-fold cross-validation and 86.74% for Jack-Knife validation. Two more algorithms, RBF network and support vector machine (SVM), also have been employed and compared with NBA-Palm. Conclusion Taken together, our analyses demonstrate that NBA-Palm is a useful computational program that provides insights for further experimentation. The accuracy of NBA-Palm is comparable with our previously described tool CSS-Palm. The NBA-Palm is freely accessible from: .
Collapse
Affiliation(s)
- Yu Xue
- Laboratory of Cellular Dynamics, Hefei National Laboratory for Physical Sciences, and the University of Science and Technology of China, Hefei, China 230027
| | - Hu Chen
- Institute of Bioinformatics and Systems Biology, MOE Key Laboratory of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China 100084
| | - Changjiang Jin
- Laboratory of Cellular Dynamics, Hefei National Laboratory for Physical Sciences, and the University of Science and Technology of China, Hefei, China 230027
| | - Zhirong Sun
- Institute of Bioinformatics and Systems Biology, MOE Key Laboratory of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China 100084
| | - Xuebiao Yao
- Laboratory of Cellular Dynamics, Hefei National Laboratory for Physical Sciences, and the University of Science and Technology of China, Hefei, China 230027
- Department of Physiology and Cancer Research Program, Morehouse School of Medicine, Atlanta, GA 30310, USA
| |
Collapse
|
173
|
Bauer DC, Bodén M, Thier R, Gillam EM. STAR: predicting recombination sites from amino acid sequence. BMC Bioinformatics 2006; 7:437. [PMID: 17026775 PMCID: PMC1624854 DOI: 10.1186/1471-2105-7-437] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2006] [Accepted: 10/08/2006] [Indexed: 11/12/2022] Open
Abstract
Background Designing novel proteins with site-directed recombination has enormous prospects. By locating effective recombination sites for swapping sequence parts, the probability that hybrid sequences have the desired properties is increased dramatically. The prohibitive requirements for applying current tools led us to investigate machine learning to assist in finding useful recombination sites from amino acid sequence alone. Results We present STAR, Site Targeted Amino acid Recombination predictor, which produces a score indicating the structural disruption caused by recombination, for each position in an amino acid sequence. Example predictions contrasted with those of alternative tools, illustrate STAR'S utility to assist in determining useful recombination sites. Overall, the correlation coefficient between the output of the experimentally validated protein design algorithm SCHEMA and the prediction of STAR is very high (0.89). Conclusion STAR allows the user to explore useful recombination sites in amino acid sequences with unknown structure and unknown evolutionary origin. The predictor service is available from .
Collapse
Affiliation(s)
- Denis C Bauer
- Institute for Molecular Bioscience, The University of Queensland, QLD 4072, Australia
| | - Mikael Bodén
- School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, Australia
| | - Ricarda Thier
- School of Biomedical Sciences, The University of Queensland, QLD 4072, Australia
| | - Elizabeth M Gillam
- School of Biomedical Sciences, The University of Queensland, QLD 4072, Australia
| |
Collapse
|
174
|
Karypis G. YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction. Proteins 2006; 64:575-86. [PMID: 16763996 DOI: 10.1002/prot.21036] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The accurate prediction of a protein's secondary structure plays an increasingly critical role in predicting its function and tertiary structure, as it is utilized by many of the current state-of-the-art methods for remote homology, fold recognition, and ab initio structure prediction. We developed a new secondary structure prediction algorithm called YASSPP, which uses a pair of cascaded models constructed from two sets of binary SVM-based models. YASSPP uses an input coding scheme that combines both position-specific and nonposition-specific information, utilizes a kernel function designed to capture the sequence conservation signals around the local window of each residue, and constructs a second-level model by incorporating both the three-state predictions produced by the first-level model and information about the original sequence. Experiments on three standard datasets (RS126, CB513, and EVA common subset 4) show that YASSPP is capable of producing the highest Q3 and SOV scores than that achieved by existing widely used schemes such as PSIPRED, SSPro 4.0, SAM-T99sec, as well as previously developed SVM-based schemes. On the EVA dataset it achieves a Q3 and SOV score of 79.34 and 78.65%, which are considerably higher than the best reported scores of 77.64 and 76.05%, respectively.
Collapse
Affiliation(s)
- George Karypis
- Department of Computer Science & Engineering, University of Minnesota, Army HPC Research Center, Minneapolis, Minnesota 55455, USA.
| |
Collapse
|
175
|
Abstract
Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.
Collapse
Affiliation(s)
- Chin-Sheng Yu
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan, Republic of China
| | | | | | | |
Collapse
|
176
|
Song J, Burrage K. Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics 2006; 7:425. [PMID: 17014735 PMCID: PMC1618864 DOI: 10.1186/1471-2105-7-425] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2006] [Accepted: 10/03/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Collapse
Affiliation(s)
- Jiangning Song
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| | - Kevin Burrage
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| |
Collapse
|
177
|
Zhang T, Ding Y, Chou KC. Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Comput Biol Chem 2006; 30:367-71. [PMID: 16963318 DOI: 10.1016/j.compbiolchem.2006.08.003] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Accepted: 08/03/2006] [Indexed: 11/17/2022]
Abstract
The function of eukaryotic protein is closely correlated with its subcellular location. The number of newly found protein sequences entering into data banks is rapidly increasing with the success of human genome project. It is highly desirable to predict a protein subcellular automatically from its amino acid sequence. In this paper, amino acid hydrophobic patterns and average power-spectral density (APSD) are introduced to define pseudo amino acid composition. The covariant-discriminant predictor is used to predict subcellular location. Immune-genetic algorithm (IGA) is used to find the fittest weight factors which are very important in this method. As such, high success rates are obtained by both self-consistency test (86%) and jackknife test (73%). More than 80% predictive accuracy is achieved in independent dataset test. The results demonstrate that the proposed method is practical. And, the method illuminates that the protein subcellular location can be predicted from its surface physio-chemical characteristic of protein folding.
Collapse
Affiliation(s)
- Tongliang Zhang
- Bio-Informatics Research Center, College of Information Sciences and Technology, Donghua University, Shanghai 201620, PR China
| | | | | |
Collapse
|
178
|
Abstract
MOTIVATION Most secondary structure prediction programs target only alpha helix and beta sheet structures and summarize all other structures in the random coil pseudo class. However, such an assignment often ignores existing local ordering in so-called random coil regions. Signatures for such ordering are distinct dihedral angle pattern. For this reason, we propose as an alternative approach to predict directly dihedral regions for each residue as this leads to a higher amount of structural information. RESULTS We propose a multi-step support vector machine (SVM) procedure, dihedral prediction (DHPRED), to predict the dihedral angle state of residues from sequence. Trained on 20,000 residues our approach leads to dihedral region predictions, that in regions without alpha helices or beta sheets is higher than those from secondary structure prediction programs. AVAILABILITY DHPRED has been implemented as a web service, which academic researchers can access from our webpage http://www.fz-juelich.de/nic/cbb
Collapse
Affiliation(s)
- Olav Zimmermann
- John v. Neumann Institute for Computing, FZ Jülich, 52425 Jülich, Germany
| | | |
Collapse
|
179
|
Wang Y, Xue Z, Xu J. Better prediction of the location of alpha-turns in proteins with support vector machine. Proteins 2006; 65:49-54. [PMID: 16894602 DOI: 10.1002/prot.21062] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We have developed a novel method named AlphaTurn to predict alpha-turns in proteins based on the support vector machine (SVM). The prediction was done on a data set of 469 nonhomologous proteins containing 967 alpha-turns. A great improvement in prediction performance was achieved by using multiple sequence alignment generated by PSI-BLAST as input instead of the single amino acid sequence. The introduction of secondary structure information predicted by PSIPRED also improved the prediction performance. Moreover, we handled the very uneven data set by combining the cost factor j with the "state-shifting" rule. This further promoted the prediction quality of our method. The final SVM model yielded a Matthews correlation coefficient (MCC) of 0.25 by a 10-fold cross-validation. To our knowledge, this MCC value is the highest obtained so far for predicting alpha-turns. An online Web server based on this method has been developed and can be freely accessed at http://bmc.hust.edu.cn/bioinformatics/ or http://210.42.106.80/.
Collapse
Affiliation(s)
- Yan Wang
- Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan City, China
| | | | | |
Collapse
|
180
|
|
181
|
Wang Y, Xue ZD, Shi XH, Xu J. Prediction of π-turns in proteins using PSI-BLAST profiles and secondary structure information. Biochem Biophys Res Commun 2006; 347:574-80. [PMID: 16844090 DOI: 10.1016/j.bbrc.2006.06.066] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2006] [Accepted: 06/14/2006] [Indexed: 11/28/2022]
Abstract
Due to the structural and functional importance of tight turns, some methods have been proposed to predict gamma-turns, beta-turns, and alpha-turns in proteins. In the past, studies of pi-turns were made, but not a single prediction approach has been developed so far. It will be useful to develop a method for identifying pi-turns in a protein sequence. In this paper, the support vector machine (SVM) method has been introduced to predict pi-turns from the amino acid sequence. The training and testing of this approach is performed with a newly collected data set of 640 non-homologous protein chains containing 1931 pi-turns. Different sequence encoding schemes have been explored in order to investigate their effects on the prediction performance. With multiple sequence alignment and predicted secondary structure, the final SVM model yields a Matthews correlation coefficient (MCC) of 0.556 by a 7-fold cross-validation. A web server implementing the prediction method is available at the following URL: http://210.42.106.80/piturn/.
Collapse
Affiliation(s)
- Yan Wang
- Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan City, China.
| | | | | | | |
Collapse
|
182
|
Chen L, Wang W, Ling S, Jia C, Wang F. KemaDom: a web server for domain prediction using kernel machine with local context. Nucleic Acids Res 2006; 34:W158-63. [PMID: 16844982 PMCID: PMC1538912 DOI: 10.1093/nar/gkl331] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Predicting domains of proteins is an important and challenging problem in computational biology because of its significant role in understanding the complexity of proteomes. Although many template-based prediction servers have been developed, ab initio methods should be designed and further improved to be the complementarity of the template-based methods. In this paper, we present a novel domain prediction system KemaDom by ensembling three kernel machines with the local context information among neighboring amino acids. KemaDom, an alternative ab initio predictor, can achieve high performance in predicting the number of domains in proteins. It is freely accessible at and .
Collapse
Affiliation(s)
- Lusheng Chen
- Shanghai Key Laboratory of Intelligent Information Processing, Fudan UniversityShanghai, PR China
- Department of Computer Science and Engineering, School of Life Science, Fudan UniversityShanghai, PR China
| | - Wei Wang
- Institute of Genetics, School of Life Science, Fudan UniversityShanghai, PR China
| | - Shaoping Ling
- College of Information Engineering, Xiangtan UniversityXiangtan, Hunan, PR China
| | - Caiyan Jia
- Shanghai Key Laboratory of Intelligent Information Processing, Fudan UniversityShanghai, PR China
- Department of Computer Science and Engineering, School of Life Science, Fudan UniversityShanghai, PR China
| | - Fei Wang
- Shanghai Key Laboratory of Intelligent Information Processing, Fudan UniversityShanghai, PR China
- Department of Computer Science and Engineering, School of Life Science, Fudan UniversityShanghai, PR China
- To whom correspondence should be addressed. Tel: +86 21 5566 4712; Fax: +86 21 6565 4253;
| |
Collapse
|
183
|
|
184
|
Sun XD, Huang RB. Prediction of protein structural classes using support vector machines. Amino Acids 2006; 30:469-75. [PMID: 16622605 DOI: 10.1007/s00726-005-0239-0] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2005] [Accepted: 07/12/2005] [Indexed: 11/24/2022]
Abstract
The support vector machine, a machine-learning method, is used to predict the four structural classes, i.e. mainly alpha, mainly beta, alpha-beta and fss, from the topology-level of CATH protein structure database. For the binary classification, any two structural classes which do not share any secondary structure such as alpha and beta elements could be classified with as high as 90% accuracy. The accuracy, however, will decrease to less than 70% if the structural classes to be classified contain structure elements in common. Our study also shows that the dimensions of feature space 20(2) = 400 (for dipeptide) and 20(3) = 8 000 (for tripeptide) give nearly the same prediction accuracy. Among these 4 structural classes, multi-class classification gives an overall accuracy of about 52%, indicating that the multi-class classification technique in support of vector machines may still need to be further improved in future investigation.
Collapse
Affiliation(s)
- X-D Sun
- College of Life Science and Biotechnology, Guangxi University, Nanning, Guangxi, China
| | | |
Collapse
|
185
|
Aydin Z, Altunbasak Y, Borodovsky M. Protein secondary structure prediction for a single-sequence using hidden semi-Markov models. BMC Bioinformatics 2006; 7:178. [PMID: 16571137 PMCID: PMC1479840 DOI: 10.1186/1471-2105-7-178] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2005] [Accepted: 03/30/2006] [Indexed: 11/10/2022] Open
Abstract
Background The accuracy of protein secondary structure prediction has been improving steadily towards the 88% estimated theoretical limit. There are two types of prediction algorithms: Single-sequence prediction algorithms imply that information about other (homologous) proteins is not available, while algorithms of the second type imply that information about homologous proteins is available, and use it intensively. The single-sequence algorithms could make an important contribution to studies of proteins with no detected homologs, however the accuracy of protein secondary structure prediction from a single-sequence is not as high as when the additional evolutionary information is present. Results In this paper, we further refine and extend the hidden semi-Markov model (HSMM) initially considered in the BSPSS algorithm. We introduce an improved residue dependency model by considering the patterns of statistically significant amino acid correlation at structural segment borders. We also derive models that specialize on different sections of the dependency structure and incorporate them into HSMM. In addition, we implement an iterative training method to refine estimates of HSMM parameters. The three-state-per-residue accuracy and other accuracy measures of the new method, IPSSP, are shown to be comparable or better than ones for BSPSS as well as for PSIPRED, tested under the single-sequence condition. Conclusions We have shown that new dependency models and training methods bring further improvements to single-sequence protein secondary structure prediction. The results are obtained under cross-validation conditions using a dataset with no pair of sequences having significant sequence similarity. As new sequences are added to the database it is possible to augment the dependency structure and obtain even higher accuracy. Current and future advances should contribute to the improvement of function prediction for orphan proteins inscrutable to current similarity search methods.
Collapse
Affiliation(s)
- Zafer Aydin
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA
| | - Yucel Altunbasak
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA
| | - Mark Borodovsky
- School of Biology, the Wallace H. Coulter Department of Biomedical Engineering and the Center for Bioinformatics and Computational Biology, Georgia Institute of Technology, Atlanta, GA 30332-0230, USA
| |
Collapse
|
186
|
Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins. Proteins 2006; 64:19-27. [PMID: 16568445 DOI: 10.1002/prot.20977] [Citation(s) in RCA: 110] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair. A reliable identification of DNA-binding sites in DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions. We apply Support Vector Machine (SVM), a supervised pattern recognition method, to predict DNA-binding sites in DNA-binding proteins using the following features: amino acid sequence, profile of evolutionary conservation of sequence positions, and low-resolution structural information. We use a rigorous statistical approach to study the performance of predictors that utilize different combinations of features and how this performance is affected by structural and sequence properties of proteins. Our results indicate that an SVM predictor based on a properly scaled profile of evolutionary conservation in the form of a position specific scoring matrix (PSSM) significantly outperforms a PSSM-based neural network predictor. The highest accuracy is achieved by SVM predictor that combines the profile of evolutionary conservation with low-resolution structural information. Our results also show that knowledge-based predictors of DNA-binding sites perform significantly better on proteins from mainly-alpha structural class and that the performance of these predictors is significantly correlated with certain structural and sequence properties of proteins. These observations suggest that it may be possible to assign a reliability index to the overall accuracy of the prediction of DNA-binding sites in any given protein using its sequence and structural properties. A web-server implementation of the predictors is freely available online at http://lcg.rit.albany.edu/dp-bind/.
Collapse
Affiliation(s)
- Igor B Kuznetsov
- Gen*NY*sis Center for Excellence in Cancer Genomics, Department of Epidemiology and Biostatistics, University at Albany, Rensselaer, NewYork 12144, USA.
| | | | | | | |
Collapse
|
187
|
Abstract
The ability of physicians to effectively treat and cure cancer is directly dependent on their ability to detect cancers at their early stages. The early detection of cancer has the potential to dramatically reduce mortality. Recently, the use of mass spectrometry to develop profiles of patient serum proteins has been reported as a promising method to achieve this goal. In this paper, we analyzed the ovarian cancer and prostate cancer data sets using support vector machine (SVM) to detect cancer at the early stages based on serum proteomic pattern. The results showed that SVM, in general, performed well on these two data sets, as measured by sensitivity, specificity, positive predictive value, negative predictive value, and accuracy. Linear kernel worked the best on ovarian cancer data with a sensitivity of 0.99 and an accuracy of 0.97, while polynomial kernel worked the best on prostate cancer data with a sensitivity of 0.79 and an accuracy of 0.82. When redial kernel was applied to either of the two data sets, all the samples were predicted as cancer samples, with a sensitivity of 1 and a specificity of 0. Furthermore, feature selection did not improve SVM performance.
Collapse
Affiliation(s)
- Ying Liu
- Laboratory for Bioinformatics and Medical Informatics, Department of Computer Science, Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, 75083, USA.
| |
Collapse
|
188
|
Song J, Burrage K, Yuan Z, Huber T. Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC Bioinformatics 2006; 7:124. [PMID: 16526956 PMCID: PMC1450308 DOI: 10.1186/1471-2105-7-124] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2005] [Accepted: 03/09/2006] [Indexed: 11/18/2022] Open
Abstract
Background The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function. Results In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively. Conclusion A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis.
Collapse
Affiliation(s)
- Jiangning Song
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| | - Kevin Burrage
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| | - Zheng Yuan
- Institute for Molecular Bioscience and ARC Centre in Bioinformatics, The University of Queensland, Brisbane Qld 4072, Australia
| | - Thomas Huber
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| |
Collapse
|
189
|
Abstract
We present DESTRUCT, a new method of protein secondary structure prediction, which achieves a three-state accuracy (Q3) of 79.4% in a cross-validated trial on a nonredundant set of 513 proteins. An iterative set of cascade-correlation neural networks is used to predict both secondary structure and psi dihedral angles, with predicted values enhancing the subsequent iteration. Predictive accuracies of 80.7% and 81.7% are achieved on the CASP4 and CASP5 targets, respectively. Our approach is significantly more accurate than other contemporary methods, due to feedback and a novel combination of structural representations.
Collapse
Affiliation(s)
- Matthew J Wood
- School of Chemistry, University of Nottingham, Nottingham, United Kingdom
| | | |
Collapse
|
190
|
He J, Hu HJ, Harrison R, Tai PC, Pan Y. Rule Generation for Protein Secondary Structure Prediction With Support Vector Machines and Decision Tree. IEEE Trans Nanobioscience 2006; 5:46-53. [PMID: 16570873 DOI: 10.1109/tnb.2005.864021] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Support vector machines (SVMs) have shown strong generalization ability in a number of application areas, including protein structure prediction. However, the poor comprehensibility hinders the success of the SVM for protein structure prediction. The explanation of how a decision made is important for accepting the machine learning technology, especially for applications such as bioinformatics. The reasonable interpretation is not only useful to guide the "wet experiments," but also the extracted rules are helpful to integrate computational intelligence with symbolic AI systems for advanced deduction. On the other hand, a decision tree has good comprehensibility. In this paper, a novel approach to rule generation for protein secondary structure prediction by integrating merits of both the SVM and decision tree is presented. This approach combines the SVM with decision tree into a new algorithm called SVM_ DT, which proceeds in three steps. This algorithm first trains an SVM. Then, a new training set is generated through careful selection from the output of the SVM. Finally, the obtained training set is used to train a decision tree learning system and to extract the corresponding rule sets. The results of the experiments of protein secondary structure prediction on RS126 data set show that the comprehensibility of SVM_DT is much better than that of the SVM. Moreover, the generalization ability of SVM_DT is better than that of C4.5 decision trees and is similar to that of the SVM. Hence, SVM_DT can be used not only for prediction, but also for guiding biological experiments.
Collapse
Affiliation(s)
- Jieyue He
- Computer Science and Engineering Department, Southeast University, Nanjing 210096, China.
| | | | | | | | | |
Collapse
|
191
|
Bodén M, Yuan Z, Bailey TL. Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures. BMC Bioinformatics 2006; 7:68. [PMID: 16478545 PMCID: PMC1386714 DOI: 10.1186/1471-2105-7-68] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 02/14/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from high-quality NMR models. RESULTS Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues. CONCLUSION Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Collapse
Affiliation(s)
- Mikael Bodén
- School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, St Lucia, Australia
| | - Zheng Yuan
- Institute of Molecular Bioscience, The University of Queensland, QLD 4072, St Lucia, Australia
| | - Timothy L Bailey
- Institute of Molecular Bioscience, The University of Queensland, QLD 4072, St Lucia, Australia
| |
Collapse
|
192
|
Mittal A, Gupta S. Automatic content-based retrieval and semantic classification of video content. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES 2006. [DOI: 10.1007/s00799-005-0119-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
193
|
Cui J, Han LY, Cai CZ, Zheng CJ, Ji ZL, Chen YZ. Prediction of functional class of novel bacterial proteins without the use of sequence similarity by a statistical learning method. J Mol Microbiol Biotechnol 2006; 9:86-100. [PMID: 16319498 DOI: 10.1159/000088839] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A substantial percentage of the putative protein-encoding open reading frames (ORFs) in bacterial genomes have no homolog of known function, and their function cannot be confidently assigned on the basis of sequence similarity. Methods not based on sequence similarity are needed and being developed. One method, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), predicts protein functional family irrespective of sequence similarity (Nucleic Acids Res. 2003;31:3692-3697). While it has been tested on a large number of proteins, its capability for non-homologous proteins has so far been evaluated for a relatively small number of proteins, and additional tests are needed to more fully assess SVMProt. In this work, 90 novel bacterial proteins (non-homologous to known proteins) are used to evaluate the capability of SVMProt. These proteins are such that none of their homologs are in the Swiss-Prot database, their functions not clearly described in the literature, and they themselves and their homologs are not included in the training sets of SVMProt. They represent proteins whose function cannot be confidently predicted by sequence similarity methods at present. The predicted functional class of 76.7% of each of these proteins shows various levels of consistency with the literature-described function, compared to the overall accuracy of 87% for the SVMProt functional class assignment of 34,582 proteins that have at least one homolog of known function. Our study suggests that SVMProt is capable of assigning functional class for novel bacterial proteins at a level not too much lower than that of sequence alignment methods for homologous proteins.
Collapse
Affiliation(s)
- J Cui
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Singapore
| | | | | | | | | | | |
Collapse
|
194
|
Fuzzy k-Nearest Neighbor Method for Protein Secondary Structure Prediction and Its Parallel Implementation. COMPUTATIONAL INTELLIGENCE AND BIOINFORMATICS 2006. [DOI: 10.1007/11816102_48] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
|
195
|
Wang B, Chen P, Huang DS, Li JJ, Lok TM, Lyu MR. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett 2005; 580:380-4. [PMID: 16376878 DOI: 10.1016/j.febslet.2005.11.081] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2005] [Revised: 11/29/2005] [Accepted: 11/30/2005] [Indexed: 12/01/2022]
Abstract
This paper proposes a novel method that can predict protein interaction sites in heterocomplexes using residue spatial sequence profile and evolution rate approaches. The former represents the information of multiple sequence alignments while the latter corresponds to a residue's evolutionary conservation score based on a phylogenetic tree. Three predictors using a support vector machines algorithm are constructed to predict whether a surface residue is a part of a protein-protein interface. The efficiency and the effectiveness of our proposed approach is verified by its better prediction performance compared with other models. The study is based on a non-redundant data set of heterodimers consisting of 69 protein chains.
Collapse
Affiliation(s)
- Bing Wang
- Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China
| | | | | | | | | | | |
Collapse
|
196
|
Tsai CJ, Nussinov R. The implications of higher (or lower) success in secondary structure prediction of chain fragments. Protein Sci 2005; 14:1943-4. [PMID: 16046621 PMCID: PMC2279305 DOI: 10.1110/ps.051581805] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Chung-Jung Tsai
- Basic Research Program, SAIC-Frederick Inc., Laboratory of Experimental and Computational Biology, NCI-Frederick, Frederick, MD 21701, USA
| | | |
Collapse
|
197
|
|
198
|
Han LY, Zheng CJ, Lin HH, Cui J, Li H, Zhang HL, Tang ZQ, Chen YZ. Prediction of functional class of novel plant proteins by a statistical learning method. THE NEW PHYTOLOGIST 2005; 168:109-21. [PMID: 16159326 DOI: 10.1111/j.1469-8137.2005.01482.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In plant genomes, the function of a substantial percentage of the putative protein-coding open reading frames (ORFs) is unknown. These ORFs have no significant sequence similarity to known proteins, which complicates the task of functional study of these proteins. Efforts are being made to explore methods that are complementary to, or may be used in combination with, sequence alignment and clustering methods. A web-based protein functional class prediction software, SVMProt, has shown some capability for predicting functional class of distantly related proteins. Here the usefulness of SVMProt for functional study of novel plant proteins is evaluated. To test SVMProt, 49 plant proteins (without a sequence homolog in the Swiss-Prot protein database, not in the SVMProt training set, and with functional indications provided in the literature) were selected from a comprehensive search of MEDLINE abstracts and Swiss-Prot databases in 1999-2004. These represent unique proteins the function of which, at present, cannot be confidently predicted by sequence alignment and clustering methods. The predicted functional class of 31 proteins was consistent, and that of four other proteins was weakly consistent, with published functions. Overall, the functional class of 71.4% of these proteins was consistent, or weakly consistent, with functional indications described in the literature. SVMProt shows a certain level of ability to provide useful hints about the functions of novel plant proteins with no similarity to known proteins.
Collapse
Affiliation(s)
- L Y Han
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | | | | | |
Collapse
|
199
|
Abstract
The difficulties in predicting disulfide connectivity from protein sequences lie in the nonlocal properties of the disulfide bridges that involve cysteine pairs at large sequence separation. Though some progress has been recently made in the prediction of disulfide connectivity, the current methods predict less than half of the disulfide patterns for the data set sharing less than 30% sequence identity. In this report, we use the support vector machines based on sequence features such as the coupling between the local sequence environments of cysteine pair, the cysteines sequence separations, and the global sequence descriptor, such as amino acid content. Our approach is able to predict 55% of the disulfide patterns of proteins with two to five disulfide bridges, which is 11-26% higher than other methods in the literature.
Collapse
Affiliation(s)
- Yu-Ching Chen
- Institute of Bioinformatics, National Chiao Tung University, Taiwan, Republic of China
| | | |
Collapse
|
200
|
Lo SL, Cai CZ, Chen YZ, Chung MCM. Effect of training datasets on support vector machine prediction of protein-protein interactions. Proteomics 2005; 5:876-84. [PMID: 15717327 DOI: 10.1002/pmic.200401118] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.
Collapse
Affiliation(s)
- Siaw Ling Lo
- Department of Biological Sciences, National University of Singapore
| | | | | | | |
Collapse
|