1
|
Information entropy-based differential evolution with extremely randomized trees and LightGBM for protein structural class prediction. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
2
|
Peyravi F, Latif A, Moshtaghioun SM. Protein tertiary structure prediction using hidden Markov model based on lattice. J Bioinform Comput Biol 2019; 17:1950007. [PMID: 31057069 DOI: 10.1142/s0219720019500070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The prediction of protein structure from its amino acid sequence is one of the most prominent problems in computational biology. The biological function of a protein depends on its tertiary structure which is determined by its amino acid sequence via the process of protein folding. We propose a novel fold recognition method for protein tertiary structure prediction based on a hidden Markov model and 3D coordinates of amino acid residues. The method introduces states based on the basis vectors in Bravais cubic lattices to learn the path of amino acids of the proteins of each fold. Three hidden Markov models are considered based on simple cubic, body-centered cubic (BCC) and face-centered cubic (FCC) lattices. A 10-fold cross validation was performed on a set of 42 fold SCOP dataset. The proposed composite methodology is compared to fold recognition methods which have HMM as base of their algorithms having approaches on only amino acid sequence or secondary structure. The accuracy of proposed model based on face-centered cubic lattices is quite better in comparison with SAM, 3-HMM optimized and Markov chain optimized in overall experiment. The huge data of 3D space help the model to have greater performance in comparison to methods which use only primary structures or only secondary structures.
Collapse
Affiliation(s)
- Farzad Peyravi
- * Department of Computer Engineering, Yazd University, Yazd, Iran
| | | | | |
Collapse
|
3
|
Kong L, Zhang L, Han X, Lv J. Protein Structural Class Prediction Based on Distance-related Statistical Features from Graphical Representation of Predicted Secondary Structure. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180914110451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein structural class prediction is beneficial to protein structure and function analysis. Exploring good feature representation is a key step for this prediction task. Prior works have demonstrated the effectiveness of the secondary structure based feature extraction methods especially for lowsimilarity protein sequences. However, the prediction accuracies still remain limited. To explore the potential of secondary structure information, a novel feature extraction method based on a generalized chaos game representation of predicted secondary structure is proposed. Each protein sequence is converted into a 20-dimensional distance-related statistical feature vector to characterize the distribution of secondary structure elements and segments. The feature vectors are then fed into a support vector machine classifier to predict the protein structural class. Our experiments on three widely used lowsimilarity benchmark datasets (25PDB, 1189 and 640) show that the proposed method achieves superior performance to the state-of-the-art methods. It is anticipated that our method could be extended to other graphical representations of protein sequence and be helpful in future protein research.
Collapse
Affiliation(s)
- Liang Kong
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, China
| | - Lichao Zhang
- College of Sciences, Northeastern University, Shenyang, China
| | | | - Jinfeng Lv
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, China
| |
Collapse
|
4
|
Oldfield CJ, Chen K, Kurgan L. Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences. Methods Mol Biol 2019; 1958:73-100. [PMID: 30945214 DOI: 10.1007/978-1-4939-9161-7_4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Many new methods for the sequence-based prediction of the secondary and supersecondary structures have been developed over the last several years. These and older sequence-based predictors are widely applied for the characterization and prediction of protein structure and function. These efforts have produced countless accurate predictors, many of which rely on state-of-the-art machine learning models and evolutionary information generated from multiple sequence alignments. We describe and motivate both types of predictions. We introduce concepts related to the annotation and computational prediction of the three-state and eight-state secondary structure as well as several types of supersecondary structures, such as β hairpins, coiled coils, and α-turn-α motifs. We review 34 predictors focusing on recent tools and provide detailed information for a selected set of 14 secondary structure and 3 supersecondary structure predictors. We conclude with several practical notes for the end users of these predictive methods.
Collapse
Affiliation(s)
- Christopher J Oldfield
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Ke Chen
- School of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin, People's Republic of China
| | - Lukasz Kurgan
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
5
|
Zhu XJ, Feng CQ, Lai HY, Chen W, Hao L. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.10.007] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
6
|
A Composite Approach to Protein Tertiary Structure Prediction: Hidden Markov Model Based on Lattice. Bull Math Biol 2018; 81:899-918. [DOI: 10.1007/s11538-018-00542-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 11/28/2018] [Indexed: 11/25/2022]
|
7
|
Zhang L, Kong L. iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components. J Theor Biol 2018; 441:1-8. [PMID: 29305179 DOI: 10.1016/j.jtbi.2017.12.025] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 12/18/2017] [Accepted: 12/24/2017] [Indexed: 10/18/2022]
Abstract
Gene recombination is a key process to produce hereditary differences. Recombination spot identification plays an important role in revealing genome evolution and promoting DNA function study. However, traditional experiments are not good at identifying recombination spot with huge amounts of DNA sequences springed up by sequencing. At present, some machine learning methods have been proposed to speed up this identification process. However, the correlations between nucleotides pairs at different positions along DNA sequence is often ignored, which reflects the important sequence order information. For this purpose, this study proposes a novel feature extraction method, called iRSpot-ADPM, based on DNA property in a given DNA sequence. 85 features are selected from the original feature set according to the weights calculated by support vector machine. Five-fold cross validation tests on two widely used benchmark datasets indicate that the proposed method outperforms its existing counterparts on the individual specificity(Spec), Matthews correlation coefficient(MCC) value and overall accuracy(OA). The experimental results show that the proposed method is effective for accurate recombination spot identification. Moreover, it is anticipated that the proposed method could be extended to other biology sequence and be helpful in future research. The datasets and Matlab source codes can be download from the URL: http://stxy.neuq.edu.cn/info/1095/1157.htm.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao 066004, PR China.
| | - Liang Kong
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao 066004, PR China
| |
Collapse
|
8
|
Liang Y, Zhang S. Predict protein structural class by incorporating two different modes of evolutionary information into Chou's general pseudo amino acid composition. J Mol Graph Model 2017; 78:110-117. [DOI: 10.1016/j.jmgm.2017.10.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 10/03/2017] [Accepted: 10/03/2017] [Indexed: 11/27/2022]
|
9
|
Yu B, Lou L, Li S, Zhang Y, Qiu W, Wu X, Wang M, Tian B. Prediction of protein structural class for low-similarity sequences using Chou’s pseudo amino acid composition and wavelet denoising. J Mol Graph Model 2017; 76:260-273. [DOI: 10.1016/j.jmgm.2017.07.012] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Revised: 07/11/2017] [Accepted: 07/12/2017] [Indexed: 11/25/2022]
|
10
|
Yuan M, Yang Z, Huang G, Ji G. Feature selection by maximizing correlation information for integrated high-dimensional protein data. Pattern Recognit Lett 2017. [DOI: 10.1016/j.patrec.2017.03.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
11
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A computational approach for prediction of donor splice sites with improved accuracy. J Theor Biol 2016; 404:285-294. [PMID: 27302911 DOI: 10.1016/j.jtbi.2016.06.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 04/18/2016] [Accepted: 06/09/2016] [Indexed: 11/24/2022]
Abstract
Identification of splice sites is important due to their key role in predicting the exon-intron structure of protein coding genes. Though several approaches have been developed for the prediction of splice sites, further improvement in the prediction accuracy will help predict gene structure more accurately. This paper presents a computational approach for prediction of donor splice sites with higher accuracy. In this approach, true and false splice sites were first encoded into numeric vectors and then used as input in artificial neural network (ANN), support vector machine (SVM) and random forest (RF) for prediction. ANN and SVM were found to perform equally and better than RF, while tested on HS3D and NN269 datasets. Further, the performance of ANN, SVM and RF were analyzed by using an independent test set of 50 genes and found that the prediction accuracy of ANN was higher than that of SVM and RF. All the predictors achieved higher accuracy while compared with the existing methods like NNsplice, MEM, MDD, WMM, MM1, FSPLICE, GeneID and ASSP, using the independent test set. We have also developed an online prediction server (PreDOSS) available at http://cabgrid.res.in:8080/predoss, for prediction of donor splice sites using the proposed approach.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - A R Rao
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - S D Wahi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| |
Collapse
|
12
|
Olyaee MH, Yaghoubi A, Yaghoobi M. Predicting protein structural classes based on complex networks and recurrence analysis. J Theor Biol 2016; 404:375-382. [DOI: 10.1016/j.jtbi.2016.06.018] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Revised: 05/25/2016] [Accepted: 06/15/2016] [Indexed: 11/24/2022]
|
13
|
Zhang L, Kong L, Han X, Lv J. Structural class prediction of protein using novel feature extraction method from chaos game representation of predicted secondary structure. J Theor Biol 2016; 400:1-10. [DOI: 10.1016/j.jtbi.2016.04.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Revised: 03/18/2016] [Accepted: 04/08/2016] [Indexed: 11/30/2022]
|
14
|
Kong L, Kong L, Jing R. Improving the Prediction of Protein Structural Class for Low-Similarity Sequences by Incorporating Evolutionaryand Structural Information. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2016. [DOI: 10.20965/jaciii.2016.p0402] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein structural class prediction is beneficial to study protein function, regulation and interactions. However, protein structural class prediction for low-similarity sequences (i.e., below 40% in pairwise sequence similarity) remains a challenging problem at present. In this study, a novel computational method is proposed to accurately predict protein structural class for low-similarity sequences. This method is based on support vector machine in conjunction with integrated features from evolutionary information generated with position specific iterative basic local alignment search tool (PSI-BLAST) and predicted secondary structure. Various prediction accuracies evaluated by the jackknife tests are reported on two widely-used low-similarity benchmark datasets (25PDB and 1189), reaching overall accuracies 89.3% and 87.9%, which are significantly higher than those achieved by state-of-the-art in protein structural class prediction. The experimental results suggest that our method could serve as an effective alternative to existing methods in protein structural classification, especially for low-similarity sequences.
Collapse
|
15
|
Wu CY, Li QZ, Feng ZX. Non-coding RNA identification based on topology secondary structure and reading frame in organelle genome level. Genomics 2015; 107:9-15. [PMID: 26697761 DOI: 10.1016/j.ygeno.2015.12.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Revised: 12/08/2015] [Accepted: 12/12/2015] [Indexed: 10/22/2022]
Abstract
Non-coding RNA (ncRNA) genes make transcripts as same as the encoding genes, and ncRNAs directly function as RNAs rather than serve as blueprints for proteins. As the function of ncRNA is closely related to organelle genomes, it is desirable to explore ncRNA function by confirming its provenance. In this paper, the topology secondary structure, motif and the triplets under three reading frames are considered as parameters of ncRNAs. A method of SVM combining the increment of diversity (ID) algorithm is applied to construct the classifier. When the method is applied to the ncRNA dataset less than 80% sequence identity, the overall accuracies reach 95.57%, 96.40% in the five-fold cross-validation and the jackknife test, respectively. Further, for the independent testing dataset, the average prediction success rate of our method achieved 93.24%. The higher predictive success rates indicate that our method is very helpful for distinguishing ncRNAs from various organelle genomes.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | - Zhen-Xing Feng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
16
|
Li X, Liu T, Tao P, Wang C, Chen L. A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Comput Biol Chem 2015; 59 Pt A:95-100. [DOI: 10.1016/j.compbiolchem.2015.08.012] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 08/30/2015] [Accepted: 08/30/2015] [Indexed: 12/11/2022]
|