1
|
Xiao H, Zou Y, Wang J, Wan S. A Review for Artificial Intelligence Based Protein Subcellular Localization. Biomolecules 2024; 14:409. [PMID: 38672426 PMCID: PMC11048326 DOI: 10.3390/biom14040409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/21/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Proteins need to be located in appropriate spatiotemporal contexts to carry out their diverse biological functions. Mislocalized proteins may lead to a broad range of diseases, such as cancer and Alzheimer's disease. Knowing where a target protein resides within a cell will give insights into tailored drug design for a disease. As the gold validation standard, the conventional wet lab uses fluorescent microscopy imaging, immunoelectron microscopy, and fluorescent biomarker tags for protein subcellular location identification. However, the booming era of proteomics and high-throughput sequencing generates tons of newly discovered proteins, making protein subcellular localization by wet-lab experiments a mission impossible. To tackle this concern, in the past decades, artificial intelligence (AI) and machine learning (ML), especially deep learning methods, have made significant progress in this research area. In this article, we review the latest advances in AI-based method development in three typical types of approaches, including sequence-based, knowledge-based, and image-based methods. We also elaborately discuss existing challenges and future directions in AI-based method development in this research field.
Collapse
Affiliation(s)
- Hanyu Xiao
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Yijin Zou
- College of Veterinary Medicine, China Agricultural University, Beijing 100193, China;
| | - Jieqiong Wang
- Department of Neurological Sciences, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| | - Shibiao Wan
- Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, USA;
| |
Collapse
|
2
|
Yi W, Sun A, Liu M, Liu X, Zhang W, Dai Q. Comparative Study on Feature Selection in Protein Structure and Function Prediction. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:1650693. [PMID: 36267316 PMCID: PMC9578875 DOI: 10.1155/2022/1650693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 09/14/2022] [Indexed: 11/18/2022]
Abstract
Many effective methods extract and fuse different protein features to study the relationship between protein sequence, structure, and function, but different methods have preferences in solving the research of protein structure and function, which requires selecting valuable and contributing features to design more effective prediction methods. This work mainly focused on the feature selection methods in the study of protein structure and function, and systematically compared and analyzed the efficiency of different feature selection methods in the prediction of protein structures, protein disorders, protein molecular chaperones, and protein solubility. The results show that the feature selection method based on nonlinear SVM performs best in protein structure prediction, protein solubility prediction, protein molecular chaperone prediction, and protein solubility prediction. After selection, the accuracy of features is improved by 13.16% ~71%, especially the Kmer features and PSSM features of proteins.
Collapse
Affiliation(s)
- Wenjing Yi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Ao Sun
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Manman Liu
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Wei Zhang
- College of Informatics Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
3
|
Liu J, Tang X, Cui S, Guan X. Predicting the function of rice proteins through Multi-instance Multi-label Learning based on multiple features fusion. Brief Bioinform 2022; 23:6553933. [PMID: 35325033 DOI: 10.1093/bib/bbac095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/17/2022] [Accepted: 02/23/2022] [Indexed: 11/13/2022] Open
Abstract
There are a large number of unannotated proteins with unknown functions in rice, which are difficult to be verified by biological experiments. Therefore, computational method is one of the mainstream methods for rice proteins function prediction. Two representative rice proteins, indica protein and japonica protein, are selected as the experimental dataset. In this paper, two feature extraction methods (the residue couple model method and the pseudo amino acid composition method) and the Principal Component Analysis method are combined to design protein descriptive features. Moreover, based on the state-of-the-art MIML algorithm EnMIMLNN, a novel MIML learning framework MK-EnMIMLNN is proposed. And the MK-EnMIMLNN algorithm is designed by learning multiple kernel fusion function neural network. The experimental results show that the hybrid feature extraction method is better than the single feature extraction method. More importantly, the MK-EnMIMLNN algorithm is superior to most classic MIML learning algorithms, which proves the effectiveness of the MK-EnMIMLNN algorithm in rice proteins function prediction.
Collapse
Affiliation(s)
- Jing Liu
- Information Engineering College, Shanghai Maritime University, 201306 Shanghai, China
| | - Xinghua Tang
- Information Engineering College, Shanghai Maritime University, 201306 Shanghai, China
| | - Shuanglong Cui
- Information Engineering College, Shanghai Maritime University, 201306 Shanghai, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, 200093 Shanghai, China
| |
Collapse
|
4
|
Ge F, Zhang Y, Xu J, Muhammad A, Song J, Yu DJ. Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion. Brief Bioinform 2022; 23:bbab530. [PMID: 34953462 PMCID: PMC8769912 DOI: 10.1093/bib/bbab530] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 11/13/2021] [Accepted: 11/16/2021] [Indexed: 11/13/2022] Open
Abstract
More than 6000 human diseases have been recorded to be caused by non-synonymous single nucleotide polymorphisms (nsSNPs). Rapid and accurate prediction of pathogenic nsSNPs can improve our understanding of the principle and design of new drugs, which remains an unresolved challenge. In the present work, a new computational approach, termed MSRes-MutP, is proposed based on ResNet blocks with multi-scale kernel size to predict disease-associated nsSNPs. By feeding the serial concatenation of the extracted four types of features, the performance of MSRes-MutP does not obviously improve. To address this, a second model FFMSRes-MutP is developed, which utilizes deep feature fusion strategy and multi-scale 2D-ResNet and 1D-ResNet blocks to extract relevant two-dimensional features and physicochemical properties. FFMSRes-MutP with the concatenated features achieves a better performance than that with individual features. The performance of FFMSRes-MutP is benchmarked on five different datasets. It achieves the Matthew's correlation coefficient (MCC) of 0.593 and 0.618 on the PredictSNP and MMP datasets, which are 0.101 and 0.210 higher than that of the existing best method PredictSNP1. When tested on the HumDiv and HumVar datasets, it achieves MCC of 0.9605 and 0.9507, and area under curve (AUC) of 0.9796 and 0.9748, which are 0.1747 and 0.2669, 0.0853 and 0.1335, respectively, higher than the existing best methods PolyPhen-2 and FATHMM (weighted). In addition, on blind test using a third-party dataset, FFMSRes-MutP performs as the second-best predictor (with MCC and AUC of 0.5215 and 0.7633, respectively), when compared with the other four predictors. Extensive benchmarking experiments demonstrate that FFMSRes-MutP achieves effective feature fusion and can be explored as a useful approach for predicting disease-associated nsSNPs. The webserver is freely available at http://csbio.njust.edu.cn/bioinf/ffmsresmutp/ for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Arif Muhammad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
5
|
Wang Y, Xu Y, Yang Z, Liu X, Dai Q. Using Recursive Feature Selection with Random Forest to Improve Protein Structural Class Prediction for Low-Similarity Sequences. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5529389. [PMID: 34055035 PMCID: PMC8123985 DOI: 10.1155/2021/5529389] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Accepted: 04/28/2021] [Indexed: 11/20/2022]
Abstract
Many combinations of protein features are used to improve protein structural class prediction, but the information redundancy is often ignored. In order to select the important features with strong classification ability, we proposed a recursive feature selection with random forest to improve protein structural class prediction. We evaluated the proposed method with four experiments and compared it with the available competing prediction methods. The results indicate that the proposed feature selection method effectively improves the efficiency of protein structural class prediction. Only less than 5% features are used, but the prediction accuracy is improved by 4.6-13.3%. We further compared different protein features and found that the predicted secondary structural features achieve the best performance. This understanding can be used to design more powerful prediction methods for the protein structural class.
Collapse
Affiliation(s)
- Yaoxin Wang
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Yingjie Xu
- Qixin School, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Zhenyu Yang
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
6
|
Cui BL, Ding Y. Accurate Identification of Human Phosphorylated Proteins by Ensembling Supervised Kernel Self-organizing Maps. Mol Inform 2020; 39:e1900141. [PMID: 31994832 DOI: 10.1002/minf.201900141] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 12/20/2019] [Indexed: 12/15/2022]
Abstract
Protein phosphorylation is a vital physiological process, which plays a critical role in controlling survival differentiation, cell growth, metabolism and apoptosis. The accurate identification of whether a protein will be phosphorylated solely from protein sequence is especially useful for both basic research and drug development. In this study, a new predictor specifically designed for the prediction of human phosphorylated proteins is proposed. The proposed method first train two supervised kernel self-organizing maps (SKSOMs): one is trained with feature from protein physiochemical composition view, while the other is trained with feature from protein evolutionary information view. Then, the two trained SKSOMs are ensembled to perform the final prediction. Rigorous computational experiments show that the proposed method achieves 78.75 % and 0.561 on ACC and MCC, which are 6.96 % and 12.5 % higher than that of the state-of-the-art predictor. Overall, the study demonstrated a new sensitive avenue to identify human phosphorylated proteins and could be readily extended to recognize phosphorylated proteins for other species.
Collapse
Affiliation(s)
- Bei-Liang Cui
- Network Information Center, Nanjing TECH University, Nanjing, 211816, P. R. China
| | - Yong Ding
- Information Center, Nanjing Polytechnic Institute, Nanjing, 210084, P. R. China
| |
Collapse
|
7
|
Zhou H, Li Z, Zheng J, Long Q, Li Y, Liu T, Han B. Identification of Polygonatum odoratum based on support vector machine. Pharmacogn Mag 2020. [DOI: 10.4103/pm.pm_410_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
|
8
|
Wang S, Wang X. Prediction of protein structural classes by different feature expressions based on 2-D wavelet denoising and fusion. BMC Bioinformatics 2019; 20:701. [PMID: 31874617 PMCID: PMC6929547 DOI: 10.1186/s12859-019-3276-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Protein structural class predicting is a heavily researched subject in bioinformatics that plays a vital role in protein functional analysis, protein folding recognition, rational drug design and other related fields. However, when traditional feature expression methods are adopted, the features usually contain considerable redundant information, which leads to a very low recognition rate of protein structural classes. RESULTS We constructed a prediction model based on wavelet denoising using different feature expression methods. A new fusion idea, first fuse and then denoise, is proposed in this article. Two types of pseudo amino acid compositions are utilized to distill feature vectors. Then, a two-dimensional (2-D) wavelet denoising algorithm is used to remove the redundant information from two extracted feature vectors. The two feature vectors based on parallel 2-D wavelet denoising are fused, which is known as PWD-FU-PseAAC. The related source codes are available at https://github.com/Xiaoheng-Wang12/Wang-xiaoheng/tree/master. CONCLUSIONS Experimental verification of three low-similarity datasets suggests that the proposed model achieves notably good results as regarding the prediction of protein structural classes.
Collapse
Affiliation(s)
- Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, People's Republic of China.
| | - Xiaoheng Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, People's Republic of China
| |
Collapse
|
9
|
Yang L, Gao H, Liu Z, Tang L. Identification of Phage Virion Proteins by Using the g-gap Tripeptide Composition. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666180910112813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Phages are widely distributed in locations populated by bacterial hosts. Phage proteins can be divided into two main categories, that is, virion and non-virion proteins with different functions. In practice, people mainly use phage virion proteins to clarify the lysis mechanism of bacterial cells and develop new antibacterial drugs. Accurate identification of phage virion proteins is therefore essential to understanding the phage lysis mechanism. Although some computational methods have been focused on identifying virion proteins, the result is not satisfying which gives more room for improvement. In this study, a new sequence-based method was proposed to identify phage virion proteins using g-gap tripeptide composition. In this approach, the protein features were firstly extracted from the ggap tripeptide composition. Subsequently, we obtained an optimal feature subset by performing incremental feature selection (IFS) with information gain. Finally, the support vector machine (SVM) was used as the classifier to discriminate virion proteins from non-virion proteins. In 10-fold crossvalidation test, our proposed method achieved an accuracy of 97.40% with AUC of 0.9958, which outperforms state-of-the-art methods. The result reveals that our proposed method could be a promising method in the work of phage virion proteins identification.
Collapse
Affiliation(s)
- Liangwei Yang
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Gao
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhen Liu
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Lixia Tang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
10
|
Contreras-Torres E. Predicting structural classes of proteins by incorporating their global and local physicochemical and conformational properties into general Chou's PseAAC. J Theor Biol 2018; 454:139-145. [DOI: 10.1016/j.jtbi.2018.05.033] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 05/23/2018] [Accepted: 05/28/2018] [Indexed: 11/24/2022]
|
11
|
Liang Y, Zhang S. Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou’s general PseAAC via Kullback–Leibler divergence. J Theor Biol 2018; 454:22-29. [DOI: 10.1016/j.jtbi.2018.05.035] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2018] [Revised: 05/19/2018] [Accepted: 05/29/2018] [Indexed: 12/14/2022]
|
12
|
OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. J Theor Biol 2017; 414:128-136. [DOI: 10.1016/j.jtbi.2016.11.028] [Citation(s) in RCA: 68] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2016] [Revised: 11/25/2016] [Accepted: 11/29/2016] [Indexed: 12/22/2022]
|
13
|
Hu J, Han K, Li Y, Yang JY, Shen HB, Yu DJ. TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM. Amino Acids 2016; 48:2533-2547. [DOI: 10.1007/s00726-016-2274-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 06/07/2016] [Indexed: 12/12/2022]
|
14
|
Xiufeng W, Lei Z, Rongbo H, Qinghua W, Jianxin M, Na M, Laicheng L. [Regulatory mechanism of hormones of the pituitary-target gland axes in kidney-Yang deficiency based on a support vector machine model]. J TRADIT CHIN MED 2015; 35:238-43. [PMID: 25975060 DOI: 10.1016/s0254-6272(15)30035-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
OBJECTIVE To study the development mechanism of kidney-Yang deficiency through the establishment of support vector machine models of relevant hormones of the pituitary-target gland axes in rats with kidney-Yang deficiency syndrome. METHODS The kidney-Yang deficiency rat model was created by intramuscular injection of hydrocortisone, and contents of the hormones of the pituitary-thyroid axis: thyroid stimulating hormone (TSH), 3,3',5-triiodothyronine (T3) and thyroxine (T4); hormones of the pituitary-adrenal gland axis: adrenocorticotropic hormone (ACTH) and cortisol (CORT); and hormones of the pituitary-gonadal axis: luteinizing hormone (LH), follicle-stimulating hormone (FSH), and testosterone (T), were determined in the early, middle, and advanced stages. Ten support vector regression (SVR) models of the hormones were established to analyze the mutual relationships among the hormones of the three axes. RESULTS The feedback control action of the pituitary-adrenal axis began to lose efficacy from the middle stage of kidney-Yang deficiency. The contents all hormones of the three pituitary-target gland axes decreased in the advanced stage. Relative errors of the jackknife test of the SVR models all were less than 10%. CONCLUSION Imbalances in mutual regulation among the hormones of the pituitary-target gland axes, especially loss of effectiveness of the pituitary-adrenal axis, is one pathogenesis of kidney-Yang deficiency. The SVR model can accurately reflect the complicated non-linear relationships among pituitary-target gland axes in rats with of kidney-Yang deficiency.
Collapse
|
15
|
Marrero-Ponce Y, Contreras-Torres E, García-Jacas CR, Barigye SJ, Cubillán N, Alvarado YJ. Novel 3D bio-macromolecular bilinear descriptors for protein science: Predicting protein structural classes. J Theor Biol 2015; 374:125-37. [DOI: 10.1016/j.jtbi.2015.03.026] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Revised: 02/23/2015] [Accepted: 03/20/2015] [Indexed: 12/11/2022]
|
16
|
Using protein granularity to extract the protein sequence features. J Theor Biol 2013; 331:48-53. [DOI: 10.1016/j.jtbi.2013.04.019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2012] [Revised: 04/16/2013] [Accepted: 04/18/2013] [Indexed: 11/21/2022]
|
17
|
Learning protein multi-view features in complex space. Amino Acids 2013; 44:1365-79. [DOI: 10.1007/s00726-013-1472-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2012] [Accepted: 02/13/2013] [Indexed: 12/11/2022]
|
18
|
Yu D, Wu X, Shen H, Yang J, Tang Z, Qi Y, Yang J. Enhancing Membrane Protein Subcellular Localization Prediction by Parallel Fusion of Multi-View Features. IEEE Trans Nanobioscience 2012; 11:375-85. [PMID: 22875262 DOI: 10.1109/tnb.2012.2208473] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Dongjun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | | | | | | | | | | | | |
Collapse
|
19
|
Karaçali B. Hierarchical motif vectors for prediction of functional sites in amino acid sequences using quasi-supervised learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1432-1441. [PMID: 22585139 DOI: 10.1109/tcbb.2012.68] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We propose hierarchical motif vectors to represent local amino acid sequence configurations for predicting the functional attributes of amino acid sites on a global scale in a quasi-supervised learning framework. The motif vectors are constructed via wavelet decomposition on the variations of physico-chemical amino acid properties along the sequences. We then formulate a prediction scheme for the functional attributes of amino acid sites in terms of the respective motif vectors using the quasi-supervised learning algorithm that carries out predictions for all sites in consideration using only the experimentally verified sites. We have carried out comparative performance evaluation of the proposed method on the prediction of N-glycosylation of 55,184 sites possessing the consensus N-glycosylation sequon identified over 15,104 human proteins, out of which only 1,939 were experimentally verified N-glycosylation sites. In the experiments, the proposed method achieved better predictive performance than the alternative strategies from the literature. In addition, the predicted N-glycosylation sites showed good agreement with existing potential annotations, while the novel predictions belonged to proteins known to be modified by glycosylation.
Collapse
Affiliation(s)
- Bilge Karaçali
- Department of Electrical and Electronics Engineering, Izmir Institute of Technology, Urla Izmir, Turkey.
| |
Collapse
|
20
|
|
21
|
Chen C, Li SX, Wang SM, Liang SW. Multiple information contents derived from the chromatograms and their application to the modeling of quantitative profile–efficacy relationship. Anal Chim Acta 2012; 713:30-5. [DOI: 10.1016/j.aca.2011.11.029] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2011] [Revised: 10/24/2011] [Accepted: 11/14/2011] [Indexed: 11/30/2022]
|
22
|
Du P, Li T, Wang X. Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics 2011; 8:391-404. [PMID: 21679119 DOI: 10.1586/epr.11.20] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In the last two decades, the number of the known protein sequences increased very rapidly. However, a knowledge of protein function only exists for a small portion of these sequences. Since the experimental approaches for determining protein functions are costly and time consuming, in silico methods have been introduced to bridge the gap between knowledge of protein sequences and their functions. Knowing the subcellular location of a protein is considered to be a critical step in understanding its biological functions. Many efforts have been undertaken to predict the protein subcellular locations in silico. With the accumulation of available data, the substructures of some subcellular organelles, such as the cell nucleus, mitochondria and chloroplasts, have been taken into consideration by several studies in recent years. These studies create a new research topic, namely 'protein sub-subcellular location prediction', which goes one level deeper than classic protein subcellular location prediction.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
| | | | | |
Collapse
|
23
|
Qiu JD, Sun XY, Suo SB, Shi SP, Huang SY, Liang RP, Zhang L. Predicting homo-oligomers and hetero-oligomers by pseudo-amino acid composition: An approach from discrete wavelet transformation. Biochimie 2011; 93:1132-8. [DOI: 10.1016/j.biochi.2011.03.010] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2010] [Accepted: 03/28/2011] [Indexed: 12/16/2022]
|
24
|
A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS One 2011; 6:e20592. [PMID: 21698097 PMCID: PMC3117797 DOI: 10.1371/journal.pone.0020592] [Citation(s) in RCA: 176] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2011] [Accepted: 05/04/2011] [Indexed: 11/21/2022] Open
Abstract
Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. In this paper, by introducing the “multi-label scale” and hybridizing the information of gene ontology with the sequential evolution information, a novel predictor called iLoc-Gneg is developed for predicting the subcellular localization of Gram-positive bacterial proteins with both single-location and multiple-location sites. For facilitating comparison, the same stringent benchmark dataset used to estimate the accuracy of Gneg-mPLoc was adopted to demonstrate the power of iLoc-Gneg. The dataset contains 1,392 Gram-negative bacterial proteins classified into the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. Of the 1,392 proteins, 1,328 are each with only one subcellular location and the other 64 are each with two subcellular locations, but none of the proteins included has pairwise sequence identity to any other in a same subset (subcellular location). It was observed that the overall success rate by jackknife test on such a stringent benchmark dataset by iLoc-Gneg was over 91%, which is about 6% higher than that by Gneg-mPLoc. As a user-friendly web-server, iLoc-Gneg is freely accessible to the public at http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg. Meanwhile, a step-by-step guide is provided on how to use the web-server to get the desired results. Furthermore, for the user's convenience, the iLoc-Gneg web-server also has the function to accept the batch job submission, which is not available in the existing version of Gneg-mPLoc web-server. It is anticipated that iLoc-Gneg may become a useful high throughput tool for Molecular Cell Biology, Proteomics, System Biology, and Drug Development.
Collapse
|
25
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 966] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
26
|
AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 2010; 270:56-62. [PMID: 21056045 DOI: 10.1016/j.jtbi.2010.10.037] [Citation(s) in RCA: 191] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 10/29/2010] [Accepted: 10/29/2010] [Indexed: 12/11/2022]
Abstract
Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.
Collapse
|
27
|
iFC²: an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content. Amino Acids 2010; 40:963-73. [PMID: 20730460 DOI: 10.1007/s00726-010-0721-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Accepted: 08/06/2010] [Indexed: 10/19/2022]
Abstract
Several descriptors of protein structure at the sequence and residue levels have been recently proposed. They are widely adopted in the analysis and prediction of structural and functional characteristics of proteins. Numerous in silico methods have been developed for sequence-based prediction of these descriptors. However, many of them do not have a public web-server and only a few integrate multiple descriptors to improve the predictions. We introduce iFC² (integrated prediction of fold, class, and content) server that is the first to integrate three modern predictors of sequence-level descriptors. They concern fold type (PFRES), structural class (SCEC), and secondary structure content (PSSC-core). The server exploits relations between the three descriptors to implement a cross-evaluation procedure that improves over the predictions of the individual methods. The iFC² annotates fold and class predictions as potentially correct/incorrect. When tested on datasets with low-similarity chains, for the fold prediction iFC² labels 82% of the PFRES predictions as correct and the accuracy of these predictions equals 72%. The accuracy of the remaining 28% of the PFRES predictions equals 38%. Similarly, our server assigns correct labels for over 79% of SCEC predictions, which are shown to be 98% accurate, while the remaining SCEC predictions are only 15% accurate. These results are shown to be competitive when contrasted against recent relevant web-servers. Predictions on CASP8 targets show that the content predicted by iFC² is competitive when compared with the content computed from the tertiary structures predicted by three best-performing methods in CASP8. The iFC² server is available at http://biomine.ece.ualberta.ca/1D/1D.html .
Collapse
|
28
|
Li Z, Zhou X, Dai Z, Zou X. Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm. BMC Bioinformatics 2010; 11:325. [PMID: 20550715 PMCID: PMC2905366 DOI: 10.1186/1471-2105-11-325] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2009] [Accepted: 06/16/2010] [Indexed: 11/25/2022] Open
Abstract
Background Because a priori knowledge about function of G protein-coupled receptors (GPCRs) can provide useful information to pharmaceutical research, the determination of their function is a quite meaningful topic in protein science. However, with the rapid increase of GPCRs sequences entering into databanks, the gap between the number of known sequence and the number of known function is widening rapidly, and it is both time-consuming and expensive to determine their function based only on experimental techniques. Therefore, it is vitally significant to develop a computational method for quick and accurate classification of GPCRs. Results In this study, a novel three-layer predictor based on support vector machine (SVM) and feature selection is developed for predicting and classifying GPCRs directly from amino acid sequence data. The maximum relevance minimum redundancy (mRMR) is applied to pre-evaluate features with discriminative information while genetic algorithm (GA) is utilized to find the optimized feature subsets. SVM is used for the construction of classification models. The overall accuracy with three-layer predictor at levels of superfamily, family and subfamily are obtained by cross-validation test on two non-redundant dataset. The results are about 0.5% to 16% higher than those of GPCR-CA and GPCRPred. Conclusion The results with high success rates indicate that the proposed predictor is a useful automated tool in predicting GPCRs. GPCR-SVMFS, a corresponding executable program for GPCRs prediction and classification, can be acquired freely on request from the authors.
Collapse
Affiliation(s)
- Zhanchao Li
- School of Chemistry and Chemical Engineering, Sun Yat-Sen University, Guangzhou 510275, PR China
| | | | | | | |
Collapse
|
29
|
Esmaeili M, Mohabatkar H, Mohsenzadeh S. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. J Theor Biol 2010; 263:203-9. [DOI: 10.1016/j.jtbi.2009.11.016] [Citation(s) in RCA: 241] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2009] [Revised: 11/18/2009] [Accepted: 11/20/2009] [Indexed: 01/25/2023]
|
30
|
Xi L, Du J, Li S, Li J, Liu H, Yao X. A combined molecular modeling study on gelatinases and their potent inhibitors. J Comput Chem 2010; 31:24-42. [DOI: 10.1002/jcc.21279] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
31
|
Mizianty MJ, Kurgan L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 2009; 10:414. [PMID: 20003388 PMCID: PMC2805645 DOI: 10.1186/1471-2105-10-414] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2009] [Accepted: 12/13/2009] [Indexed: 11/13/2022] Open
Abstract
Background Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. Results The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. Conclusions The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada.
| | | |
Collapse
|
32
|
A network-QSAR model for prediction of genetic-component biomarkers in human colorectal cancer. J Theor Biol 2009; 261:449-58. [DOI: 10.1016/j.jtbi.2009.07.031] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2009] [Revised: 07/20/2009] [Accepted: 07/25/2009] [Indexed: 11/23/2022]
|
33
|
Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile. Amino Acids 2009; 39:101-10. [DOI: 10.1007/s00726-009-0381-1] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2009] [Accepted: 10/20/2009] [Indexed: 10/20/2022]
|
34
|
Gao QB, Ye XF, Jin ZC, He J. Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition. Anal Biochem 2009; 398:52-9. [PMID: 19874797 DOI: 10.1016/j.ab.2009.10.040] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2009] [Revised: 10/21/2009] [Accepted: 10/22/2009] [Indexed: 10/20/2022]
Abstract
Integral membrane proteins are central to many cellular processes and constitute approximately 50% of potential targets for novel drugs. However, the number of outer membrane proteins (OMPs) present in the public structure database is very limited due to the difficulties in determining structure with experimental methods. Therefore, discriminating OMPs from non-OMPs with computational methods is of medical importance as well as genome sequencing necessity. In this study, some sequence-derived structural and physicochemical features of proteins were incorporated with amino acid composition to discriminate OMPs from non-OMPs using support vector machines. The discrimination performance of the proposed method is evaluated on a benchmark dataset of 208 OMPs, 673 globular proteins, and 206 alpha-helical membrane proteins. A high overall accuracy of 97.8% was observed in the 5-fold cross-validation test. In addition, the current method distinguished OMPs from globular proteins and alpha-helical membrane proteins with overall accuracies of 98.2 and 96.4%, respectively. The prediction performance is superior to the state-of-the-art methods in the literature. It is anticipated that the current method might be a powerful tool for the discrimination of OMPs.
Collapse
Affiliation(s)
- Qing-Bin Gao
- Department of Health Statistics, Second Military Medical University, No. 800 Xiangyin Road, Shanghai 200433, China.
| | | | | | | |
Collapse
|
35
|
Xiao X, Wang P, Chou KC. GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 2009; 30:1414-23. [PMID: 19037861 DOI: 10.1002/jcc.21163] [Citation(s) in RCA: 126] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Given an uncharacterized protein sequence, how can we identify whether it is a G-protein-coupled receptor (GPCR) or not? If it is, which functional family class does it belong to? It is important to address these questions because GPCRs are among the most frequent targets of therapeutic drugs and the information thus obtained is very useful for "comparative and evolutionary pharmacology," a technique often used for drug development. Here, we present a web-server predictor called "GPCR-CA," where "CA" stands for "Cellular Automaton" (Wolfram, S. Nature 1984, 311, 419), meaning that the CA images have been utilized to reveal the pattern features hidden in piles of long and complicated protein sequences. Meanwhile, the gray-level co-occurrence matrix factors extracted from the CA images are used to represent the samples of proteins through their pseudo amino acid composition (Chou, K.C. Proteins 2001, 43, 246). GPCR-CA is a two-layer predictor: the first layer prediction engine is for identifying a query protein as GPCR on non-GPCR; if it is a GPCR protein, the process will be automatically continued with the second-layer prediction engine to further identify its type among the following six functional classes: (a) rhodopsin-like, (b) secretin-like, (c) metabotrophic/glutamate/pheromone; (d) fungal pheromone, (e) cAMP receptor, and (f) frizzled/smoothened family. The overall success rates by the predictor for the first and second layers are over 91% and 83%, respectively, that were obtained through rigorous jackknife cross-validation tests on a new-constructed stringent benchmark dataset in which none of proteins has >or=40% pairwise sequence identity to any other in a same subset. GPCR-CA is freely accessible at http://218.65.61.89:8080/bioinfo/GPCR-CA, by which one can get the desired two-layer results for a query protein sequence within about 20 seconds.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China.
| | | | | |
Collapse
|
36
|
Anand A, Suganthan PN. Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates. J Theor Biol 2009; 259:533-40. [PMID: 19406131 DOI: 10.1016/j.jtbi.2009.04.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 02/11/2009] [Accepted: 04/20/2009] [Indexed: 11/15/2022]
Abstract
We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.
Collapse
Affiliation(s)
- Ashish Anand
- School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, S2-B2a-21, Singapore 639798, Singapore
| | | |
Collapse
|
37
|
Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 2009; 257:618-26. [DOI: 10.1016/j.jtbi.2008.12.027] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2008] [Revised: 11/07/2008] [Accepted: 12/19/2008] [Indexed: 11/17/2022]
|
38
|
Liu T, Zheng X, Wang J. Prediction of protein structural class using a complexity-based distance measure. Amino Acids 2009; 38:721-8. [PMID: 19330425 DOI: 10.1007/s00726-009-0276-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2008] [Accepted: 03/11/2009] [Indexed: 11/30/2022]
Abstract
Knowledge of structural class plays an important role in understanding protein folding patterns. So it is necessary to develop effective and reliable computational methods for prediction of protein structural class. To this end, we present a new method called NN-CDM, a nearest neighbor classifier with a complexity-based distance measure. Instead of extracting features from protein sequences as done previously, distance between each pair of protein sequences is directly evaluated by a complexity measure of symbol sequences. Then the nearest neighbor classifier is adopted as the predictive engine. To verify the performance of this method, jackknife cross-validation tests are performed on several benchmark datasets. Results show that our approach achieves a high prediction accuracy over some classical methods.
Collapse
Affiliation(s)
- Taigang Liu
- Department of Applied Mathematics, Dalian University of Technology, 116024 Dalian, China.
| | | | | |
Collapse
|
39
|
Predicting DNA- and RNA-binding proteins from sequences with kernel methods. J Theor Biol 2009; 258:289-93. [PMID: 19490865 DOI: 10.1016/j.jtbi.2009.01.024] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2008] [Revised: 12/08/2008] [Accepted: 01/19/2009] [Indexed: 11/20/2022]
Abstract
In this paper, support vector machines (SVMs) are applied to predict the nucleic-acid-binding proteins. We constructed two classifiers to differentiate DNA/RNA-binding proteins from non-nucleic-acid-binding proteins by using a conjoint triad feature which extract information directly from amino acids sequence of protein. Both self-consistency and jackknife tests show promising results on the protein datasets in which the sequences identity is less than 25%. In the self-consistency test, the predictive accuracy is 90.37% for DNA-binding proteins and 89.70% for RNA-binding proteins. In the jackknife test, the predictive accuracies are 78.93% and 76.75%, respectively. Comparison results show that our method is very competitive by outperforming other previously published sequence-based prediction methods.
Collapse
|
40
|
Kirillova S, Kumar S, Carugo O. Protein domain boundary predictions: a structural biology perspective. Open Biochem J 2009; 3:1-8. [PMID: 19401756 PMCID: PMC2669640 DOI: 10.2174/1874091x00903010001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2008] [Revised: 11/27/2008] [Accepted: 11/29/2008] [Indexed: 11/22/2022] Open
Abstract
One of the important fields to apply computational tools for domain boundaries prediction is structural biology. They can be used to design protein constructs that must be expressed in a stable and functional form and must produce diffraction-quality crystals. However, prediction of protein domain boundaries on the basis of amino acid sequences is still very problematical. In present study the performance of several computational approaches are compared. It is observed that the statistical significance of most of the predictions is rather poor. Nevertheless, when the right number of domains is correctly predicted, domain boundaries are predicted within very few residues from their real location. It can be concluded that prediction methods cannot be used yet as routine tools in structural biology, though some of them are rather promising.
Collapse
Affiliation(s)
- Svetlana Kirillova
- Department of Biomolecular Structural Chemistry, Max F. Pertuz Laboratories, Vienna University, Campus Vienna, Biocenter 5, A-1030, Vienna
| | | | | |
Collapse
|