1
|
Shen J, Xia Y, Lu Y, Lu W, Qian M, Wu H, Fu Q, Chen J. Identification of membrane protein types via deep residual hypergraph neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:20188-20212. [PMID: 38052642 DOI: 10.3934/mbe.2023894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
A membrane protein's functions are significantly associated with its type, so it is crucial to identify the types of membrane proteins. Conventional computational methods for identifying the species of membrane proteins tend to ignore two issues: High-order correlation among membrane proteins and the scenarios of multi-modal representations of membrane proteins, which leads to information loss. To tackle those two issues, we proposed a deep residual hypergraph neural network (DRHGNN), which enhances the hypergraph neural network (HGNN) with initial residual and identity mapping in this paper. We carried out extensive experiments on four benchmark datasets of membrane proteins. In the meantime, we compared the DRHGNN with recently developed advanced methods. Experimental results showed the better performance of DRHGNN on the membrane protein classification task on four datasets. Experiments also showed that DRHGNN can handle the over-smoothing issue with the increase of the number of model layers compared with HGNN. The code is available at https://github.com/yunfighting/Identification-of-Membrane-Protein-Types-via-deep-residual-hypergraph-neural-network.
Collapse
Affiliation(s)
- Jiyun Shen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Yiyi Xia
- Tianping College of Suzhou University of Science and Technology, Suzhou, China
| | - Yiming Lu
- Tianping College of Suzhou University of Science and Technology, Suzhou, China
| | - Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
- Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, China
| | - Meiling Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Jing Chen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| |
Collapse
|
2
|
Sun M, Hu H, Pang W, Zhou Y. ACP-BC: A Model for Accurate Identification of Anticancer Peptides Based on Fusion Features of Bidirectional Long Short-Term Memory and Chemically Derived Information. Int J Mol Sci 2023; 24:15447. [PMID: 37895128 PMCID: PMC10607064 DOI: 10.3390/ijms242015447] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 09/10/2023] [Accepted: 10/20/2023] [Indexed: 10/29/2023] Open
Abstract
Anticancer peptides (ACPs) have been proven to possess potent anticancer activities. Although computational methods have emerged for rapid ACPs identification, their accuracy still needs improvement. In this study, we propose a model called ACP-BC, a three-channel end-to-end model that utilizes various combinations of data augmentation techniques. In the first channel, features are extracted from the raw sequence using a bidirectional long short-term memory network. In the second channel, the entire sequence is converted into a chemical molecular formula, which is further simplified using Simplified Molecular Input Line Entry System notation to obtain deep abstract features through a bidirectional encoder representation transformer (BERT). In the third channel, we manually selected four effective features according to dipeptide composition, binary profile feature, k-mer sparse matrix, and pseudo amino acid composition. Notably, the application of chemical BERT in predicting ACPs is novel and successfully integrated into our model. To validate the performance of our model, we selected two benchmark datasets, ACPs740 and ACPs240. ACP-BC achieved prediction accuracy with 87% and 90% on these two datasets, respectively, representing improvements of 1.3% and 7% compared to existing state-of-the-art methods on these datasets. Therefore, systematic comparative experiments have shown that the ACP-BC can effectively identify anticancer peptides.
Collapse
Affiliation(s)
- Mingwei Sun
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Haoyuan Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
| | - Wei Pang
- School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK;
| | - You Zhou
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (M.S.); (H.H.)
- College of Software, Jilin University, Changchun 130012, China
| |
Collapse
|
3
|
Arican OC, Gumus O. PredDRBP-MLP: Prediction of DNA-binding proteins and RNA-binding proteins by multilayer perceptron. Comput Biol Med 2023; 164:107317. [PMID: 37562328 DOI: 10.1016/j.compbiomed.2023.107317] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 07/27/2023] [Accepted: 08/07/2023] [Indexed: 08/12/2023]
Abstract
Proteins interact with many molecules in order to maintain the vital activities in cells. Proteins that interact with DNA are called DNA-binding proteins (DBP), and proteins that interact with RNA are called RNA-binding proteins (RBP). Since DBPs and RBPs are involved in critical biological processes, their classification is quite important. Although the convolutional neural network and bidirectional long-short-term memory hybrid model (CNN-BiLSTM) is very popular in DBP and RBP classification, it has problems such as requirement of high processing power and long training time. Therefore, a multilayer perceptron (MLP) based predictor, PredDRBP-MLP (Predictor of DNA-Binding Proteins and RNA-Binding Proteins - Multilayer Perceptron) was developed in this study. PredDRBP-MLP is an artificial learning model that performs multi-class classification of DBPs, RBPs and non-nucleic acid-binding proteins (NNABP). PredDRBP-MLP achieved quite successful results on the independent dataset, specifically in the NNABP class, compared to the existing predictors, in addition to requiring lower processing power and being able to train quicker compared to CNN-BiLSTM based predictors. In NNABP class, PredDRBP-MLP predictor achieved 0.578 precision, 0.522 recall and 0.549 F1-score, while other multi-class predictor achieved 0.486 precision, 0.183 recall and 0.266 F1-score. A desktop application was developed for PredDRBP-MLP. The application is freely accessible at https://sourceforge.net/projects/preddrbp-mlp.
Collapse
Affiliation(s)
- Ozgur Can Arican
- Department of Health Bioinformatics, Ege University, 35100, Izmir, Turkey.
| | - Ozgur Gumus
- Department of Computer Engineering, Ege University, 35100, Izmir, Turkey.
| |
Collapse
|
4
|
Qian Y, Ding Y, Zou Q, Guo F. Multi-View Kernel Sparse Representation for Identification of Membrane Protein Types. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1234-1245. [PMID: 35857734 DOI: 10.1109/tcbb.2022.3191325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Membrane proteins are the main undertaker of biomembrane functions and play a vital role in many biological activities of organisms. Prediction of membrane protein types has a great help in determining the function of proteins and understanding the interactions of membrane proteins. However, the biochemical experiment is expensive and not suitable for the large-scale identification of membrane protein types. Therefore, computational methods were used to improve the efficiency of biological experiments. Most existing computational methods only use a single feature of protein, or use multiple features but do not integrate these well. In our study, the protein sequence is described via three different views (features), including amino acid composition, evolutionary information and physicochemical properties of amino acids. To exploit information among all views (features), we introduce a coupling strategy for Kernel Sparse Representation based Classification (KSRC) and construct a new model called Multi-view KSRC (MvKSRC). We implement our method on 4 benchmark data sets of membrane proteins. The comparison results indicate that our method is much superior to all existing methods.
Collapse
|
5
|
Wang H, Kwong CF, Liu Q, Liu Z, Chen Z. A Novel Artificial Intelligence System in Formulation Dissolution Prediction. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:8640115. [PMID: 35978897 PMCID: PMC9377879 DOI: 10.1155/2022/8640115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 06/20/2022] [Accepted: 06/24/2022] [Indexed: 11/29/2022]
Abstract
Artificial neural network (ANN) techniques are widely used to screen the data and predict the experimental result in pharmaceutical studies. In this study, a novel dissolution result prediction and screen system with a backpropagation network and regression methods was modeled. For this purpose, 21 groups of dissolution data were used to train and verify the ANN model. Based on the design of input data, the related data were still available to train the ANN model when the formulation composition was changed. Two regression methods, the effective data regression method (EDRM) and the reference line regression method (RLRM), make this system predict dissolution results with a high accuracy rate but use less database than the orthogonal experiment. Based on the decision tree, a data screen function is also realized in this system. This ANN model provides a novel drug prediction system with a decrease in time and cost and also easily facilitates the design of new formulation.
Collapse
Affiliation(s)
- Haoyu Wang
- Department of Electrical and Electronic Engineering, University of Nottingham Ningbo China, Ningbo, China
| | - Chiew Foong Kwong
- Department of Electrical and Electronic Engineering, University of Nottingham Ningbo China, Ningbo, China
| | - Qianyu Liu
- International Doctoral Innovation Centre, NingboTech University, Ningbo, China
| | - Zhixin Liu
- Department of Outpatient, Liaoning Thrombus Treatment Center of Integrated Chinese and Western Medicine, Shenyang, China
| | - Zhiyuan Chen
- Department of Mechanical, Materials and Manufacture, University of Nottingham Ningbo China, Ningbo, China
| |
Collapse
|
6
|
Hu J, Rao L, Zhu YH, Zhang GJ, Yu DJ. TargetDBP+: Enhancing the Performance of Identifying DNA-Binding Proteins via Weighted Convolutional Features. J Chem Inf Model 2021; 61:505-515. [PMID: 33410688 DOI: 10.1021/acs.jcim.0c00735] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Protein-DNA interactions exist ubiquitously and play important roles in the life cycles of living cells. The accurate identification of DNA-binding proteins (DBPs) is one of the key steps to understand the mechanisms of protein-DNA interactions. Although many DBP identification methods have been proposed, the current performance is still unsatisfactory. In this study, a new method, called TargetDBP+, is developed to further enhance the performance of identifying DBPs. In TargetDBP+, five convolutional features are first extracted from five feature sources, i.e., amino acid one-hot matrix (AAOHM), position-specific scoring matrix (PSSM), predicted secondary structure probability matrix (PSSPM), predicted solvent accessibility probability matrix (PSAPM), and predicted probabilities of DNA-binding sites (PPDBSs); second, the five features are weightedly and serially combined using the weights of all of the elements learned by the differential evolution algorithm; and finally, the DBP identification model of TargetDBP+ is trained using the support vector machine (SVM) algorithm. To evaluate the developed TargetDBP+ and compare it with other existing methods, a new gold-standard benchmark data set, called UniSwiss, is constructed, which consists of 4881 DBPs and 4881 non-DBPs extracted from the UniprotKB/Swiss-Prot database. Experimental results demonstrate that TargetDBP+ can obtain an accuracy of 85.83% and precision of 88.45% covering 82.41% of all DBP data on the independent validation subset of UniSwiss, with the MCC value (0.718) being significantly higher than those of other state-of-the-art control methods. The web server of TargetDBP+ is accessible at http://csbio.njust.edu.cn/bioinf/targetdbpplus/; the UniSwiss data set and stand-alone program of TargetDBP+ are accessible at https://github.com/jun-csbio/TargetDBPplus.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, P. R. China.,Key Laboratory of Data Science and Intelligence Application, Fujian Province University, Zhangzhou 363000, P. R. China
| | - Liang Rao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, P. R. China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, P. R. China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, P. R. China
| |
Collapse
|
7
|
Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020; 21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open
Abstract
Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
Collapse
Affiliation(s)
- Jian Zhang
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Lixin Lv
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Donglei Lu
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Denan Kong
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
| | | | - Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
| |
Collapse
|
8
|
Abstract
Background:
Thermophilic proteins can maintain good activity under high temperature,
therefore, it is important to study thermophilic proteins for the thermal stability of proteins.
Objective:
In order to solve the problem of low precision and low efficiency in predicting
thermophilic proteins, a prediction method based on feature fusion and machine learning was
proposed in this paper.
Methods:
For the selected thermophilic data sets, firstly, the thermophilic protein sequence was
characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and
autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce
the dimension of the expressed protein sequence features in order to reduce the training time and
improve efficiency. Finally, the classification model was designed by using the classification
algorithm.
Results:
A variety of classification algorithms was used to train and test on the selected thermophilic
dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife
method was over 92%. The combination of other evaluation indicators also proved that the SVM
performance was the best.
Conclusion:
Because of choosing an effectively feature representation method and a robust
classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to
most reported methods.
Collapse
Affiliation(s)
- Xian-Fang Wang
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Peng Gao
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Yi-Feng Liu
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Hong-Fei Li
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Fan Lu
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| |
Collapse
|
9
|
Alphonse AS, Mary NAB, Starvin MS. Classification of membrane protein using Tetra Peptide Pattern. Anal Biochem 2020; 606:113845. [PMID: 32739352 DOI: 10.1016/j.ab.2020.113845] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/17/2020] [Accepted: 06/22/2020] [Indexed: 11/29/2022]
Abstract
Membrane proteins play an important role in the life activities of organisms. The mechanism of cell structures and biological activities can be identified only by knowing the functional types of membrane proteins which accelerate the process. Therefore, it is greatly necessary to build up computational approaches for timely and accurate prediction of the functional types of membrane protein. The proposed method analyzes the structure of the membrane proteins using novel Tetra Peptide Pattern (TPP)-based feature extraction technique. A frequency occurrence matrix is created from which a feature vector is formed. This feature vector captures the pattern among amino acids in a membrane protein sequence. The feature vector is reduced in the dimension using General Kernel-based Supervised Principal Component Analysis (GKSPCA). Stacked Restricted Boltzmann Machines (RBM) in Deep Belief Network (DBN) is used for classification. The RBM is the building block of Deep Belief Network. The proposed method achieves good results on two datasets. The performance of the proposed method was analyzed using Accuracy, Specificity, Sensitivity and Mathew's correlation coefficient. The proposed method achieves good results when compared to other state-of-the-art techniques.
Collapse
Affiliation(s)
| | | | - M S Starvin
- University College of Engineering, Nagercoil, 629004, India.
| |
Collapse
|
10
|
Zhang X, Chen L. Prediction of membrane protein types by fusing protein-protein interaction and protein sequence information. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140524. [PMID: 32858174 DOI: 10.1016/j.bbapap.2020.140524] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/17/2020] [Accepted: 07/30/2020] [Indexed: 11/30/2022]
Abstract
Membrane proteins are gatekeepers to the cell and essential for determination of the function of cells. Identification of the types of membrane proteins is an essential problem in cell biology. It is time-consuming and expensive to identify the type of membrane proteins with traditional experimental methods. The alternative way is to design effective computational methods, which can provide quick and reliable predictions. To date, several computational methods have been proposed in this regard. Several of them used the features extracted from the sequence information of individual proteins. Recently, networks are more and more popular to tackle different protein-related problems, which can organize proteins in a system level and give an overview of all proteins. However, such form weakens the essential properties of proteins, such as their sequence information. In this study, a novel feature fusion scheme was proposed, which integrated the information of protein sequences and protein-protein interaction network. The fused features of a protein were defined as the linear combination of sequence features of all proteins in the network, where the combination coefficients were the probabilities yielded by the random walk with restart algorithm with the protein as the seed node. Several models with such fused features and different classification algorithms were built and evaluated. Their performance for predicting the type of membrane proteins was improved compared with the models only with the sequence features or network information.
Collapse
Affiliation(s)
- Xiaolin Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| |
Collapse
|
11
|
|
12
|
Qian L, Wen Y, Han G. Identification of Cancerlectins Using Support Vector Machines With Fusion of G-Gap Dipeptide. Front Genet 2020; 11:275. [PMID: 32318092 PMCID: PMC7147460 DOI: 10.3389/fgene.2020.00275] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 03/06/2020] [Indexed: 12/13/2022] Open
Abstract
The cancerlectin plays an important role in the initiation, survival, growth, metastasis, and spread of cancer. Therefore, to study the function of cancerlectin is greatly significant because it can help to identify tumor markers and tumor prevention, treatment, and prognosis. However, plenty of studies have generated a large amount of protein data. Traditional prediction methods have been unable to meet the needs of analysis. Developing powerful computational models based on these data to discriminate cancerlectins and non-cancerlectins on a large scale has been treated as one of the most important topics. In this study, we developed a feature extraction method to identify cancerlectins based on fusion of g-gap dipeptides. The analysis of variance was used to select the optimal feature set and a support vector machine was used to classify the data. The rigorous nested 10-fold cross-validation results, demonstrated that our method obtained the prediction accuracy of 83.91% and sensitivity of 83.15%. At the same time, in order to evaluate the performance of the classification model constructed in this work, we constructed a new data set. The prediction accuracy of the new data set reaches 83.3%. Experimental results show that the performance of our method is better than the state-of-the-art methods.
Collapse
Affiliation(s)
- Lili Qian
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Yaping Wen
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Guosheng Han
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| |
Collapse
|
13
|
Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.11.103] [Citation(s) in RCA: 88] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
14
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
15
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
16
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
17
|
Guo L, Wang S, Li M, Cao Z. Accurate classification of membrane protein types based on sequence and evolutionary information using deep learning. BMC Bioinformatics 2019; 20:700. [PMID: 31874615 PMCID: PMC6929490 DOI: 10.1186/s12859-019-3275-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Background Membrane proteins play an important role in the life activities of organisms. Knowing membrane protein types provides clues for understanding the structure and function of proteins. Though various computational methods for predicting membrane protein types have been developed, the results still do not meet the expectations of researchers. Results We propose two deep learning models to process sequence information and evolutionary information, respectively. Both models obtained better results than traditional machine learning models. Furthermore, to improve the performance of the sequence information model, we also provide a new vector representation method to replace the one-hot encoding, whose overall success rate improved by 3.81% and 6.55% on two datasets. Finally, a more effective model is obtained by fusing the above two models, whose overall success rate reached 95.68% and 92.98% on two datasets. Conclusion The final experimental results show that our method is more effective than existing methods for predicting membrane protein types, which can help laboratory researchers to identify the type of novel membrane proteins.
Collapse
Affiliation(s)
- Lei Guo
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, People's Republic of China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, People's Republic of China.
| | - Mingyuan Li
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, People's Republic of China
| | - Zicheng Cao
- School of Public Health (Shenzhen), Sun Yat-sen University, Guangzhou, 510006, People's Republic of China
| |
Collapse
|
18
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
19
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
20
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
21
|
|
22
|
Jayapriya K, Mary NAB. Employing a novel 2-gram subgroup intra pattern (2GSIP) with stacked auto encoder for membrane protein classification. Mol Biol Rep 2019; 46:2259-2272. [PMID: 30778923 DOI: 10.1007/s11033-019-04680-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 02/07/2019] [Indexed: 12/01/2022]
Abstract
Cell membrane proteins play an essentially significant function in manipulating the behaviour of cells. Examination of amino acid sequences can put forward useful insights into the tertiary structures of proteins and their biological functions. One of the important problems in amino acid analysis is the uncertainty to establish a digital coding system to better reflect the properties of amino acids and their degeneracy. In order to overcome the demerits, the proposed method is a novel representation of protein sequences that incorporates a new feature named 2-gram subgroup intra pattern. The functional types of membrane protein classification will be supportive to explain the biological functions of membrane proteins. For classification, Stacked Auto Encoder Deep learning method is applied. The performance of the proposed method is evaluated on two benchmark data sets. The results were experimented using the Self-consistency test, Accuracy, Specificity, Sensitivity, Mathew's correlation coefficient, Jackknife test and Independent data set are the tests in which the proposed method outperformed other existing techniques generally used in literatures.
Collapse
Affiliation(s)
- K Jayapriya
- Vin Solutions, Tirunelveli, Tamilnadu, India.
| | | |
Collapse
|
23
|
Prediction of membrane protein types by exploring local discriminative information from evolutionary profiles. Anal Biochem 2019; 564-565:123-132. [DOI: 10.1016/j.ab.2018.10.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/23/2018] [Accepted: 10/25/2018] [Indexed: 11/17/2022]
|
24
|
Sankari ES, Manimegalai D. Predicting membrane protein types by incorporating a novel feature set into Chou's general PseAAC. J Theor Biol 2018; 455:319-328. [DOI: 10.1016/j.jtbi.2018.07.032] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 06/27/2018] [Accepted: 07/23/2018] [Indexed: 10/28/2022]
|
25
|
Butt AH, Rasool N, Khan YD. Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC. Mol Biol Rep 2018; 45:2295-2306. [PMID: 30238411 DOI: 10.1007/s11033-018-4391-5] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Accepted: 09/14/2018] [Indexed: 11/30/2022]
Abstract
For many biological functions membrane proteins (MPs) are considered crucial. Due to this nature of MPs, many pharmaceutical agents have reflected them as attractive targets. It bears indispensable importance that MPs are predicted with accurate measures using effective and efficient computational models (CMs). Annotation of MPs using in vitro analytical techniques is time-consuming and expensive; and in some cases, it can prove to be intractable. Due to this scenario, automated prediction and annotation of MPs through CM based techniques have appeared to be useful. Based on the use of computational intelligence and statistical moments based feature set, an MP prediction framework is proposed. Furthermore, the previously used dataset has been enhanced by incorporating new MPs from the latest release of UniProtKB. Rigorous experimentation proves that the use of statistical moments with a multilayer neural network, trained using back-propagation based prediction techniques allows more thorough results.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, C-II, Johar Town, P.O. Box 10033, Lahore, 54770, Pakistan.
| | - Nouman Rasool
- Department of Life Sciences, School of Science, University of Management and Technology, C-II, Johar Town, P.O. Box 10033, Lahore, 54770, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, C-II, Johar Town, P.O. Box 10033, Lahore, 54770, Pakistan
| |
Collapse
|
26
|
iMem-2LSAAC: A two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou's pseudo amino acid composition. J Theor Biol 2018; 442:11-21. [DOI: 10.1016/j.jtbi.2018.01.008] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Revised: 12/23/2017] [Accepted: 01/10/2018] [Indexed: 02/08/2023]
|
27
|
Chen QY, Tang J, Du PF. Predicting protein lysine phosphoglycerylation sites by hybridizing many sequence based features. MOLECULAR BIOSYSTEMS 2018; 13:874-882. [PMID: 28396891 DOI: 10.1039/c6mb00875e] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Post-translational modification (PTM) is essential for many biological processes. Covalent and generally enzymatic modification of proteins can impact the activity of proteins. Modified proteins would have more complex structures and functions. Knowing whether a specific residue is modified or not is significant to unravel the function and structure of this protein. As experimental approaches to discover protein PTM sites are always costly and time consuming, computational prediction methods are desirable alternative methods. Lysine phosphoglycerylation is a type of newly discovered PTM that is related to glycolytic process and glucose metabolism. Since the lysine phosphoglycerylation process requires no catalytic enzyme, its site selectivity mechanism is still not fully understood. In this study, we designed a novel computational method, namely PhoglyPred, to identify lysine phosphoglycerylation sites. By utilizing several different protein sequence descriptors, PhoglyPred achieved an overall accuracy of 90.3% in a Jackknife test, which is better than other state-of-the-art predictors. By analyzing the importance of different features using the F-score, we found several important sequence features, which may benefit future studies in understanding the site selectivity mechanism of lysine phosphoglycerylation.
Collapse
Affiliation(s)
- Qing-Yun Chen
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | | | | |
Collapse
|
28
|
iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 2017; 7:16895-909. [PMID: 26942877 PMCID: PMC4941358 DOI: 10.18632/oncotarget.7815] [Citation(s) in RCA: 311] [Impact Index Per Article: 44.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 02/11/2016] [Indexed: 02/07/2023] Open
Abstract
Cancer remains a major killer worldwide. Traditional methods of cancer treatment are expensive and have some deleterious side effects on normal cells. Fortunately, the discovery of anticancer peptides (ACPs) has paved a new way for cancer treatment. With the explosive growth of peptide sequences generated in the post genomic age, it is highly desired to develop computational methods for rapidly and effectively identifying ACPs, so as to speed up their application in treating cancer. Here we report a sequence-based predictor called iACP developed by the approach of optimizing the g-gap dipeptide components. It was demonstrated by rigorous cross-validations that the new predictor remarkably outperformed the existing predictors for the same purpose in both overall accuracy and stability. For the convenience of most experimental scientists, a publicly accessible web-server for iACP has been established at http://lin.uestc.edu.cn/server/iACP, by which users can easily obtain their desired results.
Collapse
|
29
|
Sankari ES, Manimegalai D. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. J Theor Biol 2017; 435:208-217. [PMID: 28941868 DOI: 10.1016/j.jtbi.2017.09.018] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Revised: 09/15/2017] [Accepted: 09/18/2017] [Indexed: 12/19/2022]
Abstract
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier.
Collapse
Affiliation(s)
- E Siva Sankari
- Department of CSE, Government College of Engineering, Tirunelveli, Tamil Nadu, India.
| | - D Manimegalai
- Department of IT, National Engineering College, Kovilpatti, Tamil Nadu, India.
| |
Collapse
|
30
|
Liu B, Yang F, Huang DS, Chou KC. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 2017; 34:33-40. [DOI: 10.1093/bioinformatics/btx579] [Citation(s) in RCA: 235] [Impact Index Per Article: 33.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 09/13/2017] [Indexed: 12/30/2022] Open
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- The Gordon Life Science Institute, Boston, MA, USA
| | - Fan Yang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
- Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
31
|
Kumar R, Kumari B, Kumar M. Prediction of endoplasmic reticulum resident proteins using fragmented amino acid composition and support vector machine. PeerJ 2017; 5:e3561. [PMID: 28890846 PMCID: PMC5588793 DOI: 10.7717/peerj.3561] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2017] [Accepted: 06/20/2017] [Indexed: 12/15/2022] Open
Abstract
Background The endoplasmic reticulum plays an important role in many cellular processes, which includes protein synthesis, folding and post-translational processing of newly synthesized proteins. It is also the site for quality control of misfolded proteins and entry point of extracellular proteins to the secretory pathway. Hence at any given point of time, endoplasmic reticulum contains two different cohorts of proteins, (i) proteins involved in endoplasmic reticulum-specific function, which reside in the lumen of the endoplasmic reticulum, called as endoplasmic reticulum resident proteins and (ii) proteins which are in process of moving to the extracellular space. Thus, endoplasmic reticulum resident proteins must somehow be distinguished from newly synthesized secretory proteins, which pass through the endoplasmic reticulum on their way out of the cell. Approximately only 50% of the proteins used in this study as training data had endoplasmic reticulum retention signal, which shows that these signals are not essentially present in all endoplasmic reticulum resident proteins. This also strongly indicates the role of additional factors in retention of endoplasmic reticulum-specific proteins inside the endoplasmic reticulum. Methods This is a support vector machine based method, where we had used different forms of protein features as inputs for support vector machine to develop the prediction models. During training leave-one-out approach of cross-validation was used. Maximum performance was obtained with a combination of amino acid compositions of different part of proteins. Results In this study, we have reported a novel support vector machine based method for predicting endoplasmic reticulum resident proteins, named as ERPred. During training we achieved a maximum accuracy of 81.42% with leave-one-out approach of cross-validation. When evaluated on independent dataset, ERPred did prediction with sensitivity of 72.31% and specificity of 83.69%. We have also annotated six different proteomes to predict the candidate endoplasmic reticulum resident proteins in them. A webserver, ERPred, was developed to make the method available to the scientific community, which can be accessed at http://proteininformatics.org/mkumar/erpred/index.html. Discussion We found that out of 124 proteins of the training dataset, only 66 proteins had endoplasmic reticulum retention signals, which shows that these signals are not an absolute necessity for endoplasmic reticulum resident proteins to remain inside the endoplasmic reticulum. This observation also strongly indicates the role of additional factors in retention of proteins inside the endoplasmic reticulum. Our proposed predictor, ERPred, is a signal independent tool. It is tuned for the prediction of endoplasmic reticulum resident proteins, even if the query protein does not contain specific ER-retention signal.
Collapse
Affiliation(s)
- Ravindra Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi, India.,Current affiliation: Newe-Ya'ar Research Center, Agricultural Research Organization, Ramat Yishay, Israel
| | - Bandana Kumari
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| | - Manish Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi, India
| |
Collapse
|
32
|
Wu H, Wang K, Lu L, Xue Y, Lyu Q, Jiang M. Deep Conditional Random Field Approach to Transmembrane Topology Prediction and Application to GPCR Three-Dimensional Structure Modeling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1106-1114. [PMID: 27576262 DOI: 10.1109/tcbb.2016.2602872] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Transmembrane proteins play important roles in cellular energy production, signal transmission, and metabolism. Many shallow machine learning methods have been applied to transmembrane topology prediction, but the performance was limited by the large size of membrane proteins and the complex biological evolution information behind the sequence. In this paper, we proposed a novel deep approach based on conditional random fields named as dCRF-TM for predicting the topology of transmembrane proteins. Conditional random fields take into account more complicated interrelation between residue labels in full-length sequence than HMM and SVM-based methods. Three widely-used datasets were employed in the benchmark. DCRF-TM had the accuracy 95 percent over helix location prediction and the accuracy 78 percent over helix number prediction. DCRF-TM demonstrated a more robust performance on large size proteins (>350 residues) against 11 state-of-the-art predictors. Further dCRF-TM was applied to ab initio modeling three-dimensional structures of seven-transmembrane receptors, also known as G protein-coupled receptors. The predictions on 24 solved G protein-coupled receptors and unsolved vasopressin V2 receptor illustrated that dCRF-TM helped abGPCR-I-TASSER to improve TM-score 34.3 percent rather than using the random transmembrane definition. Two out of five predicted models caught the experimental verified disulfide bonds in vasopressin V2 receptor.
Collapse
|
33
|
Liu B, Yang F, Chou KC. 2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function. MOLECULAR THERAPY-NUCLEIC ACIDS 2017. [PMID: 28624202 PMCID: PMC5415553 DOI: 10.1016/j.omtn.2017.04.008] [Citation(s) in RCA: 194] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Involved with important cellular or gene functions and implicated with many kinds of cancers, piRNAs, or piwi-interacting RNAs, are of small non-coding RNA with around 19–33 nt in length. Given a small non-coding RNA molecule, can we predict whether it is of piRNA according to its sequence information alone? Furthermore, there are two types of piRNA: one has the function of instructing target mRNA deadenylation, and the other does not. Can we discriminate one from the other? With the avalanche of RNA sequences emerging in the postgenomic age, it is urgent to address the two problems for both basic research and drug development. Unfortunately, to the best of our knowledge, so far no computational methods whatsoever could be used to deal with the second problem, let alone deal with the two problems together. Here, by incorporating the physicochemical properties of nucleotides into the pseudo K-tuple nucleotide composition (PseKNC), we proposed a powerful predictor called 2L-piRNA. It is a two-layer ensemble classifier, in which the first layer is for identifying whether a query RNA molecule is piRNA or non-piRNA, and the second layer for identifying whether a piRNA is with or without the function of instructing target mRNA deadenylation. Rigorous cross-validations have indicated that the success rates achieved by the proposed predictor are quite high. For the convenience of most biologists and drug development scientists, the web server for 2L-piRNA has been established at http://bioinformatics.hitsz.edu.cn/2L-piRNA/, by which users can easily get their desired results without the need to go through the mathematical details.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China; Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China; Gordon Life Science Institute, Belmont, MA 02478, USA.
| | - Fan Yang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA 02478, USA; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Center of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| |
Collapse
|
34
|
Liu B, Wu H, Chou KC. Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. ACTA ACUST UNITED AC 2017. [DOI: 10.4236/ns.2017.94007] [Citation(s) in RCA: 91] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
35
|
Butt AH, Rasool N, Khan YD. A Treatise to Computational Approaches Towards Prediction of Membrane Protein and Its Subtypes. J Membr Biol 2016; 250:55-76. [DOI: 10.1007/s00232-016-9937-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Accepted: 11/02/2016] [Indexed: 10/20/2022]
|
36
|
Abstract
The structural organization of a protein family is investigated by devising a method based on the random matrix theory (RMT), which uses the physiochemical properties of the amino acid with multiple sequence alignment. A graphical method to represent protein sequences using physiochemical properties is devised that gives a fast, easy, and informative way of comparing the evolutionary distances between protein sequences. A correlation matrix associated with each property is calculated, where the noise reduction and information filtering is done using RMT involving an ensemble of Wishart matrices. The analysis of the eigenvalue statistics of the correlation matrix for the β-lactamase family shows the universal features as observed in the Gaussian orthogonal ensemble (GOE). The property-based approach captures the short- as well as the long-range correlation (approximately following GOE) between the eigenvalues, whereas the previous approach (treating amino acids as characters) gives the usual short-range correlations, while the long-range correlations are the same as that of an uncorrelated series. The distribution of the eigenvector components for the eigenvalues outside the bulk (RMT bound) deviates significantly from RMT observations and contains important information about the system. The information content of each eigenvector of the correlation matrix is quantified by introducing an entropic estimate, which shows that for the β-lactamase family the smallest eigenvectors (low eigenmodes) are highly localized as well as informative. These small eigenvectors when processed gives clusters involving positions that have well-defined biological and structural importance matching with experiments. The approach is crucial for the recognition of structural motifs as shown in β-lactamase (and other families) and selectively identifies the important positions for targets to deactivate (activate) the enzymatic actions.
Collapse
Affiliation(s)
- Pradeep Bhadola
- Department of Physics and Astrophysics, University of Delhi, Delhi 110007, India
| | - Nivedita Deo
- Department of Physics and Astrophysics, University of Delhi, Delhi 110007, India
| |
Collapse
|
37
|
A Prediction Model for Membrane Proteins Using Moments Based Features. BIOMED RESEARCH INTERNATIONAL 2016; 2016:8370132. [PMID: 26966690 PMCID: PMC4761391 DOI: 10.1155/2016/8370132] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 01/12/2016] [Indexed: 01/29/2023]
Abstract
The most expedient unit of the human body is its cell. Encapsulated within the cell are many infinitesimal entities and molecules which are protected by a cell membrane. The proteins that are associated with this lipid based bilayer cell membrane are known as membrane proteins and are considered to play a significant role. These membrane proteins exhibit their effect in cellular activities inside and outside of the cell. According to the scientists in pharmaceutical organizations, these membrane proteins perform key task in drug interactions. In this study, a technique is presented that is based on various computationally intelligent methods used for the prediction of membrane protein without the experimental use of mass spectrometry. Statistical moments were used to extract features and furthermore a Multilayer Neural Network was trained using backpropagation for the prediction of membrane proteins. Results show that the proposed technique performs better than existing methodologies.
Collapse
|
38
|
Jia J, Liu Z, Xiao X, Liu B, Chou KC. Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition. J Biomol Struct Dyn 2015; 34:1946-61. [PMID: 26375780 DOI: 10.1080/07391102.2015.1095116] [Citation(s) in RCA: 88] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
With the explosive growth of protein sequences entering into protein data banks in the post-genomic era, it is highly demanded to develop automated methods for rapidly and effectively identifying the protein-protein binding sites (PPBSs) based on the sequence information alone. To address this problem, we proposed a predictor called iPPBS-PseAAC, in which each amino acid residue site of the proteins concerned was treated as a 15-tuple peptide segment generated by sliding a window along the protein chains with its center aligned with the target residue. The working peptide segment is further formulated by a general form of pseudo amino acid composition via the following procedures: (1) it is converted into a numerical series via the physicochemical properties of amino acids; (2) the numerical series is subsequently converted into a 20-D feature vector by means of the stationary wavelet transform technique. Formed by many individual "Random Forest" classifiers, the operation engine to run prediction is a two-layer ensemble classifier, with the 1st-layer voting out the best training data-set from many bootstrap systems and the 2nd-layer voting out the most relevant one from seven physicochemical properties. Cross-validation tests indicate that the new predictor is very promising, meaning that many important key features, which are deeply hidden in complicated protein sequences, can be extracted via the wavelets transform approach, quite consistent with the facts that many important biological functions of proteins can be elucidated with their low-frequency internal motions. The web server of iPPBS-PseAAC is accessible at http://www.jci-bioinfo.cn/iPPBS-PseAAC , by which users can easily acquire their desired results without the need to follow the complicated mathematical equations involved.
Collapse
Affiliation(s)
- Jianhua Jia
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | - Zi Liu
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | - Xuan Xiao
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China.,c Gordon Life Science Institute , Boston , MA 02478 , USA
| | - Bingxiang Liu
- a Computer Department , Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | - Kuo-Chen Chou
- b Center of Excellence in Genomic Medicine Research (CEGMR) , King Abdulaziz University , Jeddah 21589 , Saudi Arabia.,c Gordon Life Science Institute , Boston , MA 02478 , USA
| |
Collapse
|
39
|
Ali F, Hayat M. Classification of membrane protein types using Voting Feature Interval in combination with Chou's Pseudo Amino Acid Composition. J Theor Biol 2015; 384:78-83. [PMID: 26297889 DOI: 10.1016/j.jtbi.2015.07.034] [Citation(s) in RCA: 112] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Revised: 07/15/2015] [Accepted: 07/29/2015] [Indexed: 12/11/2022]
Abstract
Membrane protein is a major constituent of cell, performing numerous crucial functions in the cell. These functions are mostly concerned with membrane protein's types. Initially, membrane proteins types are classified through traditional methods and reasonable results were obtained using these methods. However, due to large exploration of protein sequences in databases, it is very difficult or sometimes impossible to classify through conventional methods, because it is laborious and wasting of time. Therefore, a new powerful discriminating model is indispensable for classification of membrane protein's types with high precision. In this work, a quite promising classification model is developed having effective discriminating power of membrane protein's types. In our classification model, silent features of protein sequences are extracted via Pseudo Amino Acid Composition. Five classification algorithms were utilized. Among these classification algorithms Voting Feature Interval has obtained outstanding performance in all the three datasets. The accuracy of proposed model is 93.9% on dataset S1, 89.33% on S2 and 86.9% on dataset S3, respectively, applying 10-fold cross validation test. The success rates revealed that our proposed model has obtained the utmost outcomes than other existing models in literatures so far and will be played a substantial role in the fields of drug design and pharmaceutical industry.
Collapse
Affiliation(s)
- Farman Ali
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| |
Collapse
|
40
|
iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J Theor Biol 2015; 377:47-56. [DOI: 10.1016/j.jtbi.2015.04.011] [Citation(s) in RCA: 243] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Revised: 04/07/2015] [Accepted: 04/09/2015] [Indexed: 12/24/2022]
|
41
|
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics 2015; 290:1919-31. [DOI: 10.1007/s00438-015-1044-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Accepted: 04/06/2015] [Indexed: 02/07/2023]
|
42
|
Yang R, Zhang C, Gao R, Zhang L. An ensemble method with hybrid features to identify extracellular matrix proteins. PLoS One 2015; 10:e0117804. [PMID: 25680094 PMCID: PMC4334504 DOI: 10.1371/journal.pone.0117804] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 01/02/2015] [Indexed: 12/29/2022] Open
Abstract
The extracellular matrix (ECM) is a dynamic composite of secreted proteins that play important roles in numerous biological processes such as tissue morphogenesis, differentiation and homeostasis. Furthermore, various diseases are caused by the dysfunction of ECM proteins. Therefore, identifying these important ECM proteins may assist in understanding related biological processes and drug development. In view of the serious imbalance in the training dataset, a Random Forest-based ensemble method with hybrid features is developed in this paper to identify ECM proteins. Hybrid features are employed by incorporating sequence composition, physicochemical properties, evolutionary and structural information. The Information Gain Ratio and Incremental Feature Selection (IGR-IFS) methods are adopted to select the optimal features. Finally, the resulting predictor termed IECMP (Identify ECM Proteins) achieves an balanced accuracy of 86.4% using the 10-fold cross-validation on the training dataset, which is much higher than results obtained by other methods (ECMPRED: 71.0%, ECMPP: 77.8%). Moreover, when tested on a common independent dataset, our method also achieves significantly improved performance over ECMPP and ECMPRED. These results indicate that IECMP is an effective method for ECM protein prediction, which has a more balanced prediction capability for positive and negative samples. It is anticipated that the proposed method will provide significant information to fully decipher the molecular mechanisms of ECM-related biological processes and discover candidate drug targets. For public access, we develop a user-friendly web server for ECM protein identification that is freely accessible at http://iecmp.weka.cc.
Collapse
Affiliation(s)
- Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan, China
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, China
- * E-mail: (CJZ); (RG)
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, China
- * E-mail: (CJZ); (RG)
| | - Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan, China
| |
Collapse
|
43
|
Liu Z, Xiao X, Qiu WR, Chou KC. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal Biochem 2015; 474:69-77. [PMID: 25596338 DOI: 10.1016/j.ab.2014.12.009] [Citation(s) in RCA: 212] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2014] [Revised: 12/05/2014] [Accepted: 12/08/2014] [Indexed: 12/11/2022]
Abstract
Predominantly occurring on cytosine, DNA methylation is a process by which cells can modify their DNAs to change the expression of gene products. It plays very important roles in life development but also in forming nearly all types of cancer. Therefore, knowledge of DNA methylation sites is significant for both basic research and drug development. Given an uncharacterized DNA sequence containing many cytosine residues, which one can be methylated and which one cannot? With the avalanche of DNA sequences generated during the postgenomic age, it is highly desired to develop computational methods for accurately identifying the methylation sites in DNA. Using the trinucleotide composition, pseudo amino acid components, and a dataset-optimizing technique, we have developed a new predictor called "iDNA-Methyl" that has achieved remarkably higher success rates in identifying the DNA methylation sites than the existing predictors. A user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iDNA-Methyl, where users can easily get their desired results. We anticipate that the web-server predictor will become a very useful high-throughput tool for basic research and drug development and that the novel approach and technique can also be used to investigate many other DNA-related problems and genome analysis.
Collapse
Affiliation(s)
- Zi Liu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China.
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA; Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
44
|
Kumar R, Srivastava A, Kumari B, Kumar M. Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 2015; 365:96-103. [DOI: 10.1016/j.jtbi.2014.10.008] [Citation(s) in RCA: 125] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2014] [Revised: 10/01/2014] [Accepted: 10/06/2014] [Indexed: 01/01/2023]
|
45
|
Chen W, Lin H, Chou KC. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. MOLECULAR BIOSYSTEMS 2015; 11:2620-34. [DOI: 10.1039/c5mb00155b] [Citation(s) in RCA: 262] [Impact Index Per Article: 29.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
With the avalanche of DNA/RNA sequences generated in the post-genomic age, it is urgent to develop automated methods for analyzing the relationship between the sequences and their functions.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| | - Hao Lin
- Gordon Life Science Institute
- Boston
- USA
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
| | - Kuo-Chen Chou
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| |
Collapse
|
46
|
Li L, Yu S, Xiao W, Li Y, Huang L, Zheng X, Zhou S, Yang H. Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM. BMC Bioinformatics 2014; 15:340. [PMID: 25409550 PMCID: PMC4289199 DOI: 10.1186/1471-2105-15-340] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Accepted: 09/29/2014] [Indexed: 02/08/2023] Open
Abstract
Background Identification of the recombination hot/cold spots is critical for understanding the mechanism of recombination as well as the genome evolution process. However, experimental identification of recombination spots is both time-consuming and costly. Developing an accurate and automated method for reliably and quickly identifying recombination spots is thus urgently needed. Results Here we proposed a novel approach by fusing features from pseudo nucleic acid composition (PseNAC), including NAC, n-tier NAC and pseudo dinucleotide composition (PseDNC). A recursive feature extraction by linear kernel support vector machine (SVM) was then used to rank the integrated feature vectors and extract optimal features. SVM was adopted for identifying recombination spots based on these optimal features. To evaluate the performance of the proposed method, jackknife cross-validation test was employed on a benchmark dataset. The overall accuracy of this approach was 84.09%, which was higher (from 0.37% to 3.79%) than those of state-of-the-art tools. Conclusions Comparison results suggested that linear kernel SVM is a useful vehicle for identifying recombination hot/cold spots.
Collapse
Affiliation(s)
| | | | | | | | | | - Xiaoqi Zheng
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing 400037, China.
| | | | | |
Collapse
|
47
|
Qiu WR, Xiao X, Lin WZ, Chou KC. iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dyn 2014; 33:1731-42. [PMID: 25248923 DOI: 10.1080/07391102.2014.968875] [Citation(s) in RCA: 126] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
As one of the most important posttranslational modifications (PTMs), ubiquitination plays an important role in regulating varieties of biological processes, such as signal transduction, cell division, apoptosis, and immune response. Ubiquitination is also named "lysine ubiquitination" because it occurs when an ubiquitin is covalently attached to lysine (K) residues of targeting proteins. Given an uncharacterized protein sequence that contains many lysine residues, which one of them is the ubiquitination site, and which one is of non-ubiquitination site? With the avalanche of protein sequences generated in the postgenomic age, it is highly desired for both basic research and drug development to develop an automated method for rapidly and accurately annotating the ubiquitination sites in proteins. In view of this, a new predictor called "iUbiq-Lys" was developed based on the evolutionary information, gray system model, as well as the general form of pseudo-amino acid composition. It was demonstrated via the rigorous cross-validations that the new predictor remarkably outperformed all its counterparts. As a web-server, iUbiq-Lys is accessible to the public at http://www.jci-bioinfo.cn/iUbiq-Lys . For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process.
Collapse
Affiliation(s)
- Wang-Ren Qiu
- a Computer Department, Jing-De-Zhen Ceramic Institute , Jing-De-Zhen 333403 , China
| | | | | | | |
Collapse
|
48
|
Xu R, Zhou J, Liu B, He Y, Zou Q, Wang X, Chou KC. Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. J Biomol Struct Dyn 2014; 33:1720-30. [PMID: 25252709 DOI: 10.1080/07391102.2014.968624] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
DNA-binding proteins are crucial for various cellular processes and hence have become an important target for both basic research and drug development. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to establish an automated method for rapidly and accurately identifying DNA-binding proteins based on their sequence information alone. Owing to the fact that all biological species have developed beginning from a very limited number of ancestral species, it is important to take into account the evolutionary information in developing such a high-throughput tool. In view of this, a new predictor was proposed by incorporating the evolutionary information into the general form of pseudo amino acid composition via the top-n-gram approach. It was observed by comparing the new predictor with the existing methods via both jackknife test and independent data-set test that the new predictor outperformed its counterparts. It is anticipated that the new predictor may become a useful vehicle for identifying DNA-binding proteins. It has not escaped our notice that the novel approach to extract evolutionary information into the formulation of statistical samples can be used to identify many other protein attributes as well.
Collapse
Affiliation(s)
- Ruifeng Xu
- a School of Computer Science and Technology , Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town , Xili, Shenzhen 518055 , Guangdong , China
| | | | | | | | | | | | | |
Collapse
|
49
|
Du P, Gu S, Jiao Y. PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. Int J Mol Sci 2014; 15:3495-506. [PMID: 24577312 PMCID: PMC3975349 DOI: 10.3390/ijms15033495] [Citation(s) in RCA: 242] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Revised: 02/13/2014] [Accepted: 02/14/2014] [Indexed: 11/16/2022] Open
Abstract
The general form pseudo-amino acid composition (PseAAC) has been widely used to represent protein sequences in predicting protein structural and functional attributes. We developed the program PseAAC-General to generate various different modes of Chou’s general PseAAC, such as the gene ontology mode, the functional domain mode, and the sequential evolution mode. This program allows the users to define their own desired modes. In every mode, 544 physicochemical properties of the amino acids are available for choosing. The computing efficiency is at least 100 times that of existing programs, which makes it able to facilitate the extensive studies on proteins and peptides. The PseAAC-General is freely available via SourceForge. It runs on both Linux and Windows.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | - Shuwang Gu
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | - Yasen Jiao
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|