101
|
Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites. Biosystems 2013; 113:50-7. [DOI: 10.1016/j.biosystems.2013.04.005] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2012] [Revised: 04/10/2013] [Accepted: 04/24/2013] [Indexed: 12/22/2022]
|
102
|
Feng PM, Ding H, Chen W, Lin H. Naïve Bayes classifier with feature selection to identify phage virion proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:530696. [PMID: 23762187 PMCID: PMC3671239 DOI: 10.1155/2013/530696] [Citation(s) in RCA: 107] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/10/2013] [Revised: 04/16/2013] [Accepted: 04/28/2013] [Indexed: 12/31/2022]
Abstract
Knowledge about the protein composition of phage virions is a key step to understand the functions of phage virion proteins. However, the experimental method to identify virion proteins is time consuming and expensive. Thus, it is highly desirable to develop novel computational methods for phage virion protein identification. In this study, a Naïve Bayes based method was proposed to predict phage virion proteins using amino acid composition and dipeptide composition. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife test, the proposed method achieved an accuracy of 79.15% for phage virion and nonvirion proteins classification, which are superior to that of other state-of-the-art classifiers. These results indicate that the proposed method could be as an effective and promising high-throughput method in phage proteomics research.
Collapse
Affiliation(s)
- Peng-Mian Feng
- School of Public Health, Hebei United University, Tangshan 063000, China
| | - Hui Ding
- Key Laboratory for Neuroinformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China
| | - Hao Lin
- Key Laboratory for Neuroinformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
103
|
LIN HAO, DING CHEN, YUAN LUFENG, CHEN WEI, DING HUI, LI ZIQIANG, GUO FENGBIAO, HUANG JIAN, RAO NINI. PREDICTING SUBCHLOROPLAST LOCATIONS OF PROTEINS BASED ON THE GENERAL FORM OF CHOU'S PSEUDO AMINO ACID COMPOSITION: APPROACHED FROM OPTIMAL TRIPEPTIDE COMPOSITION. INT J BIOMATH 2013. [DOI: 10.1142/s1793524513500034] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Chloroplasts are organelles found in plant cells that conduct photosynthesis. The subchloroplast locations of proteins are correlated with their functions. With the availability of a great number of protein data, it is highly desired to develop a computational method to predict the subchloroplast locations of chloroplast proteins. In this study, we proposed a novel method to predict subchloroplast locations of proteins using tripeptide compositions. It first used the binomial distribution to optimize the feature sets. Then the support vector machine was selected to perform the prediction of subchloroplast locations of proteins. The proposed method was tested on a reliable and rigorous dataset including 259 chloroplast proteins with sequence identity ≤ 25%. In the jack-knife cross-validation, 92.21% envelope proteins, 93.20% thylakoid membrane, 52.63% thylakoid lumen and 85.00% stroma can be correctly identified. The overall accuracy achieves 88.03% which is higher than that of other models. Based on this method, a predictor called ChloPred has been built and can be freely available from http://cobi.uestc.edu.cn/people/hlin/tools/ChloPred/ . The predictor will provide important information for theoretical and experimental research of chloroplast proteins.
Collapse
Affiliation(s)
- HAO LIN
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - CHEN DING
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - LU-FENG YUAN
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - WEI CHEN
- Center for Genomics and Computational Biology, Department of Physics, College of Sciences, Hebei United University, Tangshan 063000, P. R. China
| | - HUI DING
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - ZI-QIANG LI
- School of Information and Engineering, Sichuan Agricultural University, Yaan 625014, P. R. China
| | - FENG-BIAO GUO
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - JIAN HUANG
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - NI-NI RAO
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| |
Collapse
|
104
|
Huang C, Yuan JQ. A Multilabel Model Based on Chou’s Pseudo–Amino Acid Composition for Identifying Membrane Proteins with Both Single and Multiple Functional Types. J Membr Biol 2013; 246:327-34. [DOI: 10.1007/s00232-013-9536-9] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2012] [Accepted: 03/11/2013] [Indexed: 11/24/2022]
|
105
|
Wang X, Li GZ. Multilabel learning via random label selection for protein subcellular multilocations prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:436-446. [PMID: 23929867 DOI: 10.1109/tcbb.2013.21] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multilocation proteins to multiple proteins with single location, which does not take correlations among different subcellular locations into account. In this paper, a novel method named random label selection (RALS) (multilabel learning via RALS), which extends the simple binary relevance (BR) method, is proposed to learn from multilocation proteins in an effective and efficient way. RALS does not explicitly find the correlations among labels, but rather implicitly attempts to learn the label correlations from data by augmenting original feature space with randomly selected labels as its additional input features. Through the fivefold cross-validation test on a benchmark data set, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark data sets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multilocations of proteins. The prediction web server is available at >http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.
Collapse
Affiliation(s)
- Xiao Wang
- Key Laboratory of Embedded System and Service Computing, Ministry of Education, Department of Control Science and Engineering, Tongji University, Shanghai 201804, China
| | | |
Collapse
|
106
|
Xu Y, Ding J, Wu LY, Chou KC. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One 2013; 8:e55844. [PMID: 23409062 PMCID: PMC3567014 DOI: 10.1371/journal.pone.0055844] [Citation(s) in RCA: 282] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Accepted: 01/02/2013] [Indexed: 11/18/2022] Open
Abstract
Posttranslational modifications (PTMs) of proteins are responsible for sensing and transducing signals to regulate various cellular functions and signaling events. S-nitrosylation (SNO) is one of the most important and universal PTMs. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods for timely identifying the exact SNO sites in proteins because this kind of information is very useful for both basic research and drug development. Here, a new predictor, called iSNO-PseAAC, was developed for identifying the SNO sites in proteins by incorporating the position-specific amino acid propensity (PSAAP) into the general form of pseudo amino acid composition (PseAAC). The predictor was implemented using the conditional random field (CRF) algorithm. As a demonstration, a benchmark dataset was constructed that contains 731 SNO sites and 810 non-SNO sites. To reduce the homology bias, none of these sites were derived from the proteins that had [Formula: see text] pairwise sequence identity to any other. It was observed that the overall cross-validation success rate achieved by iSNO-PseAAC in identifying nitrosylated proteins on an independent dataset was over 90%, indicating that the new predictor is quite promising. Furthermore, a user-friendly web-server for iSNO-PseAAC was established at http://app.aporc.org/iSNO-PseAAC/, by which users can easily obtain the desired results without the need to follow the mathematical equations involved during the process of developing the prediction method. It is anticipated that iSNO-PseAAC may become a useful high throughput tool for identifying the SNO sites, or at the very least play a complementary role to the existing methods in this area.
Collapse
Affiliation(s)
- Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, China
- Gordon Life Science Institute, San Diego, California, United States of America
| | - Jun Ding
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, China
| | - Ling-Yun Wu
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
107
|
Xiao X, Wang P, Lin WZ, Jia JH, Chou KC. iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem 2013; 436:168-77. [PMID: 23395824 DOI: 10.1016/j.ab.2013.01.019] [Citation(s) in RCA: 347] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2012] [Revised: 01/10/2013] [Accepted: 01/21/2013] [Indexed: 12/14/2022]
Abstract
Antimicrobial peptides (AMPs), also called host defense peptides, are an evolutionarily conserved component of the innate immune response and are found among all classes of life. According to their special functions, AMPs are generally classified into ten categories: Antibacterial Peptides, Anticancer/tumor Peptides, Antifungal Peptides, Anti-HIV Peptides, Antiviral Peptides, Antiparasital Peptides, Anti-protist Peptides, AMPs with Chemotactic Activity, Insecticidal Peptides, and Spermicidal Peptides. Given a query peptide, how can we identify whether it is an AMP or non-AMP? If it is, can we identify which functional type or types it belong to? Particularly, how can we deal with the multi-type problem since an AMP may belong to two or more functional types? To address these problems, which are obviously very important to both basic research and drug development, a multi-label classifier was developed based on the pseudo amino acid composition (PseAAC) and fuzzy K-nearest neighbor (FKNN) algorithm, where the components of PseAAC were featured by incorporating five physicochemical properties. The novel classifier is called iAMP-2L, where "2L" means that it is a 2-level predictor. The 1st-level is to answer the 1st question above, while the 2nd-level is to answer the 2nd and 3rd questions that are beyond the reach of any existing methods in this area. For the conveniences of users, a user-friendly web-server for iAMP-2L was established at http://www.jci-bioinfo.cn/iAMP-2L.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China.
| | | | | | | | | |
Collapse
|
108
|
Characterization of structure–antioxidant activity relationship of peptides in free radical systems using QSAR models: Key sequence positions and their amino acid properties. J Theor Biol 2013; 318:29-43. [DOI: 10.1016/j.jtbi.2012.10.029] [Citation(s) in RCA: 130] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Revised: 10/21/2012] [Accepted: 10/22/2012] [Indexed: 11/22/2022]
|
109
|
Chen W, Feng PM, Lin H, Chou KC. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res 2013; 41:e68. [PMID: 23303794 PMCID: PMC3616736 DOI: 10.1093/nar/gks1450] [Citation(s) in RCA: 468] [Impact Index Per Article: 42.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Meiotic recombination is an important biological process. As a main driving force of evolution, recombination provides natural new combinations of genetic variations. Rather than randomly occurring across a genome, meiotic recombination takes place in some genomic regions (the so-called ‘hotspots’) with higher frequencies, and in the other regions (the so-called ‘coldspots’) with lower frequencies. Therefore, the information of the hotspots and coldspots would provide useful insights for in-depth studying of the mechanism of recombination and the genome evolution process as well. So far, the recombination regions have been mainly determined by experiments, which are both expensive and time-consuming. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the recombination regions. In this study, a predictor, called ‘iRSpot-PseDNC’, was developed for identifying the recombination hotspots and coldspots. In the new predictor, the samples of DNA sequences are formulated by a novel feature vector, the so-called ‘pseudo dinucleotide composition’ (PseDNC), into which six local DNA structural properties, i.e. three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise), are incorporated. It was observed by the rigorous jackknife test that the overall success rate achieved by iRSpot-PseDNC was >82% in identifying recombination spots in Saccharomyces cerevisiae, indicating the new predictor is promising or at least may become a complementary tool to the existing methods in this area. Although the benchmark data set used to train and test the current method was from S. cerevisiae, the basic approaches can also be extended to deal with all the other genomes. Particularly, it has not escaped our notice that the PseDNC approach can be also used to study many other DNA-related problems. As a user-friendly web-server, iRSpot-PseDNC is freely accessible at http://lin.uestc.edu.cn/server/iRSpot-PseDNC.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan, China.
| | | | | | | |
Collapse
|
110
|
EuLoc: a web-server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAAC. J Comput Aided Mol Des 2013; 27:91-103. [PMID: 23283513 DOI: 10.1007/s10822-012-9628-0] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2012] [Accepted: 12/17/2012] [Indexed: 01/25/2023]
Abstract
The function of a protein is generally related to its subcellular localization. Therefore, knowing its subcellular localization is helpful in understanding its potential functions and roles in biological processes. This work develops a hybrid method for computationally predicting the subcellular localization of eukaryotic protein. The method is called EuLoc and incorporates the Hidden Markov Model (HMM) method, homology search approach and the support vector machines (SVM) method by fusing several new features into Chou's pseudo-amino acid composition. The proposed SVM module overcomes the shortcoming of the homology search approach in predicting the subcellular localization of a protein which only finds low-homologous or non-homologous sequences in a protein subcellular localization annotated database. The proposed HMM modules overcome the shortcoming of SVM in predicting subcellular localizations using few data on protein sequences. Several features of a protein sequence are considered, including the sequence-based features, the biological features derived from PROSITE, NLSdb and Pfam, the post-transcriptional modification features and others. The overall accuracy and location accuracy of EuLoc are 90.5 and 91.2 %, respectively, revealing a better predictive performance than obtained elsewhere. Although the amounts of data of the various subcellular location groups in benchmark dataset differ markedly, the accuracies of 12 subcellular localizations of EuLoc range from 82.5 to 100 %, indicating that this tool is much more balanced than other tools. EuLoc offers a high, balanced predictive power for each subcellular localization. EuLoc is now available on the web at http://euloc.mbc.nctu.edu.tw/.
Collapse
|
111
|
Lin WZ, Fang JA, Xiao X, Chou KC. iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. MOLECULAR BIOSYSTEMS 2013; 9:634-44. [DOI: 10.1039/c3mb25466f] [Citation(s) in RCA: 218] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
112
|
Pirhadi S, Ghasemi JB. Pharmacophore Identification, Molecular Docking, Virtual Screening, and In Silico ADME Studies of Non-Nucleoside Reverse Transcriptase Inhibitors. Mol Inform 2012; 31:856-66. [PMID: 27476739 DOI: 10.1002/minf.201200018] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2012] [Accepted: 11/19/2012] [Indexed: 01/26/2023]
Abstract
Non-nucleoside reverse transcriptase inhibitors (NNRTIs) have gained a definitive place due to their unique antiviral potency, high specificity and low toxicity in antiretroviral combination therapies used to treat HIV. In this study, chemical feature based pharmacophore models of different classes of NNRT inhibitors of HIV-1 have been developed. The best HypoRefine pharmacophore model, Hypo 1, which has the best correlation coefficient (0.95) and the lowest RMS (0.97), contains two hydrogen bond acceptors, one hydrophobic and one ring aromatic feature, as well as four excluded volumes. Hypo 1 was further validated by test set and Fischer validation method. The best pharmacophore model was then utilized as a 3D search query to perform a virtual screening to retrieve potential inhibitors. The hit compounds were subsequently subjected to filtering by Lipinski's rule of five and docking studies by Libdock and Gold methods to refine the retrieved hits. Finally, 7 top ranked compounds based on Gold score fitness function were subjected to in silico ADME studies to investigate for compliance with the standard ranges.
Collapse
Affiliation(s)
- Somayeh Pirhadi
- Chemistry Department, Faculty of Sciences, K. N. Toosi University of Technology, Tehran, Iran fax: +98-21-22853650; tel: +98-21-22850266
| | - Jahan B Ghasemi
- Chemistry Department, Faculty of Sciences, K. N. Toosi University of Technology, Tehran, Iran fax: +98-21-22853650; tel: +98-21-22850266.
| |
Collapse
|
113
|
Predicting secretory proteins of malaria parasite by incorporating sequence evolution information into pseudo amino acid composition via grey system model. PLoS One 2012. [PMID: 23189138 PMCID: PMC3506597 DOI: 10.1371/journal.pone.0049040] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The malaria disease has become a cause of poverty and a major hindrance to economic development. The culprit of the disease is the parasite, which secretes an array of proteins within the host erythrocyte to facilitate its own survival. Accordingly, the secretory proteins of malaria parasite have become a logical target for drug design against malaria. Unfortunately, with the increasing resistance to the drugs thus developed, the situation has become more complicated. To cope with the drug resistance problem, one strategy is to timely identify the secreted proteins by malaria parasite, which can serve as potential drug targets. However, it is both expensive and time-consuming to identify the secretory proteins of malaria parasite by experiments alone. To expedite the process for developing effective drugs against malaria, a computational predictor called "iSMP-Grey" was developed that can be used to identify the secretory proteins of malaria parasite based on the protein sequence information alone. During the prediction process a protein sample was formulated with a 60D (dimensional) feature vector formed by incorporating the sequence evolution information into the general form of PseAAC (pseudo amino acid composition) via a grey system model, which is particularly useful for solving complicated problems that are lack of sufficient information or need to process uncertain information. It was observed by the jackknife test that iSMP-Grey achieved an overall success rate of 94.8%, remarkably higher than those by the existing predictors in this area. As a user-friendly web-server, iSMP-Grey is freely accessible to the public at http://www.jci-bioinfo.cn/iSMP-Grey. Moreover, for the convenience of most experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematical equations involved in this paper.
Collapse
|
114
|
A novel statistical measure for sequence comparison on the basis of k-word counts. J Theor Biol 2012; 318:91-100. [PMID: 23147229 DOI: 10.1016/j.jtbi.2012.10.035] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2011] [Revised: 10/10/2012] [Accepted: 10/31/2012] [Indexed: 11/24/2022]
Abstract
Numerous efficient methods based on word counts for sequence analysis have been proposed to characterize DNA sequences to help in comparison, retrieval from the databases and reconstructing evolutionary relations. However, most of them seem unrelated to any intrinsic characteristics of DNA. In this paper, we proposed a novel statistical measure for sequence comparison on the basis of k-word counts. This new measure removed the influence of sequences' lengths and uncovered bulk property of DNA sequences. The proposed measure was tested by similarity search and phylogenetic analysis. The experimental assessment demonstrated that our similarity measure was efficient.
Collapse
|
115
|
Kandaswamy KK, Pugalenthi G, Kalies KU, Hartmann E, Martinetz T. EcmPred: prediction of extracellular matrix proteins based on random forest with maximum relevance minimum redundancy feature selection. J Theor Biol 2012; 317:377-83. [PMID: 23123454 DOI: 10.1016/j.jtbi.2012.10.015] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2012] [Revised: 10/08/2012] [Accepted: 10/09/2012] [Indexed: 12/11/2022]
Abstract
The extracellular matrix (ECM) is a major component of tissues of multicellular organisms. It consists of secreted macromolecules, mainly polysaccharides and glycoproteins. Malfunctions of ECM proteins lead to severe disorders such as marfan syndrome, osteogenesis imperfecta, numerous chondrodysplasias, and skin diseases. In this work, we report a random forest approach, EcmPred, for the prediction of ECM proteins from protein sequences. EcmPred was trained on a dataset containing 300 ECM and 300 non-ECM and tested on a dataset containing 145 ECM and 4187 non-ECM proteins. EcmPred achieved 83% accuracy on the training and 77% on the test dataset. EcmPred predicted 15 out of 20 experimentally verified ECM proteins. By scanning the entire human proteome, we predicted novel ECM proteins validated with gene ontology and InterPro. The dataset and standalone version of the EcmPred software is available at http://www.inb.uni-luebeck.de/tools-demos/Extracellular_matrix_proteins/EcmPred.
Collapse
|
116
|
Jahandideh S, Srinivasasainagendra V, Zhi D. Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection. J Theor Biol 2012; 312:65-75. [PMID: 22884576 DOI: 10.1016/j.jtbi.2012.07.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2012] [Revised: 07/09/2012] [Accepted: 07/13/2012] [Indexed: 01/11/2023]
Abstract
RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.
Collapse
Affiliation(s)
- Samad Jahandideh
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
| | - Vinodh Srinivasasainagendra
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Degui Zhi
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
| |
Collapse
|
117
|
Wang X, Li GZ. A multi-label predictor for identifying the subcellular locations of singleplex and multiplex eukaryotic proteins. PLoS One 2012; 7:e36317. [PMID: 22629314 PMCID: PMC3358325 DOI: 10.1371/journal.pone.0036317] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 04/01/2012] [Indexed: 01/30/2023] Open
Abstract
Subcellular locations of proteins are important functional attributes. An effective and efficient subcellular localization predictor is necessary for rapidly and reliably annotating subcellular locations of proteins. Most of existing subcellular localization methods are only used to deal with single-location proteins. Actually, proteins may simultaneously exist at, or move between, two or more different subcellular locations. To better reflect characteristics of multiplex proteins, it is highly desired to develop new methods for dealing with them. In this paper, a new predictor, called Euk-ECC-mPLoc, by introducing a powerful multi-label learning approach which exploits correlations between subcellular locations and hybridizing gene ontology with dipeptide composition information, has been developed that can be used to deal with systems containing both singleplex and multiplex eukaryotic proteins. It can be utilized to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centrosome, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome, (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole. Experimental results on a stringent benchmark dataset of eukaryotic proteins by jackknife cross validation test show that the average success rate and overall success rate obtained by Euk-ECC-mPLoc were 69.70% and 81.54%, respectively, indicating that our approach is quite promising. Particularly, the success rates achieved by Euk-ECC-mPLoc for small subsets were remarkably improved, indicating that it holds a high potential for simulating the development of the area. As a user-friendly web-server, Euk-ECC-mPLoc is freely accessible to the public at the website http://levis.tongji.edu.cn:8080/bioinfo/Euk-ECC-mPLoc/. We believe that Euk-ECC-mPLoc may become a useful high-throughput tool, or at least play a complementary role to the existing predictors in identifying subcellular locations of eukaryotic proteins.
Collapse
Affiliation(s)
| | - Guo-Zheng Li
- The MOE Key Laboratory of Embedded System and Service Computing, Department of Control Science and Engineering, Tongji University, Shanghai, China
| |
Collapse
|
118
|
Du P, Wang X, Xu C, Gao Y. PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Anal Biochem 2012; 425:117-9. [PMID: 22459120 DOI: 10.1016/j.ab.2012.03.015] [Citation(s) in RCA: 223] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2012] [Revised: 03/17/2012] [Accepted: 03/20/2012] [Indexed: 01/05/2023]
Abstract
The pseudo-amino acid composition has been widely used to convert complicated protein sequences with various lengths to fixed length digital feature vectors while keeping considerable sequence order information. However, so far the only software available to the public is the web server PseAAC (http://www.csbio.sjtu.edu.cn/bioinf/PseAAC), which has some limitations in dealing with large-scale datasets. Here, we propose a new cross-platform stand-alone software program, called PseAAC-Builder (http://www.pseb.sf.net), which can be used to generate various modes of Chou's pseudo-amino acid composition in a much more efficient and flexible way. It is anticipated that PseAAC-Builder may become a useful tool for studying various protein attributes.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | | | | | | |
Collapse
|
119
|
Xiao X, Wang P, Chou KC. iNR-PhysChem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix. PLoS One 2012; 7:e30869. [PMID: 22363503 PMCID: PMC3283608 DOI: 10.1371/journal.pone.0030869] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2011] [Accepted: 12/22/2011] [Indexed: 11/30/2022] Open
Abstract
Nuclear receptors (NRs) form a family of ligand-activated transcription factors that regulate a wide variety of biological processes, such as homeostasis, reproduction, development, and metabolism. Human genome contains 48 genes encoding NRs. These receptors have become one of the most important targets for therapeutic drug development. According to their different action mechanisms or functions, NRs have been classified into seven subfamilies. With the avalanche of protein sequences generated in the postgenomic age, we are facing the following challenging problems. Given an uncharacterized protein sequence, how can we identify whether it is a nuclear receptor? If it is, what subfamily it belongs to? To address these problems, we developed a predictor called iNR-PhysChem in which the protein samples were expressed by a novel mode of pseudo amino acid composition (PseAAC) whose components were derived from a physical-chemical matrix via a series of auto-covariance and cross-covariance transformations. It was observed that the overall success rate achieved by iNR-PhysChem was over 98% in identifying NRs or non-NRs, and over 92% in identifying NRs among the following seven subfamilies: NR1thyroid hormone like, NR2HNF4-like, NR3estrogen like, NR4nerve growth factor IB-like, NR5fushi tarazu-F1 like, NR6germ cell nuclear factor like, and NR0knirps like. These rates were derived by the jackknife tests on a stringent benchmark dataset in which none of protein sequences included has pairwise sequence identity to any other in a same subset. As a user-friendly web-server, iNR-PhysChem is freely accessible to the public at either http://www.jci-bioinfo.cn/iNR-PhysChem or http://icpr.jci.edu.cn/bioinfo/iNR-PhysChem. Also a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics involved in developing the predictor. It is anticipated that iNR-PhysChem may become a useful high throughput tool for both basic research and drug design.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China.
| | | | | |
Collapse
|