1
|
Khan YD, Amin N, Hussain W, Rasool N, Khan SA, Chou KC. iProtease-PseAAC(2L): A two-layer predictor for identifying proteases and their types using Chou's 5-step-rule and general PseAAC. Anal Biochem 2019; 588:113477. [PMID: 31654612 DOI: 10.1016/j.ab.2019.113477] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 10/02/2019] [Accepted: 10/18/2019] [Indexed: 12/16/2022]
Abstract
Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent. Herein, we report a classifier named iProtease-PseAAC (2L) for identifying proteases and their classes. The predictor is developed employing the flow of 5-step rule, initiating from the collection of benchmark dataset and terminating at the development of predictor. Rigorous verification and validation tests are performed and metrics are collected to calculate the authenticity of the trained model. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. The average accuracy for level-2 i.e. protease classification is 95.77%. Based on the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases and their classes using a given protein sequence.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan.
| | - Najm Amin
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| | - Sher Afzal Khan
- Faculty of Computing and Information Technology in Rabigh, Jeddah, 21577, Saudi Arabia; Abdul Wali Khan University, Department of Computer Sciences, Mardan, Pakistan
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA, 02478, USA
| |
Collapse
|
2
|
Masso M, Bansal A, Prem P, Gajjala A, Vaisman II. Fitness of unregulated human Ras mutants modeled by implementing computational mutagenesis and machine learning techniques. Heliyon 2019; 5:e01884. [PMID: 31211262 PMCID: PMC6562371 DOI: 10.1016/j.heliyon.2019.e01884] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2019] [Revised: 04/23/2019] [Accepted: 05/30/2019] [Indexed: 10/26/2022] Open
Abstract
Ras proteins play a pivotal role as oncogenes by participating in diverse signaling events, including those linked to cell growth, differentiation, and proliferation. Using experimental fitness data and implementing artificial intelligence and a computational mutagenesis technique, we developed models that reliably predict fitness for all single residue mutants of H-ras proto-oncogene protein p21. The computational mutagenesis generated a feature vector of protein structural changes for each variant, and these data correlated well with fitness. Random forest classification and tree regression machine learning algorithms were implemented for training predictive models. Cross-validations were used to evaluate model performance, and control experiments were performed to assess statistical significance. Classification models revealed a balanced accuracy rate as high as 82%, with a Matthew's correlation of 0.63, and an area under ROC curve of 0.90. Similarly, regression models displayed Pearson's correlation reaching 0.79. On the other hand, control data sets led to performance values consistent with random guessing. Comparisons with several related state-of-the-art methods reflected favorably on our trained models. This H-Ras proof-of-principle study suggests a complementary approach for understanding mechanisms with which other proteins are involved in oncogenesis, including related Ras isoforms, and for providing useful insights into designing future diagnostic and treatment modalities.
Collapse
Affiliation(s)
- Majid Masso
- School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia, 20110, USA
| | - Arnav Bansal
- School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia, 20110, USA
| | - Preethi Prem
- School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia, 20110, USA
| | - Akhil Gajjala
- School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia, 20110, USA
| | - Iosif I Vaisman
- School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia, 20110, USA
| |
Collapse
|
3
|
Masso M. All-atom four-body knowledge-based statistical potential to distinguish native tertiary RNA structures from nonnative folds. J Theor Biol 2018; 453:58-67. [PMID: 29782930 DOI: 10.1016/j.jtbi.2018.05.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 04/05/2018] [Accepted: 05/17/2018] [Indexed: 11/16/2022]
Abstract
Scientific breakthroughs in recent decades have uncovered the capability of RNA molecules to fulfill a wide array of structural, functional, and regulatory roles in living cells, leading to a concomitantly significant increase in both the number and diversity of experimentally determined RNA three-dimensional (3D) structures. Atomic coordinates from a representative training set of solved RNA structures, displaying low sequence and structure similarity, facilitate derivation of knowledge-based energy functions. Here we develop an all-atom four-body statistical potential and evaluate its capacity to distinguish native RNA 3D structures from nonnative folds based on calculated free energy scores. Atomic four-body nearest-neighbors are objectively identified by their occurrence as tetrahedral vertices in the Delaunay tessellations of RNA structures, and rates of atomic quadruplet interactions expected by chance are obtained from a multinomial reference distribution. Our four-body energy function, referred to as RAMP (ribonucleic acids multibody potential), is subsequently derived by applying the inverted Boltzmann principle to the frequency data, yielding an energy score for each type of atomic quadruplet interaction. Several well-known benchmark datasets reveal that RAMP is comparable with, and often outperforms, existing knowledge- and physics-based energy functions. To the best of our knowledge, this is the first study detailing an RNA tertiary structure-based multibody statistical potential and its comparative evaluation.
Collapse
Affiliation(s)
- Majid Masso
- School of Systems Biology, 10900 University Blvd. MS 5B3, George Mason University, Manassas, VA 20110 USA.
| |
Collapse
|
4
|
Khan S, Naseem I, Togneri R, Bennamoun M. RAFP-Pred: Robust Prediction of Antifreeze Proteins Using Localized Analysis of n-Peptide Compositions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:244-250. [PMID: 28113406 DOI: 10.1109/tcbb.2016.2617337] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In extreme cold weather, living organisms produce Antifreeze Proteins (AFPs) to counter the otherwise lethal intracellular formation of ice. Structures and sequences of various AFPs exhibit a high degree of heterogeneity, consequently the prediction of the AFPs is considered to be a challenging task. In this research, we propose to handle this arduous manifold learning task using the notion of localized processing. In particular, an AFP sequence is segmented into two sub-segments each of which is analyzed for amino acid and di-peptide compositions. We propose to use only the most significant features using the concept of information gain (IG) followed by a random forest classification approach. The proposed RAFP-Pred achieved an excellent performance on a number of standard datasets. We report a high Youden's index (sensitivity+specificity-1) value of 0.75 on the standard independent test data set outperforming the AFP-PseAAC, AFP_PSSM, AFP-Pred, and iAFP by a margin of 0.05, 0.06, 0.14, and 0.68, respectively. The verification rate on the UniProKB dataset is found to be 83.19 percent which is substantially superior to the 57.18 percent reported for the iAFP method.
Collapse
|
5
|
Cardoso JGR, Andersen MR, Herrgård MJ, Sonnenschein N. Analysis of genetic variation and potential applications in genome-scale metabolic modeling. Front Bioeng Biotechnol 2015; 3:13. [PMID: 25763369 PMCID: PMC4329917 DOI: 10.3389/fbioe.2015.00013] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 01/22/2015] [Indexed: 11/13/2022] Open
Abstract
Genetic variation is the motor of evolution and allows organisms to overcome the environmental challenges they encounter. It can be both beneficial and harmful in the process of engineering cell factories for the production of proteins and chemicals. Throughout the history of biotechnology, there have been efforts to exploit genetic variation in our favor to create strains with favorable phenotypes. Genetic variation can either be present in natural populations or it can be artificially created by mutagenesis and selection or adaptive laboratory evolution. On the other hand, unintended genetic variation during a long term production process may lead to significant economic losses and it is important to understand how to control this type of variation. With the emergence of next-generation sequencing technologies, genetic variation in microbial strains can now be determined on an unprecedented scale and resolution by re-sequencing thousands of strains systematically. In this article, we review challenges in the integration and analysis of large-scale re-sequencing data, present an extensive overview of bioinformatics methods for predicting the effects of genetic variants on protein function, and discuss approaches for interfacing existing bioinformatics approaches with genome-scale models of cellular processes in order to predict effects of sequence variation on cellular phenotypes.
Collapse
Affiliation(s)
- João G. R. Cardoso
- The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | | | - Markus J. Herrgård
- The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| | - Nikolaus Sonnenschein
- The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark
| |
Collapse
|
6
|
Masso M. Modeling functional changes to Escherichia coli thymidylate synthase upon single residue replacements: a structure-based approach. PeerJ 2015; 3:e721. [PMID: 25648456 PMCID: PMC4304848 DOI: 10.7717/peerj.721] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Accepted: 12/18/2014] [Indexed: 11/30/2022] Open
Abstract
Escherichia coli thymidylate synthase (TS) is an enzyme that is indispensable to DNA synthesis and cell division, as it provides the only de novo source of dTMP by catalyzing the reductive methylation of dUMP, thus making it a key target for chemotherapeutic agents. High resolution X-ray crystallographic structures are available for TS and, owing to its relatively small size, successful experimental mutagenesis studies have been conducted on the enzyme. In this study, an in silico mutagenesis technique is used to investigate the effects of single amino acid substitutions in TS on enzymatic activity, one that employs the TS protein structure as well as a knowledge-based, four-body statistical potential. For every single residue TS variant, this approach yields both a global structural perturbation score and a set of local environmental perturbation scores that characterize the mutated position as well as all structurally neighboring residues. Global scores for the TS variants are capable of uniquely characterizing groups of residue positions in the enzyme according to their physicochemical, functional, or structural properties. Additionally, these global scores elucidate a statistically significant structure–function relationship among a collection of 372 single residue TS variants whose activity levels have been experimentally determined. Predictive models of TS variant activity are subsequently trained on this dataset of experimental mutants, whose respective feature vectors encode information regarding the mutated position as well as its six nearest residue neighbors in the TS structure, including their environmental perturbation scores.
Collapse
Affiliation(s)
- Majid Masso
- Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University , Manassas, VA , USA
| |
Collapse
|
7
|
AUTO-MUTE 2.0: A Portable Framework with Enhanced Capabilities for Predicting Protein Functional Consequences upon Mutation. Adv Bioinformatics 2014; 2014:278385. [PMID: 25197272 PMCID: PMC4150472 DOI: 10.1155/2014/278385] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2014] [Revised: 07/29/2014] [Accepted: 07/29/2014] [Indexed: 11/18/2022] Open
Abstract
The AUTO-MUTE 2.0 stand-alone software package includes a collection of programs for predicting functional changes to proteins upon single residue substitutions, developed by combining structure-based features with trained statistical learning models. Three of the predictors evaluate changes to protein stability upon mutation, each complementing a distinct experimental approach. Two additional classifiers are available, one for predicting activity changes due to residue replacements and the other for determining the disease potential of mutations associated with nonsynonymous single nucleotide polymorphisms (nsSNPs) in human proteins. These five command-line driven tools, as well as all the supporting programs, complement those that run our AUTO-MUTE web-based server. Nevertheless, all the codes have been rewritten and substantially altered for the new portable software, and they incorporate several new features based on user feedback. Included among these upgrades is the ability to perform three highly requested tasks: to run "big data" batch jobs; to generate predictions using modified protein data bank (PDB) structures, and unpublished personal models prepared using standard PDB file formatting; and to utilize NMR structure files that contain multiple models.
Collapse
|
8
|
Mao R, Raj Kumar PK, Guo C, Zhang Y, Liang C. Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine. PLoS One 2014; 9:e104049. [PMID: 25110928 PMCID: PMC4128822 DOI: 10.1371/journal.pone.0104049] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2014] [Accepted: 07/06/2014] [Indexed: 01/04/2023] Open
Abstract
One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.
Collapse
Affiliation(s)
- Rui Mao
- College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China
- College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China
- Department of Biology, Miami University, Oxford, Ohio, United States of America
| | | | - Cheng Guo
- Department of Biology, Miami University, Oxford, Ohio, United States of America
| | - Yang Zhang
- College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling, Shaanxi, China
- College of Information Engineering, Northwest A&F University, Yangling, Shaanxi, China
- * E-mail: (YZ); (CL)
| | - Chun Liang
- Department of Biology, Miami University, Oxford, Ohio, United States of America
- Department of Computer Sciences and Software Engineering, Miami University, Oxford, Ohio, United States of America
- * E-mail: (YZ); (CL)
| |
Collapse
|
9
|
Masso M, Chuang G, Hao K, Jain S, Vaisman II. Structure-based predictors of resistance to the HIV-1 integrase inhibitor Elvitegravir. Antiviral Res 2014; 106:5-12. [DOI: 10.1016/j.antiviral.2014.03.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Revised: 03/14/2014] [Accepted: 03/17/2014] [Indexed: 12/15/2022]
|
10
|
Zhao N, Han JG, Shyu CR, Korkin D. Determining effects of non-synonymous SNPs on protein-protein interactions using supervised and semi-supervised learning. PLoS Comput Biol 2014; 10:e1003592. [PMID: 24784581 PMCID: PMC4006705 DOI: 10.1371/journal.pcbi.1003592] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2013] [Accepted: 03/13/2014] [Indexed: 12/31/2022] Open
Abstract
Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of large-scale protein-protein interaction networks, and can be useful for functional annotation of disease-associated SNPs. SNIP-IN tool is freely accessible as a web-server at http://korkinlab.org/snpintool/.
Collapse
Affiliation(s)
- Nan Zhao
- Informatics Institute, University of Missouri, Columbia, Missouri, United States of America
| | - Jing Ginger Han
- Informatics Institute, University of Missouri, Columbia, Missouri, United States of America
| | - Chi-Ren Shyu
- Informatics Institute, University of Missouri, Columbia, Missouri, United States of America
- Department of Computer Science, University of Missouri, Columbia, Missouri, United States of America
| | - Dmitry Korkin
- Informatics Institute, University of Missouri, Columbia, Missouri, United States of America
- Department of Computer Science, University of Missouri, Columbia, Missouri, United States of America
- Bond Life Science Center, University of Missouri, Columbia, Missouri, United States of America
| |
Collapse
|
11
|
Computational Approaches and Resources in Single Amino Acid Substitutions Analysis Toward Clinical Research. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:365-423. [DOI: 10.1016/b978-0-12-800168-4.00010-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
12
|
Predicting DNA binding proteins using support vector machine with hybrid fractal features. J Theor Biol 2013; 343:186-92. [PMID: 24189096 DOI: 10.1016/j.jtbi.2013.10.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2013] [Revised: 08/12/2013] [Accepted: 10/17/2013] [Indexed: 11/20/2022]
Abstract
DNA-binding proteins play a vitally important role in many biological processes. Prediction of DNA-binding proteins from amino acid sequence is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) investigates the patterns hidden in protein sequences, and visually reveals previously unknown structure. Fractal dimensions (FD) are good tools to measure sizes of complex, highly irregular geometric objects. In order to extract the intrinsic correlation with DNA-binding property from protein sequences, CGR algorithm, fractal dimension and amino acid composition are applied to formulate the numerical features of protein samples in this paper. Seven groups of features are extracted, which can be computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test and Jackknife test. Comparing the results of numerical experiments, the group of amino acid composition and fractal dimension (21-dimension vector) gets the best result, the average accuracy is 81.82% and average Matthew's correlation coefficient (MCC) is 0.6017. This resulting predictor is also compared with existing method DNA-Prot and shows better performances.
Collapse
|
13
|
Using protein granularity to extract the protein sequence features. J Theor Biol 2013; 331:48-53. [DOI: 10.1016/j.jtbi.2013.04.019] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2012] [Revised: 04/16/2013] [Accepted: 04/18/2013] [Indexed: 11/21/2022]
|
14
|
Jahandideh S, Zhi D. Systematic investigation of predicted effect of nonsynonymous SNPs in human prion protein gene: a molecular modeling and molecular dynamics study. J Biomol Struct Dyn 2013; 32:289-300. [PMID: 23527686 DOI: 10.1080/07391102.2012.763216] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Nonsynonymous mutations in the human prion protein (HuPrP) gene contribute to the conversion of HuPrP(C) to HuPrP(Sc) and amyloid formation which in turn leads to prion diseases such as familial Creutzfeldt-Jakob disease and Gerstmann-Straussler-Scheinker disease. In order to better understand and predict the role of HuPrP mutations, we developed the following procedure: first, we consulted the Human Genome Variation database and dbSNP databases, and we reviewed literature for the retrieval of aggregation-related nsSNPs of the HuPrP gene. Next, we used three different methods - Polymorphism Phenotyping (PolyPhen), PANTHER, and Auto-Mute - to predict the effect of nsSNPs on the phenotype. We compared the predictions against experimentally reported effects of these nsSNPs to evaluate the accuracy of the three methods: PolyPhen predicted 17 out of 22 nsSNPs as "probably damaging" or "possibly damaging"; PANTHER predicted 8 out of 22 nsSNPs as "Deleterious"; and Auto-Mute predicted 9 out of 20 nsSNPs as "Disease". Finally, structural analyses of the native protein against mutated models were investigated using molecular modeling and molecular dynamics (MD) simulation methods. In addition to comparing predictor methods, our results show the applicability of our procedure for the prediction of damaging nsSNPs. Our study also elucidates the obvious relationship between predicted values of aggregation-related nsSNPs in HuPrP gene and molecular modeling and MD simulations results. In conclusion, this procedure would enable researchers to select outstanding candidates for extensive MD simulations in order to decipher more details of HuPrP aggregation. An animated interactive 3D complement (I3DC) is available in Proteopedia at http://proteopedia.org/w/Journal:JBSD:34.
Collapse
Affiliation(s)
- Samad Jahandideh
- a Section on Statistical Genetics, Department of Biostatistics , School of Public Health, University of Alabama at Birmingham , Birmingham , AL , 35294 , USA
| | | |
Collapse
|
15
|
Chiechi A, Novello C, Magagnoli G, Petricoin EF, Deng J, Benassi MS, Picci P, Vaisman I, Espina V, Liotta LA. Elevated TNFR1 and serotonin in bone metastasis are correlated with poor survival following bone metastasis diagnosis for both carcinoma and sarcoma primary tumors. Clin Cancer Res 2013; 19:2473-85. [PMID: 23493346 DOI: 10.1158/1078-0432.ccr-12-3416] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
PURPOSE There is an urgent need for therapies that will reduce the mortality of patients with bone metastasis. In this study, we profiled the protein signal pathway networks of the human bone metastasis microenvironment. The goal was to identify sets of interacting proteins that correlate with survival time following the first diagnosis of bone metastasis. EXPERIMENTAL DESIGN Using Reverse Phase Protein Microarray technology, we measured the expression of 88 end points in the bone microenvironment of 159 bone metastasis tissue samples derived from patients with primary carcinomas and sarcomas. RESULTS Metastases originating from different primary tumors showed similar levels of cell signaling across tissue types for the majority of proteins analyzed, suggesting that the bone microenvironment strongly influences the metastatic tumor signaling profiles. In a training set (72 samples), TNF receptor 1, alone (P = 0.0013) or combined with serotonin (P = 0.0004), TNFα (P = 0.0214), and RANK (P = 0.0226), was associated with poor survival, regardless of the primary tumor of origin. Results were confirmed by (i) analysis of an independent validation set (71 samples) and (ii) independent bioinformatic analysis using a support vector machine learning model. Spearman rho analysis revealed a highly significant number of interactions intersecting with ERα S118, serotonin, TNFα, RANKL, and matrix metalloproteinase in the bone metastasis signaling network, regardless of the primary tumor. The interaction network pattern was significantly different in the short versus long survivors. CONCLUSIONS TNF receptor 1 and neuroendocrine-regulated protein signal pathways seem to play an important role in bone metastasis and may constitute a novel drug-targetable mechanism of seed-soil cross talk in bone metastasis.
Collapse
Affiliation(s)
- Antonella Chiechi
- Laboratory of Experimental Oncology, Istituto Ortopedico Rizzoli, Bologna, Italy
| | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Qiu Z, Qin C, Jiu M, Wang X. A simple iterative method to optimize protein–ligand-binding residue prediction. J Theor Biol 2013; 317:219-23. [DOI: 10.1016/j.jtbi.2012.10.028] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2012] [Revised: 10/19/2012] [Accepted: 10/22/2012] [Indexed: 11/15/2022]
|
17
|
Analyzing effects of naturally occurring missense mutations. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2012; 2012:805827. [PMID: 22577471 PMCID: PMC3346971 DOI: 10.1155/2012/805827] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Revised: 02/01/2012] [Accepted: 02/01/2012] [Indexed: 11/17/2022]
Abstract
Single-point mutation in genome, for example, single-nucleotide polymorphism (SNP) or rare genetic mutation, is the change of a single nucleotide for another in the genome sequence. Some of them will produce an amino acid substitution in the corresponding protein sequence (missense mutations); others will not. This paper focuses on genetic mutations resulting in a change in the amino acid sequence of the corresponding protein and how to assess their effects on protein wild-type characteristics. The existing methods and approaches for predicting the effects of mutation on protein stability, structure, and dynamics are outlined and discussed with respect to their underlying principles. Available resources, either as stand-alone applications or webservers, are pointed out as well. It is emphasized that understanding the molecular mechanisms behind these effects due to these missense mutations is of critical importance for detecting disease-causing mutations. The paper provides several examples of the application of 3D structure-based methods to model the effects of protein stability and protein-protein interactions caused by missense mutations as well.
Collapse
|
18
|
Chou KC, Wu ZC, Xiao X. iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. ACTA ACUST UNITED AC 2012; 8:629-41. [PMID: 22134333 DOI: 10.1039/c1mb05420a] [Citation(s) in RCA: 270] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California 92130, USA.
| | | | | |
Collapse
|
19
|
Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J Theor Biol 2012; 293:49-54. [DOI: 10.1016/j.jtbi.2011.10.004] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2011] [Revised: 10/04/2011] [Accepted: 10/04/2011] [Indexed: 11/18/2022]
|
20
|
Henriksen SB, Mortensen RJ, Geertz-Hansen HM, Neves-Petersen MT, Arnason O, Söring J, Petersen SB. Hyperdimensional analysis of amino acid pair distributions in proteins. PLoS One 2011; 6:e25638. [PMID: 22174733 PMCID: PMC3235099 DOI: 10.1371/journal.pone.0025638] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2011] [Accepted: 09/08/2011] [Indexed: 01/06/2023] Open
Abstract
Our manuscript presents a novel approach to protein structure analyses. We have organized an 8-dimensional data cube with protein 3D-structural information from 8706 high-resolution non-redundant protein-chains with the aim of identifying packing rules at the amino acid pair level. The cube contains information about amino acid type, solvent accessibility, spatial and sequence distance, secondary structure and sequence length. We are able to pose structural queries to the data cube using program ProPack. The response is a 1, 2 or 3D graph. Whereas the response is of a statistical nature, the user can obtain an instant list of all PDB-structures where such pair is found. The user may select a particular structure, which is displayed highlighting the pair in question. The user may pose millions of different queries and for each one he will receive the answer in a few seconds. In order to demonstrate the capabilities of the data cube as well as the programs, we have selected well known structural features, disulphide bridges and salt bridges, where we illustrate how the queries are posed, and how answers are given. Motifs involving cysteines such as disulphide bridges, zinc-fingers and iron-sulfur clusters are clearly identified and differentiated. ProPack also reveals that whereas pairs of Lys residues virtually never appear in close spatial proximity, pairs of Arg are abundant and appear at close spatial distance, contrasting the belief that electrostatic repulsion would prevent this juxtaposition and that Arg-Lys is perceived as a conservative mutation. The presented programs can find and visualize novel packing preferences in proteins structures allowing the user to unravel correlations between pairs of amino acids. The new tools allow the user to view statistical information and visualize instantly the structures that underpin the statistical information, which is far from trivial with most other SW tools for protein structure analysis.
Collapse
Affiliation(s)
- Svend B. Henriksen
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Rasmus J. Mortensen
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Henrik M. Geertz-Hansen
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Maria Teresa Neves-Petersen
- International Iberian Nanotechnol Lab (INL), Braga, Portugal
- Nanobiotechnology Group, Department of Biotechnology, Chemistry and Environmental Sciences, University of Aalborg, Aalborg, Denmark
- * E-mail:
| | - Omar Arnason
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Jón Söring
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Steffen B. Petersen
- Nanobiotechnology Group, Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
- The Institute for Lasers, Photonics and Biophotonics, University at Buffalo, The State University of New York, Buffalo, New York, United States of America
| |
Collapse
|
21
|
Wavelet images and Chou’s pseudo amino acid composition for protein classification. Amino Acids 2011; 43:657-65. [DOI: 10.1007/s00726-011-1114-9] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2010] [Accepted: 09/28/2011] [Indexed: 10/16/2022]
|
22
|
Lu JL, Hu XH, Hu DG. A new hybrid fractal algorithm for predicting thermophilic nucleotide sequences. J Theor Biol 2011; 293:74-81. [PMID: 22001320 DOI: 10.1016/j.jtbi.2011.09.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2011] [Revised: 09/23/2011] [Accepted: 09/26/2011] [Indexed: 01/20/2023]
Abstract
Knowledge of thermophilic mechanisms about some organisms whose optimum growth temperature (OGT) ranges from 50 to 80 degree plays a major role in helping design stable proteins. How to predict a DNA sequence to be thermophilic is a long but not fairly resolved problem. Chaos game representation (CGR) can investigate the patterns hiding in DNA sequences, and can visually reveal previously unknown structure. Fractal dimensions are good tools to measure sizes of complex, highly irregular geometric objects. In this paper, we convert every DNA sequence into a high dimensional vector by CGR algorithm and fractal dimension, and then predict the DNA sequence thermostability by these fractal features and support vector machine (SVM). We have conducted experiments on three groups: 17-dimensional vector, 65-dimensional vector, and 257-dimensional vector. Each group is evaluated by the 10-fold cross-validation test. For the results, the group of 257-dimensional vector gets the best results: the average accuracy is 0.9456 and average MCC is 0.8878. The results are also compared with the previous work with single CGR features. The comparison shows the high effectiveness of the new hybrid fractal algorithm.
Collapse
Affiliation(s)
- Jin-Long Lu
- College of Science, Huazhong Agricultural University, Wuhan, PR China
| | | | | |
Collapse
|
23
|
OligoPred: A web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into Chou's pseudo amino acid composition. J Mol Graph Model 2011; 30:129-34. [DOI: 10.1016/j.jmgm.2011.06.014] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2011] [Revised: 06/18/2011] [Accepted: 06/30/2011] [Indexed: 01/13/2023]
|
24
|
Jingbo X, Silan Z, Feng S, Huijuan X, Xuehai H, Xiaohui N, Zhi L. Using the concept of pseudo amino acid composition to predict resistance gene against Xanthomonas oryzae pv. oryzae in rice: An approach from chaos games representation. J Theor Biol 2011; 284:16-23. [DOI: 10.1016/j.jtbi.2011.06.003] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Revised: 06/02/2011] [Accepted: 06/03/2011] [Indexed: 10/18/2022]
|
25
|
Cheng F, Theodorescu D, Schulman IG, Lee JK. In vitro transcriptomic prediction of hepatotoxicity for early drug discovery. J Theor Biol 2011; 290:27-36. [PMID: 21884709 DOI: 10.1016/j.jtbi.2011.08.009] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Revised: 07/27/2011] [Accepted: 08/11/2011] [Indexed: 01/08/2023]
Abstract
Liver toxicity (hepatotoxicity) is a critical issue in drug discovery and development. Standard preclinical evaluation of drug hepatotoxicity is generally performed using in vivo animal systems. However, only a small number of preselected compounds can be examined in vivo due to high experimental costs. A more efficient yet accurate screening technique that can identify potentially hepatotoxic compounds in the early stages of drug development would thus be valuable. Here, we develop and apply a novel genomic prediction technique for screening hepatotoxic compounds based on in vitro human liver cell tests. Using a training set of in vivo rodent experiments for drug hepatotoxicity evaluation, we discovered common biomarkers of drug-induced liver toxicity among six heterogeneous compounds. This gene set was further triaged to a subset of 32 genes that can be used as a multi-gene expression signature to predict hepatotoxicity. This multi-gene predictor was independently validated and showed consistently high prediction performance on five test sets of in vitro human liver cell and in vivo animal toxicity experiments. The predictor also demonstrated utility in evaluating different degrees of toxicity in response to drug concentrations, which may be useful not only for discerning a compound's general hepatotoxicity but also for determining its toxic concentration.
Collapse
Affiliation(s)
- Feng Cheng
- Department of Biophysics, University of Virginia, Charlottesville, VA, USA.
| | | | | | | |
Collapse
|
26
|
Nanni L, Lumini A, Gupta D, Garg A. Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:467-475. [PMID: 21860064 DOI: 10.1109/tcbb.2011.117] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The availability of a reliable prediction method for prediction of bacterial virulent proteins has several important applications in research efforts targeted aimed at finding novel drug targets, vaccine candidates, and understanding virulence mechanisms in pathogens. In this work, we have studied several feature extraction approaches for representing proteins and propose a novel bacterial virulent protein prediction method, based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence and from the evolutionary information of a given protein. We have evaluated and compared several ensembles obtained by combining six feature extraction methods and several classification approaches based on two general purpose classifiers (i.e., Support Vector Machine and a variant of input decimated ensemble) and their random subspace version. An extensive evaluation was performed according to a blind testing protocol, where the parameters of the system are optimized using the training set and the system is validated in three different independent data sets, allowing selection of the most performing system and demonstrating the validity of the proposed method. Based on the results obtained using the blind test protocol, it is interesting to note that even if in each independent data set the most performing stand-alone method is not always the same, the fusion of different methods enhances prediction efficiency in all the tested independent data sets.
Collapse
|
27
|
Wang P, Xiao X, Chou KC. NR-2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features. PLoS One 2011; 6:e23505. [PMID: 21858146 PMCID: PMC3156231 DOI: 10.1371/journal.pone.0023505] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2011] [Accepted: 07/19/2011] [Indexed: 11/18/2022] Open
Abstract
Nuclear receptors (NRs) are one of the most abundant classes of transcriptional regulators in animals. They regulate diverse functions, such as homeostasis, reproduction, development and metabolism. Therefore, NRs are a very important target for drug development. Nuclear receptors form a superfamily of phylogenetically related proteins and have been subdivided into different subfamilies due to their domain diversity. In this study, a two-level predictor, called NR-2L, was developed that can be used to identify a query protein as a nuclear receptor or not based on its sequence information alone; if it is, the prediction will be automatically continued to further identify it among the following seven subfamilies: (1) thyroid hormone like (NR1), (2) HNF4-like (NR2), (3) estrogen like, (4) nerve growth factor IB-like (NR4), (5) fushi tarazu-F1 like (NR5), (6) germ cell nuclear factor like (NR6), and (7) knirps like (NR0). The identification was made by the Fuzzy K nearest neighbor (FK-NN) classifier based on the pseudo amino acid composition formed by incorporating various physicochemical and statistical features derived from the protein sequences, such as amino acid composition, dipeptide composition, complexity factor, and low-frequency Fourier spectrum components. As a demonstration, it was shown through some benchmark datasets derived from the NucleaRDB and UniProt with low redundancy that the overall success rates achieved by the jackknife test were about 93% and 89% in the first and second level, respectively. The high success rates indicate that the novel two-level predictor can be a useful vehicle for identifying NRs and their subfamilies. As a user-friendly web server, NR-2L is freely accessible at either http://icpr.jci.edu.cn/bioinfo/NR2L or http://www.jci-bioinfo.cn/NR2L. Each job submitted to NR-2L can contain up to 500 query protein sequences and be finished in less than 2 minutes. The less the number of query proteins is, the shorter the time will usually be. All the program codes for NR-2L are available for non-commercial purpose upon request.
Collapse
Affiliation(s)
- Pu Wang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail:
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
28
|
Hu LL, Huang T, Cai YD, Chou KC. Prediction of body fluids where proteins are secreted into based on protein interaction network. PLoS One 2011; 6:e22989. [PMID: 21829572 PMCID: PMC3146524 DOI: 10.1371/journal.pone.0022989] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2011] [Accepted: 07/08/2011] [Indexed: 12/27/2022] Open
Abstract
Determining the body fluids where secreted proteins can be secreted into is important for protein function annotation and disease biomarker discovery. In this study, we developed a network-based method to predict which kind of body fluids human proteins can be secreted into. For a newly constructed benchmark dataset that consists of 529 human-secreted proteins, the prediction accuracy for the most possible body fluid location predicted by our method via the jackknife test was 79.02%, significantly higher than the success rate by a random guess (29.36%). The likelihood that the predicted body fluids of the first four orders contain all the true body fluids where the proteins can be secreted into is 62.94%. Our method was further demonstrated with two independent datasets: one contains 57 proteins that can be secreted into blood; while the other contains 61 proteins that can be secreted into plasma/serum and were possible biomarkers associated with various cancers. For the 57 proteins in first dataset, 55 were correctly predicted as blood-secrete proteins. For the 61 proteins in the second dataset, 58 were predicted to be most possible in plasma/serum. These encouraging results indicate that the network-based prediction method is quite promising. It is anticipated that the method will benefit the relevant areas for both basic research and drug development.
Collapse
Affiliation(s)
- Le-Le Hu
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Department of Chemistry, College of Sciences, Shanghai University, Shanghai, China
| | - Tao Huang
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Shanghai Center for Bioinformation Technology, Shanghai, China
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Centre for Computational Systems Biology, Fudan University, Shanghai, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail:
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
29
|
A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS One 2011; 6:e20592. [PMID: 21698097 PMCID: PMC3117797 DOI: 10.1371/journal.pone.0020592] [Citation(s) in RCA: 176] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2011] [Accepted: 05/04/2011] [Indexed: 11/21/2022] Open
Abstract
Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. In this paper, by introducing the “multi-label scale” and hybridizing the information of gene ontology with the sequential evolution information, a novel predictor called iLoc-Gneg is developed for predicting the subcellular localization of Gram-positive bacterial proteins with both single-location and multiple-location sites. For facilitating comparison, the same stringent benchmark dataset used to estimate the accuracy of Gneg-mPLoc was adopted to demonstrate the power of iLoc-Gneg. The dataset contains 1,392 Gram-negative bacterial proteins classified into the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. Of the 1,392 proteins, 1,328 are each with only one subcellular location and the other 64 are each with two subcellular locations, but none of the proteins included has pairwise sequence identity to any other in a same subset (subcellular location). It was observed that the overall success rate by jackknife test on such a stringent benchmark dataset by iLoc-Gneg was over 91%, which is about 6% higher than that by Gneg-mPLoc. As a user-friendly web-server, iLoc-Gneg is freely accessible to the public at http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg. Meanwhile, a step-by-step guide is provided on how to use the web-server to get the desired results. Furthermore, for the user's convenience, the iLoc-Gneg web-server also has the function to accept the batch job submission, which is not available in the existing version of Gneg-mPLoc web-server. It is anticipated that iLoc-Gneg may become a useful high throughput tool for Molecular Cell Biology, Proteomics, System Biology, and Drug Development.
Collapse
|
30
|
Roterman I, Konieczny L, Jurkowski W, Prymula K, Banach M. Two-intermediate model to characterize the structure of fast-folding proteins. J Theor Biol 2011; 283:60-70. [PMID: 21635900 DOI: 10.1016/j.jtbi.2011.05.027] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Revised: 05/17/2011] [Accepted: 05/18/2011] [Indexed: 01/15/2023]
Abstract
This paper introduces a new model that enables researchers to conduct protein folding simulations. A two-step in silico process is used in the course of structural analysis of a set of fast-folding proteins. The model assumes an early stage (ES) that depends solely on the backbone conformation, as described by its geometrical properties--specifically, by the V-angle between two sequential peptide bond planes (which determines the radius of curvature, also called R-radius, according to a second-degree polynomial form). The agreement between the structure under consideration and the assumed model is measured in terms of the magnitude of dispersion of both parameters with respect to idealized values. The second step, called late-stage folding (LS), is based on the "fuzzy oil drop" model, which involves an external hydrophobic force field described by a three-dimensional Gauss function. The degree of conformance between the structure under consideration and its idealized model is expressed quantitatively by means of the Kullback-Leibler entropy, which is a measure of disparity between the observed and expected hydrophobicity distributions. A set of proteins, representative of the fast-folding group - specifically, cold shock proteins - is shown to agree with the proposed model.
Collapse
Affiliation(s)
- I Roterman
- Department of Bioinformatics and Telemedicine, Jagiellonian University-Medical College, Lazarza 16, 31-530 Krakow, Poland.
| | | | | | | | | |
Collapse
|
31
|
Feature importance analysis in guide strand identification of microRNAs. Comput Biol Chem 2011; 35:131-6. [PMID: 21704258 DOI: 10.1016/j.compbiolchem.2011.04.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2011] [Revised: 03/22/2011] [Accepted: 04/23/2011] [Indexed: 11/22/2022]
Abstract
MicroRNA (miRNA) is the negative regulator of gene expression, also known as guide strand of transient miRNA:miRNA* duplex. It is critical in maintaining the normal physiological processes such as development, differentiation, and apoptosis in many organisms. With increasing miRNA data, it is desirable to design methods to identify guide strand based on machine learning algorithms. In this study, the random forest models based on local sequence-structure features were proposed to identify miRNA in four species. The accuracies achieved were 86.51% for Homo sapiens, 81.66% for Ornithorhynchus anatinus, 82.33% for Mus musculus and 85.71% for Schmidtea mediterranea, respectively. Furthermore, the important analysis of feature elements was carried out by using the conditional feature importance strategy. The analysis results revealed that most of the significant elements were related to guanine-cytosine (GC) base pair. We believed that our method could be beneficial to annotate the function of miRNA and help the further understanding of the RNA interference mechanism.
Collapse
|
32
|
iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 2011; 6:e18258. [PMID: 21483473 PMCID: PMC3068162 DOI: 10.1371/journal.pone.0018258] [Citation(s) in RCA: 241] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2010] [Accepted: 02/24/2011] [Indexed: 12/26/2022] Open
Abstract
Predicting protein subcellular localization is an important and difficult problem, particularly when query proteins may have the multiplex character, i.e., simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular location predictor can only be used to deal with the single-location or “singleplex” proteins. Actually, multiple-location or “multiplex” proteins should not be ignored because they usually posses some unique biological functions worthy of our special notice. By introducing the “multi-labeled learning” and “accumulation-layer scale”, a new predictor, called iLoc-Euk, has been developed that can be used to deal with the systems containing both singleplex and multiplex proteins. As a demonstration, the jackknife cross-validation was performed with iLoc-Euk on a benchmark dataset of eukaryotic proteins classified into the following 22 location sites: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centriole, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole, where none of proteins included has pairwise sequence identity to any other in a same subset. The overall success rate thus obtained by iLoc-Euk was 79%, which is significantly higher than that by any of the existing predictors that also have the capacity to deal with such a complicated and stringent system. As a user-friendly web-server, iLoc-Euk is freely accessible to the public at the web-site http://icpr.jci.edu.cn/bioinfo/iLoc-Euk. It is anticipated that iLoc-Euk may become a useful bioinformatics tool for Molecular Cell Biology, Proteomics, System Biology, and Drug Development Also, its novel approach will further stimulate the development of predicting other protein attributes.
Collapse
|
33
|
Wu ZC, Xiao X, Chou KC. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. MOLECULAR BIOSYSTEMS 2011; 7:3287-97. [PMID: 21984117 DOI: 10.1039/c1mb05232b] [Citation(s) in RCA: 181] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Affiliation(s)
- Zhi-Cheng Wu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
| | | | | |
Collapse
|
34
|
AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 2010; 270:56-62. [PMID: 21056045 DOI: 10.1016/j.jtbi.2010.10.037] [Citation(s) in RCA: 191] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 10/29/2010] [Accepted: 10/29/2010] [Indexed: 12/11/2022]
Abstract
Some creatures living in extremely low temperatures can produce some special materials called "antifreeze proteins" (AFPs), which can prevent the cell and body fluids from freezing. AFPs are present in vertebrates, invertebrates, plants, bacteria, fungi, etc. Although AFPs have a common function, they show a high degree of diversity in sequences and structures. Therefore, sequence similarity based search methods often fails to predict AFPs from sequence databases. In this work, we report a random forest approach "AFP-Pred" for the prediction of antifreeze proteins from protein sequence. AFP-Pred was trained on the dataset containing 300 AFPs and 300 non-AFPs and tested on the dataset containing 181 AFPs and 9193 non-AFPs. AFP-Pred achieved 81.33% accuracy from training and 83.38% from testing. The performance of AFP-Pred was compared with BLAST and HMM. High prediction accuracy and successful of prediction of hypothetical proteins suggests that AFP-Pred can be a useful approach to identify antifreeze proteins from sequence information, irrespective of their sequence similarity.
Collapse
|