1
|
Patiyal S, Tiwari P, Ghai M, Dhapola A, Dhall A, Raghava GPS. A hybrid approach for predicting transcription factors. FRONTIERS IN BIOINFORMATICS 2024; 4:1425419. [PMID: 39119181 PMCID: PMC11306938 DOI: 10.3389/fbinf.2024.1425419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 07/03/2024] [Indexed: 08/10/2024] Open
Abstract
Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of "TransFacPred" (https://webs.iiitd.edu.in/raghava/transfacpred).
Collapse
Affiliation(s)
| | | | | | | | | | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
2
|
Anteghini M, Santos VAMD, Saccenti E. PortPred: Exploiting deep learning embeddings of amino acid sequences for the identification of transporter proteins and their substrates. J Cell Biochem 2023; 124:1803-1824. [PMID: 37877557 DOI: 10.1002/jcb.30490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 10/26/2023]
Abstract
The physiology of every living cell is regulated at some level by transporter proteins which constitute a relevant portion of membrane-bound proteins and are involved in the movement of ions, small and macromolecules across bio-membranes. The importance of transporter proteins is unquestionable. The prediction and study of previously unknown transporters can lead to the discovery of new biological pathways, drugs and treatments. Here we present PortPred, a tool to accurately identify transporter proteins and their substrate starting from the protein amino acid sequence. PortPred successfully combines pre-trained deep learning-based protein embeddings and machine learning classification approaches and outperforms other state-of-the-art methods. In addition, we present a comparison of the most promising protein sequence embeddings (Unirep, SeqVec, ProteinBERT, ESM-1b) and their performances for this specific task.
Collapse
Affiliation(s)
- Marco Anteghini
- LifeGlimmer GmbH, Berlin, Germany
- Department of Systems and Synthetic Biology, Wageningen University & Research, Wageningen WE, The Netherlands
- Department of Visual and Data-Centric Computing, Zuse Institute Berlin, Berlin, Germany
| | - Vitor Ap Martins Dos Santos
- LifeGlimmer GmbH, Berlin, Germany
- Department of Bioprocess Engineering, Wageningen University & Research, Wageningen WE, The Netherlands
| | - Edoardo Saccenti
- Department of Systems and Synthetic Biology, Wageningen University & Research, Wageningen WE, The Netherlands
| |
Collapse
|
3
|
Dixit R, Khambhati K, Supraja KV, Singh V, Lederer F, Show PL, Awasthi MK, Sharma A, Jain R. Application of machine learning on understanding biomolecule interactions in cellular machinery. BIORESOURCE TECHNOLOGY 2023; 370:128522. [PMID: 36565819 DOI: 10.1016/j.biortech.2022.128522] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 12/17/2022] [Accepted: 12/20/2022] [Indexed: 06/17/2023]
Abstract
Machine learning (ML) applications have become ubiquitous in all fields of research including protein science and engineering. Apart from protein structure and mutation prediction, scientists are focusing on knowledge gaps with respect to the molecular mechanisms involved in protein binding and interactions with other components in the experimental setups or the human body. Researchers are working on several wet-lab techniques and generating data for a better understanding of concepts and mechanics involved. The information like biomolecular structure, binding affinities, structure fluctuations and movements are enormous which can be handled and analyzed by ML. Therefore, this review highlights the significance of ML in understanding the biomolecular interactions while assisting in various fields of research such as drug discovery, nanomedicine, nanotoxicity and material science. Hence, the way ahead would be to force hand-in hand of laboratory work and computational techniques.
Collapse
Affiliation(s)
- Rewati Dixit
- Waste Treatment Laboratory, Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology Delhi, Haus-khas, New Delhi 110016, India
| | - Khushal Khambhati
- Department of Biosciences, School of Science, Indrashil University, Rajpur, Mehsana 382715, Gujarat, India
| | - Kolli Venkata Supraja
- Waste Treatment Laboratory, Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology Delhi, Haus-khas, New Delhi 110016, India
| | - Vijai Singh
- Department of Biosciences, School of Science, Indrashil University, Rajpur, Mehsana 382715, Gujarat, India
| | - Franziska Lederer
- Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Bautzner landstrasse 400, 01328 Dresden, Germany
| | - Pau-Loke Show
- Zhejiang Provincial Key Laboratory for Subtropical Water Environment and Marine Biological Resources Protection, Wenzhou University, Wenzhou 325035, China; Department of Sustainable Engineering, Saveetha School of Engineering, SIMATS, Chennai 602105, India; Department of Chemical and Environmental Engineering, University of Nottingham, Malaysia, 43500 Semenyih, Selangor Darul Ehsan, Malaysia
| | - Mukesh Kumar Awasthi
- College of Natural Resources and Environment, Northwest A&F University, Yangling 712100, China
| | - Abhinav Sharma
- Institute Theory of Polymers, Leibniz Institute for Polymer Research, Hohe Strasse 6, 01069 Dresden, Germany
| | - Rohan Jain
- Helmholtz-Zentrum Dresden-Rossendorf, Helmholtz Institute Freiberg for Resource Technology, Bautzner landstrasse 400, 01328 Dresden, Germany.
| |
Collapse
|
4
|
Bagchi A. Latest trends in structure based drug design with protein targets. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2019; 121:1-23. [PMID: 32312418 DOI: 10.1016/bs.apcsb.2019.11.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Structure based drug designing is an important endeavor in the field of structural bioinformatics. Previously the entire process was dependent on the wet-lab experiments to build libraries of ligand molecules. And the molecules used to be tested to determine their binding efficacies with protein target. However, the entire process is very lengthy and above all highly expensive. With the advent of supercomputers and increasing computational powers, the search process for finding suitable ligand molecules against target proteins have become more streamlined and cost-effective. Now the entire ligand search process is performed in-silico with the help of the techniques of virtual screening, molecular docking simulations and molecular dynamics studies. In the present chapter, a brief overview of the computational techniques involved in structure based drug designing is presented with a special emphasis on the thermodynamic principles behind the molecular interactions.
Collapse
Affiliation(s)
- Angshuman Bagchi
- Department of Biochemistry and Biophysics, University of Kalyani, Kalyani, West Bengal, India
| |
Collapse
|
5
|
NeuroPIpred: a tool to predict, design and scan insect neuropeptides. Sci Rep 2019; 9:5129. [PMID: 30914676 PMCID: PMC6435694 DOI: 10.1038/s41598-019-41538-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 03/05/2019] [Indexed: 12/15/2022] Open
Abstract
Insect neuropeptides and their associated receptors have been one of the potential targets for the pest control. The present study describes in silico models developed using natural and modified insect neuropeptides for predicting and designing new neuropeptides. Amino acid composition analysis revealed the preference of residues C, D, E, F, G, N, S, and Y in insect neuropeptides The positional residue preference analysis show that in natural neuropeptides residues like A, N, F, D, P, S, and I are preferred at N terminus and residues like L, R, P, F, N, and G are preferred at C terminus. Prediction models were developed using input features like amino acid and dipeptide composition, binary profiles and implementing different machine learning techniques. Dipeptide composition based SVM model performed best among all the models. In case of NeuroPIpred_DS1, model achieved an accuracy of 86.50% accuracy and 0.73 MCC on training dataset and 83.71% accuracy and 0.67 MCC on validation dataset whereas in case of NeuroPIpred_DS2, model achieved 97.47% accuracy and 0.95 MCC on training dataset and 97.93% accuracy and 0.96 MCC on validation dataset. In order to assist researchers, we created standalone and user friendly web server NeuroPIpred, available at (https://webs.iiitd.edu.in/raghava/neuropipred.)
Collapse
|
6
|
Zhao H, Taherzadeh G, Zhou Y, Yang Y. Computational Prediction of Carbohydrate-Binding Proteins and Binding Sites. ACTA ACUST UNITED AC 2018; 94:e75. [PMID: 30106511 DOI: 10.1002/cpps.75] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Protein-carbohydrate interaction is essential for biological systems, and carbohydrate-binding proteins (CBPs) are important targets when designing antiviral and anticancer drugs. Due to the high cost and difficulty associated with experimental approaches, many computational methods have been developed as complementary approaches to predict CBPs or carbohydrate-binding sites. However, most of these computational methods are not publicly available. Here, we provide a comprehensive review of related studies and demonstrate our two recently developed bioinformatics methods. The method SPOT-CBP is a template-based method for detecting CBPs based on structure through structural homology search combined with a knowledge-based scoring function. This method can yield model complex structure in addition to accurate prediction of CBPs. Furthermore, it has been observed that similarly accurate predictions can be made using structures from homology modeling, which has significantly expanded its applicability. The other method, SPRINT-CBH, is a de novo approach that predicts binding residues directly from protein sequences by using sequence information and predicted structural properties. This approach does not need structurally similar templates and thus is not limited by the current database of known protein-carbohydrate complex structures. These two complementary methods are available at https://sparks-lab.org. © 2018 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Huiying Zhao
- Sun Yat-Sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia.,Institute for Glycomics, Griffith University, Gold Coast, Queensland, Australia
| | - Yuedong Yang
- School of Information and Communication Technology, Griffith University, Gold Coast, Queensland, Australia.,Institute for Glycomics, Griffith University, Gold Coast, Queensland, Australia.,School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
7
|
Harnessing the evolutionary information on oxygen binding proteins through Support Vector Machines based modules. BMC Res Notes 2018; 11:290. [PMID: 29751818 PMCID: PMC5948687 DOI: 10.1186/s13104-018-3383-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 04/30/2018] [Indexed: 02/06/2023] Open
Abstract
Objectives The arrival of free oxygen on the globe, aerobic life is becoming possible. However, it has become very clear that the oxygen binding proteins are widespread in the biosphere and are found in all groups of organisms, including prokaryotes, eukaryotes as well as in fungi, plants, and animals. The exponential growth and availability of fresh annotated protein sequences in the databases motivated us to develop an improved version of “Oxypred” for identifying oxygen-binding proteins. Results In this study, we have proposed a method for identifying oxy-proteins with two different sequence similarity cutoffs 50 and 90%. A different amino acid composition based Support Vector Machines models was developed, including the evolutionary profiles in the form position-specific scoring matrix (PSSM). The fivefold cross-validation techniques were applied to evaluate the prediction performance. Also, we compared with existing methods, which shows nearly 97% recognition, but, our newly developed models were able to recognize almost 99.99 and 100% in both oxy-50 and 90% similarity models respectively. Our result shows that our approaches are faster and achieve a better prediction performance over the existing methods. The web-server Oxypred2 was developed for an alternative method for identifying oxy-proteins with more additional modules including PSSM, available at http://bioinfo.imtech.res.in/servers/muthu/oxypred2/home.html. Electronic supplementary material The online version of this article (10.1186/s13104-018-3383-9) contains supplementary material, which is available to authorized users.
Collapse
|
8
|
Banno M, Komiyama Y, Cao W, Oku Y, Ueki K, Sumikoshi K, Nakamura S, Terada T, Shimizu K. Development of a sugar-binding residue prediction system from protein sequences using support vector machine. Comput Biol Chem 2016; 66:36-43. [PMID: 27889654 DOI: 10.1016/j.compbiolchem.2016.10.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Revised: 10/05/2016] [Accepted: 10/23/2016] [Indexed: 11/16/2022]
Abstract
Several methods have been proposed for protein-sugar binding site prediction using machine learning algorithms. However, they are not effective to learn various properties of binding site residues caused by various interactions between proteins and sugars. In this study, we classified sugars into acidic and nonacidic sugars and showed that their binding sites have different amino acid occurrence frequencies. By using this result, we developed sugar-binding residue predictors dedicated to the two classes of sugars: an acid sugar binding predictor and a nonacidic sugar binding predictor. We also developed a combination predictor which combines the results of the two predictors. We showed that when a sugar is known to be an acidic sugar, the acidic sugar binding predictor achieves the best performance, and showed that when a sugar is known to be a nonacidic sugar or is not known to be either of the two classes, the combination predictor achieves the best performance. Our method uses only amino acid sequences for prediction. Support vector machine was used as a machine learning algorithm and the position-specific scoring matrix created by the position-specific iterative basic local alignment search tool was used as the feature vector. We evaluated the performance of the predictors using five-fold cross-validation. We have launched our system, as an open source freeware tool on the GitHub repository (https://doi.org/10.5281/zenodo.61513).
Collapse
Affiliation(s)
- Masaki Banno
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Yusuke Komiyama
- Digital Content and Media Sciences Research Division, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-Ward, Tokyo 101-8430, Japan
| | - Wei Cao
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Yuya Oku
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Kokoro Ueki
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Kazuya Sumikoshi
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Shugo Nakamura
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Tohru Terada
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan
| | - Kentaro Shimizu
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-Ward, Tokyo 113-8657, Japan.
| |
Collapse
|
9
|
Taherzadeh G, Zhou Y, Liew AWC, Yang Y. Sequence-Based Prediction of Protein-Carbohydrate Binding Sites Using Support Vector Machines. J Chem Inf Model 2016; 56:2115-2122. [PMID: 27623166 DOI: 10.1021/acs.jcim.6b00320] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Carbohydrate-binding proteins play significant roles in many diseases including cancer. Here, we established a machine-learning-based method (called sequence-based prediction of residue-level interaction sites of carbohydrates, SPRINT-CBH) to predict carbohydrate-binding sites in proteins using support vector machines (SVMs). We found that integrating evolution-derived sequence profiles with additional information on sequence and predicted solvent accessible surface area leads to a reasonably accurate, robust, and predictive method, with area under receiver operating characteristic curve (AUC) of 0.78 and 0.77 and Matthew's correlation coefficient of 0.34 and 0.29, respectively for 10-fold cross validation and independent test without balancing binding and nonbinding residues. The quality of the method is further demonstrated by having statistically significantly more binding residues predicted for carbohydrate-binding proteins than presumptive nonbinding proteins in the human proteome, and by the bias of rare alleles toward predicted carbohydrate-binding sites for nonsynonymous mutations from the 1000 genome project. SPRINT-CBH is available as an online server at http://sparks-lab.org/server/SPRINT-CBH .
Collapse
Affiliation(s)
- Ghazaleh Taherzadeh
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Alan Wee-Chung Liew
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| | - Yuedong Yang
- School of Information and Communication Technology and ‡Institute for Glycomics, Griffith University , Parklands Drive, Southport, Queensland 4215, Australia
| |
Collapse
|
10
|
Singh H, Srivastava HK, Raghava GPS. A web server for analysis, comparison and prediction of protein ligand binding sites. Biol Direct 2016; 11:14. [PMID: 27016210 PMCID: PMC4807588 DOI: 10.1186/s13062-016-0118-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 03/22/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND One of the major challenges in the field of system biology is to understand the interaction between a wide range of proteins and ligands. In the past, methods have been developed for predicting binding sites in a protein for a limited number of ligands. RESULTS In order to address this problem, we developed a web server named 'LPIcom' to facilitate users in understanding protein-ligand interaction. Analysis, comparison and prediction modules are available in the "LPIcom' server to predict protein-ligand interacting residues for 824 ligands. Each ligand must have at least 30 protein binding sites in PDB. Analysis module of the server can identify residues preferred in interaction and binding motif for a given ligand; for example residues glycine, lysine and arginine are preferred in ATP binding sites. Comparison module of the server allows comparing protein-binding sites of multiple ligands to understand the similarity between ligands based on their binding site. This module indicates that ATP, ADP and GTP ligands are in the same cluster and thus their binding sites or interacting residues exhibit a high level of similarity. Propensity-based prediction module has been developed for predicting ligand-interacting residues in a protein for more than 800 ligands. In addition, a number of web-based tools have been integrated to facilitate users in creating web logo and two-sample between ligand interacting and non-interacting residues. CONCLUSIONS In summary, this manuscript presents a web-server for analysis of ligand interacting residue. This server is available for public use from URL http://crdd.osdd.net/raghava/lpicom .
Collapse
Affiliation(s)
- Harinder Singh
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, 160036, India
| | | | - Gajendra P S Raghava
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh, 160036, India. .,, .
| |
Collapse
|
11
|
BacHbpred: Support Vector Machine Methods for the Prediction of Bacterial Hemoglobin-Like Proteins. Adv Bioinformatics 2016; 2016:8150784. [PMID: 27034664 PMCID: PMC4789356 DOI: 10.1155/2016/8150784] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Revised: 01/21/2016] [Accepted: 01/26/2016] [Indexed: 11/27/2022] Open
Abstract
The recent upsurge in microbial genome data has revealed that hemoglobin-like (HbL) proteins may be widely distributed among bacteria and that some organisms may carry more than one HbL encoding gene. However, the discovery of HbL proteins has been limited to a small number of bacteria only. This study describes the prediction of HbL proteins and their domain classification using a machine learning approach. Support vector machine (SVM) models were developed for predicting HbL proteins based upon amino acid composition (AC), dipeptide composition (DC), hybrid method (AC + DC), and position specific scoring matrix (PSSM). In addition, we introduce for the first time a new prediction method based on max to min amino acid residue (MM) profiles. The average accuracy, standard deviation (SD), false positive rate (FPR), confusion matrix, and receiver operating characteristic (ROC) were analyzed. We also compared the performance of our proposed models in homology detection databases. The performance of the different approaches was estimated using fivefold cross-validation techniques. Prediction accuracy was further investigated through confusion matrix and ROC curve analysis. All experimental results indicate that the proposed BacHbpred can be a perspective predictor for determination of HbL related proteins. BacHbpred, a web tool, has been developed for HbL prediction.
Collapse
|
12
|
Pai PP, Mondal S. MOWGLI: prediction of protein-MannOse interacting residues With ensemble classifiers usinG evoLutionary Information. J Biomol Struct Dyn 2015; 34:2069-83. [PMID: 26457920 DOI: 10.1080/07391102.2015.1106978] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Proteins interact with carbohydrates to perform various cellular interactions. Of the many carbohydrate ligands that proteins bind with, mannose constitute an important class, playing important roles in host defense mechanisms. Accurate identification of mannose-interacting residues (MIR) may provide important clues to decipher the underlying mechanisms of protein-mannose interactions during infections. This study proposes an approach using an ensemble of base classifiers for prediction of MIR using their evolutionary information in the form of position-specific scoring matrix. The base classifiers are random forests trained by different subsets of training data set Dset128 using 10-fold cross-validation. The optimized ensemble of base classifiers, MOWGLI, is then used to predict MIR on protein chains of the test data set Dtestset29 which showed a promising performance with 92.0% accurate prediction. An overall improvement of 26.6% in precision was observed upon comparison with the state-of-art. It is hoped that this approach, yielding enhanced predictions, could be eventually used for applications in drug design and vaccine development.
Collapse
Affiliation(s)
- Priyadarshini P Pai
- a Department of Biological Sciences , Birla Institute of Technology and Science-Pilani , K.K. Birla Goa Campus, Near NH17 Bypass Road, Zuarinagar , Goa 403726 , India
| | - Sukanta Mondal
- a Department of Biological Sciences , Birla Institute of Technology and Science-Pilani , K.K. Birla Goa Campus, Near NH17 Bypass Road, Zuarinagar , Goa 403726 , India
| |
Collapse
|
13
|
Singh H, Singh S, Raghava GPS. In silico
platform for predicting and initiating β-turns in a protein at desired locations. Proteins 2015; 83:910-21. [DOI: 10.1002/prot.24783] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2014] [Revised: 02/09/2015] [Accepted: 02/14/2015] [Indexed: 11/09/2022]
Affiliation(s)
- Harinder Singh
- Bioinformatics Center, Institute of Microbial Technology; Chandigarh India
| | - Sandeep Singh
- Bioinformatics Center, Institute of Microbial Technology; Chandigarh India
| | | |
Collapse
|
14
|
Panwar B, Raghava GPS. Identification of protein-interacting nucleotides in a RNA sequence using composition profile of tri-nucleotides. Genomics 2015; 105:197-203. [PMID: 25640448 DOI: 10.1016/j.ygeno.2015.01.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Revised: 01/21/2015] [Accepted: 01/23/2015] [Indexed: 10/24/2022]
Abstract
The RNA-protein interactions play a diverse role in the cells, thus identification of RNA-protein interface is essential for the biologist to understand their function. In the past, several methods have been developed for predicting RNA interacting residues in proteins, but limited efforts have been made for the identification of protein-interacting nucleotides in RNAs. In order to discriminate protein-interacting and non-interacting nucleotides, we used various classifiers (NaiveBayes, NaiveBayesMultinomial, BayesNet, ComplementNaiveBayes, MultilayerPerceptron, J48, SMO, RandomForest, SMO and SVM(light)) for prediction model development using various features and achieved highest 83.92% sensitivity, 84.82 specificity, 84.62% accuracy and 0.62 Matthew's correlation coefficient by SVM(light) based models. We observed that certain tri-nucleotides like ACA, ACC, AGA, CAC, CCA, GAG, UGA, and UUU preferred in protein-interaction. All the models have been developed using a non-redundant dataset and are evaluated using five-fold cross validation technique. A web-server called RNApin has been developed for the scientific community (http://crdd.osdd.net/raghava/rnapin/).
Collapse
Affiliation(s)
- Bharat Panwar
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Sector 39A, Chandigarh, India.
| | - Gajendra P S Raghava
- Bioinformatics Centre, CSIR-Institute of Microbial Technology, Sector 39A, Chandigarh, India. http://www.imtech.res.in/raghava/
| |
Collapse
|
15
|
Zuo YC, Su WX, Zhang SH, Wang SS, Wu CY, Yang L, Li GP. Discrimination of membrane transporter protein types using K-nearest neighbor method derived from the similarity distance of total diversity measure. MOLECULAR BIOSYSTEMS 2015; 11:950-7. [PMID: 25607774 DOI: 10.1039/c4mb00681j] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Membrane transporters play crucial roles in the fundamental cellular processes of living organisms. Computational techniques are very necessary to annotate the transporter functions. In this study, a multi-class K nearest neighbor classifier based on the increment of diversity (KNN-ID) was developed to discriminate the membrane transporter types when the increment of diversity (ID) was introduced as one of the novel similarity distances. Comparisons with multiple recently published methods showed that the proposed KNN-ID method outperformed the other methods, obtaining more than 20% improvement for overall accuracy. The overall prediction accuracy reached was 83.1%, when the K was selected as 2. The prediction sensitivity achieved 76.7%, 89.1%, 80.1% for channels/pores, electrochemical potential-driven transporters, primary active transporters, respectively. Discrimination and comparison between any two different classes of transporters further demonstrated that the proposed method is a potential classifier and will play a complementary role for facilitating the functional assignment of transporters.
Collapse
Affiliation(s)
- Yong-Chun Zuo
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, College of Life Sciences, Inner Mongolia University, Hohhot, 010021, China.
| | | | | | | | | | | | | |
Collapse
|
16
|
Gupta S, Chavan S, Deobagkar DN, Deobagkar DD. Bio/chemoinformatics in India: an outlook. Brief Bioinform 2014; 16:710-31. [PMID: 25159593 DOI: 10.1093/bib/bbu028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 07/28/2014] [Indexed: 12/25/2022] Open
Abstract
With the advent of significant establishment and development of Internet facilities and computational infrastructure, an overview on bio/chemoinformatics is presented along with its multidisciplinary facts, promises and challenges. The Government of India has paved the way for more profound research in biological field with the use of computational facilities and schemes/projects to collaborate with scientists from different disciplines. Simultaneously, the growth of available biomedical data has provided fresh insight into the nature of redundant and compensatory data. Today, bioinformatics research in India is characterized by a powerful grid computing systems, great variety of biological questions addressed and the close collaborations between scientists and clinicians, with a full spectrum of focuses ranging from database building and methods development to biological discoveries. In fact, this outlook provides a resourceful platform highlighting the funding agencies, institutes and industries working in this direction, which would certainly be of great help to students seeking their career in bioinformatics. Thus, in short, this review highlights the current bio/chemoinformatics trend, educations, status, diverse applicability and demands for further development.
Collapse
|
17
|
Mishra NK, Chang J, Zhao PX. Prediction of membrane transport proteins and their substrate specificities using primary sequence information. PLoS One 2014; 9:e100278. [PMID: 24968309 PMCID: PMC4072671 DOI: 10.1371/journal.pone.0100278] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2014] [Accepted: 05/23/2014] [Indexed: 11/18/2022] Open
Abstract
Background Membrane transport proteins (transporters) move hydrophilic substrates across hydrophobic membranes and play vital roles in most cellular functions. Transporters represent a diverse group of proteins that differ in topology, energy coupling mechanism, and substrate specificity as well as sequence similarity. Among the functional annotations of transporters, information about their transporting substrates is especially important. The experimental identification and characterization of transporters is currently costly and time-consuming. The development of robust bioinformatics-based methods for the prediction of membrane transport proteins and their substrate specificities is therefore an important and urgent task. Results Support vector machine (SVM)-based computational models, which comprehensively utilize integrative protein sequence features such as amino acid composition, dipeptide composition, physico-chemical composition, biochemical composition, and position-specific scoring matrices (PSSM), were developed to predict the substrate specificity of seven transporter classes: amino acid, anion, cation, electron, protein/mRNA, sugar, and other transporters. An additional model to differentiate transporters from non-transporters was also developed. Among the developed models, the biochemical composition and PSSM hybrid model outperformed other models and achieved an overall average prediction accuracy of 76.69% with a Mathews correlation coefficient (MCC) of 0.49 and a receiver operating characteristic area under the curve (AUC) of 0.833 on our main dataset. This model also achieved an overall average prediction accuracy of 78.88% and MCC of 0.41 on an independent dataset. Conclusions Our analyses suggest that evolutionary information (i.e., the PSSM) and the AAIndex are key features for the substrate specificity prediction of transport proteins. In comparison, similarity-based methods such as BLAST, PSI-BLAST, and hidden Markov models do not provide accurate predictions for the substrate specificity of membrane transport proteins. TrSSP: The Transporter Substrate Specificity Prediction Server, a web server that implements the SVM models developed in this paper, is freely available at http://bioinfo.noble.org/TrSSP.
Collapse
Affiliation(s)
- Nitish K. Mishra
- Plant Biology Division, The Samuel Roberts Noble Foundation, Ardmore, Oklahoma, United States of America
| | - Junil Chang
- Plant Biology Division, The Samuel Roberts Noble Foundation, Ardmore, Oklahoma, United States of America
| | - Patrick X. Zhao
- Plant Biology Division, The Samuel Roberts Noble Foundation, Ardmore, Oklahoma, United States of America
- * E-mail:
| |
Collapse
|
18
|
Malik A, Lee J, Lee J. Community-based network study of protein-carbohydrate interactions in plant lectins using glycan array data. PLoS One 2014; 9:e95480. [PMID: 24755681 PMCID: PMC3995809 DOI: 10.1371/journal.pone.0095480] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Accepted: 03/27/2014] [Indexed: 12/14/2022] Open
Abstract
Lectins play major roles in biological processes such as immune recognition and regulation, inflammatory responses, cytokine signaling, and cell adhesion. Recently, glycan microarrays have shown to play key roles in understanding glycobiology, allowing us to study the relationship between the specificities of glycan binding proteins and their natural ligands at the omics scale. However, one of the drawbacks in utilizing glycan microarray data is the lack of systematic analysis tools to extract information. In this work, we attempt to group various lectins and their interacting carbohydrates by using community-based analysis of a lectin-carbohydrate network. The network consists of 1119 nodes and 16769 edges and we have identified 3 lectins having large degrees of connectivity playing the roles of hubs. The community based network analysis provides an easy way to obtain a general picture of the lectin-glycan interaction and many statistically significant functional groups.
Collapse
Affiliation(s)
- Adeel Malik
- Center for In Silico Protein Science, School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
| | - Juyong Lee
- Center for In Silico Protein Science, School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
| | - Jooyoung Lee
- Center for In Silico Protein Science, School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
- * E-mail:
| |
Collapse
|
19
|
Singh H, Ansari HR, Raghava GPS. Improved method for linear B-cell epitope prediction using antigen's primary sequence. PLoS One 2013; 8:e62216. [PMID: 23667458 PMCID: PMC3646881 DOI: 10.1371/journal.pone.0062216] [Citation(s) in RCA: 207] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Accepted: 03/18/2013] [Indexed: 11/18/2022] Open
Abstract
One of the major challenges in designing a peptide-based vaccine is the identification of antigenic regions in an antigen that can stimulate B-cell’s response, also called B-cell epitopes. In the past, several methods have been developed for the prediction of conformational and linear (or continuous) B-cell epitopes. However, the existing methods for predicting linear B-cell epitopes are far from perfection. In this study, an attempt has been made to develop an improved method for predicting linear B-cell epitopes. We have retrieved experimentally validated B-cell epitopes as well as non B-cell epitopes from Immune Epitope Database and derived two types of datasets called Lbtope_Variable and Lbtope_Fixed length datasets. The Lbtope_Variable dataset contains 14876 B-cell epitope and 23321 non-epitopes of variable length where as Lbtope_Fixed length dataset contains 12063 B-cell epitopes and 20589 non-epitopes of fixed length. We also evaluated the performance of models on above datasets after removing highly identical peptides from the datasets. In addition, we have derived third dataset Lbtope_Confirm having 1042 epitopes and 1795 non-epitopes where each epitope or non-epitope has been experimentally validated in at least two studies. A number of models have been developed to discriminate epitopes and non-epitopes using different machine-learning techniques like Support Vector Machine, and K-Nearest Neighbor. We achieved accuracy from ∼54% to 86% using diverse s features like binary profile, dipeptide composition, AAP (amino acid pair) profile. In this study, for the first time experimentally validated non B-cell epitopes have been used for developing method for predicting linear B-cell epitopes. In previous studies, random peptides have been used as non B-cell epitopes. In order to provide service to scientific community, a web server LBtope has been developed for predicting and designing B-cell epitopes (http://crdd.osdd.net/raghava/lbtope/).
Collapse
Affiliation(s)
- Harinder Singh
- Bioinformatics Center, Institute of Microbial Technology, Chandigarh, India
| | | | | |
Collapse
|
20
|
Panwar B, Gupta S, Raghava GPS. Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information. BMC Bioinformatics 2013; 14:44. [PMID: 23387468 PMCID: PMC3577447 DOI: 10.1186/1471-2105-14-44] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 01/31/2013] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND The vitamins are important cofactors in various enzymatic-reactions. In past, many inhibitors have been designed against vitamin binding pockets in order to inhibit vitamin-protein interactions. Thus, it is important to identify vitamin interacting residues in a protein. It is possible to detect vitamin-binding pockets on a protein, if its tertiary structure is known. Unfortunately tertiary structures of limited proteins are available. Therefore, it is important to develop in-silico models for predicting vitamin interacting residues in protein from its primary structure. RESULTS In this study, first we compared protein-interacting residues of vitamins with other ligands using Two Sample Logo (TSL). It was observed that ATP, GTP, NAD, FAD and mannose preferred {G,R,K,S,H}, {G,K,T,S,D,N}, {T,G,Y}, {G,Y,W} and {Y,D,W,N,E} residues respectively, whereas vitamins preferred {Y,F,S,W,T,G,H} residues for the interaction with proteins. Furthermore, compositional information of preferred and non-preferred residues along with patterns-specificity was also observed within different vitamin-classes. Vitamins A, B and B6 preferred {F,I,W,Y,L,V}, {S,Y,G,T,H,W,N,E} and {S,T,G,H,Y,N} interacting residues respectively. It suggested that protein-binding patterns of vitamins are different from other ligands, and motivated us to develop separate predictor for vitamins and their sub-classes. The four different prediction modules, (i) vitamin interacting residues (VIRs), (ii) vitamin-A interacting residues (VAIRs), (iii) vitamin-B interacting residues (VBIRs) and (iv) pyridoxal-5-phosphate (vitamin B6) interacting residues (PLPIRs) have been developed. We applied various classifiers of SVM, BayesNet, NaiveBayes, ComplementNaiveBayes, NaiveBayesMultinomial, RandomForest and IBk etc., as machine learning techniques, using binary and Position-Specific Scoring Matrix (PSSM) features of protein sequences. Finally, we selected best performing SVM modules and obtained highest MCC of 0.53, 0.48, 0.61, 0.81 for VIRs, VAIRs, VBIRs, PLPIRs respectively, using PSSM-based evolutionary information. All the modules developed in this study have been trained and tested on non-redundant datasets and evaluated using five-fold cross-validation technique. The performances were also evaluated on the balanced and different independent datasets. CONCLUSIONS This study demonstrates that it is possible to predict VIRs, VAIRs, VBIRs and PLPIRs from evolutionary information of protein sequence. In order to provide service to the scientific community, we have developed web-server and standalone software VitaPred (http://crdd.osdd.net/raghava/vitapred/).
Collapse
Affiliation(s)
- Bharat Panwar
- Bioinformatics Centre, Institute of Microbial Technology (CSIR), Chandigarh, India
| | | | | |
Collapse
|
21
|
Chauhan JS, Bhat AH, Raghava GPS, Rao A. GlycoPP: a webserver for prediction of N- and O-glycosites in prokaryotic protein sequences. PLoS One 2012; 7:e40155. [PMID: 22808107 PMCID: PMC3392279 DOI: 10.1371/journal.pone.0040155] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Accepted: 06/01/2012] [Indexed: 12/30/2022] Open
Abstract
Glycosylation is one of the most abundant post-translational modifications (PTMs) required for various structure/function modulations of proteins in a living cell. Although elucidated recently in prokaryotes, this type of PTM is present across all three domains of life. In prokaryotes, two types of protein glycan linkages are more widespread namely, N- linked, where a glycan moiety is attached to the amide group of Asn, and O- linked, where a glycan moiety is attached to the hydroxyl group of Ser/Thr/Tyr. For their biologically ubiquitous nature, significance, and technology applications, the study of prokaryotic glycoproteins is a fast emerging area of research. Here we describe new Support Vector Machine (SVM) based algorithms (models) developed for predicting glycosylated-residues (glycosites) with high accuracy in prokaryotic protein sequences. The models are based on binary profile of patterns, composition profile of patterns, and position-specific scoring matrix profile of patterns as training features. The study employ an extensive dataset of 107 N-linked and 116 O-linked glycosites extracted from 59 experimentally characterized glycoproteins of prokaryotes. This dataset includes validated N-glycosites from phyla Crenarchaeota, Euryarchaeota (domain Archaea), Proteobacteria (domain Bacteria) and validated O-glycosites from phyla Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria (domain Bacteria). In view of the current understanding that glycosylation occurs on folded proteins in bacteria, hybrid models have been developed using information on predicted secondary structures and accessible surface area in various combinations with training features. Using these models, N-glycosites and O-glycosites could be predicted with an accuracy of 82.71% (MCC 0.65) and 73.71% (MCC 0.48), respectively. An evaluation of the best performing models with 28 independent prokaryotic glycoproteins confirms the suitability of these models in predicting N- and O-glycosites in potential glycoproteins from aforementioned organisms, with reasonably high confidence. A web server GlycoPP, implementing these models is available freely at http:/www.imtech.res.in/raghava/glycopp/.
Collapse
Affiliation(s)
- Jagat S. Chauhan
- Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
| | - Adil H. Bhat
- Protein Science and Engineering, Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
| | - Gajendra P. S. Raghava
- Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
- * E-mail: (AR); (GPSR)
| | - Alka Rao
- Protein Science and Engineering, Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
- * E-mail: (AR); (GPSR)
| |
Collapse
|