1
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
2
|
Rao B, Yu X, Bai J, Hu J. E2EATP: Fast and High-Accuracy Protein-ATP Binding Residue Prediction via Protein Language Model Embedding. J Chem Inf Model 2024; 64:289-300. [PMID: 38127815 DOI: 10.1021/acs.jcim.3c01298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Identifying the ATP-binding sites of proteins is fundamentally important to uncover the mechanisms of protein functions and explore drug discovery. Many computational methods are proposed to predict ATP-binding sites. However, due to the limitation of the quality of feature representation, the prediction performance still has a big room for improvement. In this study, we propose an end-to-end deep learning model, E2EATP, to dig out more discriminative information from a protein sequence for improving the ATP-binding site prediction performance. Concretely, we employ a pretrained deep learning-based protein language model (ESM2) to automatically extract high-latent discriminative representations of protein sequences relevant for protein functions. Based on ESM2, we design a residual convolutional neural network to train a protein-ATP binding site prediction model. Furthermore, a weighted focal loss function is used to reduce the negative impact of imbalanced data on the model training stage. Experimental results on the two independent testing data sets demonstrate that E2EATP could achieve higher Matthew's correlation coefficient and AUC values than most existing state-of-the-art prediction methods. The speed (about 0.05 s per protein) of E2EATP is much faster than the other existing prediction methods. Detailed data analyses show that the major advantage of E2EATP lies at the utilization of the pretrained protein language model that extracts more discriminative information from the protein sequence only. The standalone package of E2EATP is freely available for academic at https://github.com/jun-csbio/e2eatp/.
Collapse
Affiliation(s)
- Bing Rao
- School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China
| | - Xuan Yu
- Glasgow College, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jie Bai
- School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
3
|
Roy BG, Choi J, Fuchs MF. Predictive Modeling of Proteins Encoded by a Plant Virus Sheds a New Light on Their Structure and Inherent Multifunctionality. Biomolecules 2024; 14:62. [PMID: 38254661 PMCID: PMC10813169 DOI: 10.3390/biom14010062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 12/29/2023] [Accepted: 12/30/2023] [Indexed: 01/24/2024] Open
Abstract
Plant virus genomes encode proteins that are involved in replication, encapsidation, cell-to-cell, and long-distance movement, avoidance of host detection, counter-defense, and transmission from host to host, among other functions. Even though the multifunctionality of plant viral proteins is well documented, contemporary functional repertoires of individual proteins are incomplete. However, these can be enhanced by modeling tools. Here, predictive modeling of proteins encoded by the two genomic RNAs, i.e., RNA1 and RNA2, of grapevine fanleaf virus (GFLV) and their satellite RNAs by a suite of protein prediction software confirmed not only previously validated functions (suppressor of RNA silencing [VSR], viral genome-linked protein [VPg], protease [Pro], symptom determinant [Sd], homing protein [HP], movement protein [MP], coat protein [CP], and transmission determinant [Td]) and previously identified putative functions (helicase [Hel] and RNA-dependent RNA polymerase [Pol]), but also predicted novel functions with varying levels of confidence. These include a T3/T7-like RNA polymerase domain for protein 1AVSR, a short-chain reductase for protein 1BHel/VSR, a parathyroid hormone family domain for protein 1EPol/Sd, overlapping domains of unknown function and an ABC transporter domain for protein 2BMP, and DNA topoisomerase domains, transcription factor FBXO25 domain, or DNA Pol subunit cdc27 domain for the satellite RNA protein. Structural predictions for proteins 2AHP/Sd, 2BMP, and 3A? had low confidence, while predictions for proteins 1AVSR, 1BHel*/VSR, 1CVPg, 1DPro, 1EPol*/Sd, and 2CCP/Td retained higher confidence in at least one prediction. This research provided new insights into the structure and functions of GFLV proteins and their satellite protein. Future work is needed to validate these findings.
Collapse
Affiliation(s)
- Brandon G. Roy
- Plant Pathology and Plant-Microbe Biology Section, School of Integrative Plant Science, Cornell University, 15 Castle Creek Drive, Geneva, NY 14456, USA; (J.C.); (M.F.F.)
| | | | | |
Collapse
|
4
|
Guan S, Zou Q, Wu H, Ding Y. Protein-DNA Binding Residues Prediction Using a Deep Learning Model With Hierarchical Feature Extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2619-2628. [PMID: 35834447 DOI: 10.1109/tcbb.2022.3190933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Biologically important effects occur when proteins bind to other substances, of which binding to DNA is a crucial one. Therefore, accurate identification of protein-DNA binding residues is important for further understanding of the protein-DNA interaction mechanism. Although wet-lab methods can accurately obtain the location of bound residues, it requires significant human, financial and time costs. There is thus an urgent need to develop efficient computational-based methods. Most current state-of-the-art methods are two-step approaches: the first step uses a sliding window technique to extract residue features; the second step uses each residue as an input to the model for prediction. This has a negative impact on the efficiency of prediction and ease of use. In this study, we propose a sequence-to-sequence (seq2seq) model that can input the entire protein sequence of variable length and use two modules, Transformer Encoder Block and Feature Extracting Block, for hierarchical feature extraction, where Transformer Encoder Block is used to extract global features, and then Feature Extracting Block is used to extract local features to further improve the recognition capability of the model. The comparison results on two benchmark datasets, namely PDNA-543 and PDNA-41, prove the effectiveness of our method in identifying protein-DNA binding residues.
Collapse
|
5
|
Bartas M, Slychko K, Červeň J, Pečinka P, Arndt-Jovin DJ, Jovin TM. Extensive Bioinformatics Analyses Reveal a Phylogenetically Conserved Winged Helix (WH) Domain (Zτ) of Topoisomerase IIα, Elucidating Its Very High Affinity for Left-Handed Z-DNA and Suggesting Novel Putative Functions. Int J Mol Sci 2023; 24:10740. [PMID: 37445918 DOI: 10.3390/ijms241310740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 06/13/2023] [Accepted: 06/22/2023] [Indexed: 07/15/2023] Open
Abstract
The dynamic processes operating on genomic DNA, such as gene expression and cellular division, lead inexorably to topological challenges in the form of entanglements, catenanes, knots, "bubbles", R-loops, and other outcomes of supercoiling and helical disruption. The resolution of toxic topological stress is the function attributed to DNA topoisomerases. A prominent example is the negative supercoiling (nsc) trailing processive enzymes such as DNA and RNA polymerases. The multiple equilibrium states that nscDNA can adopt by redistribution of helical twist and writhe include the left-handed double-helical conformation known as Z-DNA. Thirty years ago, one of our labs isolated a protein from Drosophila cells and embryos with a 100-fold greater affinity for Z-DNA than for B-DNA, and identified it as topoisomerase II (gene Top2, orthologous to the human UniProt proteins TOP2A and TOP2B). GTP increased the affinity and selectivity for Z-DNA even further and also led to inhibition of the isomerase enzymatic activity. An allosteric mechanism was proposed, in which topoII acts as a Z-DNA-binding protein (ZBP) to stabilize given states of topological (sub)domains and associated multiprotein complexes. We have now explored this possibility by comprehensive bioinformatic analyses of the available protein sequences of topoII representing organisms covering the whole tree of life. Multiple alignment of these sequences revealed an extremely high level of evolutionary conservation, including a winged-helix protein segment, here denoted as Zτ, constituting the putative structural homolog of Zα, the canonical Z-DNA/Z-RNA binding domain previously identified in the interferon-inducible RNA Adenosine-to-Inosine-editing deaminase, ADAR1p150. In contrast to Zα, which is separate from the protein segment responsible for catalysis, Zτ encompasses the active site tyrosine of topoII; a GTP-binding site and a GxxG sequence motif are in close proximity. Quantitative Zτ-Zα similarity comparisons and molecular docking with interaction scoring further supported the "B-Z-topoII hypothesis" and has led to an expanded mechanism for topoII function incorporating the recognition of Z-DNA segments ("Z-flipons") as an inherent and essential element. We further propose that the two Zτ domains of the topoII homodimer exhibit a single-turnover "conformase" activity on given G(ate) B-DNA segments ("Z-flipins"), inducing their transition to the left-handed Z-conformation. Inasmuch as the topoII-Z-DNA complexes are isomerase inactive, we infer that they fulfill important structural roles in key processes such as mitosis. Topoisomerases are preeminent targets of anti-cancer drug discovery, and we anticipate that detailed elucidation of their structural-functional interactions with Z-DNA and GTP will facilitate the design of novel, more potent and selective anti-cancer chemotherapeutic agents.
Collapse
Affiliation(s)
- Martin Bartas
- Department of Biology and Ecology, University of Ostrava, 710 00 Ostrava, Czech Republic
| | - Kristyna Slychko
- Department of Biology and Ecology, University of Ostrava, 710 00 Ostrava, Czech Republic
| | - Jiří Červeň
- Department of Biology and Ecology, University of Ostrava, 710 00 Ostrava, Czech Republic
| | - Petr Pečinka
- Department of Biology and Ecology, University of Ostrava, 710 00 Ostrava, Czech Republic
| | - Donna J Arndt-Jovin
- Emeritus Laboratory of Cellular Dynamics, Max Planck Institute for Multidisciplinary Sciences, 37077 Göttingen, Germany
| | - Thomas M Jovin
- Emeritus Laboratory of Cellular Dynamics, Max Planck Institute for Multidisciplinary Sciences, 37077 Göttingen, Germany
| |
Collapse
|
6
|
Halgasova N, Javorova R, Bocanova L, Krajcikova D, Bauer JA, Bukovska G. Characterization of a newly discovered putative DNA replication initiator from Paenibacillus polymyxa phage phiBP. Microbiol Res 2023; 274:127437. [PMID: 37327604 DOI: 10.1016/j.micres.2023.127437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 06/08/2023] [Accepted: 06/10/2023] [Indexed: 06/18/2023]
Abstract
The bacteriophage phiBP contains a newly discovered putative replisome organizer, a helicase loader, and a beta clamp, which together may serve to replicate its DNA. Bioinformatics analysis of the phiBP replisome organizer sequence showed that it belongs to a recently identified family of putative initiator proteins. We prepared and isolated a wild type-like recombinant protein, gpRO-HC, and a mutant protein gpRO-HCK8A, containing a lysine to alanine substitution at position 8. gpRO-HC had low ATPase activity regardless of the presence of DNA, while the ATPase activity of the mutant was significantly higher. gpRO-HC bound to both single- and double-stranded DNA substrates. Different methods showed that gpRO-HC forms higher oligomers containing about 12 subunits. This work provides the first information about another group of phage initiator proteins, which trigger DNA replication in phages infecting low GC Gram-positive bacteria.
Collapse
Affiliation(s)
- Nora Halgasova
- Department of Genomics and Biotechnology, Institute of Molecular Biology, Slovak Academy of Sciences, Dubravska cesta 21, 845 51 Bratislava, Slovakia.
| | - Rachel Javorova
- Department of Genomics and Biotechnology, Institute of Molecular Biology, Slovak Academy of Sciences, Dubravska cesta 21, 845 51 Bratislava, Slovakia.
| | - Lucia Bocanova
- Department of Genomics and Biotechnology, Institute of Molecular Biology, Slovak Academy of Sciences, Dubravska cesta 21, 845 51 Bratislava, Slovakia.
| | - Daniela Krajcikova
- Department of Microbial Genetics, Institute of Molecular Biology, Slovak Academy of Sciences, Dubravska cesta 21, 845 51 Bratislava, Slovakia.
| | - Jacob A Bauer
- Department of Biochemistry and Protein Structure, Institute of Molecular Biology, Slovak Academy of Sciences, Dubravska cesta 21, 845 51 Bratislava, Slovakia.
| | - Gabriela Bukovska
- Department of Genomics and Biotechnology, Institute of Molecular Biology, Slovak Academy of Sciences, Dubravska cesta 21, 845 51 Bratislava, Slovakia.
| |
Collapse
|
7
|
Azucenas CR, Ruwe TA, Bonamer JP, Qiao B, Ganz T, Jormakka M, Nemeth E, Mackenzie B. Comparative analysis of the functional properties of human and mouse ferroportin. Am J Physiol Cell Physiol 2023; 324:C1110-C1118. [PMID: 36939203 PMCID: PMC10191125 DOI: 10.1152/ajpcell.00063.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 03/16/2023] [Accepted: 03/16/2023] [Indexed: 03/21/2023]
Abstract
Ferroportin (Fpn)-expressed at the plasma membrane of macrophages, enterocytes, and hepatocytes-mediates the transfer of cellular iron into the blood plasma. Under the control of the iron-regulatory hormone hepcidin, Fpn serves a critical role in systemic iron homeostasis. Although we have previously characterized human Fpn, a great deal of research in iron homeostasis and disorders uses mouse models. By way of example, the flatiron mouse, a model of classical ferroportin disease, bears the mutation H32R in Fpn and is characterized by systemic iron deficiency and macrophage iron retention. The flatiron mouse also appears to exhibit a manganese phenotype, raising the possibility that mouse Fpn serves a role in manganese metabolism. At odds with this observation, we have found that human Fpn does not transport manganese, so we considered the possibility that a species difference could explain this discrepancy. We tested the hypothesis that mouse but not human Fpn can transport manganese and performed a comparative analysis of mouse and human Fpn. We examined the functional properties of human Fpn, mouse Fpn, and mutant mouse Fpn by using radiotracer assays in RNA-injected Xenopus oocytes. We found that neither mouse nor human Fpn transports manganese. Mouse and human Fpn share identical properties with respect to substrate profile, calcium dependence, optimal pH, and hepcidin sensitivity. We have also demonstrated that Fpn is not an ATPase pump. Our findings validate the use of mouse models of ferroportin function in iron homeostasis and disease.
Collapse
Affiliation(s)
- Corbin R Azucenas
- Department of Pharmacology & Systems Physiology, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
- Medical Sciences Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
- Systems Biology & Physiology Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
| | - T Alex Ruwe
- Department of Pharmacology & Systems Physiology, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
- Systems Biology & Physiology Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
| | - John P Bonamer
- Department of Pharmacology & Systems Physiology, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
| | - Bo Qiao
- Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, California, United States
| | - Tomas Ganz
- Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, California, United States
- Department of Pathology, David Geffen School of Medicine at UCLA, Los Angeles, California, United States
| | - Mika Jormakka
- Department of Cell Biology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Elizabeta Nemeth
- Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, California, United States
| | - Bryan Mackenzie
- Department of Pharmacology & Systems Physiology, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
- Medical Sciences Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
- Systems Biology & Physiology Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
| |
Collapse
|
8
|
Rashid S, Sundaram S, Kwoh CK. Empirical Study of Protein Feature Representation on Deep Belief Networks Trained With Small Data for Secondary Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:955-966. [PMID: 35439138 DOI: 10.1109/tcbb.2022.3168676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Protein secondary structure (SS) prediction is a classic problem of computational biology and is widely used in structural characterization and to infer homology. While most SS predictors have been trained on thousands of sequences, a previous approach had developed a compact model of training proteins that used a C-Alpha, C-Beta Side Chain (CABS)-algorithm derived energy based feature representation. Here, the previous approach is extended to Deep Belief Networks (DBN). Deep learning methods are notorious for requiring large datasets and there is a wide consensus that training deep models from scratch on small datasets, works poorly. By contrast, we demonstrate a simple DBN architecture containing a single hidden layer, trained only on the CB513 dataset. Testing on an independent set of G Switch proteins improved the Q 3 score of the previous compact model by almost 3%. The findings are further confirmed by comparison to several deep learning models which are trained on thousands of proteins. Finally, the DBN performance is also compared with Position Specific Scoring Matrix (PSSM)-profile based feature representation. The importance of (i) structural information in protein feature representation and (ii) complementary small dataset learning approaches for detection of structural fold switching are demonstrated.
Collapse
|
9
|
Liao J, Wang Q, Wu F, Huang Z. In Silico Methods for Identification of Potential Active Sites of Therapeutic Targets. Molecules 2022; 27:7103. [PMID: 36296697 PMCID: PMC9609013 DOI: 10.3390/molecules27207103] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/12/2022] [Accepted: 08/25/2022] [Indexed: 07/30/2023] Open
Abstract
Target identification is an important step in drug discovery, and computer-aided drug target identification methods are attracting more attention compared with traditional drug target identification methods, which are time-consuming and costly. Computer-aided drug target identification methods can greatly reduce the searching scope of experimental targets and associated costs by identifying the diseases-related targets and their binding sites and evaluating the druggability of the predicted active sites for clinical trials. In this review, we introduce the principles of computer-based active site identification methods, including the identification of binding sites and assessment of druggability. We provide some guidelines for selecting methods for the identification of binding sites and assessment of druggability. In addition, we list the databases and tools commonly used with these methods, present examples of individual and combined applications, and compare the methods and tools. Finally, we discuss the challenges and limitations of binding site identification and druggability assessment at the current stage and provide some recommendations and future perspectives.
Collapse
Affiliation(s)
- Jianbo Liao
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- The Second School of Clinical Medicine, Guangdong Medical University, Dongguan 523808, China
| | - Qinyu Wang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
| | - Fengxu Wu
- Hubei Key Laboratory of Wudang Local Chinese Medicine Research, School of Pharmaceutical Sciences, Hubei University of Medicine, Shiyan 442000, China
| | - Zunnan Huang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- Marine Biomedical Research Institute of Guangdong Zhanjiang, Zhanjiang 524023, China
| |
Collapse
|
10
|
Premachandran K, Srinivasan TS. In silico modelling and interactive profiling of BPH resistance NBS-LRR proteins with salivary specific proteins of rice planthoppers. GENE REPORTS 2022. [DOI: 10.1016/j.genrep.2022.101648] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
11
|
Patiyal S, Dhall A, Raghava GPS. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform 2022; 23:6658239. [PMID: 35943134 DOI: 10.1093/bib/bbac322] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 07/01/2022] [Accepted: 07/15/2022] [Indexed: 11/13/2022] Open
Abstract
DNA-protein interaction is one of the most crucial interactions in the biological system, which decides the fate of many processes such as transcription, regulation and splicing of genes. In this study, we trained our models on a training dataset of 646 DNA-binding proteins having 15 636 DNA interacting and 298 503 non-interacting residues. Our trained models were evaluated on an independent dataset of 46 DNA-binding proteins having 965 DNA interacting and 9911 non-interacting residues. All proteins in the independent dataset have less than 30% of sequence similarity with proteins in the training dataset. A wide range of traditional machine learning and deep learning (1D-CNN) techniques-based models have been developed using binary, physicochemical properties and Position-Specific Scoring Matrix (PSSM)/evolutionary profiles. In the case of machine learning technique, eXtreme Gradient Boosting-based model achieved a maximum area under the receiver operating characteristics (AUROC) curve of 0.77 on the independent dataset using PSSM profile. Deep learning-based model achieved the highest AUROC of 0.79 on the independent dataset using a combination of all three profiles. We evaluated the performance of existing methods on the independent dataset and observed that our proposed method outperformed all the existing methods. In order to facilitate scientific community, we developed standalone software and web server, which are accessible from https://webs.iiitd.edu.in/raghava/dbpred.
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|
12
|
Shi W, Singha M, Pu L, Srivastava G, Ramanujam J, Brylinski M. GraphSite: Ligand Binding Site Classification with Deep Graph Learning. Biomolecules 2022; 12:biom12081053. [PMID: 36008947 PMCID: PMC9405584 DOI: 10.3390/biom12081053] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 07/18/2022] [Accepted: 07/20/2022] [Indexed: 12/10/2022] Open
Abstract
The binding of small organic molecules to protein targets is fundamental to a wide array of cellular functions. It is also routinely exploited to develop new therapeutic strategies against a variety of diseases. On that account, the ability to effectively detect and classify ligand binding sites in proteins is of paramount importance to modern structure-based drug discovery. These complex and non-trivial tasks require sophisticated algorithms from the field of artificial intelligence to achieve a high prediction accuracy. In this communication, we describe GraphSite, a deep learning-based method utilizing a graph representation of local protein structures and a state-of-the-art graph neural network to classify ligand binding sites. Using neural weighted message passing layers to effectively capture the structural, physicochemical, and evolutionary characteristics of binding pockets mitigates model overfitting and improves the classification accuracy. Indeed, comprehensive cross-validation benchmarks against a large dataset of binding pockets belonging to 14 diverse functional classes demonstrate that GraphSite yields the class-weighted F1-score of 81.7%, outperforming other approaches such as molecular docking and binding site matching. Further, it also generalizes well to unseen data with the F1-score of 70.7%, which is the expected performance in real-world applications. We also discuss new directions to improve and extend GraphSite in the future.
Collapse
Affiliation(s)
- Wentao Shi
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA; (W.S.); (J.R.)
| | - Manali Singha
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA; (M.S.); (G.S.)
| | - Limeng Pu
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA;
| | - Gopal Srivastava
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA; (M.S.); (G.S.)
| | - Jagannathan Ramanujam
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA; (W.S.); (J.R.)
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA;
| | - Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA; (M.S.); (G.S.)
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA;
- Correspondence: ; Tel.: +1-(225)-578-2791; Fax: +1-(225)-578-2597
| |
Collapse
|
13
|
Yamaguchi S, Nakashima H, Moriwaki Y, Terada T, Shimizu K. Prediction of protein mononucleotide binding sites using AlphaFold2 and machine learning. Comput Biol Chem 2022; 100:107744. [DOI: 10.1016/j.compbiolchem.2022.107744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 07/12/2022] [Accepted: 07/22/2022] [Indexed: 11/26/2022]
|
14
|
Chelur VR, Priyakumar UD. BiRDS - Binding Residue Detection from Protein Sequences Using Deep ResNets. J Chem Inf Model 2022; 62:1809-1818. [PMID: 35414182 DOI: 10.1021/acs.jcim.1c00972] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein-drug interactions play important roles in many biological processes and therapeutics. Predicting the binding sites of a protein helps to discover such interactions. New drugs can be designed to optimize these interactions, improving protein function. The tertiary structure of a protein decides the binding sites available to the drug molecule, but the determination of the 3D structure is slow and expensive. Conversely, the determination of the amino acid sequence is swift and economical. Although quick and accurate prediction of the binding site using just the sequence is challenging, the application of Deep Learning, which has been hugely successful in several biochemical tasks, makes it feasible. BiRDS is a Residual Neural Network that predicts the protein's most active binding site using sequence information. SC-PDB, an annotated database of druggable binding sites, is used for training the network. Multiple Sequence Alignments of the proteins in the database are generated using DeepMSA, and features such as Position-Specific Scoring Matrix, Secondary Structure, and Relative Solvent Accessibility are extracted. During training, a weighted binary cross-entropy loss function is used to counter the substantial imbalance in the two classes of binding and nonbinding residues. A novel test set SC6K is introduced to compare binding-site prediction methods. BiRDS achieves an AUROC score of 0.87, and the center of 25% of its predicted binding sites lie within 4 Å of the center of the actual binding site.
Collapse
Affiliation(s)
- Vineeth R Chelur
- Center for Computational Natural Sciences & Bioinformatics International Institute of Information Technology Hyderabad 500032, India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences & Bioinformatics International Institute of Information Technology Hyderabad 500032, India
| |
Collapse
|
15
|
Nguyen TTD, Ho QT, Tarn YC, Ou YY. MFPS_CNN: Multi-filter pattern scanning from position-specific scoring matrix with convolutional neural network for efficient prediction of ion transporters. Mol Inform 2022; 41:e2100271. [PMID: 35322557 DOI: 10.1002/minf.202100271] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 03/23/2022] [Indexed: 11/08/2022]
Abstract
In cellular transportation mechanisms, the movement of ions across the cell membrane and its proper control are important for cells, especially for life processes. Ion transporters/pumps and ion channel proteins work as border guards controlling the incessant traffic of ions across cell membranes. We revisited the study of classification of transporters and ion channels from membrane proteins with a more efficient deep learning approach. Specifically, we applied multi-window scanning filters of convolutional neural networks on almost full-length position-specific scoring matrices for extracting useful information. In this way, we were able to retain important evolutionary information of the proteins. Our experiment results show that a convolutional neural network with a minimum number of convolutional layers can be enough to extract the conserved information of proteins which leads to higher performance. Our best prediction models were obtained after examining different data imbalanced handling techniques, and different protein encoding methods. We also showed that our models were superior to traditional deep learning approaches on the same datasets as well as other machine learning classification algorithms.
Collapse
|
16
|
Ding Y, Yang C, Tang J, Guo F. Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02737-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
17
|
Yang YH, Wang JS, Yuan SS, Liu ML, Su W, Lin H, Zhang ZY. A Survey for Predicting ATP Binding Residues of Proteins Using Machine Learning Methods. Curr Med Chem 2021; 29:789-806. [PMID: 34514982 DOI: 10.2174/0929867328666210910125802] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 06/29/2021] [Accepted: 07/04/2021] [Indexed: 11/22/2022]
Abstract
Protein-ligand interactions are necessary for majority protein functions. Adenosine-5'-triphosphate (ATP) is one such ligand that plays vital role as a coenzyme in providing energy for cellular activities, catalyzing biological reaction and signaling. Knowing ATP binding residues of proteins is helpful for annotation of protein function and drug design. However, due to the huge amounts of protein sequences influx into databases in the post-genome era, experimentally identifying ATP binding residues is cost-ineffective and time-consuming. To address this problem, computational methods have been developed to predict ATP binding residues. In this review, we briefly summarized the application of machine learning methods in detecting ATP binding residues of proteins. We expect this review will be helpful for further research.
Collapse
Affiliation(s)
- Yu-He Yang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Jia-Shu Wang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Shi-Shi Yuan
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Meng-Lu Liu
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Wei Su
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Zhao-Yue Zhang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| |
Collapse
|
18
|
Aggarwal R, Gupta A, Chelur V, Jawahar CV, Priyakumar UD. DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks. J Chem Inf Model 2021; 62:5069-5079. [PMID: 34374539 DOI: 10.1021/acs.jcim.1c00799] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
A structure-based drug design pipeline involves the development of potential drug molecules or ligands that form stable complexes with a given receptor at its binding site. A prerequisite to this is finding druggable and functionally relevant binding sites on the 3D structure of the protein. Although several methods for detecting binding sites have been developed beforehand, a majority of them surprisingly fail in the identification and ranking of binding sites accurately. The rapid adoption and success of deep learning algorithms in various sections of structural biology beckons the usage of such algorithms for accurate binding site detection. As a combination of geometry based software and deep learning, we report a novel framework, DeepPocket that utilizes 3D convolutional neural networks for the rescoring of pockets identified by Fpocket and further segments these identified cavities on the protein surface. Apart from this, we also propose another data set SC6K containing protein structures submitted in the Protein Data Bank (PDB) from January 1st, 2018, until February 28th, 2020, for ligand binding site (LBS) detection. DeepPocket's results on various binding site data sets and SC6K highlight its better performance over current state-of-the-art methods and good generalization ability over novel structures.
Collapse
Affiliation(s)
- Rishal Aggarwal
- International Institute of Information Technology, Hyderabad 500 032, India
| | - Akash Gupta
- International Institute of Information Technology, Hyderabad 500 032, India
| | - Vineeth Chelur
- International Institute of Information Technology, Hyderabad 500 032, India
| | - C V Jawahar
- International Institute of Information Technology, Hyderabad 500 032, India
| | - U Deva Priyakumar
- International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
19
|
Nguyen TTD, Nguyen DK, Ou YY. Addressing data imbalance problems in ligand-binding site prediction using a variational autoencoder and a convolutional neural network. Brief Bioinform 2021; 22:6329407. [PMID: 34322702 DOI: 10.1093/bib/bbab277] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 06/29/2021] [Accepted: 06/30/2021] [Indexed: 11/14/2022] Open
Abstract
Since 2015, a fast growing number of deep learning-based methods have been proposed for protein-ligand binding site prediction and many have achieved promising performance. These methods, however, neglect the imbalanced nature of binding site prediction problems. Traditional data-based approaches for handling data imbalance employ linear interpolation of minority class samples. Such approaches may not be fully exploited by deep neural networks on downstream tasks. We present a novel technique for balancing input classes by developing a deep neural network-based variational autoencoder (VAE) that aims to learn important attributes of the minority classes concerning nonlinear combinations. After learning, the trained VAE was used to generate new minority class samples that were later added to the original data to create a balanced dataset. Finally, a convolutional neural network was used for classification, for which we assumed that the nonlinearity could be fully integrated. As a case study, we applied our method to the identification of FAD- and FMN-binding sites of electron transport proteins. Compared with the best classifiers that use traditional machine learning algorithms, our models obtained a great improvement on sensitivity while maintaining similar or higher levels of accuracy and specificity. We also demonstrate that our method is better than other data imbalance handling techniques, such as SMOTE, ADASYN, and class weight adjustment. Additionally, our models also outperform existing predictors in predicting the same binding types. Our method is general and can be applied to other data types for prediction problems with moderate-to-heavy data imbalances.
Collapse
Affiliation(s)
| | - Duc-Khanh Nguyen
- Department of Information Management, Yuan Ze University, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Graduate Program in Biomedical Informatics, Yuan Ze University, Taiwan
| |
Collapse
|
20
|
Hu J, Zheng LL, Bai YS, Zhang KW, Yu DJ, Zhang GJ. Accurate prediction of protein-ATP binding residues using position-specific frequency matrix. Anal Biochem 2021; 626:114241. [PMID: 33971164 DOI: 10.1016/j.ab.2021.114241] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Revised: 04/27/2021] [Accepted: 05/01/2021] [Indexed: 10/21/2022]
Abstract
Knowledge of protein-ATP interaction can help for protein functional annotation and drug discovery. Accurately identifying protein-ATP binding residues is an important but challenging task to gain the knowledge of protein-ATP interactions, especially for the case where only protein sequence information is given. In this study, we propose a novel method, named DeepATPseq, to predict protein-ATP binding residues without using any information about protein three-dimension structure or sequence-derived structural information. In DeepATPseq, the HHBlits-generated position-specific frequency matrix (PSFM) profile is first employed to extract the feature information of each residue. Then, for each residue, the PSFM-based feature is fed into two prediction models, which are generated by the algorithms of deep convolutional neural network (DCNN) and support vector machine (SVM) separately. The final ATP-binding probability of the corresponding residue is calculated by the weighted sum of the outputted values of DCNN-based and SVM-based models. Experimental results on the independent validation data set demonstrate that DeepATPseq could achieve an accuracy of 77.71%, covering 57.42% of all ATP-binding residues, while achieving a Matthew's correlation coefficient value (0.655) that is significantly higher than that of existing sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis show that the major advantage of DeepATPseq lies at the combination utilization of DCNN and SVM that helps dig out more discriminative information from the PSFM profiles. The online server and standalone package of DeepATPseq are freely available at: https://jun-csbio.github.io/DeepATPseq/for academic use.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| | - Lin-Lin Zheng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Yan-Song Bai
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Ke-Wen Zhang
- College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology,Xiaolingwei 200, Nanjing, 210094, China.
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China.
| |
Collapse
|
21
|
Yang C, Ding Y, Meng Q, Tang J, Guo F. Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information. Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05573-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
22
|
Prediction of Protein-ATP Binding Residues Based on Ensemble of Deep Convolutional Neural Networks and LightGBM Algorithm. Int J Mol Sci 2021; 22:ijms22020939. [PMID: 33477866 PMCID: PMC7832895 DOI: 10.3390/ijms22020939] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 01/13/2021] [Accepted: 01/16/2021] [Indexed: 12/13/2022] Open
Abstract
Accurately identifying protein-ATP binding residues is important for protein function annotation and drug design. Previous studies have used classic machine-learning algorithms like support vector machine (SVM) and random forest to predict protein-ATP binding residues; however, as new machine-learning techniques are being developed, the prediction performance could be further improved. In this paper, an ensemble predictor that combines deep convolutional neural network and LightGBM with ensemble learning algorithm is proposed. Three subclassifiers have been developed, including a multi-incepResNet-based predictor, a multi-Xception-based predictor, and a LightGBM predictor. The final prediction result is the combination of outputs from three subclassifiers with optimized weight distribution. We examined the performance of our proposed predictor using two datasets: a classic ATP-binding benchmark dataset and a newly proposed ATP-binding dataset. Our predictor achieved area under the curve (AUC) values of 0.925 and 0.902 and Matthews Correlation Coefficient (MCC) values of 0.639 and 0.642, respectively, which are both better than other state-of-art prediction methods.
Collapse
|
23
|
CFAP45 deficiency causes situs abnormalities and asthenospermia by disrupting an axonemal adenine nucleotide homeostasis module. Nat Commun 2020; 11:5520. [PMID: 33139725 PMCID: PMC7606486 DOI: 10.1038/s41467-020-19113-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 09/25/2020] [Indexed: 11/08/2022] Open
Abstract
Axonemal dynein ATPases direct ciliary and flagellar beating via adenosine triphosphate (ATP) hydrolysis. The modulatory effect of adenosine monophosphate (AMP) and adenosine diphosphate (ADP) on flagellar beating is not fully understood. Here, we describe a deficiency of cilia and flagella associated protein 45 (CFAP45) in humans and mice that presents a motile ciliopathy featuring situs inversus totalis and asthenospermia. CFAP45-deficient cilia and flagella show normal morphology and axonemal ultrastructure. Proteomic profiling links CFAP45 to an axonemal module including dynein ATPases and adenylate kinase as well as CFAP52, whose mutations cause a similar ciliopathy. CFAP45 binds AMP in vitro, consistent with structural modelling that identifies an AMP-binding interface between CFAP45 and AK8. Microtubule sliding of dyskinetic sperm from Cfap45−/− mice is rescued with the addition of either AMP or ADP with ATP, compared to ATP alone. We propose that CFAP45 supports mammalian ciliary and flagellar beating via an adenine nucleotide homeostasis module. The mechanism by which adenosine monophosphate modulates dynein ATPase-mediated ciliary and flagellar beating remains obscure. Here the authors identify an axonemal module including cilia and flagella associated protein 45 that supports adenine nucleotide homeostasis and underlies a human ciliopathy
Collapse
|
24
|
Xia CQ, Pan X, Shen HB. Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 2020; 36:3018-3027. [PMID: 32091580 DOI: 10.1093/bioinformatics/btaa110] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 01/19/2020] [Accepted: 02/18/2020] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Knowledge of protein-ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein-ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severely imbalanced data. RESULTS In this study, we propose a novel deep-learning-based method called DELIA for protein-ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. To overcome the problem of severe data imbalance between the binding and nonbinding residues, strategies of oversampling in mini-batch, random undersampling and stacking ensemble are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline. AVAILABILITY AND IMPLEMENTATION The web server of DELIA is available at www.csbio.sjtu.edu.cn/bioinf/delia/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chun-Qiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| |
Collapse
|
25
|
Identification of ligand-binding residues using protein sequence profile alignment and query-specific support vector machine model. Anal Biochem 2020; 604:113799. [DOI: 10.1016/j.ab.2020.113799] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Revised: 05/23/2020] [Accepted: 05/26/2020] [Indexed: 12/23/2022]
|
26
|
Zhu YH, Hu J, Qi Y, Song XN, Yu DJ. Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites. Comb Chem High Throughput Screen 2020; 22:455-469. [PMID: 31553288 DOI: 10.2174/1386207322666190925125524] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2019] [Revised: 06/21/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors. MATERIALS AND METHODS In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm. RESULTS Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yong Qi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Xiao-Ning Song
- School of Internet of Things, Jiangnan University, Wuxi 214122, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
27
|
Nguyen T, Le N, Ho Q, Phan D, Ou Y. Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain. Mol Inform 2020; 39:e2000033. [DOI: 10.1002/minf.202000033] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Accepted: 06/26/2020] [Indexed: 11/10/2022]
Affiliation(s)
| | - Nguyen‐Quoc‐Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine College of Medicine, Taipei Medical University Taipei City 106 Taiwan
- Research Center for Artificial Intelligence in Medicine Taipei Medical University Taipei City 106 Taiwan
| | - Quang‐Thai Ho
- Department of Computer Science and Engineering Yuan Ze University Chung-Li Taiwan 32003
| | - Dinh‐Van Phan
- University of Economics University of Danang 41 Leduan St Danang City 550000 Vietnam
| | - Yu‐Yen Ou
- Department of Computer Science and Engineering Yuan Ze University Chung-Li Taiwan 32003
| |
Collapse
|
28
|
Langton M, Pandelia ME. Hepatitis B Virus Oncoprotein HBx Is Not an ATPase. ACS OMEGA 2020; 5:16772-16778. [PMID: 32685845 PMCID: PMC7364715 DOI: 10.1021/acsomega.0c01762] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Accepted: 06/12/2020] [Indexed: 06/11/2023]
Abstract
HBx is the smallest gene product of the Hepatitis B virus (HBV) and an oncogenic stimulus in chronic infections leading to liver disease. HBx interacts and interferes with numerous cellular processes, but its modes of action remain poorly understood. It has been invoked that HBx employs nucleotide hydrolysis to regulate molecular pathways or protein-protein interactions. In the present study, we reinvestigate the (d)NTP hydrolysis of recombinant HBx to explore its potential as a biochemical probe for antiviral studies. For our investigations, we employed existing soluble constructs (i.e., GST-HBx, MBP-HBx) and engineered new fusion proteins (i.e., DsbC-HBx, NusA-HBx), which are shown to serve as better systems for in vitro research. We performed mutational scanning of the computationally predicted NTP-binding domain, which includes residues associated with clinical cases. Steady-state and end-point activity assays, in tandem with mass-spectrometric analyses, reveal that the observed hydrolysis of all alleged HBx substrates, ATP, dATP, and GTP, is contingent on the presence of the GroEL chaperone, which preferentially copurifies as a contaminant with GST-HBx and MBP-HBx. Collectively, our findings provide new technical standards for recombinant HBx studies and reveal that nucleotide hydrolysis is not an operant mechanism by which HBx contributes to viral HBV carcinogenesis.
Collapse
|
29
|
A Biological and Immunological Characterization of Schistosoma Japonicum Heat Shock Proteins 40 and 90α. Int J Mol Sci 2020; 21:ijms21114034. [PMID: 32512920 PMCID: PMC7312537 DOI: 10.3390/ijms21114034] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/27/2020] [Accepted: 06/03/2020] [Indexed: 12/12/2022] Open
Abstract
We characterized Schistosoma japonicum HSP40 (Sjp40) and HSP90α (Sjp90α) in this study. Western blot analysis revealed both are present in soluble egg antigens and egg secretory proteins, implicating them in triggering the host immune response after secretion from eggs into host tissues. These observations were confirmed by immunolocalization showing both HSPs are located in the Reynolds’ layer within mature eggs, suggesting they are secreted by miracidia and accumulate between the envelope and the eggshell. Both HSPs are present in the musculature and parenchyma of adult males and in the vitelline cells of females; only Sjp90α is present on the tegument of adults. Sjp40 was able to enhance the expression of macrophages, dendritic cells, and eosinophilic cells in mouse liver non-parenchymal cells, whereas rSjp90α only stimulated the expression of dendritic cells. T helper 1 (Th1), Th2, and Th17 responses were increased upon rSjp40 stimulation in vitro, but rSjp90 only stimulated an increased Th17 response. Sjp40 has an important role in reducing the expression of fibrogenic gene markers in hepatic stellate cells in vitro. Overall, these findings provide new information on HSPs in S. japonicum, improving our understanding of the pathological roles they play in their interaction with host immune cells.
Collapse
|
30
|
Vignolle GA, Mach RL, Mach-Aigner AR, Derntl C. Novel approach in whole genome mining and transcriptome analysis reveal conserved RiPPs in Trichoderma spp. BMC Genomics 2020; 21:258. [PMID: 32216757 PMCID: PMC7099791 DOI: 10.1186/s12864-020-6653-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Accepted: 03/04/2020] [Indexed: 01/10/2023] Open
Abstract
Background Ribosomally synthesized and post-translationally modified peptides (RiPPs) are a highly diverse group of secondary metabolites (SM) of bacterial and fungal origin. While RiPPs have been intensively studied in bacteria, little is known about fungal RiPPs. In Fungi only six classes of RiPPs are described. Current strategies for genome mining are based on these six known classes. However, the genes involved in the biosynthesis of theses RiPPs are normally organized in biosynthetic gene clusters (BGC) in fungi. Results Here we describe a comprehensive strategy to mine fungal genomes for RiPPs by combining and adapting existing tools (e.g. antiSMASH and RiPPMiner) followed by extensive manual curation based on conserved domain identification, (comparative) phylogenetic analysis, and RNASeq data. Deploying this strategy, we could successfully rediscover already known fungal RiPPs. Further, we analysed four fungal genomes from the Trichoderma genus. We were able to find novel potential RiPP BGCs in Trichoderma using our unconventional mining approach. Conclusion We demonstrate that the unusual mining approach using tools developed for bacteria can be used in fungi, when carefully curated. Our study is the first report of the potential of Trichoderma to produce RiPPs, the detected clusters encode novel uncharacterized RiPPs. The method described in our study will lead to further mining efforts in all subdivisions of the fungal kingdom.
Collapse
Affiliation(s)
- Gabriel A Vignolle
- Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Strasse 1a, 1060, Wien, Austria
| | - Robert L Mach
- Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Strasse 1a, 1060, Wien, Austria
| | - Astrid R Mach-Aigner
- Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Strasse 1a, 1060, Wien, Austria
| | - Christian Derntl
- Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Strasse 1a, 1060, Wien, Austria.
| |
Collapse
|
31
|
Chen CW, Lin MH, Liao CC, Chang HP, Chu YW. iStable 2.0: Predicting protein thermal stability changes by integrating various characteristic modules. Comput Struct Biotechnol J 2020; 18:622-630. [PMID: 32226595 PMCID: PMC7090336 DOI: 10.1016/j.csbj.2020.02.021] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 02/25/2020] [Accepted: 02/27/2020] [Indexed: 11/15/2022] Open
Abstract
Protein mutations can lead to structural changes that affect protein function and result in disease occurrence. In protein engineering, drug design or and optimization industries, mutations are often used to improve protein stability or to change protein properties while maintaining stability. To provide possible candidates for novel protein design, several computational tools for predicting protein stability changes have been developed. Although many prediction tools are available, each tool employs different algorithms and features. This can produce conflicting prediction results that make it difficult for users to decide upon the correct protein design. Therefore, this study proposes an integrated prediction tool, iStable 2.0, which integrates 11 sequence-based and structure-based prediction tools by machine learning and adds protein sequence information as features. Three coding modules are designed for the system, an Online Server Module, a Stand-alone Module and a Sequence Coding Module, to improve the prediction performance of the previous version of the system. The final integrated structure-based classification model has a higher Matthews correlation coefficient than that of the single prediction tool (0.708 vs 0.547, respectively), and the Pearson correlation coefficient of the regression model likewise improves from 0.669 to 0.714. The sequence-based model not only successfully integrates off-the-shelf predictors but also improves the Matthews correlation coefficient of the best single prediction tool by at least 0.161, which is better than the individual structure-based prediction tools. In addition, both the Sequence Coding Module and the Stand-alone Module maintain performance with only a 5% decrease of the Matthews correlation coefficient when the integrated online tools are unavailable. iStable 2.0 is available at http://ncblab.nchu.edu.tw/iStable2.
Collapse
Affiliation(s)
- Chi-Wei Chen
- Department of Computer Science and Engineering, National Chung-Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
| | - Meng-Han Lin
- Institute of Genomics and Bioinformatics, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
| | - Chi-Chou Liao
- Institute of Genomics and Bioinformatics, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
| | - Hsung-Pin Chang
- Department of Computer Science and Engineering, National Chung-Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
| | - Yen-Wei Chu
- Institute of Genomics and Bioinformatics, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Biotechnology Center, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Ph.D. Program in Translational Medicine, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Rong Hsing Research Center for Translational Medicine, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan
- Corresponding author at: Institute of Genomics and Bioinformatics, National Chung Hsing University, 145 Xingda Rd., South Dist., Taichung City 402, Taiwan.
| |
Collapse
|
32
|
Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 2020; 18:417-426. [PMID: 32140203 PMCID: PMC7049599 DOI: 10.1016/j.csbj.2020.02.008] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 01/23/2020] [Accepted: 02/11/2020] [Indexed: 12/21/2022] Open
Abstract
Proteins participate in various essential processes in vivo via interactions with other molecules. Identifying the residues participating in these interactions not only provides biological insights for protein function studies but also has great significance for drug discoveries. Therefore, predicting protein-ligand binding sites has long been under intense research in the fields of bioinformatics and computer aided drug discovery. In this review, we first introduce the research background of predicting protein-ligand binding sites and then classify the methods into four categories, namely, 3D structure-based, template similarity-based, traditional machine learning-based and deep learning-based methods. We describe representative algorithms in each category and elaborate on machine learning and deep learning-based prediction methods in more detail. Finally, we discuss the trends and challenges of the current research such as molecular dynamics simulation based cryptic binding sites prediction, and highlight prospective directions for the near future.
Collapse
Affiliation(s)
- Jingtian Zhao
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Yang Cao
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu 610065, China
| | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu 610065, China
| |
Collapse
|
33
|
Le NQK, Ho QT, Ou YY. Using two-dimensional convolutional neural networks for identifying GTP binding sites in Rab proteins. J Bioinform Comput Biol 2020; 17:1950005. [PMID: 30866734 DOI: 10.1142/s0219720019500057] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Deep learning has been increasingly and widely used to solve numerous problems in various fields with state-of-the-art performance. It can also be applied in bioinformatics to reduce the requirement for feature extraction and reach high performance. This study attempts to use deep learning to predict GTP binding sites in Rab proteins, which is one of the most vital molecular functions in life science. A functional loss of GTP binding sites in Rab proteins has been implicated in a variety of human diseases (choroideremia, intellectual disability, cancer, Parkinson's disease). Therefore, creating a precise model to identify their functions is a crucial problem for understanding these diseases and designing the drug targets. Our deep learning model with two-dimensional convolutional neural network and position-specific scoring matrix profiles could identify GTP binding residues with achieved sensitivity of 92.3%, specificity of 99.8%, accuracy of 99.5%, and MCC of 0.92 for independent dataset. Compared with other published works, this approach achieved a significant improvement. Throughout the proposed study, we provide an effective model for predicting GTP binding sites in Rab proteins and a basis for further research that can apply deep learning in bioinformatics, especially in nucleotide binding site prediction.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- * Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan 32003, R. O. C.,† School of Humanities, Nanyang Technological University, 48 Nanyang Ave, Singapore 639798, Singapore
| | - Quang-Thai Ho
- * Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan 32003, R. O. C
| | - Yu-Yen Ou
- * Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan 32003, R. O. C
| |
Collapse
|
34
|
Agrawal P, Mishra G, Raghava GPS. SAMbinder: A Web Server for Predicting S-Adenosyl-L-Methionine Binding Residues of a Protein From Its Amino Acid Sequence. Front Pharmacol 2020; 10:1690. [PMID: 32082172 PMCID: PMC7002541 DOI: 10.3389/fphar.2019.01690] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 12/24/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION S-adenosyl-L-methionine (SAM) is an essential cofactor present in the biological system and plays a key role in many diseases. There is a need to develop a method for predicting SAM binding sites in a protein for designing drugs against SAM associated disease. To the best of our knowledge, there is no method that can predict the binding site of SAM in a given protein sequence. RESULT This manuscript describes a method SAMbinder, developed for predicting SAM interacting residue in a protein from its primary sequence. All models were trained, tested, and evaluated on 145 SAM binding protein chains where no two chains have more than 40% sequence similarity. Firstly, models were developed using different machine learning techniques on a balanced data set containing 2,188 SAM interacting and an equal number of non-interacting residues. Our random forest based model developed using binary profile feature got maximum Matthews Correlation Coefficient (MCC) 0.42 with area under receiver operating characteristics (AUROC) 0.79 on the validation data set. The performance of our models improved significantly from MCC 0.42 to 0.61, when evolutionary information in the form of the position-specific scoring matrix (PSSM) profile is used as a feature. We also developed models on a realistic data set containing 2,188 SAM interacting and 40,029 non-interacting residues and got maximum MCC 0.61 with AUROC of 0.89. In order to evaluate the performance of our models, we used internal as well as external cross-validation technique. AVAILABILITY AND IMPLEMENTATION https://webs.iiitd.edu.in/raghava/sambinder/.
Collapse
Affiliation(s)
- Piyush Agrawal
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
- Bioinformatics Center, CSIR-Institute of Microbial Technology, Chandigarh, India
| | - Gaurav Mishra
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
- Department of Electrical Engineering, Shiv Nadar University, Greater Noida, India
| | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
35
|
Oldfield CJ, Fan X, Wang C, Dunker AK, Kurgan L. Computational Prediction of Intrinsic Disorder in Protein Sequences with the disCoP Meta-predictor. Methods Mol Biol 2020; 2141:21-35. [PMID: 32696351 DOI: 10.1007/978-1-0716-0524-0_2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Intrinsically disordered proteins are either entirely disordered or contain disordered regions in their native state. These proteins and regions function without the prerequisite of a stable structure and were found to be abundant across all kingdoms of life. Experimental annotation of disorder lags behind the rapidly growing number of sequenced proteins, motivating the development of computational methods that predict disorder in protein sequences. DisCoP is a user-friendly webserver that provides accurate sequence-based prediction of protein disorder. It relies on meta-architecture in which the outputs generated by multiple disorder predictors are combined together to improve predictive performance. The architecture of disCoP is presented, and its accuracy relative to several other disorder predictors is briefly discussed. We describe usage of the web interface and explain how to access and read results generated by this computational tool. We also provide an example of prediction results and interpretation. The disCoP's webserver is publicly available at http://biomine.cs.vcu.edu/servers/disCoP/ .
Collapse
Affiliation(s)
| | - Xiao Fan
- Department of Pediatrics, Columbia University, New York, NY, USA
| | - Chen Wang
- Department of Medicine, Columbia University, New York, NY, USA
| | - A Keith Dunker
- Department of Biochemistry and Molecular Biology, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
36
|
Patiyal S, Agrawal P, Kumar V, Dhall A, Kumar R, Mishra G, Raghava GP. NAGbinder: An approach for identifying N-acetylglucosamine interacting residues of a protein from its primary sequence. Protein Sci 2020; 29:201-210. [PMID: 31654438 PMCID: PMC6933864 DOI: 10.1002/pro.3761] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Revised: 10/24/2019] [Accepted: 10/24/2019] [Indexed: 12/14/2022]
Abstract
N-acetylglucosamine (NAG) belongs to the eight essential saccharides that are required to maintain the optimal health and precise functioning of systems ranging from bacteria to human. In the present study, we have developed a method, NAGbinder, which predicts the NAG-interacting residues in a protein from its primary sequence information. We extracted 231 NAG-interacting nonredundant protein chains from Protein Data Bank, where no two sequences share more than 40% sequence identity. All prediction models were trained, validated, and evaluated on these 231 protein chains. At first, prediction models were developed on balanced data consisting of 1,335 NAG-interacting and noninteracting residues, using various window size. The model developed by implementing Random Forest using binary profiles as the main principle for identifying NAG-interacting residue with window size 9, performed best among other models. It achieved highest Matthews Correlation Coefficient (MCC) of 0.31 and 0.25, and Area Under Receiver Operating Curve (AUROC) of 0.73 and 0.70 on training and validation data set, respectively. We also developed prediction models on realistic data set (1,335 NAG-interacting and 47,198 noninteracting residues) using the same principle, where the model achieved MCC of 0.26 and 0.27, and AUROC of 0.70 and 0.71, on training and validation data set, respectively. The success of our method can be appraised by the fact that, if a sequence of 1,000 amino acids is analyzed with our approach, 10 residues will be predicted as NAG-interacting, out of which five are correct. Best models were incorporated in the standalone version and in the webserver available at https://webs.iiitd.edu.in/raghava/nagbinder/.
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
| | - Piyush Agrawal
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
- Bioinformatics CentreCSIR‐Institute of Microbial TechnologyChandigarhIndia
| | - Vinod Kumar
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
- Bioinformatics CentreCSIR‐Institute of Microbial TechnologyChandigarhIndia
| | - Anjali Dhall
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
| | - Rajesh Kumar
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
- Bioinformatics CentreCSIR‐Institute of Microbial TechnologyChandigarhIndia
| | - Gaurav Mishra
- Department of Electrical EngineeringShiv Nadar University, Greater NoidaGautam Buddha NagarIndia
| | - Gajendra P.S. Raghava
- Department of Computational BiologyIndraprastha Institute of Information TechnologyDelhiIndia
| |
Collapse
|
37
|
Abstract
Intrinsically disordered regions (IDRs) are estimated to be highly abundant in nature. While only several thousand proteins are annotated with experimentally derived IDRs, computational methods can be used to predict IDRs for the millions of currently uncharacterized protein chains. Several dozen disorder predictors were developed over the last few decades. While some of these methods provide accurate predictions, unavoidably they also make some mistakes. Consequently, one of the challenges facing users of these methods is how to decide which predictions can be trusted and which are likely incorrect. This practical problem can be solved using quality assessment (QA) scores that predict correctness of the underlying (disorder) predictions at a residue level. We motivate and describe a first-of-its-kind toolbox of QA methods, QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions), which provides the scores for a diverse set of ten disorder predictors. QUARTER is available to the end users as a free and convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTER/ . We briefly describe the predictive architecture of QUARTER and provide detailed instructions on how to use the webserver. We also explain how to interpret results produced by QUARTER with the help of a case study.
Collapse
|
38
|
Song J, Liu G, Song C, Jiang J. A novel sequence-based prediction method for ATP-binding sites using fusion of SMOTE algorithm and random forests classifier. BIOTECHNOL BIOTEC EQ 2020. [DOI: 10.1080/13102818.2020.1840436] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Affiliation(s)
- Jiazhi Song
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, PR China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, PR China
- College of Computer Science and Technology, Inner Mongolia University for Nationalities, Tongliao, Inner Mongolia, PR China
| | - Guixia Liu
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, PR China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, PR China
| | - Chuyi Song
- College of Mathematics and Physics, Inner Mongolia University for Nationalities, Tongliao, Inner Mongolia, PR China
| | - Jingqing Jiang
- College of Computer Science and Technology, Inner Mongolia University for Nationalities, Tongliao, Inner Mongolia, PR China
| |
Collapse
|
39
|
Mishra PKK, Nimmanapalli R. In silico characterization of Leptospira interrogans DNA ligase A and delineation of its antimicrobial stretches. ANN MICROBIOL 2019. [DOI: 10.1007/s13213-019-01516-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
|
40
|
Zhao Z, Xu Y, Zhao Y. SXGBsite: Prediction of Protein-Ligand Binding Sites Using Sequence Information and Extreme Gradient Boosting. Genes (Basel) 2019; 10:E965. [PMID: 31771119 PMCID: PMC6947422 DOI: 10.3390/genes10120965] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 10/19/2019] [Accepted: 11/19/2019] [Indexed: 12/13/2022] Open
Abstract
The prediction of protein-ligand binding sites is important in drug discovery and drug design. Protein-ligand binding site prediction computational methods are inexpensive and fast compared with experimental methods. This paper proposes a new computational method, SXGBsite, which includes the synthetic minority over-sampling technique (SMOTE) and the Extreme Gradient Boosting (XGBoost). SXGBsite uses the position-specific scoring matrix discrete cosine transform (PSSM-DCT) and predicted solvent accessibility (PSA) to extract features containing sequence information. A new balanced dataset was generated by SMOTE to improve classifier performance, and a prediction model was constructed using XGBoost. The parallel computing and regularization techniques enabled high-quality and fast predictions and mitigated overfitting caused by SMOTE. An evaluation using 12 different types of ligand binding site independent test sets showed that SXGBsite performs similarly to the existing methods on eight of the independent test sets with a faster computation time. SXGBsite may be applied as a complement to biological experiments.
Collapse
Affiliation(s)
| | - Yonghong Xu
- School of Electrical Engineering, Yanshan University, Qinhuangdao 066004, China
| | | |
Collapse
|
41
|
Wang W, Li K, Lv H, Zhang H, Wang S, Huang J. SmoPSI: Analysis and Prediction of Small Molecule Binding Sites Based on Protein Sequence Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:1926156. [PMID: 31814842 PMCID: PMC6877956 DOI: 10.1155/2019/1926156] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/27/2019] [Revised: 09/16/2019] [Accepted: 09/26/2019] [Indexed: 11/20/2022]
Abstract
The analysis and prediction of small molecule binding sites is very important for drug discovery and drug design. The traditional experimental methods for detecting small molecule binding sites are usually expensive and time consuming, and the tools for single species small molecule research are equally inefficient. In recent years, some algorithms for predicting binding sites of protein-small molecules have been developed based on the geometric and sequence characteristics of proteins. In this paper, we have proposed SmoPSI, a classification model based on the XGBoost algorithm for predicting the binding sites of small molecules, using protein sequence information. The model achieved better results with an AUC of 0.918 and an ACC of 0.913. The experimental results demonstrate that our method achieves high performances and outperforms many existing predictors. In addition, we also analyzed the binding residues and nonbinding residues and finally found the PSSM; hydrophilicity, hydrophobicity, charge, and hydrogen bonding have obviously different effects on the binding-site predictions.
Collapse
Affiliation(s)
- Wei Wang
- Department of Computer Science and Technology, College of Computer and Information Engineering, Henan Normal University, 453007 Xinxiang, Henan Province, China
- Laboratory of Computation Intelligence and Information Processing, Engineering Technology Research Center for Computing Intelligence and Data Mining, 453007 Xinxiang, Henan Province, China
| | - Keliang Li
- Department of Computer Science and Technology, College of Computer and Information Engineering, Henan Normal University, 453007 Xinxiang, Henan Province, China
| | - Hehe Lv
- Department of Computer Science and Technology, College of Computer and Information Engineering, Henan Normal University, 453007 Xinxiang, Henan Province, China
| | - Hongjun Zhang
- School of Aviation Engineering, Anyang University, 455000 Anyang, Henan Province, China
| | - Shixun Wang
- Department of Computer Science and Technology, College of Computer and Information Engineering, Henan Normal University, 453007 Xinxiang, Henan Province, China
| | - Junwei Huang
- Department of Computer Science and Technology, College of Computer and Information Engineering, Henan Normal University, 453007 Xinxiang, Henan Province, China
| |
Collapse
|
42
|
Ugidos N, Mena J, Baquero S, Alloza I, Azkargorta M, Elortza F, Vandenbroeck K. Interactome of the Autoimmune Risk Protein ANKRD55. Front Immunol 2019; 10:2067. [PMID: 31620119 PMCID: PMC6759997 DOI: 10.3389/fimmu.2019.02067] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 08/15/2019] [Indexed: 01/03/2023] Open
Abstract
The ankyrin repeat domain-55 (ANKRD55) gene contains intronic single nucleotide polymorphisms (SNPs) associated with risk to contract multiple sclerosis, rheumatoid arthritis or other autoimmune disorders. Risk alleles of these SNPs are associated with higher levels of ANKRD55 in CD4+ T cells. The biological function of ANKRD55 is unknown, but given that ankyrin repeat domains constitute one of the most common protein-protein interaction platforms in nature, it is likely to function in complex with other proteins. Thus, identification of its protein interactomes may provide clues. We identified ANKRD55 interactomes via recombinant overexpression in HEK293 or HeLa cells and mass spectrometry. One hundred forty-eight specifically interacting proteins were found in total protein extracts and 22 in extracts of sucrose gradient-purified nuclei. Bioinformatic analysis suggested that the ANKRD55-protein partners from total protein extracts were related to nucleotide and ATP binding, enriched in nuclear transport terms and associated with cell cycle and RNA, lipid and amino acid metabolism. The enrichment analysis of the ANKRD55-protein partners from nuclear extracts is related to sumoylation, RNA binding, processes associated with cell cycle, RNA transport, nucleotide and ATP binding. The interaction between overexpressed ANKRD55 isoform 001 and endogenous RPS3, the cohesins SMC1A and SMC3, CLTC, PRKDC, VIM, β-tubulin isoforms, and 14-3-3 isoforms were validated by western blot, reverse immunoprecipitaton and/or confocal microscopy. We also identified three phosphorylation sites in ANKRD55, with S436 exhibiting the highest score as likely 14-3-3 binding phosphosite. Our study suggests that ANKRD55 may exert function(s) in the formation or architecture of multiple protein complexes, and is regulated by (de)phosphorylation reactions. Based on interactome and subcellular localization analysis, ANKRD55 is likely transported into the nucleus by the classical nuclear import pathway and is involved in mitosis, probably via effects associated with mitotic spindle dynamics.
Collapse
Affiliation(s)
- Nerea Ugidos
- Neurogenomiks Group, Department of Neuroscience, University of the Basque Country (UPV/EHU), Leioa, Spain.,Achucarro Basque Center for Neuroscience, Leioa, Spain
| | - Jorge Mena
- Neurogenomiks Group, Department of Neuroscience, University of the Basque Country (UPV/EHU), Leioa, Spain.,Achucarro Basque Center for Neuroscience, Leioa, Spain
| | - Sara Baquero
- Neurogenomiks Group, Department of Neuroscience, University of the Basque Country (UPV/EHU), Leioa, Spain.,Achucarro Basque Center for Neuroscience, Leioa, Spain
| | - Iraide Alloza
- Neurogenomiks Group, Department of Neuroscience, University of the Basque Country (UPV/EHU), Leioa, Spain.,Achucarro Basque Center for Neuroscience, Leioa, Spain
| | - Mikel Azkargorta
- Proteomics Platform, CIC bioGUNE, CIBERehd, ProteoRed-ISCIII, Derio, Spain
| | - Felix Elortza
- Proteomics Platform, CIC bioGUNE, CIBERehd, ProteoRed-ISCIII, Derio, Spain
| | - Koen Vandenbroeck
- Neurogenomiks Group, Department of Neuroscience, University of the Basque Country (UPV/EHU), Leioa, Spain.,Achucarro Basque Center for Neuroscience, Leioa, Spain.,IKERBASQUE, Basque Foundation for Science, Bilbao, Spain
| |
Collapse
|
43
|
Nguyen TTD, Le NQK, Kusuma RMI, Ou YY. Prediction of ATP-binding sites in membrane proteins using a two-dimensional convolutional neural network. J Mol Graph Model 2019; 92:86-93. [PMID: 31344547 DOI: 10.1016/j.jmgm.2019.07.003] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 06/24/2019] [Accepted: 07/13/2019] [Indexed: 12/28/2022]
Abstract
Membrane proteins, the most important drug targets, account for around 30% of total proteins encoded by the genome of living organisms. An important role of these proteins is to bind adenosine triphosphate (ATP), facilitating crucial biological processes such as metabolism and cell signaling. There are several reports elucidating ATP-binding sites within proteins. However, such studies on membrane proteins are limited. Our prediction tool, DeepATP, combines evolutionary information in the form of Position Specific Scoring Matrix and two-dimensional Convolutional Neural Network to predict ATP-binding sites in membrane proteins with an MCC of 0.89 and an AUC of 99%. Compared to recently published ATP-binding site predictors and classifiers that use traditional machine learning algorithms, our approach performs significantly better. We suggest this method as a reliable tool for biologists for ATP-binding site prediction in membrane proteins.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 6397983, Singapore
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
44
|
Lu C, Liu Z, Zhang E, He F, Ma Z, Wang H. MPLs-Pred: Predicting Membrane Protein-Ligand Binding Sites Using Hybrid Sequence-Based Features and Ligand-Specific Models. Int J Mol Sci 2019; 20:ijms20133120. [PMID: 31247932 PMCID: PMC6651575 DOI: 10.3390/ijms20133120] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 06/23/2019] [Accepted: 06/23/2019] [Indexed: 02/07/2023] Open
Abstract
Membrane proteins (MPs) are involved in many essential biomolecule mechanisms as a pivotal factor in enabling the small molecule and signal transport between the two sides of the biological membrane; this is the reason that a large portion of modern medicinal drugs target MPs. Therefore, accurately identifying the membrane protein-ligand binding sites (MPLs) will significantly improve drug discovery. In this paper, we propose a sequence-based MPLs predictor called MPLs-Pred, where evolutionary profiles, topology structure, physicochemical properties, and primary sequence segment descriptors are combined as features applied to a random forest classifier, and an under-sampling scheme is used to enhance the classification capability with imbalanced samples. Additional ligand-specific models were taken into consideration in refining the prediction. The corresponding experimental results based on our method achieved an appreciable performance, with 0.63 MCC (Matthews correlation coefficient) as the overall prediction precision, and those values were 0.604, 0.7, and 0.692, respectively, for the three main types of ligands: drugs, metal ions, and biomacromolecules. MPLs-Pred is freely accessible at http://icdtools.nenu.edu.cn/.
Collapse
Affiliation(s)
- Chang Lu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Zhe Liu
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Enju Zhang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| | - Zhiqiang Ma
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| | - Han Wang
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China.
- Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
45
|
Abstract
Background:
B-cell epitope prediction is an essential tool for a variety of
immunological studies. For identifying such epitopes, several computational predictors have been
proposed in the past 10 years.
Objective:
In this review, we summarized the representative computational approaches developed
for the identification of linear B-cell epitopes.
</P><P>
Methods: We mainly discuss the datasets, feature extraction methods and classification methods
used in the previous work.
Results:
The performance of the existing methods was not very satisfying, and so more effective
approaches should be proposed by considering the structural information of proteins.
Conclusion:
We consider existing challenges and future perspectives for developing reliable
methods for predicting linear B-cell epitopes.
Collapse
Affiliation(s)
- Cangzhi Jia
- School of Science, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Hongyan Gong
- School of Science, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Yan Zhu
- School of Science, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Yixia Shi
- Department of Mathematics and Statistics, Lingnan Normal University, Zhanjiang, China
| |
Collapse
|
46
|
Qiao L, Xie D. MIonSite: Ligand-specific prediction of metal ion-binding sites via enhanced AdaBoost algorithm with protein sequence information. Anal Biochem 2019; 566:75-88. [DOI: 10.1016/j.ab.2018.11.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2018] [Revised: 10/15/2018] [Accepted: 11/07/2018] [Indexed: 11/24/2022]
|
47
|
Oldfield CJ, Chen K, Kurgan L. Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences. Methods Mol Biol 2019; 1958:73-100. [PMID: 30945214 DOI: 10.1007/978-1-4939-9161-7_4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Many new methods for the sequence-based prediction of the secondary and supersecondary structures have been developed over the last several years. These and older sequence-based predictors are widely applied for the characterization and prediction of protein structure and function. These efforts have produced countless accurate predictors, many of which rely on state-of-the-art machine learning models and evolutionary information generated from multiple sequence alignments. We describe and motivate both types of predictions. We introduce concepts related to the annotation and computational prediction of the three-state and eight-state secondary structure as well as several types of supersecondary structures, such as β hairpins, coiled coils, and α-turn-α motifs. We review 34 predictors focusing on recent tools and provide detailed information for a selected set of 14 secondary structure and 3 supersecondary structure predictors. We conclude with several practical notes for the end users of these predictive methods.
Collapse
Affiliation(s)
- Christopher J Oldfield
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Ke Chen
- School of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin, People's Republic of China
| | - Lukasz Kurgan
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
48
|
Agrawal P, Patiyal S, Kumar R, Kumar V, Singh H, Raghav PK, Raghava GPS. ccPDB 2.0: an updated version of datasets created and compiled from Protein Data Bank. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5298333. [PMID: 30689843 PMCID: PMC6343045 DOI: 10.1093/database/bay142] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Accepted: 12/09/2018] [Indexed: 12/20/2022]
Abstract
ccPDB 2.0 (http://webs.iiitd.edu.in/raghava/ccpdb) is an updated version of the manually curated database ccPDB that maintains datasets required for developing methods to predict the structure and function of proteins. The number of datasets compiled from literature increased from 45 to 141 in ccPDB 2.0. Similarly, the number of protein structures used for creating datasets also increased from ~74 000 to ~137 000 (PDB March 2018 release). ccPDB 2.0 provides the same web services and flexible tools which were present in the previous version of the database. In the updated version, links of the number of methods developed in the past few years have also been incorporated. This updated resource is built on responsive templates which is compatible with smartphones (mobile, iPhone, iPad, tablets etc.) and large screen gadgets. In summary, ccPDB 2.0 is a user-friendly web-based platform that provides comprehensive as well as updated information about datasets.
Collapse
Affiliation(s)
- Piyush Agrawal
- Bioinformatics Center, CSIR-Institute of Microbial Technology, India.,Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, New Delhi, India
| | - Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, New Delhi, India
| | - Rajesh Kumar
- Bioinformatics Center, CSIR-Institute of Microbial Technology, India.,Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, New Delhi, India
| | - Vinod Kumar
- Bioinformatics Center, CSIR-Institute of Microbial Technology, India.,Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, New Delhi, India
| | - Harinder Singh
- J. Craig Venter Institute 9605 Medical Center Drive, Suite 150 Rockville, MD, USA
| | - Pawan Kumar Raghav
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, New Delhi, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III, New Delhi, India
| |
Collapse
|
49
|
Hu J, Li Y, Zhang Y, Yu DJ. ATPbind: Accurate Protein-ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons. J Chem Inf Model 2018; 58:501-510. [PMID: 29361215 DOI: 10.1021/acs.jcim.7b00397] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Protein-ATP interactions are ubiquitous in a wide variety of biological processes. Correctly locating ATP binding sites from protein information is an important but challenging task for protein function annotation and drug discovery. However, there is no method that can optimally identify ATP binding sites for different proteins. In this study, we report a new composite predictor, ATPbind, for ATP binding sites by integrating the outputs of two template-based predictors (i.e., S-SITE and TM-SITE) and three discriminative sequence-driven features of proteins: position specific scoring matrix, predicted secondary structure, and predicted solvent accessibility. In ATPbind, we assembled multiple support vector machines (SVMs) based on a random undersampling technique to cope with the serious imbalance phenomenon between the numbers of ATP binding sites and of non-ATP binding sites. We also constructed a new gold-standard benchmark data set consisting of 429 ATP binding proteins from the PDB database to evaluate and compare the proposed ATPbind with other existing predictors. Starting from a query sequence and predicted I-TASSER models, ATPbind can achieve an average accuracy of 72%, covering 62% of all ATP binding sites while achieving a Matthews correlation coefficient value that is significantly higher than that of other state-of-the-art predictors.
Collapse
Affiliation(s)
- Jun Hu
- School of Computer Science and Engineering, Nanjing University of Science and Technology , Xiaolingwei 200, Nanjing, 210094, P. R. China.,Department of Computational Medicine and Bioinformatics, University of Michigan , 100 Washtenaw, Ann Arbor, Michigan 48109-2218, United States
| | - Yang Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology , Xiaolingwei 200, Nanjing, 210094, P. R. China.,Department of Computational Medicine and Bioinformatics, University of Michigan , 100 Washtenaw, Ann Arbor, Michigan 48109-2218, United States
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan , 100 Washtenaw, Ann Arbor, Michigan 48109-2218, United States
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology , Xiaolingwei 200, Nanjing, 210094, P. R. China
| |
Collapse
|
50
|
Zhang J, Ma Z, Kurgan L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 2017; 20:1250-1268. [DOI: 10.1093/bib/bbx168] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 11/15/2017] [Indexed: 11/13/2022] Open
Abstract
Abstract
Proteins interact with a variety of molecules including proteins and nucleic acids. We review a comprehensive collection of over 50 studies that analyze and/or predict these interactions. While majority of these studies address either solely protein–DNA or protein–RNA binding, only a few have a wider scope that covers both protein–protein and protein–nucleic acid binding. Our analysis reveals that binding residues are typically characterized with three hallmarks: relative solvent accessibility (RSA), evolutionary conservation and propensity of amino acids (AAs) for binding. Motivated by drawbacks of the prior studies, we perform a large-scale analysis to quantify and contrast the three hallmarks for residues that bind DNA-, RNA-, protein- and (for the first time) multi-ligand-binding residues that interact with DNA and proteins, and with RNA and proteins. Results generated on a well-annotated data set of over 23 000 proteins show that conservation of binding residues is higher for nucleic acid- than protein-binding residues. Multi-ligand-binding residues are more conserved and have higher RSA than single-ligand-binding residues. We empirically show that each hallmark discriminates between binding and nonbinding residues, even predicted RSA, and that combining them improves discriminatory power for each of the five types of interactions. Linear scoring functions that combine these hallmarks offer good predictive performance of residue-level propensity for binding and provide intuitive interpretation of predictions. Better understanding of these residue-level interactions will facilitate development of methods that accurately predict binding in the exponentially growing databases of protein sequences.
Collapse
|