1
|
Zhang L, Liu T. PDNAPred: Interpretable prediction of protein-DNA binding sites based on pre-trained protein language models. Int J Biol Macromol 2024; 281:136147. [PMID: 39357703 DOI: 10.1016/j.ijbiomac.2024.136147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 09/11/2024] [Accepted: 09/27/2024] [Indexed: 10/04/2024]
Abstract
Protein-DNA interactions play critical roles in various biological processes and are essential for drug discovery. However, traditional experimental methods are labor-intensive and unable to keep pace with the increasing volume of protein sequences, leading to a substantial number of proteins lacking DNA-binding annotations. Therefore, developing an efficient computational method to identify protein-DNA binding sites is crucial. Unfortunately, most existing computational methods rely on manually selected features or protein structure information, making these methods inapplicable to large-scale prediction tasks. In this study, we introduced PDNAPred, a sequence-based method that combines two pre-trained protein language models with a designed CNN-GRU network to identify DNA-binding sites. Additionally, to tackle the issue of imbalanced dataset samples, we employed focal loss. Our comprehensive experiments demonstrated that PDNAPred significantly improved the accuracy of DNA-binding site prediction, outperforming existing state-of-the-art sequence-based methods. Remarkably, PDNAPred also achieved results comparable to advanced structure-based methods. The designed CNN-GRU network enhances its capability to detect DNA-binding sites accurately. Furthermore, we validated the versatility of PDNAPred by training it on RNA-binding site datasets, showing its potential as a general framework for amino acid binding site prediction. Finally, we conducted model interpretability analysis to elucidate the reasons behind PDNAPred's outstanding performance.
Collapse
Affiliation(s)
- Lingrong Zhang
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.
| |
Collapse
|
2
|
Daanial Khan Y, Alkhalifah T, Alturise F, Hassan Butt A. DeepDBS: Identification of DNA-binding sites in protein sequences by using deep representations and random forest. Methods 2024; 231:26-36. [PMID: 39270885 DOI: 10.1016/j.ymeth.2024.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 08/26/2024] [Accepted: 09/04/2024] [Indexed: 09/15/2024] Open
Abstract
Interactions of biological molecules in organisms are considered to be primary factors for the lifecycle of that organism. Various important biological functions are dependent on such interactions and among different kinds of interactions, the protein DNA interactions are very important for the processes of transcription, regulation of gene expression, DNA repairing and packaging. Thus, keeping the knowledge of such interactions and the sites of those interactions is necessary to study the mechanism of various biological processes. As experimental identification through biological assays is quite resource-demanding, costly and error-prone, scientists opt for the computational methods for efficient and accurate identification of such DNA-protein interaction sites. Thus, herein, we propose a novel and accurate method namely DeepDBS for the identification of DNA-binding sites in proteins, using primary amino acid sequences of proteins under study. From protein sequences, deep representations were computed through a one-dimensional convolution neural network (1D-CNN), recurrent neural network (RNN) and long short-term memory (LSTM) network and were further used to train a Random Forest classifier. Random Forest with LSTM-based features outperformed the other models, as well as the existing state-of-the-art methods with an accuracy score of 0.99 for self-consistency test, 10-fold cross-validation, 5-fold cross-validation, and jackknife validation while 0.92 for independent dataset testing. It is concluded based on results that the DeepDBS can help accurate and efficient identification of DNA binding sites (DBS) in proteins.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Punjab 54770, Pakistan
| | - Tamim Alkhalifah
- Department of Computer Engineering, College of Computer, Qassim University, Buraydah, Saudi Arabia
| | - Fahad Alturise
- Department of Cybersecurity, College of Computer, Qassim University, Buraydah 52571, Saudi Arabia
| | - Ahmad Hassan Butt
- Department of Computer Science, Faculty of Computing and Information Technology, University of the Punjab, Lahore 54000, Punjab, Pakistan.
| |
Collapse
|
3
|
Wang B, Li W. Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction. Genes (Basel) 2024; 15:1090. [PMID: 39202449 PMCID: PMC11353971 DOI: 10.3390/genes15081090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 08/13/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024] Open
Abstract
Protein and nucleic acid binding site prediction is a critical computational task that benefits a wide range of biological processes. Previous studies have shown that feature selection holds particular significance for this prediction task, making the generation of more discriminative features a key area of interest for many researchers. Recent progress has shown the power of protein language models in handling protein sequences, in leveraging the strengths of attention networks, and in successful applications to tasks such as protein structure prediction. This naturally raises the question of the applicability of protein language models in predicting protein and nucleic acid binding sites. Various approaches have explored this potential. This paper first describes the development of protein language models. Then, a systematic review of the latest methods for predicting protein and nucleic acid binding sites is conducted by covering benchmark sets, feature generation methods, performance comparisons, and feature ablation studies. These comparisons demonstrate the importance of protein language models for the prediction task. Finally, the paper discusses the challenges of protein and nucleic acid binding site prediction and proposes possible research directions and future trends. The purpose of this survey is to furnish researchers with actionable suggestions for comprehending the methodologies used in predicting protein-nucleic acid binding sites, fostering the creation of protein-centric language models, and tackling real-world obstacles encountered in this field.
Collapse
Affiliation(s)
| | - Wenjin Li
- Institute for Advanced Study, Shenzhen University, Shenzhen 518061, China;
| |
Collapse
|
4
|
Zhang B, Hou Z, Yang Y, Wong KC, Zhu H, Li X. SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues. Commun Biol 2024; 7:679. [PMID: 38830995 PMCID: PMC11148103 DOI: 10.1038/s42003-024-06332-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 05/15/2024] [Indexed: 06/05/2024] Open
Abstract
Proteins and nucleic-acids are essential components of living organisms that interact in critical cellular processes. Accurate prediction of nucleic acid-binding residues in proteins can contribute to a better understanding of protein function. However, the discrepancy between protein sequence information and obtained structural and functional data renders most current computational models ineffective. Therefore, it is vital to design computational models based on protein sequence information to identify nucleic acid binding sites in proteins. Here, we implement an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method, called SOFB, which characterizes protein sequences by learning the semantics of biological dynamics contexts, and then develop an ensemble deep learning-based sequence network to learn feature representation and classification by explicitly modeling dynamic semantic information. Among them, the language learning model, which is constructed from natural language to biological language, captures the underlying relationships of protein sequences, and the ensemble deep learning-based sequence network consisting of different convolutional layers together with Bi-LSTM refines various features for optimal performance. Meanwhile, to address the imbalanced issue, we adopt ensemble learning to train multiple models and then incorporate them. Our experimental results on several DNA/RNA nucleic-acid-binding residue datasets demonstrate that our proposed model outperforms other state-of-the-art methods. In addition, we conduct an interpretability analysis of the identified nucleic acid binding residue sequences based on the attention weights of the language learning model, revealing novel insights into the dynamic semantic information that supports the identified nucleic acid binding residues. SOFB is available at https://github.com/Encryptional/SOFB and https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452 .
Collapse
Affiliation(s)
- Bin Zhang
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Zilong Hou
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Yuning Yang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Canada
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong, Hong Kong SAR
| | - Haoran Zhu
- School of Artificial Intelligence, Jilin University, Changchun, China.
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun, China.
| |
Collapse
|
5
|
Xia Y, Pan X, Shen HB. A comprehensive survey on protein-ligand binding site prediction. Curr Opin Struct Biol 2024; 86:102793. [PMID: 38447285 DOI: 10.1016/j.sbi.2024.102793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 02/18/2024] [Accepted: 02/18/2024] [Indexed: 03/08/2024]
Abstract
Protein-ligand binding site prediction is critical for protein function annotation and drug discovery. Biological experiments are time-consuming and require significant equipment, materials, and labor resources. Developing accurate and efficient computational methods for protein-ligand interaction prediction is essential. Here, we summarize the key challenges associated with ligand binding site (LBS) prediction and introduce recently published methods from their input features, computational algorithms, and ligand types. Furthermore, we investigate the specificity of allosteric site identification as a particular LBS type. Finally, we discuss the prospective directions for machine learning-based LBS prediction in the near future.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| |
Collapse
|
6
|
Sun C, Feng Y. EPDRNA: A Model for Identifying DNA-RNA Binding Sites in Disease-Related Proteins. Protein J 2024; 43:513-521. [PMID: 38491248 DOI: 10.1007/s10930-024-10183-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/02/2024] [Indexed: 03/18/2024]
Abstract
Protein-DNA and protein-RNA interactions are involved in many biological processes and regulate many cellular functions. Moreover, they are related to many human diseases. To understand the molecular mechanism of protein-DNA binding and protein-RNA binding, it is important to identify which residues in the protein sequence bind to DNA and RNA. At present, there are few methods for specifically identifying the binding sites of disease-related protein-DNA and protein-RNA. In this study, so we combined four machine learning algorithms into an ensemble classifier (EPDRNA) to predict DNA and RNA binding sites in disease-related proteins. The dataset used in model was collated from UniProt and PDB database, and PSSM, physicochemical properties and amino acid type were used as features. The EPDRNA adopted soft voting and achieved the best AUC value of 0.73 at the DNA binding sites, and the best AUC value of 0.71 at the RNA binding sites in 10-fold cross validation in the training sets. In order to further verify the performance of the model, we assessed EPDRNA for the prediction of DNA-binding sites and the prediction of RNA-binding sites on the independent test dataset. The EPDRNA achieved 85% recall rate and 25% precision on the protein-DNA interaction independent test set, and achieved 82% recall rate and 27% precision on the protein-RNA interaction independent test set. The online EPDRNA webserver is freely available at http://www.s-bioinformatics.cn/epdrna .
Collapse
Affiliation(s)
- CanZhuang Sun
- College of Science, Inner Mongolia Agriculture University, Hohhot, 010018, People's Republic of China
| | - YongE Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot, 010018, People's Republic of China.
| |
Collapse
|
7
|
Zheng M, Sun G, Li X, Fan Y. EGPDI: identifying protein-DNA binding sites based on multi-view graph embedding fusion. Brief Bioinform 2024; 25:bbae330. [PMID: 38975896 PMCID: PMC11229037 DOI: 10.1093/bib/bbae330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/08/2024] [Accepted: 06/26/2024] [Indexed: 07/09/2024] Open
Abstract
Mechanisms of protein-DNA interactions are involved in a wide range of biological activities and processes. Accurately identifying binding sites between proteins and DNA is crucial for analyzing genetic material, exploring protein functions, and designing novel drugs. In recent years, several computational methods have been proposed as alternatives to time-consuming and expensive traditional experiments. However, accurately predicting protein-DNA binding sites still remains a challenge. Existing computational methods often rely on handcrafted features and a single-model architecture, leaving room for improvement. We propose a novel computational method, called EGPDI, based on multi-view graph embedding fusion. This approach involves the integration of Equivariant Graph Neural Networks (EGNN) and Graph Convolutional Networks II (GCNII), independently configured to profoundly mine the global and local node embedding representations. An advanced gated multi-head attention mechanism is subsequently employed to capture the attention weights of the dual embedding representations, thereby facilitating the integration of node features. Besides, extra node features from protein language models are introduced to provide more structural information. To our knowledge, this is the first time that multi-view graph embedding fusion has been applied to the task of protein-DNA binding site prediction. The results of five-fold cross-validation and independent testing demonstrate that EGPDI outperforms state-of-the-art methods. Further comparative experiments and case studies also verify the superiority and generalization ability of EGPDI.
Collapse
Affiliation(s)
- Mengxin Zheng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Xueping Li
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| |
Collapse
|
8
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
9
|
Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 2024; 52:e27. [PMID: 38281252 PMCID: PMC10954458 DOI: 10.1093/nar/gkae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 12/22/2023] [Accepted: 01/11/2024] [Indexed: 01/30/2024] Open
Abstract
Protein language models (pLMs) trained on a large corpus of protein sequences have shown unprecedented scalability and broad generalizability in a wide range of predictive modeling tasks, but their power has not yet been harnessed for predicting protein-nucleic acid binding sites, critical for characterizing the interactions between proteins and nucleic acids. Here, we present EquiPNAS, a new pLM-informed E(3) equivariant deep graph neural network framework for improved protein-nucleic acid binding site prediction. By combining the strengths of pLM and symmetry-aware deep graph learning, EquiPNAS consistently outperforms the state-of-the-art methods for both protein-DNA and protein-RNA binding site prediction on multiple datasets across a diverse set of predictive modeling scenarios ranging from using experimental input to AlphaFold2 predictions. Our ablation study reveals that the pLM embeddings used in EquiPNAS are sufficiently powerful to dramatically reduce the dependence on the availability of evolutionary information without compromising on accuracy, and that the symmetry-aware nature of the E(3) equivariant graph-based neural architecture offers remarkable robustness and performance resilience. EquiPNAS is freely available at https://github.com/Bhattacharya-Lab/EquiPNAS.
Collapse
Affiliation(s)
- Rahmatullah Roche
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Bernard Moussad
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Sumit Tarafder
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | |
Collapse
|
10
|
Zhang J, Chen Q, Liu B. iNucRes-ASSH: Identifying nucleic acid-binding residues in proteins by using self-attention-based structure-sequence hybrid neural network. Proteins 2024; 92:395-410. [PMID: 37915276 DOI: 10.1002/prot.26626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 09/27/2023] [Accepted: 10/17/2023] [Indexed: 11/03/2023]
Abstract
Interaction between proteins and nucleic acids is crucial to many cellular activities. Accurately detecting nucleic acid-binding residues (NABRs) in proteins can help researchers better understand the interaction mechanism between proteins and nucleic acids. Structure-based methods can generally make more accurate predictions than sequence-based methods. However, the existing structure-based methods are sensitive to protein conformational changes, causing limited generalizability. More effective and robust approaches should be further explored. In this study, we propose iNucRes-ASSH to identify nucleic acid-binding residues with a self-attention-based structure-sequence hybrid neural network. It improves the generalizability and robustness of NABR prediction from two levels: residue representation and prediction model. Experimental results show that iNucRes-ASSH can predict the nucleic acid-binding residues even when the experimentally validated structures are unavailable and outperforms five competing methods on a recent benchmark dataset and a widely used test dataset.
Collapse
Affiliation(s)
- Jun Zhang
- National Engineering Laboratory for Big Data System Computing Technology, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
11
|
Chuah CW, He W, Huang DS. GMean-a semi-supervised GRU and K-mean model for predicting the TF binding site. Sci Rep 2024; 14:2539. [PMID: 38291225 PMCID: PMC10827707 DOI: 10.1038/s41598-024-52933-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 01/25/2024] [Indexed: 02/01/2024] Open
Abstract
The transcription factor binding site is a deoxyribonucleic acid sequence that binds to transcription factors. Transcription factors are proteins that regulate the transcription gene. Abnormal turnover of transcription factors can lead to uncontrolled cell growth. Therefore, discovering the relationships between transcription factors and deoxyribonucleic acid sequences is an important component of bioinformatics research. Numerous deep learning and machine learning language models have been developed to accomplish these tasks. Our goal in this work is to propose a GMean model for predicting unlabelled deoxyribonucleic acid sequences. The GMean model is a hybrid model with a combination of gated recurrent unit and K-mean clustering. The GMean model is developed in three phases. The labelled and unlabelled data are processed based on k-mers and tokenization. The labelled data is used for training. The unlabelled data are used for testing and prediction. The experimental data consists of deoxyribonucleic acid experimental of GM12878, K562 and HepG2. The experimental results show that GMean is feasible and effective in predicting deoxyribonucleic acid sequences, as the highest accuracy is 91.85% in predicting K562 and HepG2. This is followed by the prediction of the sequence between GM12878 and K562 with an accuracy of 89.13%. The lowest accuracy is the prediction of the sequence between HepG2 and GM12828, which is 88.80%.
Collapse
Affiliation(s)
- Chai Wen Chuah
- Guangdong University of Science and Technology, Songsan Hu, Dongguang, 523070, Guangdong, China.
| | - Wanxian He
- Guangxi Academy of Sciences, 98 Daling Road, Nanning, 530007, Guangxi, China
| | - De-Shuang Huang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Tongxin Road No. 568, Ningbo, 315201, China
| |
Collapse
|
12
|
Zhu YH, Liu Z, Liu Y, Ji Z, Yu DJ. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform 2024; 25:bbae040. [PMID: 38349057 PMCID: PMC10939370 DOI: 10.1093/bib/bbae040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 02/15/2024] Open
Abstract
Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Zi Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, Yangzhou 225000, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
13
|
Keith AD, Sawyer EB, Choy DCY, Xie Y, Biggs GS, Klein OJ, Brear PD, Wales DJ, Barker PD. Combining experiment and energy landscapes to explore anaerobic heme breakdown in multifunctional hemoproteins. Phys Chem Chem Phys 2024; 26:695-712. [PMID: 38053511 DOI: 10.1039/d3cp03897a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
To survive, many pathogens extract heme from their host organism and break down the porphyrin scaffold to sequester the Fe2+ ion via a heme oxygenase. Recent studies have revealed that certain pathogens can anaerobically degrade heme. Our own research has shown that one such pathway proceeds via NADH-dependent heme degradation, which has been identified in a family of hemoproteins from a range of bacteria. HemS, from Yersinia enterocolitica, is the main focus of this work, along with HmuS (Yersinia pestis), ChuS (Escherichia coli) and ShuS (Shigella dysenteriae). We combine experiments, Energy Landscape Theory, and a bioinformatic investigation to place these homologues within a wider phylogenetic context. A subset of these hemoproteins are known to bind certain DNA promoter regions, suggesting not only that they can catalytically degrade heme, but that they are also involved in transcriptional modulation responding to heme flux. Many of the bacterial species responsible for these hemoproteins (including those that produce HemS, ChuS and ShuS) are known to specifically target oxygen-depleted regions of the gastrointestinal tract. A deeper understanding of anaerobic heme breakdown processes exploited by these pathogens could therefore prove useful in the development of future strategies for disease prevention.
Collapse
Affiliation(s)
- Alasdair D Keith
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Elizabeth B Sawyer
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Desmond C Y Choy
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Yuhang Xie
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - George S Biggs
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Oskar James Klein
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Paul D Brear
- Department of Biochemistry, University of Cambridge, Sanger Building, Cambridge CB2 1GA, UK
| | - David J Wales
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Paul D Barker
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| |
Collapse
|
14
|
Zhao H, Zhu B, Jiang T, Cui Z, Wu H. Identification of DNA-protein binding residues through integration of Transformer encoder and Bi-directional Long Short-Term Memory. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:170-185. [PMID: 38303418 DOI: 10.3934/mbe.2024008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
DNA-protein binding is crucial for the normal development and function of organisms. The significance of accurately identifying DNA-protein binding sites lies in its role in disease prevention and the development of innovative approaches to disease treatment. In the present study, we introduce a precise and robust identifier for DNA-protein binding residues. In the context of protein representation, we combine the evolutionary information of the protein, represented by its position-specific scoring matrix, with the spatial information of the protein's secondary structure, enriching the overall informational content. This approach initially employs a combination of Bi-directional Long Short-Term Memory and Transformer encoder to jointly extract the interdependencies among residues within the protein sequence. Subsequently, convolutional operations are applied to the resulting feature matrix to capture local features of the residues. Experimental results on the benchmark dataset demonstrate that our method exhibits a higher level of competitiveness when compared to contemporary classifiers. Specifically, our method achieved an MCC of 0.349, SP of 96.50%, SN of 44.03% and ACC of 94.59% on the PDNA-41 dataset.
Collapse
Affiliation(s)
- Haipeng Zhao
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Baozhong Zhu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | | | - Zhiming Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
15
|
Rodrigues CHM, Ascher DB. CSM-Potential2: A comprehensive deep learning platform for the analysis of protein interacting interfaces. Proteins 2023. [PMID: 37870486 DOI: 10.1002/prot.26615] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 10/04/2023] [Accepted: 10/05/2023] [Indexed: 10/24/2023]
Abstract
Proteins are molecular machinery that participate in virtually all essential biological functions within the cell, which are tightly related to their 3D structure. The importance of understanding protein structure-function relationship is highlighted by the exponential growth of experimental structures, which has been greatly expanded by recent breakthroughs in protein structure prediction, most notably RosettaFold, and AlphaFold2. These advances have prompted the development of several computational approaches that leverage these data sources to explore potential biological interactions. However, most methods are generally limited to analysis of single types of interactions, such as protein-protein or protein-ligand interactions, and their complexity limits the usability to expert users. Here we report CSM-Potential2, a deep learning platform for the analysis of binding interfaces on protein structures. In addition to prediction of protein-protein interactions binding sites and classification of biological ligands, our new platform incorporates prediction of interactions with nucleic acids at the residue level and allows for ligand transplantation based on sequence and structure similarity to experimentally determined structures. We anticipate our platform to be a valuable resource that provides easy access to a range of state-of-the-art methods to expert and non-expert users for the study of biological interactions. Our tool is freely available as an easy-to-use web server and API available at https://biosig.lab.uq.edu.au/csm_potential.
Collapse
Affiliation(s)
- Carlos H M Rodrigues
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, Queensland, Australia
| | - David B Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, Queensland, Australia
| |
Collapse
|
16
|
Song Y, Yuan Q, Zhao H, Yang Y. Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Brief Bioinform 2023; 24:bbad360. [PMID: 37824738 DOI: 10.1093/bib/bbad360] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 09/18/2023] [Accepted: 09/18/2023] [Indexed: 10/14/2023] Open
Abstract
The interactions between nucleic acids and proteins are important in diverse biological processes. The high-quality prediction of nucleic-acid-binding sites continues to pose a significant challenge. Presently, the predictive efficacy of sequence-based methods is constrained by their exclusive consideration of sequence context information, whereas structure-based methods are unsuitable for proteins lacking known tertiary structures. Though protein structures predicted by AlphaFold2 could be used, the extensive computing requirement of AlphaFold2 hinders its use for genome-wide applications. Based on the recent breakthrough of ESMFold for fast prediction of protein structures, we have developed GLMSite, which accurately identifies DNA- and RNA-binding sites using geometric graph learning on ESMFold predicted structures. Here, the predicted protein structures are employed to construct protein structural graph with residues as nodes and spatially neighboring residue pairs for edges. The node representations are further enhanced through the pre-trained language model ProtTrans. The network was trained using a geometric vector perceptron, and the geometric embeddings were subsequently fed into a common network to acquire common binding characteristics. Finally, these characteristics were input into two fully connected layers to predict binding sites with DNA and RNA, respectively. Through comprehensive tests on DNA/RNA benchmark datasets, GLMSite was shown to surpass the latest sequence-based methods and be comparable with structure-based methods. Moreover, the prediction was shown useful for inferring nucleic-acid-binding proteins, demonstrating its potential for protein function discovery. The datasets, codes, and trained models are available at https://github.com/biomed-AI/nucleic-acid-binding.
Collapse
Affiliation(s)
- Yidong Song
- Key Laboratory of Machine Intelligence and Advanced Computing of MOE, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Qianmu Yuan
- Key Laboratory of Machine Intelligence and Advanced Computing of MOE, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Huiying Zhao
- Key Laboratory of Machine Intelligence and Advanced Computing of MOE, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing of MOE, School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| |
Collapse
|
17
|
Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.14.557719. [PMID: 37745556 PMCID: PMC10515942 DOI: 10.1101/2023.09.14.557719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Protein language models (pLMs) trained on a large corpus of protein sequences have shown unprecedented scalability and broad generalizability in a wide range of predictive modeling tasks, but their power has not yet been harnessed for predicting protein-nucleic acid binding sites, critical for characterizing the interactions between proteins and nucleic acids. Here we present EquiPNAS, a new pLM-informed E(3) equivariant deep graph neural network framework for improved protein-nucleic acid binding site prediction. By combining the strengths of pLM and symmetry-aware deep graph learning, EquiPNAS consistently outperforms the state-of-the-art methods for both protein-DNA and protein-RNA binding site prediction on multiple datasets across a diverse set of predictive modeling scenarios ranging from using experimental input to AlphaFold2 predictions. Our ablation study reveals that the pLM embeddings used in EquiPNAS are sufficiently powerful to dramatically reduce the dependence on the availability of evolutionary information without compromising on accuracy, and that the symmetry-aware nature of the E(3) equivariant graph-based neural architecture offers remarkable robustness and performance resilience. EquiPNAS is freely available at https://github.com/Bhattacharya-Lab/EquiPNAS.
Collapse
Affiliation(s)
- Rahmatullah Roche
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Bernard Moussad
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Sumit Tarafder
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Debswapna Bhattacharya
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| |
Collapse
|
18
|
Jiang Z, Shen YY, Liu R. Structure-based prediction of nucleic acid binding residues by merging deep learning- and template-based approaches. PLoS Comput Biol 2023; 19:e1011428. [PMID: 37672551 PMCID: PMC10482303 DOI: 10.1371/journal.pcbi.1011428] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 08/11/2023] [Indexed: 09/08/2023] Open
Abstract
Accurate prediction of nucleic binding residues is essential for the understanding of transcription and translation processes. Integration of feature- and template-based strategies could improve the prediction of these key residues in proteins. Nevertheless, traditional hybrid algorithms have been surpassed by recently developed deep learning-based methods, and the possibility of integrating deep learning- and template-based approaches to improve performance remains to be explored. To address these issues, we developed a novel structure-based integrative algorithm called NABind that can accurately predict DNA- and RNA-binding residues. A deep learning module was built based on the diversified sequence and structural descriptors and edge aggregated graph attention networks, while a template module was constructed by transforming the alignments between the query and its multiple templates into features for supervised learning. Furthermore, the stacking strategy was adopted to integrate the above two modules for improving prediction performance. Finally, a post-processing module dependent on the random walk algorithm was proposed to further correct the integrative predictions. Extensive evaluations indicated that our approach could not only achieve excellent performance on both native and predicted structures but also outperformed existing hybrid algorithms and recent deep learning methods. The NABind server is available at http://liulab.hzau.edu.cn/NABind/.
Collapse
Affiliation(s)
- Zheng Jiang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Yue-Yue Shen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Rong Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
19
|
Guan S, Zou Q, Wu H, Ding Y. Protein-DNA Binding Residues Prediction Using a Deep Learning Model With Hierarchical Feature Extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2619-2628. [PMID: 35834447 DOI: 10.1109/tcbb.2022.3190933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Biologically important effects occur when proteins bind to other substances, of which binding to DNA is a crucial one. Therefore, accurate identification of protein-DNA binding residues is important for further understanding of the protein-DNA interaction mechanism. Although wet-lab methods can accurately obtain the location of bound residues, it requires significant human, financial and time costs. There is thus an urgent need to develop efficient computational-based methods. Most current state-of-the-art methods are two-step approaches: the first step uses a sliding window technique to extract residue features; the second step uses each residue as an input to the model for prediction. This has a negative impact on the efficiency of prediction and ease of use. In this study, we propose a sequence-to-sequence (seq2seq) model that can input the entire protein sequence of variable length and use two modules, Transformer Encoder Block and Feature Extracting Block, for hierarchical feature extraction, where Transformer Encoder Block is used to extract global features, and then Feature Extracting Block is used to extract local features to further improve the recognition capability of the model. The comparison results on two benchmark datasets, namely PDNA-543 and PDNA-41, prove the effectiveness of our method in identifying protein-DNA binding residues.
Collapse
|
20
|
Durairaj J, de Ridder D, van Dijk AD. Beyond sequence: Structure-based machine learning. Comput Struct Biotechnol J 2022; 21:630-643. [PMID: 36659927 PMCID: PMC9826903 DOI: 10.1016/j.csbj.2022.12.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 12/21/2022] [Accepted: 12/21/2022] [Indexed: 12/31/2022] Open
Abstract
Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
Collapse
Affiliation(s)
- Janani Durairaj
- Biozentrum, University of Basel, Basel, Switzerland
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Aalt D.J. van Dijk
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| |
Collapse
|
21
|
de Francisco Martínez P, Morgante V, González-Pastor JE. Isolation of novel cold-tolerance genes from rhizosphere microorganisms of Antarctic plants by functional metagenomics. Front Microbiol 2022; 13:1026463. [DOI: 10.3389/fmicb.2022.1026463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 10/25/2022] [Indexed: 11/19/2022] Open
Abstract
The microorganisms that thrive in Antarctica, one of the coldest environments on the planet, have developed diverse adaptation mechanisms to survive in these extreme conditions. Through functional metagenomics, in this work, 29 new genes related to cold tolerance have been isolated and characterized from metagenomic libraries of microorganisms from the rhizosphere of two Antarctic plants. Both libraries were hosted in two cold-sensitive strains of Escherichia coli: DH10B ΔcsdA and DH10B ΔcsdA Δrnr. The csdA gene encodes a DEAD-box RNA helicase and rnr gene encodes an exoribonuclease, both essential for cold-adaptation. Cold-tolerance tests have been carried out in solid and liquid media at 15°C. Among the cold-tolerance genes identified, 12 encode hypothetical and unknown proteins, and 17 encode a wide variety of different proteins previously related to other well-characterized ones involved in metabolism reactions, transport and membrane processes, or genetic information processes. Most of them have been connected to cold-tolerance mechanisms. Interestingly, 13 genes had no homologs in E. coli, thus potentially providing entirely new adaptation strategies for this bacterium. Moreover, ten genes also conferred resistance to UV-B radiation, another extreme condition in Antarctica.
Collapse
|
22
|
Hu J, Bai YS, Zheng LL, Jia NX, Yu DJ, Zhang GJ. Protein-DNA Binding Residue Prediction via Bagging Strategy and Sequence-Based Cube-Format Feature. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3635-3645. [PMID: 34714748 DOI: 10.1109/tcbb.2021.3123828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Protein-DNA interactions play an important role in diverse biological processes. Accurately identifying protein-DNA binding residues is a critical but challenging task for protein function annotations and drug design. Although wet-lab experimental methods are the most accurate way to identify protein-DNA binding residues, they are time consuming and labor intensive. There is an urgent need to develop computational methods to rapidly and accurately predict protein-DNA binding residues. In this study, we propose a novel sequence-based method, named PredDBR, for predicting DNA-binding residues. In PredDBR, for each query protein, its position-specific frequency matrix (PSFM), predicted secondary structure (PSS), and predicted probabilities of ligand-binding residues (PPLBR) are first generated as three feature sources. Secondly, for each feature source, the sliding window technique is employed to extract the matrix-format feature of each residue. Then, we design two strategies, i.e., square root (SR) and average (AVE), to separately transform PSFM-based and two predicted feature source-based, i.e., PSS-based and PPLBR-based, matrix-format features of each residue into three corresponding cube-format features. Finally, after serially combining the three cube-format features, the ensemble classifier is generated via applying bagging strategy to multiple base classifiers built by the framework of 2D convolutional neural network. The computational experimental results demonstrate that the proposed PredDBR achieves an average overall accuracy of 93.7% and a Mathew's correlation coefficient of 0.405 on two independent validation datasets and outperforms several state-of-the-art sequenced-based protein-DNA binding residue predictors. The PredDBR web-server is available at https://jun-csbio.github.io/PredDBR/.
Collapse
|
23
|
Patiyal S, Dhall A, Raghava GPS. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform 2022; 23:6658239. [PMID: 35943134 DOI: 10.1093/bib/bbac322] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 07/01/2022] [Accepted: 07/15/2022] [Indexed: 11/13/2022] Open
Abstract
DNA-protein interaction is one of the most crucial interactions in the biological system, which decides the fate of many processes such as transcription, regulation and splicing of genes. In this study, we trained our models on a training dataset of 646 DNA-binding proteins having 15 636 DNA interacting and 298 503 non-interacting residues. Our trained models were evaluated on an independent dataset of 46 DNA-binding proteins having 965 DNA interacting and 9911 non-interacting residues. All proteins in the independent dataset have less than 30% of sequence similarity with proteins in the training dataset. A wide range of traditional machine learning and deep learning (1D-CNN) techniques-based models have been developed using binary, physicochemical properties and Position-Specific Scoring Matrix (PSSM)/evolutionary profiles. In the case of machine learning technique, eXtreme Gradient Boosting-based model achieved a maximum area under the receiver operating characteristics (AUROC) curve of 0.77 on the independent dataset using PSSM profile. Deep learning-based model achieved the highest AUROC of 0.79 on the independent dataset using a combination of all three profiles. We evaluated the performance of existing methods on the independent dataset and observed that our proposed method outperformed all the existing methods. In order to facilitate scientific community, we developed standalone software and web server, which are accessible from https://webs.iiitd.edu.in/raghava/dbpred.
Collapse
Affiliation(s)
- Sumeet Patiyal
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi-110020, India
| |
Collapse
|
24
|
Yamaguchi S, Nakashima H, Moriwaki Y, Terada T, Shimizu K. Prediction of protein mononucleotide binding sites using AlphaFold2 and machine learning. Comput Biol Chem 2022; 100:107744. [DOI: 10.1016/j.compbiolchem.2022.107744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 07/12/2022] [Accepted: 07/22/2022] [Indexed: 11/26/2022]
|
25
|
Wang N, Yan K, Zhang J, Liu B. iDRNA-ITF: identifying DNA- and RNA-binding residues in proteins based on induction and transfer framework. Brief Bioinform 2022; 23:6609520. [PMID: 35709747 DOI: 10.1093/bib/bbac236] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 05/06/2022] [Accepted: 05/20/2022] [Indexed: 11/14/2022] Open
Abstract
Protein-DNA and protein-RNA interactions are involved in many biological activities. In the post-genome era, accurate identification of DNA- and RNA-binding residues in protein sequences is of great significance for studying protein functions and promoting new drug design and development. Therefore, some sequence-based computational methods have been proposed for identifying DNA- and RNA-binding residues. However, they failed to fully utilize the functional properties of residues, leading to limited prediction performance. In this paper, a sequence-based method iDRNA-ITF was proposed to incorporate the functional properties in residue representation by using an induction and transfer framework. The properties of nucleic acid-binding residues were induced by the nucleic acid-binding residue feature extraction network, and then transferred into the feature integration modules of the DNA-binding residue prediction network and the RNA-binding residue prediction network for the final prediction. Experimental results on four test sets demonstrate that iDRNA-ITF achieves the state-of-the-art performance, outperforming the other existing sequence-based methods. The webserver of iDRNA-ITF is freely available at http://bliulab.net/iDRNA-ITF.
Collapse
Affiliation(s)
- Ning Wang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
26
|
Arya A, Mary Varghese D, Kumar Verma A, Ahmad S. Inadequacy of evolutionary profiles vis-a-vis single sequences in predicting transient DNA-binding sites in proteins. J Mol Biol 2022; 434:167640. [PMID: 35597551 DOI: 10.1016/j.jmb.2022.167640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 05/01/2022] [Accepted: 05/16/2022] [Indexed: 10/18/2022]
Abstract
Sequence-based prediction of DNA-binding residues in a protein is a widely studied problem for which machine learning methods with continuously improving predictive power have been developed. Concatenated rows within a sliding window of a Position Specific Substitution Matrix (PSSM) of the protein concerned are currently used as the primary feature set in almost all the methods of predicting DNA-binding residues. Here we report that these evolutionary profiles are powerful, only for identifying conserved binding sites and fall short for the residue positions which undergo binding to non-binding transitions in closely related proteins. We created a database of highly similar protein pairs with known protein-DNA complexes and investigated differential predictability of conserved and transient binding within each pair. Retraining machine learning models uniformly, we compared the predictive powers of the models trained on PSSMs against similarly trained models on sparse-encoded single sequences. We found that the transient binding site predictions from evolutionary profiles are outperformed by single sequence based models under controlled training and test experiments by as much as 8 percentage points. Thus, we conclude that the PSSM-based models are inadequate to predict high specificity DNA-binding residues. These findings are of critical significance for the design of mutant- and species-specific DNA ligands and for homology based modeling of protein-DNA complexes.
Collapse
Affiliation(s)
- Ajay Arya
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA
| | - Dana Mary Varghese
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA
| | - Ajay Kumar Verma
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA
| | - Shandar Ahmad
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, INDIA.
| |
Collapse
|
27
|
Salmela H, Harwood GP, Münch D, Elsik CG, Herrero-Galán E, Vartiainen MK, Amdam GV. Nuclear translocation of vitellogenin in the honey bee ( Apis mellifera). APIDOLOGIE 2022; 53:13. [PMID: 35309709 PMCID: PMC8924143 DOI: 10.1007/s13592-022-00914-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 11/10/2021] [Accepted: 12/18/2021] [Indexed: 05/27/2023]
Abstract
UNLABELLED Vitellogenin (Vg) is a conserved protein used by nearly all oviparous animals to produce eggs. It is also pleiotropic and performs functions in oxidative stress resistance, immunity, and, in honey bees, behavioral development of the worker caste. It has remained enigmatic how Vg affects multiple traits. Here, we asked whether Vg enters the nucleus and acts via DNA-binding. We used cell fractionation, immunohistology, and cell culture to show that a structural subunit of honey bee Vg translocates into cell nuclei. We then demonstrated Vg-DNA binding theoretically and empirically with prediction software and chromatin immunoprecipitation with sequencing (ChIP-seq), finding binding sites at genes influencing immunity and behavior. Finally, we investigated the immunological and enzymatic conditions affecting Vg cleavage and nuclear translocation and constructed a 3D structural model. Our data are the first to show Vg in the nucleus and suggest a new fundamental regulatory role for this ubiquitous protein. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s13592-022-00914-9.
Collapse
Affiliation(s)
- Heli Salmela
- Organismal and Evolutionary Biology Research Program, University of Helsinki, Viikinkaari 1, 00014 Helsinki, FI Finland
| | - Gyan P. Harwood
- Department of Entomology, University of Illinois at Urbana-Champaign, 320 Morrill Hall 505 South Goodwin Avenue, Urbana, IL 61801 USA
| | - Daniel Münch
- Faculty of Environmental Sciences and Natural Resource Management, Norwegian University of Life Sciences, N-1432 Aas, Norway
| | - Christine G. Elsik
- Division of Animal Sciences, University of Missouri, S108 Animal Sciences Research Center (ASRC), Colombia, MO 65211 USA
| | | | - Maria K. Vartiainen
- Institute of Biotechnology, Helsinki Institute of Life Science, University of Helsinki, Viikinkaari 5, 00014 Helsinki, Finland
| | - Gro V. Amdam
- Faculty of Environmental Sciences and Natural Resource Management, Norwegian University of Life Sciences, N-1432 Aas, Norway
- School of Life Sciences, Arizona State University, 427 East Tyler Mall, Tempe, AZ 85281 USA
| |
Collapse
|
28
|
Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein-DNA binding site prediction using graph transformer. Brief Bioinform 2022; 23:6509729. [PMID: 35039821 DOI: 10.1093/bib/bbab564] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 11/24/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
Protein-DNA interactions play crucial roles in the biological systems, and identifying protein-DNA binding sites is the first step for mechanistic understanding of various biological activities (such as transcription and repair) and designing novel drugs. How to accurately identify DNA-binding residues from only protein sequence remains a challenging task. Currently, most existing sequence-based methods only consider contextual features of the sequential neighbors, which are limited to capture spatial information. Based on the recent breakthrough in protein structure prediction by AlphaFold2, we propose an accurate predictor, GraphSite, for identifying DNA-binding residues based on the structural models predicted by AlphaFold2. Here, we convert the binding site prediction problem into a graph node classification task and employ a transformer-based variant model to take the protein structural information into account. By leveraging predicted protein structures and graph transformer, GraphSite substantially improves over the latest sequence-based and structure-based methods. The algorithm is further confirmed on the independent test set of 181 proteins, where GraphSite surpasses the state-of-the-art structure-based method by 16.4% in area under the precision-recall curve and 11.2% in Matthews correlation coefficient, respectively. We provide the datasets, the predicted structures and the source codes along with the pre-trained models of GraphSite at https://github.com/biomed-AI/GraphSite. The GraphSite web server is freely available at https://biomed.nscc-gz.cn/apps/GraphSite.
Collapse
Affiliation(s)
- Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Sheng Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Jiahua Rao
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Shuangjia Zheng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
- Key Laboratory of Machine Intelligence and Advanced Computing of MOE, Sun Yat-sen University, Guangzhou 510000, China
| |
Collapse
|
29
|
Jiang Z, Xiao SR, Liu R. Dissecting and predicting different types of binding sites in nucleic acids based on structural information. Brief Bioinform 2021; 23:6384399. [PMID: 34624074 PMCID: PMC8769709 DOI: 10.1093/bib/bbab411] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 08/26/2021] [Accepted: 09/07/2021] [Indexed: 12/16/2022] Open
Abstract
The biological functions of DNA and RNA generally depend on their interactions with other molecules, such as small ligands, proteins and nucleic acids. However, our knowledge of the nucleic acid binding sites for different interaction partners is very limited, and identification of these critical binding regions is not a trivial work. Herein, we performed a comprehensive comparison between binding and nonbinding sites and among different categories of binding sites in these two nucleic acid classes. From the structural perspective, RNA may interact with ligands through forming binding pockets and contact proteins and nucleic acids using protruding surfaces, while DNA may adopt regions closer to the middle of the chain to make contacts with other molecules. Based on structural information, we established a feature-based ensemble learning classifier to identify the binding sites by fully using the interplay among different machine learning algorithms, feature spaces and sample spaces. Meanwhile, we designed a template-based classifier by exploiting structural conservation. The complementarity between the two classifiers motivated us to build an integrative framework for improving prediction performance. Moreover, we utilized a post-processing procedure based on the random walk algorithm to further correct the integrative predictions. Our unified prediction framework yielded promising results for different binding sites and outperformed existing methods.
Collapse
Affiliation(s)
- Zheng Jiang
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Si-Rui Xiao
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Rong Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| |
Collapse
|
30
|
Zhang S, Zhao L, Zheng CH, Xia J. A feature-based approach to predict hot spots in protein-DNA binding interfaces. Brief Bioinform 2021; 21:1038-1046. [PMID: 30957840 DOI: 10.1093/bib/bbz037] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Revised: 02/20/2019] [Accepted: 03/07/2019] [Indexed: 12/21/2022] Open
Abstract
DNA-binding hot spot residues of proteins are dominant and fundamental interface residues that contribute most of the binding free energy of protein-DNA interfaces. As experimental methods for identifying hot spots are expensive and time consuming, computational approaches are urgently required in predicting hot spots on a large scale. In this work, we systematically assessed a wide variety of 114 features from a combination of the protein sequence, structure, network and solvent accessible information and their combinations along with various feature selection strategies for hot spot prediction. We then trained and compared four commonly used machine learning models, namely, support vector machine (SVM), random forest, Naïve Bayes and k-nearest neighbor, for the identification of hot spots using 10-fold cross-validation and the independent test set. Our results show that (1) features based on the solvent accessible surface area have significant effect on hot spot prediction; (2) different but complementary features generally enhance the prediction performance; and (3) SVM outperforms other machine learning methods on both training and independent test sets. In an effort to improve predictive performance, we developed a feature-based method, namely, PrPDH (Prediction of Protein-DNA binding Hot spots), for the prediction of hot spots in protein-DNA binding interfaces using SVM based on the selected 10 optimal features. Comparative results on benchmark data sets indicate that our predictor is able to achieve generally better performance in predicting hot spots compared to the state-of-the-art predictors. A user-friendly web server for PrPDH is well established and is freely available at http://bioinfo.ahu.edu.cn:8080/PrPDH.
Collapse
Affiliation(s)
- Sijia Zhang
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Le Zhao
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Chun-Hou Zheng
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
31
|
Erokhina TN, Ryazantsev DY, Samokhvalova LV, Mozhaev AA, Orsa AN, Zavriev SK, Morozov SY. Activity of Chemically Synthesized Peptide Encoded by the miR156A Precursor and Conserved in the Brassicaceae Family Plants. BIOCHEMISTRY (MOSCOW) 2021; 86:551-562. [PMID: 33993858 DOI: 10.1134/s0006297921050047] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
It was recently found that the primary transcripts of some microRNA genes (pri-miRNAs) are able to express peptides with 12 to 40 residues in length. These peptides, called miPEPs, participate in the transcriptional regulation of their own pri-miRNAs. In our previous studies, we used bioinformatic approach for comparative analysis of pri-miRNA sequences in plant genomes to identify a new group of miPEPs (miPEP-156a peptides) encoded by pri-miR156a in several dozen species of the Brassicaceae family. Exogenous miPEP-156a peptides could efficiently penetrate into the plant seedlings through the root system and spread systemically to the leaves. The peptides produced moderate morphological effect accelerating primary root growth. In parallel, the miPEP-156a peptides upregulated expression of their own pri-miR156a. Importantly, the observed effects at both morphological and molecular levels correlated with the peptide ability to quickly translocate into the cell nucleus and to bind chromatin. In this work, we established secondary structure of the miPEP-156a and demonstrated its changes induced by formation of the peptide complex with DNA.
Collapse
Affiliation(s)
- Tatiana N Erokhina
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, 117997, Russia
| | - Dmitry Yu Ryazantsev
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, 117997, Russia
| | - Larisa V Samokhvalova
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, 117997, Russia
| | - Andrey A Mozhaev
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, 117997, Russia
| | - Alexander N Orsa
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, 117997, Russia
| | - Sergey K Zavriev
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, 117997, Russia
| | - Sergey Yu Morozov
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119992, Russia.
| |
Collapse
|
32
|
Xia Y, Xia CQ, Pan X, Shen HB. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021; 49:e51. [PMID: 33577689 PMCID: PMC8136796 DOI: 10.1093/nar/gkab044] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 02/09/2021] [Indexed: 11/24/2022] Open
Abstract
Knowledge of the interactions between proteins and nucleic acids is the basis of understanding various biological activities and designing new drugs. How to accurately identify the nucleic-acid-binding residues remains a challenging task. In this paper, we propose an accurate predictor, GraphBind, for identifying nucleic-acid-binding residues on proteins based on an end-to-end graph neural network. Considering that binding sites often behave in highly conservative patterns on local tertiary structures, we first construct graphs based on the structural contexts of target residues and their spatial neighborhood. Then, hierarchical graph neural networks (HGNNs) are used to embed the latent local patterns of structural and bio-physicochemical characteristics for binding residue recognition. We comprehensively evaluate GraphBind on DNA/RNA benchmark datasets. The results demonstrate the superior performance of GraphBind than state-of-the-art methods. Moreover, GraphBind is extended to other ligand-binding residue prediction to verify its generalization capability. Web server of GraphBind is freely available at http://www.csbio.sjtu.edu.cn/bioinf/GraphBind/.
Collapse
Affiliation(s)
- Ying Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Chun-Qiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.,School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
33
|
Hendrix SG, Chang KY, Ryu Z, Xie ZR. DeepDISE: DNA Binding Site Prediction Using a Deep Learning Method. Int J Mol Sci 2021; 22:ijms22115510. [PMID: 34073705 PMCID: PMC8197219 DOI: 10.3390/ijms22115510] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 04/30/2021] [Accepted: 05/19/2021] [Indexed: 11/18/2022] Open
Abstract
It is essential for future research to develop a new, reliable prediction method of DNA binding sites because DNA binding sites on DNA-binding proteins provide critical clues about protein function and drug discovery. However, the current prediction methods of DNA binding sites have relatively poor accuracy. Using 3D coordinates and the atom-type of surface protein atom as the input, we trained and tested a deep learning model to predict how likely a voxel on the protein surface is to be a DNA-binding site. Based on three different evaluation datasets, the results show that our model not only outperforms several previous methods on two commonly used datasets, but also demonstrates its robust performance to be consistent among the three datasets. The visualized prediction outcomes show that the binding sites are also mostly located in correct regions. We successfully built a deep learning model to predict the DNA binding sites on target proteins. It demonstrates that 3D protein structures plus atom-type information on protein surfaces can be used to predict the potential binding sites on a protein. This approach should be further extended to develop the binding sites of other important biological molecules.
Collapse
Affiliation(s)
- Samuel Godfrey Hendrix
- Computational Drug Discovery Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Georgia, Athens, GA 30602, USA; (S.G.H.); (Z.R.)
| | - Kuan Y. Chang
- Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung 202, Taiwan;
| | - Zeezoo Ryu
- Computational Drug Discovery Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Georgia, Athens, GA 30602, USA; (S.G.H.); (Z.R.)
- Department of Computer Science, Franklin College of Arts and Sciences, University of Georgia, Athens, GA 30602, USA
| | - Zhong-Ru Xie
- Computational Drug Discovery Laboratory, School of Electrical and Computer Engineering, College of Engineering, University of Georgia, Athens, GA 30602, USA; (S.G.H.); (Z.R.)
- Correspondence:
| |
Collapse
|
34
|
Fan BL, Jiang Z, Sun J, Liu R. Systematic characterization and prediction of coenzyme A-associated proteins using sequence and network information. Brief Bioinform 2020; 22:6012866. [PMID: 33253385 DOI: 10.1093/bib/bbaa308] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 09/08/2020] [Accepted: 10/12/2020] [Indexed: 01/11/2023] Open
Abstract
Coenzyme A-associated proteins (CAPs) are a category of functionally important proteins involved in multiple biological processes through interactions with coenzyme A (CoA). To date, unfortunately, the specific differences between CAPs and other proteins have yet to be systemically investigated. Moreover, there are no computational methods that can be used specifically to predict these proteins. Herein, we characterized CAPs from multifaceted viewpoints and revealed their specific preferences. Compared with other proteins, CAPs were more likely to possess binding regions for CoA and its derivatives, were evolutionarily highly conserved, exhibited ordered and hydrophobic structural conformations, and tended to be densely located in protein-protein interaction networks. Based on these biological insights, we built seven classifiers using predicted CoA-binding residue distributions, word embedding vectors, remote homolog numbers, evolutionary conservation, amino acid composition, predicted structural features and network properties. These classifiers could effectively identify CAPs in Homo sapiens, Mus musculus and Arabidopsis thaliana. The complementarity among the individual classifiers prompted us to build a two-layer stacking model named CAPE for improving prediction performance. We applied CAPE to identify some high-confidence candidates in the three species, which were tightly associated with the known functions of CAPs. Finally, we extended our algorithm to cross-species prediction, thereby developing a generic CAP prediction model. In summary, this work provides a comprehensive survey and an effective predictor for CAPs, which can help uncover the interplay between CoA and functionally relevant proteins.
Collapse
Affiliation(s)
- Bing-Liang Fan
- College of Informatics, Huazhong Agricultural University
| | - Zheng Jiang
- College of Informatics, Huazhong Agricultural University
| | - Jun Sun
- College of Informatics, Huazhong Agricultural University
| | - Rong Liu
- College of Informatics, Huazhong Agricultural University
| |
Collapse
|
35
|
Zhu X, Liu L, He J, Fang T, Xiong Y, Mitchell JC. iPNHOT: a knowledge-based approach for identifying protein-nucleic acid interaction hot spots. BMC Bioinformatics 2020; 21:289. [PMID: 32631222 PMCID: PMC7336410 DOI: 10.1186/s12859-020-03636-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Accepted: 06/25/2020] [Indexed: 12/20/2022] Open
Abstract
Background The interaction between proteins and nucleic acids plays pivotal roles in various biological processes such as transcription, translation, and gene regulation. Hot spots are a small set of residues that contribute most to the binding affinity of a protein-nucleic acid interaction. Compared to the extensive studies of the hot spots on protein-protein interfaces, the hot spot residues within protein-nucleic acids interfaces remain less well-studied, in part because mutagenesis data for protein-nucleic acids interaction are not as abundant as that for protein-protein interactions. Results In this study, we built a new computational model, iPNHOT, to effectively predict hot spot residues on protein-nucleic acids interfaces. One training data set and an independent test set were collected from dbAMEPNI and some recent literature, respectively. To build our model, we generated 97 different sequential and structural features and used a two-step strategy to select the relevant features. The final model was built based only on 7 features using a support vector machine (SVM). The features include two unique features such as ∆SASsa1/2 and esp3, which are newly proposed in this study. Based on the cross validation results, our model gave F1 score and AUROC as 0.725 and 0.807 on the subset collected from ProNIT, respectively, compared to 0.407 and 0.670 of mCSM-NA, a state-of-the art model to predict the thermodynamic effects of protein-nucleic acid interaction. The iPNHOT model was further tested on the independent test set, which showed that our model outperformed other methods. Conclusion In this study, by collecting data from a recently published database dbAMEPNI, we proposed a new model, iPNHOT, to predict hotspots on both protein-DNA and protein-RNA interfaces. The results show that our model outperforms the existing state-of-art models. Our model is available for users through a webserver: http://zhulab.ahu.edu.cn/iPNHOT/.
Collapse
Affiliation(s)
- Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China. .,School of Life Sciences, Anhui University, Hefei, Anhui, China.
| | - Ling Liu
- School of Life Sciences, Anhui University, Hefei, Anhui, China
| | - Jingjing He
- School of Life Sciences, Anhui University, Hefei, Anhui, China
| | - Ting Fang
- School of Life Sciences, Anhui University, Hefei, Anhui, China
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
| |
Collapse
|
36
|
Lamprecht-Grandío M, Cortesão M, Mirete S, de la Cámara MB, de Figueras CG, Pérez-Pantoja D, White JJ, Farías ME, Rosselló-Móra R, González-Pastor JE. Novel Genes Involved in Resistance to Both Ultraviolet Radiation and Perchlorate From the Metagenomes of Hypersaline Environments. Front Microbiol 2020; 11:453. [PMID: 32292392 PMCID: PMC7135895 DOI: 10.3389/fmicb.2020.00453] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Accepted: 03/03/2020] [Indexed: 12/25/2022] Open
Abstract
Microorganisms that thrive in hypersaline environments on the surface of our planet are exposed to the harmful effects of ultraviolet radiation. Therefore, for their protection, they have sunscreen pigments and highly efficient DNA repair and protection systems. The present study aimed to identify new genes involved in UV radiation resistance from these microorganisms, many of which cannot be cultured in the laboratory. Thus, a functional metagenomic approach was used and for this, small-insert libraries were constructed with DNA isolated from microorganisms of high-altitude Andean hypersaline lakes in Argentina (Diamante and Ojo Seco lakes, 4,589 and 3,200 m, respectively) and from the Es Trenc solar saltern in Spain. The libraries were hosted in a UV radiation-sensitive strain of Escherichia coli (recA mutant) and they were exposed to UVB. The resistant colonies were analyzed and as a result, four clones were identified with environmental DNA fragments containing five genes that conferred resistance to UV radiation in E. coli. One gene encoded a RecA-like protein, complementing the mutation in recA that makes the E. coli host strain more sensitive to UV radiation. Two other genes from the same DNA fragment encoded a TATA-box binding protein and an unknown protein, both responsible for UV resistance. Interestingly, two other genes from different and remote environments, the Ojo Seco Andean lake and the Es Trenc saltern, encoded two hypothetical proteins that can be considered homologous based on their significant amino acid similarity (49%). All of these genes also conferred resistance to 4-nitroquinoline 1-oxide (4-NQO), a compound that mimics the effect of UV radiation on DNA, and also to perchlorate, a powerful oxidant that can induce DNA damage. Furthermore, the hypothetical protein from the Es Trenc salterns was localized as discrete foci possibly associated with damaged sites in the DNA in cells treated with 4-NQO, so it could be involved in the repair of damaged DNA. In summary, novel genes involved in resistance to UV radiation, 4-NQO and perchlorate have been identified in this work and two of them encoding hypothetical proteins that could be involved in DNA damage repair activities not previously described.
Collapse
Affiliation(s)
| | - Marta Cortesão
- Department of Molecular Evolution, Centro de Astrobiología (CSIC-INTA), Madrid, Spain
| | - Salvador Mirete
- Department of Molecular Evolution, Centro de Astrobiología (CSIC-INTA), Madrid, Spain
| | | | | | - Danilo Pérez-Pantoja
- Programa Institucional de Fomento a la Investigación, Desarrollo e Innovación, Universidad Tecnológica Metropolitana, Santiago, Chile
| | - Joseph John White
- Department of Molecular Evolution, Centro de Astrobiología (CSIC-INTA), Madrid, Spain
| | - María Eugenia Farías
- Laboratorio de Investigaciones Microbiológicas de Lagunas Andinas (LIMLA), Planta Piloto de Procesos Industriales y Microbiológicos (PROIMI), Centro Científico Tecnológico, Consejo Nacional de Investigaciones Científicas y Técnicas, San Miguel de Tucumán, Argentina
| | - Ramon Rosselló-Móra
- Marine Microbiology Group, Department of Ecology and Marine Resources, Mediterranean Institute for Advanced Studies (IMEDEA, CSIC-UIB), Esporles, Spain
| | | |
Collapse
|
37
|
Zhou J, Lu Q, Xu R, Gui L, Wang H. EL_LSTM: Prediction of DNA-Binding Residue from Protein Sequence by Combining Long Short-Term Memory and Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:124-135. [PMID: 30040656 DOI: 10.1109/tcbb.2018.2858806] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Most past works for DNA-binding residue prediction did not consider the relationships between residues. In this paper, we propose a novel approach for DNA-binding residue prediction, referred to as EL_LSTM, which includes two main components. The first component is the Long Short-Term Memory (LSTM), which learns pairwise relationships between residues through a bi-gram model and then learns feature vectors for all residues. The second component is an ensemble learning based classifier introduced to tackle the data imbalance problem in binding residue predictions. We use a variant of the bagging strategy in ensemble learning to achieve balanced samples. Evaluations on PDNA-224 and DBP-123 show that adding feature relationships performs better than classifiers without feature relationships by at least 0.028 on MCC, 1.18 percent on ST and 0.012 on AUC. This indicates the usefulness of feature relationships for DNA-binding residue predictions. Evaluation on using ensemble learning indicates that the improvement can reach at least 0.021 on MCC, 1.32 percent on ST, and 0.018 on AUC compared to the use of a single LSTM classifier. Comparisons with the state-of-the-art predictors show that our proposed EL_LSTM outperforms them significantly. Further feature analysis validates the effectiveness of LSTM for the prediction of DNA-binding residues.
Collapse
|
38
|
Yang R, Zhang C, Gao R, Zhang L, Song Q. Predicting FAD Interacting Residues with Feature Selection and Comprehensive Sequence Descriptors. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:2046-2056. [PMID: 29993986 DOI: 10.1109/tcbb.2018.2824332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The function of a flavoprotein is determined to a great extent by the binding sites on its surface that interacts with flavin adenine dinucleotide (FAD). Malfunction or dysregulation of FAD binding leads to a series of diseases. Therefore, accurately identifying FAD interacting residues (FIRs) provides insights into the molecular mechanisms of flavoprotein-related biological processes and disease progression. In this paper, a new computational method is proposed for identifying FIRs from protein sequences. Various sequence-derived discriminative features are explored. We analyze the distinctions of these features between FIRs and non-FIRs. We also investigate the predictive capabilities of both individual features and combinations of features. A relief algorithm followed by incremental feature selection (relief-IFS) is then adopted to search the optimal features. Finally, a random forest (RF) module is used to predict FIRs based on the optimal features. Using a 5-fold cross-validation test, the proposed method performs well, with a sensitivity of 0.847, a specificity of 0.933, an accuracy of 0.890, and a Matthews correlation coefficient (MCC) of 0.782, thereby outperforming previous methods. These results indicate that our method is relatively successful at predicting FIRs.
Collapse
|
39
|
Poddar S, Chakravarty D, Chakrabarti P. Structural changes in DNA-binding proteins on complexation. Nucleic Acids Res 2019. [PMID: 29534202 PMCID: PMC6283420 DOI: 10.1093/nar/gky170] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Characterization and prediction of the DNA-biding regions in proteins are essential for our understanding of how proteins recognize/bind DNA. We analyze the unbound (U) and the bound (B) forms of proteins from the protein–DNA docking benchmark that contains 66 binary protein–DNA complexes along with their unbound counterparts. Proteins binding DNA undergo greater structural changes on complexation (in particular, those in the enzyme category) than those involved in protein–protein interactions (PPI). While interface atoms involved in PPI exhibit an increase in their solvent-accessible surface area (ASA) in the bound form in the majority of the cases compared to the unbound interface, protein–DNA interactions indicate increase and decrease in equal measure. In 25% structures, the U form has missing residues which are located in the interface in the B form. The missing atoms contribute more toward the buried surface area compared to other interface atoms. Lys, Gly and Arg are prominent in disordered segments that get ordered in the interface on complexation. In going from U to B, there may be an increase in coil and helical content at the expense of turns and strands. Consideration of flexibility cannot distinguish the interface residues from the surface residues in the U form.
Collapse
Affiliation(s)
- Sayan Poddar
- Department of Biochemistry, Bose Institute, P1/12 CIT Scheme VIIM, Kolkata 700054, India
| | - Devlina Chakravarty
- Bioinformatics Centre, Bose Institute, P1/12CIT Scheme VIIM, Kolkata 700054, India
| | - Pinak Chakrabarti
- Department of Biochemistry, Bose Institute, P1/12 CIT Scheme VIIM, Kolkata 700054, India.,Bioinformatics Centre, Bose Institute, P1/12CIT Scheme VIIM, Kolkata 700054, India
| |
Collapse
|
40
|
Chauhan S, Ahmad S. Enabling full‐length evolutionary profiles based deep convolutional neural network for predicting DNA‐binding proteins from sequence. Proteins 2019; 88:15-30. [DOI: 10.1002/prot.25763] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/01/2019] [Accepted: 06/15/2019] [Indexed: 12/22/2022]
Affiliation(s)
- Sucheta Chauhan
- School of Computational and Integrative SciencesJawaharlal Nehru University New Delhi India
| | - Shandar Ahmad
- School of Computational and Integrative SciencesJawaharlal Nehru University New Delhi India
| |
Collapse
|
41
|
Zhu YH, Hu J, Song XN, Yu DJ. DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines. J Chem Inf Model 2019; 59:3057-3071. [PMID: 30943723 DOI: 10.1021/acs.jcim.8b00749] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China
| | - Jun Hu
- College of Information Engineering , Zhejiang University of Technology , Hangzhou 310023 , P. R. China
| | - Xiao-Ning Song
- School of Internet of Things , Jiangnan University , 1800 Lihu Road , Wuxi 214122 , P. R. China
| | - Dong-Jun Yu
- School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China
| |
Collapse
|
42
|
Gil N, Fiser A. The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis. Bioinformatics 2019; 35:12-19. [PMID: 29947739 PMCID: PMC6298051 DOI: 10.1093/bioinformatics/bty523] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 04/20/2018] [Accepted: 06/26/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs). Results The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| |
Collapse
|
43
|
Hu W, Qin L, Li M, Pu X, Guo Y. Individually double minimum-distance definition of protein-RNA binding residues and application to structure-based prediction. J Comput Aided Mol Des 2018; 32:1363-1373. [PMID: 30478757 DOI: 10.1007/s10822-018-0177-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 11/14/2018] [Indexed: 01/01/2023]
Abstract
Identifying protein-RNA binding residues is essential for understanding the mechanism of protein-RNA interactions. So far, rigid distance thresholds are commonly used to define protein-RNA binding residues. However, after investigating 182 non-redundant protein-RNA complexes, we find that it would be unsuitable for a certain amount of complexes since the distances between proteins and RNAs vary widely. In this work, a novel definition method was proposed based on a flexible distance cutoff. This method can fully consider the individual differences among complexes by setting a variable tolerance limit of protein-RNA interactions, i.e. the double minimum-distance by which different distance thresholds are achieved for different complexes. In order to validate our method, a comprehensive comparison between our flexible method and traditional rigid methods was implemented in terms of interface structure, amino acid composition, interface area and interaction force, etc. The results indicate that this method is more reasonable because it incorporates the specificity of different complexes by extracting the important residues lost by rigid distance methods and discarding some redundant residues. Finally, to further test our double minimum-distance definition strategy, we developed a classifier to predict those binding sites derived from our new method by using structural features and a random forest machine learning algorithm. The model achieved a satisfactory prediction performance and the accuracy on independent data sets reaches to 85.0%. To the best of our knowledge, it is the first prediction model to define positive and negative samples using a flexible cutoff. So the comparison analysis and modeling results have demonstrated that our method would be a very promising strategy for more precisely defining protein-RNA binding sites.
Collapse
Affiliation(s)
- Wen Hu
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Liu Qin
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China.
| |
Collapse
|
44
|
Abstract
The increasing number of protein structures with uncharacterized function necessitates the development of in silico prediction methods for functional annotations on proteins. In this chapter, different kinds of computational approaches are briefly introduced to predict DNA-binding residues on surface of DNA-binding proteins, and the merits and limitations of these methods are mainly discussed. This chapter focuses on the structure-based approaches and mainly discusses the framework of machine learning methods in application to DNA-binding prediction task.
Collapse
|
45
|
Zhang J, Ma Z, Kurgan L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 2017; 20:1250-1268. [DOI: 10.1093/bib/bbx168] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 11/15/2017] [Indexed: 11/13/2022] Open
Abstract
Abstract
Proteins interact with a variety of molecules including proteins and nucleic acids. We review a comprehensive collection of over 50 studies that analyze and/or predict these interactions. While majority of these studies address either solely protein–DNA or protein–RNA binding, only a few have a wider scope that covers both protein–protein and protein–nucleic acid binding. Our analysis reveals that binding residues are typically characterized with three hallmarks: relative solvent accessibility (RSA), evolutionary conservation and propensity of amino acids (AAs) for binding. Motivated by drawbacks of the prior studies, we perform a large-scale analysis to quantify and contrast the three hallmarks for residues that bind DNA-, RNA-, protein- and (for the first time) multi-ligand-binding residues that interact with DNA and proteins, and with RNA and proteins. Results generated on a well-annotated data set of over 23 000 proteins show that conservation of binding residues is higher for nucleic acid- than protein-binding residues. Multi-ligand-binding residues are more conserved and have higher RSA than single-ligand-binding residues. We empirically show that each hallmark discriminates between binding and nonbinding residues, even predicted RSA, and that combining them improves discriminatory power for each of the five types of interactions. Linear scoring functions that combine these hallmarks offer good predictive performance of residue-level propensity for binding and provide intuitive interpretation of predictions. Better understanding of these residue-level interactions will facilitate development of methods that accurately predict binding in the exponentially growing databases of protein sequences.
Collapse
|
46
|
Chen J, Xie ZR, Wu Y. Understand protein functions by comparing the similarity of local structural environments. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2016; 1865:142-152. [PMID: 27884635 DOI: 10.1016/j.bbapap.2016.11.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 11/03/2016] [Accepted: 11/17/2016] [Indexed: 12/20/2022]
Abstract
The three-dimensional structures of proteins play an essential role in regulating binding between proteins and their partners, offering a direct relationship between structures and functions of proteins. It is widely accepted that the function of a protein can be determined if its structure is similar to other proteins whose functions are known. However, it is also observed that proteins with similar global structures do not necessarily correspond to the same function, while proteins with very different folds can share similar functions. This indicates that function similarity is originated from the local structural information of proteins instead of their global shapes. We assume that proteins with similar local environments prefer binding to similar types of molecular targets. In order to testify this assumption, we designed a new structural indicator to define the similarity of local environment between residues in different proteins. This indicator was further used to calculate the probability that a given residue binds to a specific type of structural neighbors, including DNA, RNA, small molecules and proteins. After applying the method to a large-scale non-redundant database of proteins, we show that the positive signal of binding probability calculated from the local structural indicator is statistically meaningful. In summary, our studies suggested that the local environment of residues in a protein is a good indicator to recognize specific binding partners of the protein. The new method could be a potential addition to a suite of existing template-based approaches for protein function prediction.
Collapse
Affiliation(s)
- Jiawen Chen
- Department of Systems and Computational Biology, Albert Einstein College of Medicine of Yeshiva University, 1300 Morris Park Avenue, Bronx, NY 10461, United States
| | - Zhong-Ru Xie
- Department of Systems and Computational Biology, Albert Einstein College of Medicine of Yeshiva University, 1300 Morris Park Avenue, Bronx, NY 10461, United States
| | - Yinghao Wu
- Department of Systems and Computational Biology, Albert Einstein College of Medicine of Yeshiva University, 1300 Morris Park Avenue, Bronx, NY 10461, United States.
| |
Collapse
|
47
|
CRHunter: integrating multifaceted information to predict catalytic residues in enzymes. Sci Rep 2016; 6:34044. [PMID: 27665935 PMCID: PMC5036049 DOI: 10.1038/srep34044] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 09/07/2016] [Indexed: 11/08/2022] Open
Abstract
A variety of algorithms have been developed for catalytic residue prediction based on either feature- or template-based methodology. However, no studies have systematically compared these two strategies and further considered whether their combination could improve the prediction performance. Herein, we developed an integrative algorithm named CRHunter by simultaneously using the complementarity between feature- and template-based methodologies and that between structural and sequence information. Several novel structural features were generated by the Delaunay triangulation and Laplacian transformation of enzyme structures. Combining these features with traditional descriptors, we invented two support vector machine feature predictors based on both structural and sequence information. Furthermore, we established two template predictors using structure and profile alignments. Evaluated on datasets with different levels of homology, our feature predictors achieve relatively stable performance, whereas our template predictors yield poor results when the homological relationships become weak. Nevertheless, the hybrid algorithm CRHunter consistently achieves optimal performance among all our predictors. We also illustrate that our methodology can be applied to the predicted structures of enzymes. Compared with state-of-the-art methods, CRHunter yields comparable or better performance on various datasets. Finally, the application of this algorithm to structural genomics targets sheds light on solved protein structures with unknown functions.
Collapse
|
48
|
Zhou J, Xu R, He Y, Lu Q, Wang H, Kong B. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context. Sci Rep 2016; 6:27653. [PMID: 27282833 PMCID: PMC4901350 DOI: 10.1038/srep27653] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 05/18/2016] [Indexed: 02/01/2023] Open
Abstract
Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community.
Collapse
Affiliation(s)
- Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.,Department of Computing, the Hong Kong Polytechnic University, Hong Kong
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.,Shenzhen Engineering Laboratory of Performance Robots at Digital Stage, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China
| | - Yulan He
- School of Engineering and Applied Science, Aston University, UK
| | - Qin Lu
- Department of Computing, the Hong Kong Polytechnic University, Hong Kong
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Bing Kong
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| |
Collapse
|
49
|
Miao Z, Westhof E. A Large-Scale Assessment of Nucleic Acids Binding Site Prediction Programs. PLoS Comput Biol 2015; 11:e1004639. [PMID: 26681179 PMCID: PMC4683125 DOI: 10.1371/journal.pcbi.1004639] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 10/30/2015] [Indexed: 11/18/2022] Open
Abstract
Computational prediction of nucleic acid binding sites in proteins are necessary to disentangle functional mechanisms in most biological processes and to explore the binding mechanisms. Several strategies have been proposed, but the state-of-the-art approaches display a great diversity in i) the definition of nucleic acid binding sites; ii) the training and test datasets; iii) the algorithmic methods for the prediction strategies; iv) the performance measures and v) the distribution and availability of the prediction programs. Here we report a large-scale assessment of 19 web servers and 3 stand-alone programs on 41 datasets including more than 5000 proteins derived from 3D structures of protein-nucleic acid complexes. Well-defined binary assessment criteria (specificity, sensitivity, precision, accuracy…) are applied. We found that i) the tools have been greatly improved over the years; ii) some of the approaches suffer from theoretical defects and there is still room for sorting out the essential mechanisms of binding; iii) RNA binding and DNA binding appear to follow similar driving forces and iv) dataset bias may exist in some methods.
Collapse
Affiliation(s)
- Zhichao Miao
- Architecture et Réactivité de l'ARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire du CNRS, Strasbourg, France
| | - Eric Westhof
- Architecture et Réactivité de l'ARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire du CNRS, Strasbourg, France
| |
Collapse
|
50
|
SNBRFinder: A Sequence-Based Hybrid Algorithm for Enhanced Prediction of Nucleic Acid-Binding Residues. PLoS One 2015; 10:e0133260. [PMID: 26176857 PMCID: PMC4503397 DOI: 10.1371/journal.pone.0133260] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/25/2015] [Indexed: 11/19/2022] Open
Abstract
Protein-nucleic acid interactions are central to various fundamental biological processes. Automated methods capable of reliably identifying DNA- and RNA-binding residues in protein sequence are assuming ever-increasing importance. The majority of current algorithms rely on feature-based prediction, but their accuracy remains to be further improved. Here we propose a sequence-based hybrid algorithm SNBRFinder (Sequence-based Nucleic acid-Binding Residue Finder) by merging a feature predictor SNBRFinderF and a template predictor SNBRFinderT. SNBRFinderF was established using the support vector machine whose inputs include sequence profile and other complementary sequence descriptors, while SNBRFinderT was implemented with the sequence alignment algorithm based on profile hidden Markov models to capture the weakly homologous template of query sequence. Experimental results show that SNBRFinderF was clearly superior to the commonly used sequence profile-based predictor and SNBRFinderT can achieve comparable performance to the structure-based template methods. Leveraging the complementary relationship between these two predictors, SNBRFinder reasonably improved the performance of both DNA- and RNA-binding residue predictions. More importantly, the sequence-based hybrid prediction reached competitive performance relative to our previous structure-based counterpart. Our extensive and stringent comparisons show that SNBRFinder has obvious advantages over the existing sequence-based prediction algorithms. The value of our algorithm is highlighted by establishing an easy-to-use web server that is freely accessible at http://ibi.hzau.edu.cn/SNBRFinder.
Collapse
|