1
|
Hu X, Zhang X, Sun W, Liu C, Deng P, Cao Y, Zhang C, Xu N, Zhang T, Zhang Y, Liu JJ, Wang H. Systematic discovery of DNA-binding tandem repeat proteins. Nucleic Acids Res 2024; 52:10464-10489. [PMID: 39189466 PMCID: PMC11417379 DOI: 10.1093/nar/gkae710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 07/30/2024] [Accepted: 08/07/2024] [Indexed: 08/28/2024] Open
Abstract
Tandem repeat proteins (TRPs) are widely distributed and bind to a wide variety of ligands. DNA-binding TRPs such as zinc finger (ZNF) and transcription activator-like effector (TALE) play important roles in biology and biotechnology. In this study, we first conducted an extensive analysis of TRPs in public databases, and found that the enormous diversity of TRPs is largely unexplored. We then focused our efforts on identifying novel TRPs possessing DNA-binding capabilities. We established a protein language model for DNA-binding protein prediction (PLM-DBPPred), and predicted a large number of DNA-binding TRPs. A subset was then selected for experimental screening, leading to the identification of 11 novel DNA-binding TRPs, with six showing sequence specificity. Notably, members of the STAR (Short TALE-like Repeat proteins) family can be programmed to target specific 9 bp DNA sequences with high affinity. Leveraging this property, we generated artificial transcription factors using reprogrammed STAR proteins and achieved targeted activation of endogenous gene sets. Furthermore, the members of novel families such as MOON (Marine Organism-Originated DNA binding protein) and pTERF (prokaryotic mTERF-like protein) exhibit unique features and distinct DNA-binding characteristics, revealing interesting biological clues. Our study expands the diversity of DNA-binding TRPs, and demonstrates that a systematic approach greatly enhances the discovery of new biological insights and tools.
Collapse
Affiliation(s)
- Xiaoxuan Hu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Xuechun Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Wen Sun
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing 100101, China
| | - Chunhong Liu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Pujuan Deng
- State Key Laboratory of Membrane Biology, Beijing Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing 100084, China
| | - Yuanwei Cao
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Chenze Zhang
- National Key Laboratory of Efficacy and Mechanism on Chinese Medicine for Metabolic Diseases, Beijing University of Chinese Medicine, Beijing 100029, China
| | - Ning Xu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Tongtong Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Yong E Zhang
- University of Chinese Academy of Sciences, Beijing 100049, China
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Jun-Jie Gogo Liu
- State Key Laboratory of Membrane Biology, Beijing Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing 100084, China
| | - Haoyi Wang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing 100101, China
| |
Collapse
|
2
|
Hu J, Zeng WW, Jia NX, Arif M, Yu DJ, Zhang GJ. Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm. J Chem Inf Model 2023; 63:1044-1057. [PMID: 36719781 DOI: 10.1021/acs.jcim.2c00943] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Identification of the DNA-binding protein (DBP) helps dig out information embedded in the DNA-protein interaction, which is significant to understanding the mechanisms of DNA replication, transcription, and repair. Although existing computational methods for predicting the DBPs based on protein sequences have obtained great success, there is still room for improvement since the sequence-order information is not fully mined in these methods. In this study, a new three-part sequence-order feature extraction (called TPSO) strategy is developed to extract more discriminative information from protein sequences for predicting the DBPs. For each query protein, TPSO first divides its primary sequence features into N- and C-terminal fragments and then extracts the numerical pseudo features of three parts including the full sequence and these two fragments, respectively. Based on TPSO, a novel deep learning-based method, called TPSO-DBP, is proposed, which employs the sequence-based single-view features, the bidirectional long short-term memory (BiLSTM) and fully connected (FC) neural networks to learn the DBP prediction model. Empirical outcomes reveal that TPSO-DBP can achieve an accuracy of 87.01%, covering 85.30% of all DBPs, while achieving a Matthew's correlation coefficient value (0.741) that is significantly higher than most existing state-of-the-art DBP prediction methods. Detailed data analyses have indicated that the advantages of TPSO-DBP lie in the utilization of TPSO, which helps extract more concealed prominent patterns, and the deep neural network framework composed of BiLSTM and FC that learns the nonlinear relationships between input features and DBPs. The standalone package and web server of TPSO-DBP are freely available at https://jun-csbio.github.io/TPSO-DBP/.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Wen-Wu Zeng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Ning-Xin Jia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Muhammad Arif
- School of Systems and Technology, Department of Informatics and Systems, University of Management and Technology, Lahore54770, Pakistan
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing210094, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| |
Collapse
|
3
|
Wang N, Zhang J, Liu B. iDRBP-EL: Identifying DNA- and RNA- Binding Proteins Based on Hierarchical Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:432-441. [PMID: 34932484 DOI: 10.1109/tcbb.2021.3136905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) from the primary sequences is essential for further exploring protein-nucleic acid interactions. Previous studies have shown that machine-learning-based methods can efficiently identify DBPs or RBPs. However, the information used in these methods is slightly unitary, and most of them only can predict DBPs or RBPs. In this study, we proposed a computational predictor iDRBP-EL to identify DNA- and RNA- binding proteins, and introduced hierarchical ensemble learning to integrate three level information. The method can integrate the information of different features, machine learning algorithms and data into one multi-label model. The ablation experiment showed that the fusion of different information can improve the prediction performance and overcome the cross-prediction problem. Experimental results on the independent datasets showed that iDRBP-EL outperformed all the other competing methods. Moreover, we established a user-friendly webserver iDRBP-EL (http://bliulab.net/iDRBP-EL), which can predict both DBPs and RBPs only based on protein sequences.
Collapse
|
4
|
Varghese DM, Nussinov R, Ahmad S. Predictive modeling of moonlighting DNA-binding proteins. NAR Genom Bioinform 2022; 4:lqac091. [PMID: 36474806 PMCID: PMC9716651 DOI: 10.1093/nargab/lqac091] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 10/25/2022] [Accepted: 11/11/2022] [Indexed: 09/10/2024] Open
Abstract
Moonlighting proteins are multifunctional, single-polypeptide chains capable of performing multiple autonomous functions. Most moonlighting proteins have been discovered through work unrelated to their multifunctionality. We believe that prediction of moonlighting proteins from first principles, that is, using sequence, predicted structure, evolutionary profiles, and global gene expression profiles, for only one functional class of proteins in a single organism at a time will significantly advance our understanding of multifunctional proteins. In this work, we investigated human moonlighting DNA-binding proteins (mDBPs) in terms of properties that distinguish them from other (non-moonlighting) proteins with the same DNA-binding protein (DBP) function. Following a careful and comprehensive analysis of discriminatory features, a machine learning model was developed to assess the predictability of mDBPs from other DBPs (oDBPs). We observed that mDBPs can be discriminated from oDBPs with high accuracy of 74% AUC of ROC using these first principles features. A number of novel predicted mDBPs were found to have literature support for their being moonlighting and others are proposed as candidates, for which the moonlighting function is currently unknown. We believe that this work will help in deciphering and annotating novel moonlighting DBPs and scale up other functions. The source codes and data sets used for this work are freely available at https://zenodo.org/record/7299265#.Y2pO3ctBxPY.
Collapse
Affiliation(s)
- Dana Mary Varghese
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, India
| | - Ruth Nussinov
- Computational Structural Biology Section, Cancer Innovation Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
- Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Israel
| | - Shandar Ahmad
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, India
| |
Collapse
|
5
|
Asim MN, Ibrahim MA, Zehe C, Trygg J, Dengel A, Ahmed S. BoT-Net: a lightweight bag of tricks-based neural network for efficient LncRNA–miRNA interaction prediction. Interdiscip Sci 2022; 14:841-862. [PMID: 35947255 PMCID: PMC9581873 DOI: 10.1007/s12539-022-00535-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 06/16/2022] [Accepted: 07/12/2022] [Indexed: 11/30/2022]
Abstract
Background and objective: Interactions of long non-coding ribonucleic acids (lncRNAs) with micro-ribonucleic acids (miRNAs) play an essential role in gene regulation, cellular metabolic, and pathological processes. Existing purely sequence based computational approaches lack robustness and efficiency mainly due to the high length variability of lncRNA sequences. Hence, the prime focus of the current study is to find optimal length trade-offs between highly flexible length lncRNA sequences. Method The paper at hand performs in-depth exploration of diverse copy padding, sequence truncation approaches, and presents a novel idea of utilizing only subregions of lncRNA sequences to generate fixed-length lncRNA sequences. Furthermore, it presents a novel bag of tricks-based deep learning approach “Bot-Net” which leverages a single layer long-short-term memory network regularized through DropConnect to capture higher order residue dependencies, pooling to retain most salient features, normalization to prevent exploding and vanishing gradient issues, learning rate decay, and dropout to regularize precise neural network for lncRNA–miRNA interaction prediction. Results BoT-Net outperforms the state-of-the-art lncRNA–miRNA interaction prediction approach by 2%, 8%, and 4% in terms of accuracy, specificity, and matthews correlation coefficient. Furthermore, a case study analysis indicates that BoT-Net also outperforms state-of-the-art lncRNA–protein interaction predictor on a benchmark dataset by accuracy of 10%, sensitivity of 19%, specificity of 6%, precision of 14%, and matthews correlation coefficient of 26%. Conclusion In the benchmark lncRNA–miRNA interaction prediction dataset, the length of the lncRNA sequence varies from 213 residues to 22,743 residues and in the benchmark lncRNA–protein interaction prediction dataset, lncRNA sequences vary from 15 residues to 1504 residues. For such highly flexible length sequences, fixed length generation using copy padding introduces a significant level of bias which makes a large number of lncRNA sequences very much identical to each other and eventually derail classifier generalizeability. Empirical evaluation reveals that within 50 residues of only the starting region of long lncRNA sequences, a highly informative distribution for lncRNA–miRNA interaction prediction is contained, a crucial finding exploited by the proposed BoT-Net approach to optimize the lncRNA fixed length generation process. Availability: BoT-Net web server can be accessed at https://sds_genetic_analysis.opendfki.de/lncmiRNA/. Graphic Abstract ![]()
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
| | - Christoph Zehe
- Sartorius Stedim Cellca GmbH, 88471, Laupheim, Baden-Wurttemberg, Germany
| | - Johan Trygg
- Sartorius Stedim Cellca GmbH, 88471, Laupheim, Baden-Wurttemberg, Germany
- Computational Life Science Cluster (CLiC), Umea University, 90187, Umea, Sweden
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- Computational Life Science Cluster (CLiC), Umea University, 90187, Umea, Sweden
| |
Collapse
|
6
|
Cao L, Liu P, Chen J, Deng L. Prediction of Transcription Factor Binding Sites Using a Combined Deep Learning Approach. Front Oncol 2022; 12:893520. [PMID: 35719916 PMCID: PMC9204005 DOI: 10.3389/fonc.2022.893520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 04/11/2022] [Indexed: 12/04/2022] Open
Abstract
In the process of regulating gene expression and evolution, such as DNA replication and mRNA transcription, the binding of transcription factors (TFs) to TF binding sites (TFBS) plays a vital role. Precisely modeling the specificity of genes and searching for TFBS are helpful to explore the mechanism of cell expression. In recent years, computational and deep learning methods searching for TFBS have become an active field of research. However, existing methods generally cannot meet high performance and interpretability simultaneously. Here, we develop an accurate and interpretable attention-based hybrid approach, DeepARC, that combines a convolutional neural network (CNN) and recurrent neural network (RNN) to predict TFBS. DeepARC employs a positional embedding method to extract the hidden embedding from DNA sequences, including the positional information from OneHot encoding and the distributed embedding from DNA2Vec. DeepARC feeds the positional embedding of the DNA sequence into a CNN-BiLSTM-Attention-based framework to complete the task of finding the motif. Taking advantage of the attention mechanism, DeepARC can gain greater access to valuable information about the motif and bring interpretability to the work of searching for motifs through the attention weight graph. Moreover, DeepARC achieves promising performances with an average area under the receiver operating characteristic curve (AUC) score of 0.908 on five cell lines (A549, GM12878, Hep-G2, H1-hESC, and Hela) in the benchmark dataset. We also compare the positional embedding with OneHot and DNA2Vec and gain a competitive advantage.
Collapse
Affiliation(s)
- Linan Cao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Pei Liu
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jialong Chen
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
7
|
Research on DNA-Binding Protein Identification Method Based on LSTM-CNN Feature Fusion. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9705275. [PMID: 35693256 PMCID: PMC9184165 DOI: 10.1155/2022/9705275] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 12/23/2021] [Accepted: 04/27/2022] [Indexed: 11/29/2022]
Abstract
Protein is closely related to life activities. As a kind of protein, DNA-binding protein plays an irreplaceable role in life activities. Therefore, it is very important to study DNA-binding protein, which is a subject worthy of study. Although traditional biotechnology has high precision, its cost and efficiency are increasingly unable to meet the needs of modern society. Machine learning methods can make up for the deficiencies of biological experimental techniques to a certain extent, but they are not as simple and fast as deep learning for data processing. In this paper, a deep learning framework based on parallel long and short-term memory(LSTM) and convolutional neural networks(CNN) was proposed to identify DNA-binding protein. This model can not only further extract the information and features of protein sequences, but also the features of evolutionary information. Finally, the two features are combined for training and testing. On the PDB2272 dataset, compared with PDBP_Fusion model, Accuracy(ACC) and Matthew's Correlation Coefficient (MCC) increased by 3.82% and 7.98% respectively. The experimental results of this model have certain advantages.
Collapse
|
8
|
Zhang J, Yan K, Chen Q, Liu B. PreRBP-TL: prediction of species-specific RNA-binding proteins based on transfer learning. Bioinformatics 2022; 38:2135-2143. [PMID: 35176130 DOI: 10.1093/bioinformatics/btac106] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 11/18/2021] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION RNA-binding proteins (RBPs) play crucial roles in post-transcriptional regulation. Accurate identification of RBPs helps to understand gene expression, regulation, etc. In recent years, some computational methods were proposed to identify RBPs. However, these methods fail to accurately identify RBPs from some specific species with limited data, such as bacteria. RESULTS In this study, we introduce a computational method called PreRBP-TL for identifying species-specific RBPs based on transfer learning. The weights of the prediction model were initialized by pretraining with the large general RBP dataset and then fine-tuned with the small species-specific RPB dataset by using transfer learning. The experimental results show that the PreRBP-TL achieves better performance for identifying the species-specific RBPs from Human, Arabidopsis, Escherichia coli and Salmonella, outperforming eight state-of-the-art computational methods. It is anticipated PreRBP-TL will become a useful method for identifying RBPs. AVAILABILITY AND IMPLEMENTATION For the convenience of researchers to identify RBPs, the web server of PreRBP-TL was established, freely available at http://bliulab.net/PreRBP-TL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
9
|
Convolutional neural networks with image representation of amino acid sequences for protein function prediction. Comput Biol Chem 2021; 92:107494. [PMID: 33930742 DOI: 10.1016/j.compbiolchem.2021.107494] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 04/21/2021] [Indexed: 01/11/2023]
Abstract
Proteins are one of the most important molecules that govern the cellular processes in most of the living organisms. Various functions of the proteins are of paramount importance to understand the basics of life. Several supervised learning approaches are applied in this field to predict the functionality of proteins. In this paper, we propose a convolutional neural network based approach ProtConv to predict the functionality of proteins by converting the amino-acid sequences to a two dimensional image. We have used a protein embedding technique using transfer learning to generate the feature vector. Feature vector is then converted into a square sized single channel image to be fed into a convolutional network. The neural network architecture used here is a combination of convolutional filters and average pooling layers followed by dense fully connected layers to predict a binary function. We have performed experiments on standard benchmark datasets taken from two very important protein function prediction task: proinflammatory cytokines and anticancer peptides. Our experiments show that the proposed method, ProtConv achieves state-of-the-art performances on both of the datasets. All necessary details about implementation with source code and datasets are made available at: https://github.com/swakkhar/ProtConv.
Collapse
|
10
|
Zhang Q, Liu P, Wang X, Zhang Y, Han Y, Yu B. StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106921] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
11
|
Zhang J, Chen Q, Liu B. iDRBP_MMC: Identifying DNA-Binding Proteins and RNA-Binding Proteins Based on Multi-Label Learning Model and Motif-Based Convolutional Neural Network. J Mol Biol 2020; 432:5860-5875. [DOI: 10.1016/j.jmb.2020.09.008] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 08/12/2020] [Accepted: 09/04/2020] [Indexed: 11/28/2022]
|
12
|
Predicting Hot Spot Residues at Protein-DNA Binding Interfaces Based on Sequence Information. Interdiscip Sci 2020; 13:1-11. [PMID: 33068261 DOI: 10.1007/s12539-020-00399-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 09/27/2020] [Accepted: 10/01/2020] [Indexed: 10/23/2022]
Abstract
Hot spot residues at protein-DNA binding interfaces are hugely important for investigating the underlying mechanism of molecular recognition. Currently, there are a few tools available for identifying the hot spot residues in the protein-DNA complexes. In addition, the three-dimensional protein structures are needed in these tools. However, it is well known that the three-dimensional structures are unavailable for most proteins. Considering the limitation, we proposed a method, named SPDH, for predicting hot spot residues only based on protein sequences. Firstly, we obtained 133 features from physicochemical property, conservation, predicted solvent accessible surface area and structure. Then, we systematically assessed these features based on various feature selection methods to obtain the optimal feature subset and compared the models using four classical machine learning algorithms (support vector machine, random forest, logistic regression, and k-nearest neighbor) on the training dataset. We found that the variability of physicochemical property features between wild and mutative types was important on improving the performance of the prediction model. On the independent test set, our method achieved the performance with AUC of 0.760 and sensitivity of 0.808, and outperformed other methods. The data and source code can be downloaded at https://github.com/xialab-ahu/SPDH .
Collapse
|
13
|
PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7297631. [PMID: 32352006 PMCID: PMC7174956 DOI: 10.1155/2020/7297631] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 04/01/2020] [Indexed: 12/02/2022]
Abstract
DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.
Collapse
|
14
|
HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:1384749. [PMID: 32300371 PMCID: PMC7142336 DOI: 10.1155/2020/1384749] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 03/16/2020] [Indexed: 02/08/2023]
Abstract
Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.
Collapse
|
15
|
Computational Identification and Analysis of Ubiquinone-Binding Proteins. Cells 2020; 9:cells9020520. [PMID: 32102444 PMCID: PMC7072731 DOI: 10.3390/cells9020520] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 02/21/2020] [Accepted: 02/21/2020] [Indexed: 12/15/2022] Open
Abstract
Ubiquinone is an important cofactor that plays vital and diverse roles in many biological processes. Ubiquinone-binding proteins (UBPs) are receptor proteins that dock with ubiquinones. Analyzing and identifying UBPs via a computational approach will provide insights into the pathways associated with ubiquinones. In this work, we were the first to propose a UBPs predictor (UBPs-Pred). The optimal feature subset selected from three categories of sequence-derived features was fed into the extreme gradient boosting (XGBoost) classifier, and the parameters of XGBoost were tuned by multi-objective particle swarm optimization (MOPSO). The experimental results over the independent validation demonstrated considerable prediction performance with a Matthews correlation coefficient (MCC) of 0.517. After that, we analyzed the UBPs using bioinformatics methods, including the statistics of the binding domain motifs and protein distribution, as well as an enrichment analysis of the gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.
Collapse
|