1
|
Pradhan UK, Naha S, Das R, Gupta A, Parsad R, Meher PK. RBProkCNN: Deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes. Comput Struct Biotechnol J 2024; 23:1631-1640. [PMID: 38660008 PMCID: PMC11039349 DOI: 10.1016/j.csbj.2024.04.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 04/12/2024] [Accepted: 04/12/2024] [Indexed: 04/26/2024] Open
Abstract
RNA-binding proteins (RBPs) are central to key functions such as post-transcriptional regulation, mRNA stability, and adaptation to varied environmental conditions in prokaryotes. While the majority of research has concentrated on eukaryotic RBPs, recent developments underscore the crucial involvement of prokaryotic RBPs. Although computational methods have emerged in recent years to identify RBPs, they have fallen short in accurately identifying prokaryotic RBPs due to their generic nature. To bridge this gap, we introduce RBProkCNN, a novel machine learning-driven computational model meticulously designed for the accurate prediction of prokaryotic RBPs. The prediction process involves the utilization of eight shallow learning algorithms and four deep learning models, incorporating PSSM-based evolutionary features. By leveraging a convolutional neural network (CNN) and evolutionarily significant features selected through extreme gradient boosting variable importance measure, RBProkCNN achieved the highest accuracy in five-fold cross-validation, yielding 98.04% auROC and 98.19% auPRC. Furthermore, RBProkCNN demonstrated robust performance with an independent dataset, showcasing a commendable 95.77% auROC and 95.78% auPRC. Noteworthy is its superior predictive accuracy when compared to several state-of-the-art existing models. RBProkCNN is available as an online prediction tool (https://iasri-sg.icar.gov.in/rbprokcnn/), offering free access to interested users. This tool represents a substantial contribution, enriching the array of resources available for the accurate and efficient prediction of prokaryotic RBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
2
|
Li X, Wei Z, Hu Y, Zhu X. GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models. Int J Biol Macromol 2024:135599. [PMID: 39276905 DOI: 10.1016/j.ijbiomac.2024.135599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 09/11/2024] [Accepted: 09/11/2024] [Indexed: 09/17/2024]
Abstract
The computational identification of nucleic acid-binding proteins (NABP) is of great significance for understanding the mechanisms of these biological activities and drug discovery. Although a bunch of sequence-based methods have been proposed to predict NABP and achieved promising performance, the structure information is often overlooked. On the other hand, the power of popular protein language models (pLM) has seldom been harnessed for predicting NABPs. In this study, we propose a novel framework called GraphNABP, to predict NABP by integrating sequence and predicted 3D structure information. Specifically, sequence embeddings and protein molecular graphs were first obtained from ProtT5 protein language model and predicted 3D structures, respectively. Then, graph attention (GAT) and bidirectional long short-term memory (BiLSTM) neural networks were used to enhance feature representations. Finally, a fully connected layer is used to predict NABPs. To the best of our knowledge, this is the first time to integrate AlphaFold and protein language models for the prediction of NABPs. The performances on multiple independent test sets indicate that GraphNABP outperforms other state-of-the-art methods. Our results demonstrate the effectiveness of pLM embeddings and structural information for NABP prediction. The codes and data used in this study are available at https://github.com/lixiangli01/GraphNABP.
Collapse
Affiliation(s)
- Xiang Li
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yueran Hu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China.
| |
Collapse
|
3
|
Zeng W, Dou Y, Pan L, Xu L, Peng S. Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein. Nat Commun 2024; 15:7838. [PMID: 39244557 PMCID: PMC11380688 DOI: 10.1038/s41467-024-52293-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 08/29/2024] [Indexed: 09/09/2024] Open
Abstract
DNA-protein interactions exert the fundamental structure of many pivotal biological processes, such as DNA replication, transcription, and gene regulation. However, accurate and efficient computational methods for identifying these interactions are still lacking. In this study, we propose a method ESM-DBP through refining the DNA-binding protein sequence repertory and domain-adaptive pretraining based the general protein language model. Our method considers the lacking exploration of general language model for DNA-binding protein domain-specific knowledge, so we screen out 170,264 DNA-binding protein sequences to construct the domain-adaptive language model. Experimental results on four downstream tasks show that ESM-DBP provides a better feature representation of DNA-binding protein compared to the original language model, resulting in improved prediction performance and outperforming the state-of-the-art methods. Moreover, ESM-DBP can still perform well even for those sequences with only a few homologous sequences. ChIP-seq on two predicted cases further support the validity of the proposed method.
Collapse
Affiliation(s)
- Wenwu Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
| | - Yutao Dou
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
| | - Liangrui Pan
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
| | - Liwen Xu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
| | - Shaoliang Peng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
| |
Collapse
|
4
|
Hu X, Zhang X, Sun W, Liu C, Deng P, Cao Y, Zhang C, Xu N, Zhang T, Zhang YE, Liu JJG, Wang H. Systematic discovery of DNA-binding tandem repeat proteins. Nucleic Acids Res 2024:gkae710. [PMID: 39189466 DOI: 10.1093/nar/gkae710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 07/30/2024] [Accepted: 08/07/2024] [Indexed: 08/28/2024] Open
Abstract
Tandem repeat proteins (TRPs) are widely distributed and bind to a wide variety of ligands. DNA-binding TRPs such as zinc finger (ZNF) and transcription activator-like effector (TALE) play important roles in biology and biotechnology. In this study, we first conducted an extensive analysis of TRPs in public databases, and found that the enormous diversity of TRPs is largely unexplored. We then focused our efforts on identifying novel TRPs possessing DNA-binding capabilities. We established a protein language model for DNA-binding protein prediction (PLM-DBPPred), and predicted a large number of DNA-binding TRPs. A subset was then selected for experimental screening, leading to the identification of 11 novel DNA-binding TRPs, with six showing sequence specificity. Notably, members of the STAR (Short TALE-like Repeat proteins) family can be programmed to target specific 9 bp DNA sequences with high affinity. Leveraging this property, we generated artificial transcription factors using reprogrammed STAR proteins and achieved targeted activation of endogenous gene sets. Furthermore, the members of novel families such as MOON (Marine Organism-Originated DNA binding protein) and pTERF (prokaryotic mTERF-like protein) exhibit unique features and distinct DNA-binding characteristics, revealing interesting biological clues. Our study expands the diversity of DNA-binding TRPs, and demonstrates that a systematic approach greatly enhances the discovery of new biological insights and tools.
Collapse
Affiliation(s)
- Xiaoxuan Hu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Xuechun Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Wen Sun
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing 100101, China
| | - Chunhong Liu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Pujuan Deng
- State Key Laboratory of Membrane Biology, Beijing Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing 100084, China
| | - Yuanwei Cao
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Chenze Zhang
- National Key Laboratory of Efficacy and Mechanism on Chinese Medicine for Metabolic Diseases, Beijing University of Chinese Medicine, Beijing 100029, China
| | - Ning Xu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Tongtong Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Yong E Zhang
- University of Chinese Academy of Sciences, Beijing 100049, China
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Jun-Jie Gogo Liu
- State Key Laboratory of Membrane Biology, Beijing Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing 100084, China
| | - Haoyi Wang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing 100101, China
| |
Collapse
|
5
|
Pradhan UK, Meher PK, Naha S, Sharma NK, Agarwal A, Gupta A, Parsad R. DBPMod: a supervised learning model for computational recognition of DNA-binding proteins in model organisms. Brief Funct Genomics 2024; 23:363-372. [PMID: 37651627 DOI: 10.1093/bfgp/elad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 08/09/2023] [Accepted: 08/15/2023] [Indexed: 09/02/2023] Open
Abstract
DNA-binding proteins (DBPs) play critical roles in many biological processes, including gene expression, DNA replication, recombination and repair. Understanding the molecular mechanisms underlying these processes depends on the precise identification of DBPs. In recent times, several computational methods have been developed to identify DBPs. However, because of the generic nature of the models, these models are unable to identify species-specific DBPs with higher accuracy. Therefore, a species-specific computational model is needed to predict species-specific DBPs. In this paper, we introduce the computational DBPMod method, which makes use of a machine learning approach to identify species-specific DBPs. For prediction, both shallow learning algorithms and deep learning models were used, with shallow learning models achieving higher accuracy. Additionally, the evolutionary features outperformed sequence-derived features in terms of accuracy. Five model organisms, including Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens and Mus musculus, were used to assess the performance of DBPMod. Five-fold cross-validation and independent test set analyses were used to evaluate the prediction accuracy in terms of area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPRC), which was found to be ~89-92% and ~89-95%, respectively. The comparative results demonstrate that the DBPMod outperforms 12 current state-of-the-art computational approaches in identifying the DBPs for all five model organisms. We further developed the web server of DBPMod to make it easier for researchers to detect DBPs and is publicly available at https://iasri-sg.icar.gov.in/dbpmod/. DBPMod is expected to be an invaluable tool for discovering DBPs, supplementing the current experimental and computational methods.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Nitesh K Sharma
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, USA
| | - Aarushi Agarwal
- Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh 201313, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
6
|
Pradhan UK, Meher PK, Naha S, Das R, Gupta A, Parsad R. ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins. Protein Sci 2024; 33:e5015. [PMID: 38747369 PMCID: PMC11094783 DOI: 10.1002/pro.5015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 04/18/2024] [Accepted: 04/21/2024] [Indexed: 05/19/2024]
Abstract
Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning-driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF-VIM) yielded the highest five-fold cross-validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting-edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Prabina Kumar Meher
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Sanchita Naha
- Division of Computer ApplicationsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ritwika Das
- Division of Agricultural BioinformaticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ajit Gupta
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Rajender Parsad
- ICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| |
Collapse
|
7
|
Sun C, Feng Y. EPDRNA: A Model for Identifying DNA-RNA Binding Sites in Disease-Related Proteins. Protein J 2024; 43:513-521. [PMID: 38491248 DOI: 10.1007/s10930-024-10183-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/02/2024] [Indexed: 03/18/2024]
Abstract
Protein-DNA and protein-RNA interactions are involved in many biological processes and regulate many cellular functions. Moreover, they are related to many human diseases. To understand the molecular mechanism of protein-DNA binding and protein-RNA binding, it is important to identify which residues in the protein sequence bind to DNA and RNA. At present, there are few methods for specifically identifying the binding sites of disease-related protein-DNA and protein-RNA. In this study, so we combined four machine learning algorithms into an ensemble classifier (EPDRNA) to predict DNA and RNA binding sites in disease-related proteins. The dataset used in model was collated from UniProt and PDB database, and PSSM, physicochemical properties and amino acid type were used as features. The EPDRNA adopted soft voting and achieved the best AUC value of 0.73 at the DNA binding sites, and the best AUC value of 0.71 at the RNA binding sites in 10-fold cross validation in the training sets. In order to further verify the performance of the model, we assessed EPDRNA for the prediction of DNA-binding sites and the prediction of RNA-binding sites on the independent test dataset. The EPDRNA achieved 85% recall rate and 25% precision on the protein-DNA interaction independent test set, and achieved 82% recall rate and 27% precision on the protein-RNA interaction independent test set. The online EPDRNA webserver is freely available at http://www.s-bioinformatics.cn/epdrna .
Collapse
Affiliation(s)
- CanZhuang Sun
- College of Science, Inner Mongolia Agriculture University, Hohhot, 010018, People's Republic of China
| | - YongE Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot, 010018, People's Republic of China.
| |
Collapse
|
8
|
García Sánchez N, Ugarte Carro E, Prieto-Santamaría L, Rodríguez-González A. Protein sequence analysis in the context of drug repurposing. BMC Med Inform Decis Mak 2024; 24:122. [PMID: 38741115 DOI: 10.1186/s12911-024-02531-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 05/08/2024] [Indexed: 05/16/2024] Open
Abstract
MOTIVATION Drug repurposing speeds up the development of new treatments, being less costly, risky, and time consuming than de novo drug discovery. There are numerous biological elements that contribute to the development of diseases and, as a result, to the repurposing of drugs. METHODS In this article, we analysed the potential role of protein sequences in drug repurposing scenarios. For this purpose, we embedded the protein sequences by performing four state of the art methods and validated their capacity to encapsulate essential biological information through visualization. Then, we compared the differences in sequence distance between protein-drug target pairs of drug repurposing and non - drug repurposing data. Thus, we were able to uncover patterns that define protein sequences in repurposing cases. RESULTS We found statistically significant sequence distance differences between protein pairs in the repurposing data and the rest of protein pairs in non-repurposing data. In this manner, we verified the potential of using numerical representations of sequences to generate repurposing hypotheses in the future.
Collapse
Affiliation(s)
- Natalia García Sánchez
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
| | - Esther Ugarte Carro
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
| | - Lucía Prieto-Santamaría
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
- ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain
| | - Alejandro Rodríguez-González
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain.
- ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain.
| |
Collapse
|
9
|
Li X, Qu W, Yan J, Tan J. RPI-EDLCN: An Ensemble Deep Learning Framework Based on Capsule Network for ncRNA-Protein Interaction Prediction. J Chem Inf Model 2024; 64:2221-2235. [PMID: 37158609 DOI: 10.1021/acs.jcim.3c00377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Noncoding RNAs (ncRNAs) play crucial roles in many cellular life activities by interacting with proteins. Identification of ncRNA-protein interactions (ncRPIs) is key to understanding the function of ncRNAs. Although a number of computational methods for predicting ncRPIs have been developed, the problem of predicting ncRPIs remains challenging. It has always been the focus of ncRPIs research to select suitable feature extraction methods and develop a deep learning architecture with better recognition performance. In this work, we proposed an ensemble deep learning framework, RPI-EDLCN, based on a capsule network (CapsuleNet) to predict ncRPIs. In terms of feature input, we extracted the sequence features, secondary structure sequence features, motif information, and physicochemical properties of ncRNA/protein. The sequence and secondary structure sequence features of ncRNA/protein are encoded by the conjoint k-mer method and then input into an ensemble deep learning model based on CapsuleNet by combining the motif information and physicochemical properties. In this model, the encoding features are processed by convolution neural network (CNN), deep neural network (DNN), and stacked autoencoder (SAE). Then the advanced features obtained from the processing are input into the CapsuleNet for further feature learning. Compared with other state-of-the-art methods under 5-fold cross-validation, the performance of RPI-EDLCN is the best, and the accuracy of RPI-EDLCN on RPI1807, RPI2241, and NPInter v2.0 data sets was 93.8%, 88.2%, and 91.9%, respectively. The results of the independent test indicated that RPI-EDLCN can effectively predict potential ncRPIs in different organisms. In addition, RPI-EDLCN successfully predicted hub ncRNAs and proteins in Mus musculus ncRNA-protein networks. Overall, our model can be used as an effective tool to predict ncRPIs and provides some useful guidance for future biological studies.
Collapse
Affiliation(s)
- Xiaoyi Li
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Wenyan Qu
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Jing Yan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| | - Jianjun Tan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China
| |
Collapse
|
10
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
11
|
Wilson B, Esmaeili F, Parsons M, Salah W, Su Z, Dutta A. sRNA-Effector: A tool to expedite discovery of small RNA regulators. iScience 2024; 27:109300. [PMID: 38469560 PMCID: PMC10926228 DOI: 10.1016/j.isci.2024.109300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 11/08/2023] [Accepted: 02/16/2024] [Indexed: 03/13/2024] Open
Abstract
microRNAs (miRNAs) are small regulatory RNAs that repress target mRNA transcripts through base pairing. Although the mechanisms of miRNA production and function are clearly established, new insights into miRNA regulation or miRNA-mediated gene silencing are still emerging. In order to facilitate the discovery of miRNA regulators or effectors, we have developed sRNA-Effector, a machine learning algorithm trained on enhanced crosslinking and immunoprecipitation sequencing and RNA sequencing data following knockdown of specific genes. sRNA-Effector can accurately identify known miRNA biogenesis and effector proteins and identifies 9 putative regulators of miRNA function, including serine/threonine kinase STK33, splicing factor SFPQ, and proto-oncogene BMI1. We validated the role of STK33, SFPQ, and BMI1 in miRNA regulation, showing that sRNA-Effector is useful for identifying new players in small RNA biology. sRNA-Effector will be a web tool available for all researchers to identify potential miRNA regulators in any cell line of interest.
Collapse
Affiliation(s)
- Briana Wilson
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22901, USA
| | - Fatemeh Esmaeili
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| | - Matthew Parsons
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22901, USA
| | - Wafa Salah
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22901, USA
| | - Zhangli Su
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| | - Anindya Dutta
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| |
Collapse
|
12
|
Zhang J, Chen Q, Liu B. iNucRes-ASSH: Identifying nucleic acid-binding residues in proteins by using self-attention-based structure-sequence hybrid neural network. Proteins 2024; 92:395-410. [PMID: 37915276 DOI: 10.1002/prot.26626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 09/27/2023] [Accepted: 10/17/2023] [Indexed: 11/03/2023]
Abstract
Interaction between proteins and nucleic acids is crucial to many cellular activities. Accurately detecting nucleic acid-binding residues (NABRs) in proteins can help researchers better understand the interaction mechanism between proteins and nucleic acids. Structure-based methods can generally make more accurate predictions than sequence-based methods. However, the existing structure-based methods are sensitive to protein conformational changes, causing limited generalizability. More effective and robust approaches should be further explored. In this study, we propose iNucRes-ASSH to identify nucleic acid-binding residues with a self-attention-based structure-sequence hybrid neural network. It improves the generalizability and robustness of NABR prediction from two levels: residue representation and prediction model. Experimental results show that iNucRes-ASSH can predict the nucleic acid-binding residues even when the experimentally validated structures are unavailable and outperforms five competing methods on a recent benchmark dataset and a widely used test dataset.
Collapse
Affiliation(s)
- Jun Zhang
- National Engineering Laboratory for Big Data System Computing Technology, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
13
|
Wang J, Zhao X, Wang Q, Zheng X, Simayi D, Zhao J, Yang P, Mao Q, Xia H. FAM76B regulates PI3K/Akt/NF-κB-mediated M1 macrophage polarization by influencing the stability of PIK3CD mRNA. Cell Mol Life Sci 2024; 81:107. [PMID: 38421448 PMCID: PMC10904503 DOI: 10.1007/s00018-024-05133-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Revised: 01/17/2024] [Accepted: 01/17/2024] [Indexed: 03/02/2024]
Abstract
Macrophage polarization is closely related to inflammation development, yet how macrophages are polarized remains unclear. In our study, the number of M1 macrophages was markedly increased in Fam76b knockout U937 cells vs. wild-type U937 cells, and FAM76B expression was decreased in M1 macrophages induced from different sources of macrophages. Moreover, Fam76b knockout enhanced the mRNA and protein levels of M1 macrophage-associated marker genes. These results suggest that FAM76B inhibits M1 macrophage polarization. We then further explored the mechanism by which FAM76B regulates macrophage polarization. We found that FAM76B can regulate PI3K/Akt/NF-κB pathway-mediated M1 macrophage polarization by stabilizing PIK3CD mRNA. Finally, FAM76B was proven to protect against inflammatory bowel disease (IBD) by inhibiting M1 macrophage polarization through the PI3K/Akt/NF-κB pathway in vivo. In summary, FAM76B regulates M1 macrophage polarization through the PI3K/Akt/NF-κB pathway in vitro and in vivo, which may inform the development of future therapeutic strategies for IBD and other inflammatory diseases.
Collapse
Affiliation(s)
- Juan Wang
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
- Department of Pathology, School of Basic Medical Science, Ningxia Medical University, Yinchuan, 750004, People's Republic of China
| | - Xinyue Zhao
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
| | - Qizhi Wang
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
| | - Xiaojing Zheng
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
| | - Dilihumaer Simayi
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
| | - Junli Zhao
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
| | - Peiyan Yang
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China
| | - Qinwen Mao
- Department of Pathology, University of Utah, Huntsman Cancer Institute, 2000 Circle of Hope Drive, Salt Lake City, UT, 84112, USA
| | - Haibin Xia
- Laboratory of Gene Therapy, Department of Biochemistry, College of Life Sciences, Shaanxi Normal University, 199 South Chang'an Road, Xi'an, 710062, Shaanxi Province, People's Republic of China.
| |
Collapse
|
14
|
Mahmud SMH, Goh KOM, Hosen MF, Nandi D, Shoombuatong W. Deep-WET: a deep learning-based approach for predicting DNA-binding proteins using word embedding techniques with weighted features. Sci Rep 2024; 14:2961. [PMID: 38316843 PMCID: PMC10844231 DOI: 10.1038/s41598-024-52653-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Accepted: 01/22/2024] [Indexed: 02/07/2024] Open
Abstract
DNA-binding proteins (DBPs) play a significant role in all phases of genetic processes, including DNA recombination, repair, and modification. They are often utilized in drug discovery as fundamental elements of steroids, antibiotics, and anticancer drugs. Predicting them poses the most challenging task in proteomics research. Conventional experimental methods for DBP identification are costly and sometimes biased toward prediction. Therefore, developing powerful computational methods that can accurately and rapidly identify DBPs from sequence information is an urgent need. In this study, we propose a novel deep learning-based method called Deep-WET to accurately identify DBPs from primary sequence information. In Deep-WET, we employed three powerful feature encoding schemes containing Global Vectors, Word2Vec, and fastText to encode the protein sequence. Subsequently, these three features were sequentially combined and weighted using the weights obtained from the elements learned through the differential evolution (DE) algorithm. To enhance the predictive performance of Deep-WET, we applied the SHapley Additive exPlanations approach to remove irrelevant features. Finally, the optimal feature subset was input into convolutional neural networks to construct the Deep-WET predictor. Both cross-validation and independent tests indicated that Deep-WET achieved superior predictive performance compared to conventional machine learning classifiers. In addition, in extensive independent test, Deep-WET was effective and outperformed than several state-of-the-art methods for DBP prediction, with accuracy of 78.08%, MCC of 0.559, and AUC of 0.805. This superior performance shows that Deep-WET has a tremendous predictive capacity to predict DBPs. The web server of Deep-WET and curated datasets in this study are available at https://deepwet-dna.monarcatechnical.com/ . The proposed Deep-WET is anticipated to serve the community-wide effort for large-scale identification of potential DBPs.
Collapse
Affiliation(s)
- S M Hasan Mahmud
- Department of Computer Science, American International University-Bangladesh (AIUB), Kuratoli, Dhaka, 1229, Bangladesh.
- Centre for Advanced Machine Learning and Applications (CAMLAs), Dhaka, 1229, Bangladesh.
| | - Kah Ong Michael Goh
- Faculty of Information Science & Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, 75450, Melaka, Malaysia.
| | - Md Faruk Hosen
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Dip Nandi
- Department of Computer Science, American International University-Bangladesh (AIUB), Kuratoli, Dhaka, 1229, Bangladesh
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| |
Collapse
|
15
|
Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024; 52:e10. [PMID: 38048333 PMCID: PMC10810184 DOI: 10.1093/nar/gkad1131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 11/10/2023] [Indexed: 12/06/2023] Open
Abstract
Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, PR China
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
16
|
Chen S, Yan K, Liu B. PDB-BRE: A ligand-protein interaction binding residue extractor based on Protein Data Bank. Proteins 2024; 92:145-153. [PMID: 37750380 DOI: 10.1002/prot.26596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 08/13/2023] [Accepted: 09/11/2023] [Indexed: 09/27/2023]
Abstract
Proteins typically exert their biological functions by interacting with other biomolecules or ligands. The study of ligand-protein interactions is crucial in elucidating the biological mechanisms of proteins. Most existing studies have focused on analyzing ligand-protein interactions, and they ignore the additional situational of inserted and modified residues. Besides, the resources often support only a single ligand type and cannot obtain satisfied results in analyzing novel complexes. Therefore, it is important to develop a general analytical tool to extract the binding residues of ligand-protein interactions in complexes fully. In this study, we propose a ligand-protein interaction binding residue extractor (PDB-BRE), which can be used to automatically extract interacting ligand or protein-binding residues from complex three-dimensional (3D) structures based on the RCSB Protein Data Bank (RCSB PDB). PDB-BRE offers a notable advantage in its comprehensive support for analyzing six distinct types of ligands, including proteins, peptides, DNA, RNA, mixed DNA and RNA entities, and non-polymeric entities. Moreover, it takes into account the consideration of inserted and modified residues within complexes. Compared to other state-of-the-art methods, PDB-BRE is more suitable for massively parallel batch analysis, and can be directly applied for downstream tasks, such as predicting binding residues of novel complexes. PDB-BRE is freely available at http://bliulab.net/PDB-BRE.
Collapse
Affiliation(s)
- Shutao Chen
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
17
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, Parsad R. RBPLight: a computational tool for discovery of plant-specific RNA-binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics 2023; 22:401-410. [PMID: 37158175 DOI: 10.1093/bfgp/elad016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 04/21/2023] [Indexed: 05/10/2023] Open
Abstract
RNA-binding proteins (RBPs) are essential for post-transcriptional gene regulation in eukaryotes, including splicing control, mRNA transport and decay. Thus, accurate identification of RBPs is important to understand gene expression and regulation of cell state. In order to detect RBPs, a number of computational models have been developed. These methods made use of datasets from several eukaryotic species, specifically from mice and humans. Although some models have been tested on Arabidopsis, these techniques fall short of correctly identifying RBPs for other plant species. Therefore, the development of a powerful computational model for identifying plant-specific RBPs is needed. In this study, we presented a novel computational model for locating RBPs in plants. Five deep learning models and ten shallow learning algorithms were utilized for prediction with 20 sequence-derived and 20 evolutionary feature sets. The highest repeated five-fold cross-validation accuracy, 91.24% AU-ROC and 91.91% AU-PRC, was achieved by light gradient boosting machine. While evaluated using an independent dataset, the developed approach achieved 94.00% AU-ROC and 94.50% AU-PRC. The proposed model achieved significantly higher accuracy for predicting plant-specific RBPs as compared to the currently available state-of-art RBP prediction models. Despite the fact that certain models have already been trained and assessed on the model organism Arabidopsis, this is the first comprehensive computer model for the discovery of plant-specific RBPs. The web server RBPLight was also developed, which is publicly accessible at https://iasri-sg.icar.gov.in/rbplight/, for the convenience of researchers to identify RBPs in plants.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sagar Gupta
- CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP) 176061, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
18
|
Arican OC, Gumus O. PredDRBP-MLP: Prediction of DNA-binding proteins and RNA-binding proteins by multilayer perceptron. Comput Biol Med 2023; 164:107317. [PMID: 37562328 DOI: 10.1016/j.compbiomed.2023.107317] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 07/27/2023] [Accepted: 08/07/2023] [Indexed: 08/12/2023]
Abstract
Proteins interact with many molecules in order to maintain the vital activities in cells. Proteins that interact with DNA are called DNA-binding proteins (DBP), and proteins that interact with RNA are called RNA-binding proteins (RBP). Since DBPs and RBPs are involved in critical biological processes, their classification is quite important. Although the convolutional neural network and bidirectional long-short-term memory hybrid model (CNN-BiLSTM) is very popular in DBP and RBP classification, it has problems such as requirement of high processing power and long training time. Therefore, a multilayer perceptron (MLP) based predictor, PredDRBP-MLP (Predictor of DNA-Binding Proteins and RNA-Binding Proteins - Multilayer Perceptron) was developed in this study. PredDRBP-MLP is an artificial learning model that performs multi-class classification of DBPs, RBPs and non-nucleic acid-binding proteins (NNABP). PredDRBP-MLP achieved quite successful results on the independent dataset, specifically in the NNABP class, compared to the existing predictors, in addition to requiring lower processing power and being able to train quicker compared to CNN-BiLSTM based predictors. In NNABP class, PredDRBP-MLP predictor achieved 0.578 precision, 0.522 recall and 0.549 F1-score, while other multi-class predictor achieved 0.486 precision, 0.183 recall and 0.266 F1-score. A desktop application was developed for PredDRBP-MLP. The application is freely accessible at https://sourceforge.net/projects/preddrbp-mlp.
Collapse
Affiliation(s)
- Ozgur Can Arican
- Department of Health Bioinformatics, Ege University, 35100, Izmir, Turkey.
| | - Ozgur Gumus
- Department of Computer Engineering, Ege University, 35100, Izmir, Turkey.
| |
Collapse
|
19
|
Deng Z, Yang Z, Liu X, Dai X, Zhang J, Deng K. Genome-Wide Identification and Expression Analysis of C3H Zinc Finger Family in Potato ( Solanum tuberosum L.). Int J Mol Sci 2023; 24:12888. [PMID: 37629069 PMCID: PMC10454627 DOI: 10.3390/ijms241612888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 08/11/2023] [Accepted: 08/15/2023] [Indexed: 08/27/2023] Open
Abstract
Transcription factors containing a CCCH structure (C3H) play important roles in plant growth and development, and their stress response, but research on the C3H gene family in potato has not been reported yet. In this study, we used bioinformatics to identify 50 C3H genes in potato and named them StC3H-1 to StC3H-50 according to their location on chromosomes, and we analyzed their physical and chemical properties, chromosome location, phylogenetic relationship, gene structure, collinearity relationship, and cis-regulatory element. The gene expression pattern analysis showed that many StC3H genes are involved in potato growth and development, and their response to diverse environmental stresses. Furthermore, RT-qPCR data showed that the expression of many StC3H genes was induced by high temperatures, indicating that StC3H genes may play important roles in potato response to heat stress. In addition, Some StC3H genes were predominantly expressed in the stolon and developing tubers, suggesting that these StC3H genes may be involved in the regulation of tuber development. Together, these results provide new information on StC3H genes and will be helpful for further revealing the function of StC3H genes in the heat stress response and tuber development in potato.
Collapse
Affiliation(s)
- Zeyi Deng
- College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China; (Z.D.); (Z.Y.); (X.L.); (X.D.); (J.Z.)
| | - Zhijiang Yang
- College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China; (Z.D.); (Z.Y.); (X.L.); (X.D.); (J.Z.)
| | - Xinyan Liu
- College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China; (Z.D.); (Z.Y.); (X.L.); (X.D.); (J.Z.)
| | - Xiumei Dai
- College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China; (Z.D.); (Z.Y.); (X.L.); (X.D.); (J.Z.)
- Engineering Research Center of South Upland Agriculture, Ministry of Education, Chongqing 400715, China
| | - Jiankui Zhang
- College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China; (Z.D.); (Z.Y.); (X.L.); (X.D.); (J.Z.)
- Engineering Research Center of South Upland Agriculture, Ministry of Education, Chongqing 400715, China
| | - Kexuan Deng
- College of Agronomy and Biotechnology, Southwest University, Chongqing 400715, China; (Z.D.); (Z.Y.); (X.L.); (X.D.); (J.Z.)
- Engineering Research Center of South Upland Agriculture, Ministry of Education, Chongqing 400715, China
| |
Collapse
|
20
|
Computational prediction of disordered binding regions. Comput Struct Biotechnol J 2023; 21:1487-1497. [PMID: 36851914 PMCID: PMC9957716 DOI: 10.1016/j.csbj.2023.02.018] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 02/08/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023] Open
Abstract
One of the key features of intrinsically disordered regions (IDRs) is their ability to interact with a broad range of partner molecules. Multiple types of interacting IDRs were identified including molecular recognition fragments (MoRFs), short linear sequence motifs (SLiMs), and protein-, nucleic acids- and lipid-binding regions. Prediction of binding IDRs in protein sequences is gaining momentum in recent years. We survey 38 predictors of binding IDRs that target interactions with a diverse set of partners, such as peptides, proteins, RNA, DNA and lipids. We offer a historical perspective and highlight key events that fueled efforts to develop these methods. These tools rely on a diverse range of predictive architectures that include scoring functions, regular expressions, traditional and deep machine learning and meta-models. Recent efforts focus on the development of deep neural network-based architectures and extending coverage to RNA, DNA and lipid-binding IDRs. We analyze availability of these methods and show that providing implementations and webservers results in much higher rates of citations/use. We also make several recommendations to take advantage of modern deep network architectures, develop tools that bundle predictions of multiple and different types of binding IDRs, and work on algorithms that model structures of the resulting complexes.
Collapse
|
21
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta A, Parsad R. PlDBPred: a novel computational model for discovery of DNA binding proteins in plants. Brief Bioinform 2023; 24:6840070. [PMID: 36416116 DOI: 10.1093/bib/bbac483] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 10/10/2022] [Accepted: 10/11/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins (DBPs) play crucial roles in numerous cellular processes including nucleotide recognition, transcriptional control and the regulation of gene expression. Majority of the existing computational techniques for identifying DBPs are mainly applicable to human and mouse datasets. Even though some models have been tested on Arabidopsis, they produce poor accuracy when applied to other plant species. Therefore, it is imperative to develop an effective computational model for predicting plant DBPs. In this study, we developed a comprehensive computational model for plant specific DBPs identification. Five shallow learning and six deep learning models were initially used for prediction, where shallow learning methods outperformed deep learning algorithms. In particular, support vector machine achieved highest repeated 5-fold cross-validation accuracy of 94.0% area under receiver operating characteristic curve (AUC-ROC) and 93.5% area under precision recall curve (AUC-PR). With an independent dataset, the developed approach secured 93.8% AUC-ROC and 94.6% AUC-PR. While compared with the state-of-art existing tools by using an independent dataset, the proposed model achieved much higher accuracy. Overall results suggest that the developed computational model is more efficient and reliable as compared to the existing models for the prediction of DBPs in plants. For the convenience of the majority of experimental scientists, the developed prediction server PlDBPred is publicly accessible at https://iasri-sg.icar.gov.in/pldbpred/.The source code is also provided at https://iasri-sg.icar.gov.in/pldbpred/source_code.php for prediction using a large-size dataset.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| |
Collapse
|
22
|
Du X, Hu J. Deep Multi-Label Joint Learning for RNA and DNA-Binding Proteins Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:307-320. [PMID: 35148267 DOI: 10.1109/tcbb.2022.3150280] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The recognition of DNA- (DBPs) and RNA-binding proteins (RBPs) is not only conducive to understanding cell function, but also a challenging task. Previous studies have shown that these proteins are usually considered separately due to different binding domains. In addition, due to the high similarity between DBPs and RBPs, it is possible for DBPs predictor to predict RBPs as DBPs, and vice versa, which leads to high cross-prediction rate. In this study, we creatively propose a novel deep multi-label joint learning framework to leverage the relationship between multiple labels and binding proteins. First, a multi-label variant network is designed to explore multi-scale context hidden information. Then, multi-label Long Short-Term Memory (multiLSTM) is used to mine the potential relationship between labels. Finally, the calibrated hidden features from variant network are considered for different levels of joint learning so that multiLSTM can better explore the correlation between them. Extensive experiments are also carried out to compare the proposed method with other existing methods. Furthermore, we also provide further insights into the importance of the relevant bioanalysis of proteins obtained from our model and summarize these binding proteins that are significantly related to a disease. Our method is freely available at http://39.108.90.186/dmlj.
Collapse
|
23
|
Yan Y, Huang T. The Interactome of Protein, DNA, and RNA. Methods Mol Biol 2023; 2695:89-110. [PMID: 37450113 DOI: 10.1007/978-1-0716-3346-5_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]
Abstract
Proteins participate in many processes of the organism and are very important for maintaining the health of the organism. However, proteins cannot function independently in the body. They must interact with proteins, DNA, RNA, and other substances to perform biological functions and maintain the body's health. At present, there are many experimental methods and software tools that can detect and predict the interaction between proteins and other substances. There are also many databases that record the interaction between proteins and other substances. This article mainly describes protein-protein, protein-DNA, and protein-RNA interactions in detail by introducing some commonly used experimental methods, the software tools produced with the accumulation of experimental data and the rapid development of machine learning, and the related databases that record the relationship between proteins and some substances. By this review, we hope that through the analysis and summary of various aspects, it will be convenient for researchers to conduct further research on protein interactions.
Collapse
Affiliation(s)
- Yuyao Yan
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China.
| |
Collapse
|
24
|
Wu Z, Basu S, Wu X, Kurgan L. qNABpredict: Quick, accurate, and taxonomy-aware sequence-based prediction of content of nucleic acid binding amino acids. Protein Sci 2023; 32:e4544. [PMID: 36519304 PMCID: PMC9798252 DOI: 10.1002/pro.4544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 12/07/2022] [Accepted: 12/08/2022] [Indexed: 12/23/2022]
Abstract
Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce more details but suffer high computational cost since they must predict every amino acid in the input sequence and rely on multiple sequence alignments. We propose an alternative approach that predicts content (fraction) of the NA-binding residues, offering more information than the protein-level prediction and much shorter runtime than the residue-level tools. Our first-of-its-kind content predictor, qNABpredict, relies on a small, rationally designed and fast-to-compute feature set that represents relevant characteristics extracted from the input sequence and a well-parametrized support vector regression model. We provide two versions of qNABpredict, a taxonomy-agnostic model that can be used for proteins of unknown taxonomic origin and more accurate taxonomy-aware models that are tailored to specific taxonomic kingdoms: archaea, bacteria, eukaryota, and viruses. Empirical tests on a low-similarity test dataset show that qNABpredict is 100 times faster and generates statistically more accurate content predictions when compared to the content extracted from results produced by the residue-level predictors. We also show that qNABpredict's content predictions can be used to improve results generated by the residue-level predictors. We release qNABpredict as a convenient webserver and source code at http://biomine.cs.vcu.edu/servers/qNABpredict/. This new tool should be particularly useful to predict details of protein-NA interactions for large protein families and proteomes.
Collapse
Affiliation(s)
- Zhonghua Wu
- School of Mathematical Sciences and LPMCNankai UniversityTianjinChina
| | - Sushmita Basu
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginiaUSA
| | - Xuantai Wu
- School of Mathematical Sciences and LPMCNankai UniversityTianjinChina
| | - Lukasz Kurgan
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginiaUSA
| |
Collapse
|
25
|
Wang N, Zhang J, Liu B. iDRBP-EL: Identifying DNA- and RNA- Binding Proteins Based on Hierarchical Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:432-441. [PMID: 34932484 DOI: 10.1109/tcbb.2021.3136905] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) from the primary sequences is essential for further exploring protein-nucleic acid interactions. Previous studies have shown that machine-learning-based methods can efficiently identify DBPs or RBPs. However, the information used in these methods is slightly unitary, and most of them only can predict DBPs or RBPs. In this study, we proposed a computational predictor iDRBP-EL to identify DNA- and RNA- binding proteins, and introduced hierarchical ensemble learning to integrate three level information. The method can integrate the information of different features, machine learning algorithms and data into one multi-label model. The ablation experiment showed that the fusion of different information can improve the prediction performance and overcome the cross-prediction problem. Experimental results on the independent datasets showed that iDRBP-EL outperformed all the other competing methods. Moreover, we established a user-friendly webserver iDRBP-EL (http://bliulab.net/iDRBP-EL), which can predict both DBPs and RBPs only based on protein sequences.
Collapse
|
26
|
Pang Y, Liu B. DMFpred: Predicting protein disorder molecular functions based on protein cubic language model. PLoS Comput Biol 2022; 18:e1010668. [PMID: 36315580 PMCID: PMC9674156 DOI: 10.1371/journal.pcbi.1010668] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 11/18/2022] [Accepted: 10/19/2022] [Indexed: 11/05/2022] Open
Abstract
Intrinsically disordered proteins and regions (IDP/IDRs) are widespread in living organisms and perform various essential molecular functions. These functions are summarized as six general categories, including entropic chain, assembler, scavenger, effector, display site, and chaperone. The alteration of IDP functions is responsible for many human diseases. Therefore, identifying the function of disordered proteins is helpful for the studies of drug target discovery and rational drug design. Experimental identification of the molecular functions of IDP in the wet lab is an expensive and laborious procedure that is not applicable on a large scale. Some computational methods have been proposed and mainly focus on predicting the entropic chain function of IDRs, while the computational predictive methods for the remaining five important categories of disordered molecular functions are desired. Motivated by the growing numbers of experimental annotated functional sequences and the need to expand the coverage of disordered protein function predictors, we proposed DMFpred for disordered molecular functions prediction, covering disordered assembler, scavenger, effector, display site and chaperone. DMFpred employs the Protein Cubic Language Model (PCLM), which incorporates three protein language models for characterizing sequences, structural and functional features of proteins, and attention-based alignment for understanding the relationship among three captured features and generating a joint representation of proteins. The PCLM was pre-trained with large-scaled IDR sequences and fine-tuned with functional annotation sequences for molecular function prediction. The predictive performance evaluation on five categories of functional and multi-functional residues suggested that DMFpred provides high-quality predictions. The web-server of DMFpred can be freely accessed from http://bliulab.net/DMFpred/.
Collapse
Affiliation(s)
- Yihe Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
- * E-mail:
| |
Collapse
|
27
|
Brandt AL, Garai S, Zagzoog A, Hurst DP, Stevenson LA, Pertwee RG, Imler GH, Reggio PH, Thakur GA, Laprairie RB. Pharmacological evaluation of enantiomerically separated positive allosteric modulators of cannabinoid 1 receptor, GAT591 and GAT593. Front Pharmacol 2022; 13:919605. [PMID: 36386195 PMCID: PMC9640980 DOI: 10.3389/fphar.2022.919605] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Accepted: 10/07/2022] [Indexed: 11/09/2023] Open
Abstract
Positive allosteric modulation of the type 1 cannabinoid receptor (CB1R) has substantial potential to treat both neurological and immune disorders. To date, a few studies have evaluated the structure-activity relationship (SAR) for CB1R positive allosteric modulators (PAMs). In this study, we separated the enantiomers of the previously characterized two potent CB1R ago-PAMs GAT591 and GAT593 to determine their biochemical activity at CB1R. Separating the enantiomers showed that the R-enantiomers (GAT1665 and GAT1667) displayed mixed allosteric agonist-PAM activity at CB1R while the S-enantiomers (GAT1664 and GAT1666) showed moderate activity. Furthermore, we observed that the R and S-enantiomers had distinct binding sites on CB1R, which led to their distinct behavior both in vitro and in vivo. The R-enantiomers (GAT1665 and GAT1667) produced ago-PAM effects in vitro, and PAM effects in the in vivo behavioral triad, indicating that the in vivo activity of these ligands may occur via PAM rather than agonist-based mechanisms. Overall, this study provides mechanistic insight into enantiospecific interaction of 2-phenylindole class of CB1R allosteric modulators, which have shown therapeutic potential in the treatment of pain, epilepsy, glaucoma, and Huntington's disease.
Collapse
Affiliation(s)
- Asher L. Brandt
- College of Pharmacy and Nutrition, University of Saskatchewan, Saskatoon, SK, Canada
| | - Sumanta Garai
- Department of Pharmaceutical Sciences, Bouvé College of Health Sciences, Boston, MA, United States
| | - Ayat Zagzoog
- College of Pharmacy and Nutrition, University of Saskatchewan, Saskatoon, SK, Canada
| | - Dow P. Hurst
- Center for Drug Discovery, University of North Carolina Greensboro, Greensboro, NC, United States
| | - Lesley A. Stevenson
- School of Medicine, Medical Sciences and Nutrition, Institute of Medical Sciences, University of Aberdeen, Aberdeen, Scotland, United Kingdom
| | - Roger G. Pertwee
- School of Medicine, Medical Sciences and Nutrition, Institute of Medical Sciences, University of Aberdeen, Aberdeen, Scotland, United Kingdom
| | - Gregory H. Imler
- Centre for Biomolecular Science and Engineering, Naval Research Laboratory, Washington, DC, United States
| | - Patricia H. Reggio
- Center for Drug Discovery, University of North Carolina Greensboro, Greensboro, NC, United States
| | - Ganesh A. Thakur
- Department of Pharmaceutical Sciences, Bouvé College of Health Sciences, Boston, MA, United States
| | - Robert B. Laprairie
- College of Pharmacy and Nutrition, University of Saskatchewan, Saskatoon, SK, Canada
- Department of Pharmacology, Faculty of Medicine, Dalhousie University, Halifax, NS, Canada
| |
Collapse
|
28
|
Wang C, Zong X, Wu F, Leung RWT, Hu Y, Qin J. DNA- and RNA-Binding Proteins Linked Transcriptional Control and Alternative Splicing Together in a Two-Layer Regulatory Network System of Chronic Myeloid Leukemia. Front Mol Biosci 2022; 9:920492. [PMID: 36052164 PMCID: PMC9425088 DOI: 10.3389/fmolb.2022.920492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 06/24/2022] [Indexed: 11/30/2022] Open
Abstract
DNA- and RNA-binding proteins (DRBPs) typically possess multiple functions to bind both DNA and RNA and regulate gene expression from more than one level. They are controllers for post-transcriptional processes, such as splicing, polyadenylation, transportation, translation, and degradation of RNA transcripts in eukaryotic organisms, as well as regulators on the transcriptional level. Although DRBPs are reported to play critical roles in various developmental processes and diseases, it is still unclear how they work with DNAs and RNAs simultaneously and regulate genes at the transcriptional and post-transcriptional levels. To investigate the functional mechanism of DRBPs, we collected data from a variety of databases and literature and identified 118 DRBPs, which function as both transcription factors (TFs) and splicing factors (SFs), thus called DRBP-SF. Extensive investigations were conducted on four DRBP-SFs that were highly expressed in chronic myeloid leukemia (CML), heterogeneous nuclear ribonucleoprotein K (HNRNPK), heterogeneous nuclear ribonucleoprotein L (HNRNPL), non-POU domain–containing octamer–binding protein (NONO), and TAR DNA-binding protein 43 (TARDBP). By integrating and analyzing ChIP-seq, CLIP-seq, RNA-seq, and shRNA-seq data in K562 using binding and expression target analysis and Statistical Utility for RBP Functions, we discovered a two-layer regulatory network system centered on these four DRBP-SFs and proposed three possible regulatory models where DRBP-SFs can connect transcriptional and alternative splicing regulatory networks cooperatively in CML. The exploration of the identified DRBP-SFs provides new ideas for studying DRBP and regulatory networks, holding promise for further mechanistic discoveries of the two-layer gene regulatory system that may play critical roles in the occurrence and development of CML.
Collapse
Affiliation(s)
- Chuhui Wang
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen, China
| | - Xueqing Zong
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen, China
| | - Fanjie Wu
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen, China
| | - Ricky Wai Tak Leung
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen, China
- College of Professional and Continuing Education, The Hong Kong Polytechnic University, Hong Kong, China
| | - Yaohua Hu
- Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
- *Correspondence: Yaohua Hu, ; Jing Qin,
| | - Jing Qin
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-sen University, Shenzhen, China
- *Correspondence: Yaohua Hu, ; Jing Qin,
| |
Collapse
|
29
|
Feng J, Wang N, Zhang J, Liu B. iDRBP-ECHF: Identifying DNA- and RNA-binding proteins based on extensible cubic hybrid framework. Comput Biol Med 2022; 149:105940. [PMID: 36044786 DOI: 10.1016/j.compbiomed.2022.105940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 07/10/2022] [Accepted: 08/06/2022] [Indexed: 11/28/2022]
Abstract
Proteins interact with nucleic acids to regulate the life activities of organisms. Therefore, how to accurately and efficiently identify nucleic acid-binding proteins (NABPs) is particularly significant. Some sequence-based computational methods have been proposed to identify DNA- and RNA-binding proteins in previous studies. However, the benchmark datasets used by these methods ignore the proportion of NABPs in the real world, and some integration methods only integrate traditional machine learning algorithms, resulting in limited prediction performance. In this study, we proposed a sequence-based method called iDRBP-ECHF to predict the DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs). We constructed a benchmark dataset by considering the proportion of positive and negative samples in the real world, and used down-sampling to generate three relatively balanced datasets to train the iDRBP-ECHF. In addition, we incorporated the deep learning algorithms into the framework to obtain a more compact high-level feature representation of the input data. The results on two independent datasets show that it achieves the most advanced performance and is superior to the other existing sequence-based DBP and RBP prediction methods. In addition, we set up a webserver iDRBP-ECHF, which can be accessed at http://bliulab.net/iDRBP-ECHF.
Collapse
Affiliation(s)
- Jiawei Feng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Ning Wang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, 518055, China.
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
30
|
Qiu XY, Wu H, Shao J. TALE-cmap: Protein function prediction based on a TALE-based architecture and the structure information from contact map. Comput Biol Med 2022; 149:105938. [DOI: 10.1016/j.compbiomed.2022.105938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 07/26/2022] [Accepted: 08/06/2022] [Indexed: 11/03/2022]
|
31
|
Wang N, Zhang J, Liu B. IDRBP-PPCT: Identifying Nucleic Acid-Binding Proteins Based on Position-Specific Score Matrix and Position-Specific Frequency Matrix Cross Transformation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2284-2293. [PMID: 33780341 DOI: 10.1109/tcbb.2021.3069263] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two important nucleic acid-binding proteins (NABPs), which play important roles in biological processes such as replication, translation and transcription of genetic material. Some proteins (DRBPs) bind to both DNA and RNA, also play a key role in gene expression. Identification of DBPs, RBPs and DRBPs is important to study protein-nucleic acid interactions. Computational methods are increasingly being proposed to automatically identify DNA- or RNA-binding proteins based only on protein sequences. One challenge is to design an effective protein representation method to convert protein sequences into fixed-dimension feature vectors. In this study, we proposed a novel protein representation method called Position-Specific Scoring Matrix (PSSM) and Position-Specific Frequency Matrix (PSFM) Cross Transformation (PPCT) to represent protein sequences. This method contains the evolutionary information in PSSM and PSFM, and their correlations. A new computational predictor called IDRBP-PPCT was proposed by combining PPCT and the two-layer framework based on the random forest algorithm to identify DBPs, RBPs and DRBPs. The experimental results on the independent dataset and the tomato genome proved the effectiveness of the proposed method. A user-friendly web-server of IDRBP-PPCT was constructed, which is freely available at http://bliulab.net/IDRBP-PPCT.
Collapse
|
32
|
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:5847242. [PMID: 35799660 PMCID: PMC9256349 DOI: 10.1155/2022/5847242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 06/07/2022] [Indexed: 11/17/2022]
Abstract
The interaction between DNA and protein is vital for the development of a living body. Previous numerous studies on in silico identification of DNA-binding proteins (DBPs) usually include features extracted from the alignment-based (pseudo) position-specific scoring matrix (PSSM), leading to limited application due to its time-consuming generation. Few researchers have paid attention to the application of pretrained language models at the scale of evolution to the identification of DBPs. To this end, we present comprehensive insights into a comparison study on alignment-based PSSM and pretrained evolutionary scale modeling (ESM) representations in the field of DBP classification. The comparison is conducted by extracting information from PSSM and ESM representations using four unified averaging operations and by performing various feature selection (FS) methods. Experimental results demonstrate that the pretrained ESM representation outperforms the PSSM-derived features in a fair comparison perspective. The pretrained feature presentation deserves wide application to the area of in silico DBP identification as well as other function annotation issues. Finally, it is also confirmed that an ensemble scheme by aggregating various trained FS models can significantly improve the classification performance of DBPs.
Collapse
|
33
|
Peng X, Wang X, Guo Y, Ge Z, Li F, Gao X, Song J. RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins. Brief Bioinform 2022; 23:6596984. [PMID: 35649392 PMCID: PMC9294422 DOI: 10.1093/bib/bbac215] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Revised: 04/25/2022] [Accepted: 05/06/2022] [Indexed: 11/27/2022] Open
Abstract
RNA binding proteins (RBPs) are critical for the post-transcriptional control of RNAs and play vital roles in a myriad of biological processes, such as RNA localization and gene regulation. Therefore, computational methods that are capable of accurately identifying RBPs are highly desirable and have important implications for biomedical and biotechnological applications. Here, we propose a two-stage deep transfer learning-based framework, termed RBP-TSTL, for accurate prediction of RBPs. In the first stage, the knowledge from the self-supervised pre-trained model was extracted as feature embeddings and used to represent the protein sequences, while in the second stage, a customized deep learning model was initialized based on an annotated pre-training RBPs dataset before being fine-tuned on each corresponding target species dataset. This two-stage transfer learning framework can enable the RBP-TSTL model to be effectively trained to learn and improve the prediction performance. Extensive performance benchmarking of the RBP-TSTL models trained using the features generated by the self-supervised pre-trained model and other models trained using hand-crafting encoding features demonstrated the effectiveness of the proposed two-stage knowledge transfer strategy based on the self-supervised pre-trained models. Using the best-performing RBP-TSTL models, we further conducted genome-scale RBP predictions for Homo sapiens, Arabidopsis thaliana, Escherichia coli, and Salmonella and established a computational compendium containing all the predicted putative RBPs candidates. We anticipate that the proposed RBP-TSTL approach will be explored as a useful tool for the characterization of RNA-binding proteins and exploration of their sequence–structure–function relationships.
Collapse
Affiliation(s)
- Xinxin Peng
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Yuming Guo
- Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia.,College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia.,KAUST Computational Bioscience Research Center, King Abdullah University of Science and Technology
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
34
|
Yan J, Jiang T, Liu J, Lu Y, Guan S, Li H, Wu H, Ding Y. DNA-binding protein prediction based on deep transfer learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:7719-7736. [PMID: 35801442 DOI: 10.3934/mbe.2022362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The study of DNA binding proteins (DBPs) is of great importance in the biomedical field and plays a key role in this field. At present, many researchers are working on the prediction and detection of DBPs. Traditional DBP prediction mainly uses machine learning methods. Although these methods can obtain relatively high pre-diction accuracy, they consume large quantities of human effort and material resources. Transfer learning has certain advantages in dealing with such prediction problems. Therefore, in the present study, two features were extracted from a protein sequence, a transfer learning method was used, and two classical transfer learning algorithms were compared to transfer samples and construct data sets. In the final step, DBPs are detected by building a deep learning neural network model in a way that uses attention mechanisms.
Collapse
Affiliation(s)
- Jun Yan
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tengsheng Jiang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Junkai Liu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yaoyao Lu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Shixuan Guan
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Haiou Li
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
- Suzhou Smart City Research Institute, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
35
|
Zhang J, Yan K, Chen Q, Liu B. PreRBP-TL: prediction of species-specific RNA-binding proteins based on transfer learning. Bioinformatics 2022; 38:2135-2143. [PMID: 35176130 DOI: 10.1093/bioinformatics/btac106] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 11/18/2021] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION RNA-binding proteins (RBPs) play crucial roles in post-transcriptional regulation. Accurate identification of RBPs helps to understand gene expression, regulation, etc. In recent years, some computational methods were proposed to identify RBPs. However, these methods fail to accurately identify RBPs from some specific species with limited data, such as bacteria. RESULTS In this study, we introduce a computational method called PreRBP-TL for identifying species-specific RBPs based on transfer learning. The weights of the prediction model were initialized by pretraining with the large general RBP dataset and then fine-tuned with the small species-specific RPB dataset by using transfer learning. The experimental results show that the PreRBP-TL achieves better performance for identifying the species-specific RBPs from Human, Arabidopsis, Escherichia coli and Salmonella, outperforming eight state-of-the-art computational methods. It is anticipated PreRBP-TL will become a useful method for identifying RBPs. AVAILABILITY AND IMPLEMENTATION For the convenience of researchers to identify RBPs, the web server of PreRBP-TL was established, freely available at http://bliulab.net/PreRBP-TL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
36
|
|
37
|
Chen Z, Jiao S, Zhao D, Zou Q, Xu L, Zhang L, Su X. The Characterization of Structure and Prediction for Aquaporin in Tumour Progression by Machine Learning. Front Cell Dev Biol 2022; 10:845622. [PMID: 35178393 PMCID: PMC8844512 DOI: 10.3389/fcell.2022.845622] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/17/2022] [Indexed: 11/21/2022] Open
Abstract
Recurrence and new cases of cancer constitute a challenging human health problem. Aquaporins (AQPs) can be expressed in many types of tumours, including the brain, breast, pancreas, colon, skin, ovaries, and lungs, and the histological grade of cancer is positively correlated with AQP expression. Therefore, the identification of aquaporins is an area to explore. Computational tools play an important role in aquaporin identification. In this research, we propose reliable, accurate and automated sequence predictor iAQPs-RF to identify AQPs. In this study, the feature extraction method was 188D (global protein sequence descriptor, GPSD). Six common classifiers, including random forest (RF), NaiveBayes (NB), support vector machine (SVM), XGBoost, logistic regression (LR) and decision tree (DT), were used for AQP classification. The classification results show that the random forest (RF) algorithm is the most suitable machine learning algorithm, and the accuracy was 97.689%. Analysis of Variance (ANOVA) was used to analyse these characteristics. Feature rank based on the ANOVA method and IFS strategy was applied to search for the optimal features. The classification results suggest that the 26th feature (neutral/hydrophobic) and 21st feature (hydrophobic) are the two most powerful and informative features that distinguish AQPs from non-AQPs. Previous studies reported that plasma membrane proteins have hydrophobic characteristics. Aquaporin subcellular localization prediction showed that all aquaporins were plasma membrane proteins with highly conserved transmembrane structures. In addition, the 3D structure of aquaporins was consistent with the localization results. Therefore, these studies confirmed that aquaporins possess hydrophobic properties. Although aquaporins are highly conserved transmembrane structures, the phylogenetic tree shows the diversity of aquaporins during evolution. The PCA showed that positive and negative samples were well separated by 54D features, indicating that the 54D feature can effectively classify aquaporins. The online prediction server is accessible at http://lab.malab.cn/∼acy/iAQP.
Collapse
Affiliation(s)
- Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Da Zhao
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lijun Zhang
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, Shenzhen, China
| | - Xi Su
- Foshan Maternal and Child Health Hospital, Foshan, China
| |
Collapse
|
38
|
Li X, Lu L, Chen L. Identification of protein functions in mouse with a label space partition method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:3820-3842. [PMID: 35341276 DOI: 10.3934/mbe.2022176] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Protein is very important for almost all living creatures because it participates in most complicated and essential biological processes. Determining the functions of given proteins is one of the most essential problems in protein science. Such determination can be conducted through traditional experiments. However, the experimental methods are always time-consuming and of high costs. In recent years, computational methods give useful aids for identification of protein functions. This study presented a new multi-label classifier for identifying functions of mouse proteins. Due to the number of functional types, which were termed as labels in the classification procedure, a label space partition method was employed to divide labels into some partitions. On each partition, a multi-label classifier was constructed. The classifiers based on all partitions were integrated in the proposed classifier. The cross-validation results proved that the proposed classifier was of good performance. Classifiers with label partition were superior to those without label partition or with random label partition.
Collapse
Affiliation(s)
- Xuan Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Lin Lu
- Department of Radiology, Columbia University Medical Center, New York 10032, USA
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
39
|
Lin C, Wang L, Shi L. AAPred-CNN: accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides. Methods 2022; 204:442-448. [PMID: 35031486 DOI: 10.1016/j.ymeth.2022.01.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/28/2021] [Accepted: 01/09/2022] [Indexed: 12/13/2022] Open
Abstract
Recently, deep learning techniques have been developed for various bioactive peptide prediction tasks. However, there are only conventional machine learning-based methods for the prediction of anti-angiogenic peptides (AAP), which play an important role in cancer treatment. The main reason why no deep learning method has been involved in this field is that there are too few experimentally validated AAPs to support the training of deep models but researchers have believed that deep learning seriously depends on the amounts of labeled data. In this paper, as a tentative work, we try to predict AAP by constructing different classical deep learning models and propose the first deep convolution neural network-based predictor (AAPred-CNN) for AAP. Contrary to intuition, the experimental results show that deep learning models can achieve superior or comparable performance to the state-of-the-art model, although they are given a few labeled sequences to train. We also decipher the influence of hyper-parameters and training samples on the performance of deep learning models to help understand how the model work. Furthermore, we also visualize the learned embeddings by dimension reduction to increase the model interpretability and reveal the residue propensity of AAP through the statistics of convolutional features for different residues. In summary, this work demonstrates the powerful representation ability of AAPred-CNNfor AAP prediction, further improving the prediction accuracy of AAP.
Collapse
Affiliation(s)
- Changhang Lin
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou, China
| | - Lei Wang
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
40
|
Cui F, Li S, Zhang Z, Sui M, Cao C, El-Latif Hesham A, Zou Q. DeepMC-iNABP: Deep learning for multiclass identification and classification of nucleic acid-binding proteins. Comput Struct Biotechnol J 2022; 20:2020-2028. [PMID: 35521556 PMCID: PMC9065708 DOI: 10.1016/j.csbj.2022.04.029] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/06/2022] [Accepted: 04/20/2022] [Indexed: 11/29/2022] Open
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play vital roles in gene expression. Accurate identification of these proteins is crucial. However, there are two existing challenges: one is the problem of ignoring DNA- and RNA-binding proteins (DRBPs), and the other is a cross-predicting problem referring to DBP predictors predicting DBPs as RBPs, and vice versa. In this study, we proposed a computational predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. DBPs, RBPs, DRBPs and non-NABPs as separate classes of data were used for training the DeepMC-iNABP model. The results on test data collected in this study and two independent test datasets showed that DeepMC-iNABP has a strong advantage in identifying the DRBPs and has the ability to alleviate the cross-prediction problem to a certain extent. The web-server of DeepMC-iNABP is freely available at http://www.deepmc-inabp.net/. The datasets used in this research can also be downloaded from the website.
Collapse
Affiliation(s)
- Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Shuang Li
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Miaomiao Sui
- Graduate School Agricultural and Life Science, The University of Tokyo, Tokyo 1138657, Japan
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef 62511, Egypt
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
- Corresponding author at: Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
41
|
RBPSpot: Learning on appropriate contextual information for RBP binding sites discovery. iScience 2021; 24:103381. [PMID: 34841226 PMCID: PMC8605353 DOI: 10.1016/j.isci.2021.103381] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 09/01/2021] [Accepted: 10/27/2021] [Indexed: 11/29/2022] Open
Abstract
Identifying the factors determining the RBP-RNA interactions remains a big challenge. It involves sparse binding motifs and a suitable sequence context for binding. The present work describes an approach to detect RBP binding sites in RNAs using an ultra-fast inexact k-mers search for statistically significant seeds. The seeds work as an anchor to evaluate the context and binding potential using flanking region information while leveraging from Deep Feed-forward Neural Network. The developed models also received support from MD-simulation studies. The implemented software, RBPSpot, scored consistently high for all the performance metrics including average accuracy of ∼90% across a large number of validated datasets. It outperformed the compared tools, including some with much complex deep-learning models, during a comprehensive benchmarking process. RBPSpot can identify RBP binding sites in the human system and can also be used to develop new models, making it a valuable resource in the area of regulatory system studies. Efficient motif anchoring helps to get good quality contextual information on binding Realistic and high granularity datasets ensure better performance of the classifiers DNN models on the contextual features outperform more complex deep learning tools RBPSpot algorithm may be used to develop RBP binding models for other species also
Collapse
|
42
|
Li HL, Pang YH, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Res 2021; 49:e129. [PMID: 34581805 PMCID: PMC8682797 DOI: 10.1093/nar/gkab829] [Citation(s) in RCA: 87] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 08/24/2021] [Accepted: 09/09/2021] [Indexed: 01/08/2023] Open
Abstract
In order to uncover the meanings of ‘book of life’, 155 different biological language models (BLMs) for DNA, RNA and protein sequence analysis are discussed in this study, which are able to extract the linguistic properties of ‘book of life’. We also extend the BLMs into a system called BioSeq-BLM for automatically representing and analyzing the sequence data. Experimental results show that the predictors generated by BioSeq-BLM achieve comparable or even obviously better performance than the exiting state-of-the-art predictors published in literatures, indicating that BioSeq-BLM will provide new approaches for biological sequence analysis based on natural language processing technologies, and contribute to the development of this very important field. In order to help the readers to use BioSeq-BLM for their own experiments, the corresponding web server and stand-alone package are established and released, which can be freely accessed at http://bliulab.net/BioSeq-BLM/.
Collapse
Affiliation(s)
- Hong-Liang Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Yi-He Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
43
|
Jin X, Liao Q, Liu B. S2L-PSIBLAST: a supervised two-layer search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics 2021; 37:4321-4327. [PMID: 34170287 DOI: 10.1093/bioinformatics/btab472] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/29/2021] [Accepted: 06/24/2021] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION Protein remote homology detection is a challenging task for the studies of protein evolutionary relationships. PSI-BLAST is an important and fundamental search method for detecting homology proteins. Although many improved versions of PSI-BLAST have been proposed, their performance is limited by the search processes of PSI-BLAST. RESULTS For further improving the performance of PSI-BLAST for protein remote homology detection, a supervised two-layer search framework based on PSI-BLAST (S2L-PSIBLAST) is proposed. S2L-PSIBLAST consists of a two-level search: the first-level search provides high-quality search results by using SMI-BLAST framework and double-link strategy to filter the non-homology protein sequences, the second-level search detects more homology proteins by profile-link similarity, and more accurate ranking lists for those detected protein sequences are obtained by learning to rank strategy. Experimental results on the updated version of Structural Classification of Proteins-extended benchmark dataset show that S2L-PSIBLAST not only obviously improves the performance of PSI-BLAST, but also achieves better performance on two improved versions of PSI-BLAST: DELTA-BLAST and PSI-BLASTexB. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaopeng Jin
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Qing Liao
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
44
|
Wang J, Zhao Y, Gong W, Liu Y, Wang M, Huang X, Tan J. EDLMFC: an ensemble deep learning framework with multi-scale features combination for ncRNA-protein interaction prediction. BMC Bioinformatics 2021; 22:133. [PMID: 33740884 PMCID: PMC7980572 DOI: 10.1186/s12859-021-04069-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2021] [Accepted: 03/05/2021] [Indexed: 11/29/2022] Open
Abstract
Background Non-coding RNA (ncRNA) and protein interactions play essential roles in various physiological and pathological processes. The experimental methods used for predicting ncRNA–protein interactions are time-consuming and labor-intensive. Therefore, there is an increasing demand for computational methods to accurately and efficiently predict ncRNA–protein interactions. Results In this work, we presented an ensemble deep learning-based method, EDLMFC, to predict ncRNA–protein interactions using the combination of multi-scale features, including primary sequence features, secondary structure sequence features, and tertiary structure features. Conjoint k-mer was used to extract protein/ncRNA sequence features, integrating tertiary structure features, then fed into an ensemble deep learning model, which combined convolutional neural network (CNN) to learn dominating biological information with bi-directional long short-term memory network (BLSTM) to capture long-range dependencies among the features identified by the CNN. Compared with other state-of-the-art methods under five-fold cross-validation, EDLMFC shows the best performance with accuracy of 93.8%, 89.7%, and 86.1% on RPI1807, NPInter v2.0, and RPI488 datasets, respectively. The results of the independent test demonstrated that EDLMFC can effectively predict potential ncRNA–protein interactions from different organisms. Furtherly, EDLMFC is also shown to predict hub ncRNAs and proteins presented in ncRNA–protein networks of Mus musculus successfully. Conclusions In general, our proposed method EDLMFC improved the accuracy of ncRNA–protein interaction predictions and anticipated providing some helpful guidance on ncRNA functions research. The source code of EDLMFC and the datasets used in this work are available at https://github.com/JingjingWang-87/EDLMFC. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04069-9.
Collapse
Affiliation(s)
- Jingjing Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Yanpeng Zhao
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Weikang Gong
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Yang Liu
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Mei Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Xiaoqian Huang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Jianjun Tan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China.
| |
Collapse
|