1
|
Wu S, Xu J, Guo JT. Accurate prediction of nucleic acid binding proteins using protein language model. BIOINFORMATICS ADVANCES 2025; 5:vbaf008. [PMID: 39990254 PMCID: PMC11845279 DOI: 10.1093/bioadv/vbaf008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Revised: 12/20/2024] [Accepted: 01/15/2025] [Indexed: 02/25/2025]
Abstract
Motivation Nucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions. Results To improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored. Availability and implementation The datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.
Collapse
Affiliation(s)
- Siwen Wu
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL 60637, United States
| | - Jun-tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|
2
|
Pradhan UK, Naha S, Das R, Gupta A, Parsad R, Meher PK. RBProkCNN: Deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes. Comput Struct Biotechnol J 2024; 23:1631-1640. [PMID: 38660008 PMCID: PMC11039349 DOI: 10.1016/j.csbj.2024.04.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 04/12/2024] [Accepted: 04/12/2024] [Indexed: 04/26/2024] Open
Abstract
RNA-binding proteins (RBPs) are central to key functions such as post-transcriptional regulation, mRNA stability, and adaptation to varied environmental conditions in prokaryotes. While the majority of research has concentrated on eukaryotic RBPs, recent developments underscore the crucial involvement of prokaryotic RBPs. Although computational methods have emerged in recent years to identify RBPs, they have fallen short in accurately identifying prokaryotic RBPs due to their generic nature. To bridge this gap, we introduce RBProkCNN, a novel machine learning-driven computational model meticulously designed for the accurate prediction of prokaryotic RBPs. The prediction process involves the utilization of eight shallow learning algorithms and four deep learning models, incorporating PSSM-based evolutionary features. By leveraging a convolutional neural network (CNN) and evolutionarily significant features selected through extreme gradient boosting variable importance measure, RBProkCNN achieved the highest accuracy in five-fold cross-validation, yielding 98.04% auROC and 98.19% auPRC. Furthermore, RBProkCNN demonstrated robust performance with an independent dataset, showcasing a commendable 95.77% auROC and 95.78% auPRC. Noteworthy is its superior predictive accuracy when compared to several state-of-the-art existing models. RBProkCNN is available as an online prediction tool (https://iasri-sg.icar.gov.in/rbprokcnn/), offering free access to interested users. This tool represents a substantial contribution, enriching the array of resources available for the accurate and efficient prediction of prokaryotic RBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
3
|
Krautwurst S, Lamkiewicz K. RNA-protein interaction prediction without high-throughput data: An overview and benchmark of in silico tools. Comput Struct Biotechnol J 2024; 23:4036-4046. [PMID: 39610906 PMCID: PMC11603007 DOI: 10.1016/j.csbj.2024.11.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 11/05/2024] [Accepted: 11/05/2024] [Indexed: 11/30/2024] Open
Abstract
RNA-protein interactions (RPIs) are crucial for accurately operating various processes in and between organisms across kingdoms of life. Mutual detection of RPI partner molecules depends on distinct sequential, structural, or thermodynamic features, which can be determined via experimental and bioinformatic methods. Still, the underlying molecular mechanisms of many RPIs are poorly understood. It is further hypothesized that many RPIs are not even described yet. Computational RPI prediction is continuously challenged by the lack of data and detailed research of very specific examples. With the discovery of novel RPI complexes in all kingdoms of life, adaptations of existing RPI prediction methods are necessary. Continuously improving computational RPI prediction is key in advancing the understanding of RPIs in detail and supplementing experimental RPI determination. The growing amount of data covering more species and detailed mechanisms support the accuracy of prediction tools, which in turn support specific experimental research on RPIs. Here, we give an overview of RPI prediction tools that do not use high-throughput data as the user's input. We review the tools according to their input, usability, and output. We then apply the tools to known RPI examples across different kingdoms of life. Our comparison shows that the investigated prediction tools do not favor a certain species and equip the user with results varying in degree of information, from an overall RPI score to detailed interacting residues. Furthermore, we provide a guide tree to assist users which RPI prediction tool is appropriate for their available input data and desired output.
Collapse
Affiliation(s)
- Sarah Krautwurst
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| | - Kevin Lamkiewicz
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Puschstr. 4, 04103 Leipzig, Germany
| |
Collapse
|
4
|
Street LA, Rothamel KL, Brannan KW, Jin W, Bokor BJ, Dong K, Rhine K, Madrigal A, Al-Azzam N, Kim JK, Ma Y, Gorhe D, Abdou A, Wolin E, Mizrahi O, Ahdout J, Mujumdar M, Doron-Mandel E, Jovanovic M, Yeo GW. Large-scale map of RNA-binding protein interactomes across the mRNA life cycle. Mol Cell 2024; 84:3790-3809.e8. [PMID: 39303721 PMCID: PMC11530141 DOI: 10.1016/j.molcel.2024.08.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 04/18/2024] [Accepted: 08/26/2024] [Indexed: 09/22/2024]
Abstract
mRNAs interact with RNA-binding proteins (RBPs) throughout their processing and maturation. While efforts have assigned RBPs to RNA substrates, less exploration has leveraged protein-protein interactions (PPIs) to study proteins in mRNA life-cycle stages. We generated an RNA-aware, RBP-centric PPI map across the mRNA life cycle in human cells by immunopurification-mass spectrometry (IP-MS) of ∼100 endogenous RBPs with and without RNase, augmented by size exclusion chromatography-mass spectrometry (SEC-MS). We identify 8,742 known and 20,802 unreported interactions between 1,125 proteins and determine that 73% of the IP-MS-identified interactions are RNA regulated. Our interactome links many proteins, some with unknown functions, to specific mRNA life-cycle stages, with nearly half associated with multiple stages. We demonstrate the value of this resource by characterizing the splicing and export functions of enhancer of rudimentary homolog (ERH), and by showing that small nuclear ribonucleoprotein U5 subunit 200 (SNRNP200) interacts with stress granule proteins and binds cytoplasmic RNA differently during stress.
Collapse
Affiliation(s)
- Lena A Street
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Katherine L Rothamel
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Center for RNA Technologies and Therapeutics, University of California, San Diego, La Jolla, CA, USA
| | - Kristopher W Brannan
- Center for RNA Therapeutics, Houston Methodist Research Institute, Houston, TX, USA; Department of Cardiovascular Sciences, Houston Methodist Research Institute, Houston, TX, USA
| | - Wenhao Jin
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Benjamin J Bokor
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Kevin Dong
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Kevin Rhine
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Assael Madrigal
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Norah Al-Azzam
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Jenny Kim Kim
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Yanzhe Ma
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Darvesh Gorhe
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Ahmed Abdou
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Erica Wolin
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Orel Mizrahi
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Joshua Ahdout
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Mayuresh Mujumdar
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Ella Doron-Mandel
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Marko Jovanovic
- Department of Biological Sciences, Columbia University, New York, NY, USA.
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Center for RNA Technologies and Therapeutics, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA; Sanford Laboratories for Innovative Medicines, San Diego, CA, USA; Sanford Stem Cell Institute, Innovation Center, San Diego, CA, USA.
| |
Collapse
|
5
|
Li X, Wei Z, Hu Y, Zhu X. GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models. Int J Biol Macromol 2024; 280:135599. [PMID: 39276905 DOI: 10.1016/j.ijbiomac.2024.135599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 09/11/2024] [Accepted: 09/11/2024] [Indexed: 09/17/2024]
Abstract
The computational identification of nucleic acid-binding proteins (NABP) is of great significance for understanding the mechanisms of these biological activities and drug discovery. Although a bunch of sequence-based methods have been proposed to predict NABP and achieved promising performance, the structure information is often overlooked. On the other hand, the power of popular protein language models (pLM) has seldom been harnessed for predicting NABPs. In this study, we propose a novel framework called GraphNABP, to predict NABP by integrating sequence and predicted 3D structure information. Specifically, sequence embeddings and protein molecular graphs were first obtained from ProtT5 protein language model and predicted 3D structures, respectively. Then, graph attention (GAT) and bidirectional long short-term memory (BiLSTM) neural networks were used to enhance feature representations. Finally, a fully connected layer is used to predict NABPs. To the best of our knowledge, this is the first time to integrate AlphaFold and protein language models for the prediction of NABPs. The performances on multiple independent test sets indicate that GraphNABP outperforms other state-of-the-art methods. Our results demonstrate the effectiveness of pLM embeddings and structural information for NABP prediction. The codes and data used in this study are available at https://github.com/lixiangli01/GraphNABP.
Collapse
Affiliation(s)
- Xiang Li
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yueran Hu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China.
| |
Collapse
|
6
|
Wu S, Guo JT. Improved prediction of DNA and RNA binding proteins with deep learning models. Brief Bioinform 2024; 25:bbae285. [PMID: 38856168 PMCID: PMC11163377 DOI: 10.1093/bib/bbae285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 05/20/2024] [Accepted: 05/31/2024] [Indexed: 06/11/2024] Open
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types of NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning-based methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given protein. The deep learning models employ two layers of convolutional neural network and one layer of long short-term memory. Our approaches outperform existing DBP and RBP predictors with a balanced prediction between DBPs and RBPs, and are more practically useful in identifying novel NABPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with ~12% improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins and their effect on the overall prediction accuracy of NABP predictions.
Collapse
Affiliation(s)
- Siwen Wu
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Jun-tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|
7
|
Arif M, Fang G, Ghulam A, Musleh S, Alam T. DPI_CDF: druggable protein identifier using cascade deep forest. BMC Bioinformatics 2024; 25:145. [PMID: 38580921 PMCID: PMC11334562 DOI: 10.1186/s12859-024-05744-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 03/13/2024] [Indexed: 04/07/2024] Open
Abstract
BACKGROUND Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory. METHODS In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. RESULTS The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. AVAILABILITY The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Ge Fang
- State Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing 210023, P. R. China, Nanjing 210023, China
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bankok, 10700, Thailand
| | - Ali Ghulam
- Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
8
|
Yan Y, Li W, Wang S, Huang T. Seq-RBPPred: Predicting RNA-Binding Proteins from Sequence. ACS OMEGA 2024; 9:12734-12742. [PMID: 38524500 PMCID: PMC10955590 DOI: 10.1021/acsomega.3c08381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/18/2023] [Accepted: 12/28/2023] [Indexed: 03/26/2024]
Abstract
RNA-binding proteins (RBPs) can interact with RNAs to regulate RNA translation, modification, splicing, and other important biological processes. The accurate identification of RBPs is of paramount importance for gaining insights into the intricate mechanisms underlying organismal life activities. Traditional experimental methods to predict RBPs require a lot of time and money, so it is important to develop computational methods to predict RBPs. However, the existing approaches for RBP prediction still require further improvement due to unidentified RBPs in many species. In this study, we present Seq-RBPPred (predicting RBPs from sequence), a novel method that utilizes a comprehensive feature representation encompassing both biophysical properties and hidden-state features derived from protein sequences. In the results, comprehensive performance evaluations of Seq-RBPPred its superiority compare with state-of-the-art methods, yielding impressive performance including 0.922 for overall accuracy, 0.926 for sensitivity, 0.903 for specificity, and Matthew's correlation coefficient (MCC) of 0.757 as ascertained from the evaluation of the testing set. The data and code of Seq-RBPPred are available at https://github.com/yaoyao-11/Seq-RBPPred.
Collapse
Affiliation(s)
- Yuyao Yan
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Wenran Li
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Sijia Wang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Tao Huang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| |
Collapse
|
9
|
Wilson B, Esmaeili F, Parsons M, Salah W, Su Z, Dutta A. sRNA-Effector: A tool to expedite discovery of small RNA regulators. iScience 2024; 27:109300. [PMID: 38469560 PMCID: PMC10926228 DOI: 10.1016/j.isci.2024.109300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 11/08/2023] [Accepted: 02/16/2024] [Indexed: 03/13/2024] Open
Abstract
microRNAs (miRNAs) are small regulatory RNAs that repress target mRNA transcripts through base pairing. Although the mechanisms of miRNA production and function are clearly established, new insights into miRNA regulation or miRNA-mediated gene silencing are still emerging. In order to facilitate the discovery of miRNA regulators or effectors, we have developed sRNA-Effector, a machine learning algorithm trained on enhanced crosslinking and immunoprecipitation sequencing and RNA sequencing data following knockdown of specific genes. sRNA-Effector can accurately identify known miRNA biogenesis and effector proteins and identifies 9 putative regulators of miRNA function, including serine/threonine kinase STK33, splicing factor SFPQ, and proto-oncogene BMI1. We validated the role of STK33, SFPQ, and BMI1 in miRNA regulation, showing that sRNA-Effector is useful for identifying new players in small RNA biology. sRNA-Effector will be a web tool available for all researchers to identify potential miRNA regulators in any cell line of interest.
Collapse
Affiliation(s)
- Briana Wilson
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22901, USA
| | - Fatemeh Esmaeili
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| | - Matthew Parsons
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22901, USA
| | - Wafa Salah
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22901, USA
| | - Zhangli Su
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| | - Anindya Dutta
- Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| |
Collapse
|
10
|
Avila-Lopez P, Lauberth SM. Exploring new roles for RNA-binding proteins in epigenetic and gene regulation. Curr Opin Genet Dev 2024; 84:102136. [PMID: 38128453 PMCID: PMC11245729 DOI: 10.1016/j.gde.2023.102136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 11/12/2023] [Accepted: 11/15/2023] [Indexed: 12/23/2023]
Abstract
A significant portion of the human proteome comprises RNA-binding proteins (RBPs) that play fundamental roles in numerous biological processes. In the last decade, there has been a staggering increase in RBP identification and classification, which has fueled interest in the evolving roles of RBPs and RBP-driven molecular mechanisms. Here, we focus on recent insights into RBP-dependent regulation of the epigenetic and transcriptional landscape. We describe advances in methodologies that define the RNA-protein interactome and machine-learning algorithms that are streamlining RBP discovery and predicting new RNA-binding regions. Finally, we present how RBP dysregulation leads to alterations in tumor-promoting gene expression and discuss the potential for targeting these RBPs for the development of new cancer therapeutics.
Collapse
Affiliation(s)
- Pedro Avila-Lopez
- Simpson Querrey Institute for Epigenetics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA; Department of Biochemistry and Molecular Genetics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Shannon M Lauberth
- Simpson Querrey Institute for Epigenetics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA; Department of Biochemistry and Molecular Genetics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
11
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, Parsad R. RBPLight: a computational tool for discovery of plant-specific RNA-binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics 2023; 22:401-410. [PMID: 37158175 DOI: 10.1093/bfgp/elad016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 04/21/2023] [Indexed: 05/10/2023] Open
Abstract
RNA-binding proteins (RBPs) are essential for post-transcriptional gene regulation in eukaryotes, including splicing control, mRNA transport and decay. Thus, accurate identification of RBPs is important to understand gene expression and regulation of cell state. In order to detect RBPs, a number of computational models have been developed. These methods made use of datasets from several eukaryotic species, specifically from mice and humans. Although some models have been tested on Arabidopsis, these techniques fall short of correctly identifying RBPs for other plant species. Therefore, the development of a powerful computational model for identifying plant-specific RBPs is needed. In this study, we presented a novel computational model for locating RBPs in plants. Five deep learning models and ten shallow learning algorithms were utilized for prediction with 20 sequence-derived and 20 evolutionary feature sets. The highest repeated five-fold cross-validation accuracy, 91.24% AU-ROC and 91.91% AU-PRC, was achieved by light gradient boosting machine. While evaluated using an independent dataset, the developed approach achieved 94.00% AU-ROC and 94.50% AU-PRC. The proposed model achieved significantly higher accuracy for predicting plant-specific RBPs as compared to the currently available state-of-art RBP prediction models. Despite the fact that certain models have already been trained and assessed on the model organism Arabidopsis, this is the first comprehensive computer model for the discovery of plant-specific RBPs. The web server RBPLight was also developed, which is publicly accessible at https://iasri-sg.icar.gov.in/rbplight/, for the convenience of researchers to identify RBPs in plants.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sagar Gupta
- CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP) 176061, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
12
|
Molzahn C, Kuechler ER, Zemlyankina I, Nierves L, Ali T, Cole G, Wang J, Albu RF, Zhu M, Cashman NR, Gilch S, Karsan A, Lange PF, Gsponer J, Mayor T. Shift of the insoluble content of the proteome in the aging mouse brain. Proc Natl Acad Sci U S A 2023; 120:e2310057120. [PMID: 37906643 PMCID: PMC10636323 DOI: 10.1073/pnas.2310057120] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 09/24/2023] [Indexed: 11/02/2023] Open
Abstract
During aging, the cellular response to unfolded proteins is believed to decline, resulting in diminished proteostasis. In model organisms, such as Caenorhabditis elegans, proteostatic decline with age has been linked to proteome solubility shifts and the onset of protein aggregation. However, this correlation has not been extensively characterized in aging mammals. To uncover age-dependent changes in the insoluble portion of a mammalian proteome, we analyzed the detergent-insoluble fraction of mouse brain tissue by mass spectrometry. We identified a group of 171 proteins, including the small heat shock protein α-crystallin, that become enriched in the detergent-insoluble fraction obtained from old mice. To enhance our ability to detect features associated with proteins in that fraction, we complemented our data with a meta-analysis of studies reporting the detergent-insoluble proteins in various mouse models of aging and neurodegeneration. Strikingly, insoluble proteins from young and old mice are distinct in several features in our study and across the collected literature data. In younger mice, proteins are more likely to be disordered, part of membraneless organelles, and involved in RNA binding. These traits become less prominent with age, as an increased number of structured proteins enter the pellet fraction. This analysis suggests that age-related changes to proteome organization lead a group of proteins with specific features to become detergent-insoluble. Importantly, these features are not consistent with those associated with proteins driving membraneless organelle formation. We see no evidence in our system of a general increase of condensate proteins in the detergent-insoluble fraction with age.
Collapse
Affiliation(s)
- Cristen Molzahn
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Edward Leong Center for Healthy Aging, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Erich R. Kuechler
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Irina Zemlyankina
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Lorenz Nierves
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Michael Cuccione Childhood Cancer Research Program, British Columbia Children's Hospital Research Institute, Vancouver, BCV5Z 4H4, Canada
| | - Tahir Ali
- Faculty of Veterinary Medicine and Hotchkiss Brain Institute, University of Calgary, Calgary, ABT2N 4Z6, Canada
| | - Grace Cole
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- British Columbia Cancer Research Institute, Vancouver, BCV5Z 1L3, Canada
| | - Jing Wang
- Division of Neurology and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Razvan F. Albu
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Mang Zhu
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Neil R. Cashman
- Division of Neurology and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| | - Sabine Gilch
- Faculty of Veterinary Medicine and Hotchkiss Brain Institute, University of Calgary, Calgary, ABT2N 4Z6, Canada
| | - Aly Karsan
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- British Columbia Cancer Research Institute, Vancouver, BCV5Z 1L3, Canada
| | - Philipp F. Lange
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Michael Cuccione Childhood Cancer Research Program, British Columbia Children's Hospital Research Institute, Vancouver, BCV5Z 4H4, Canada
- British Columbia Cancer Research Institute, Vancouver, BCV5Z 1L3, Canada
| | - Jörg Gsponer
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
| | - Thibault Mayor
- Department of Biochemistry and Molecular Biology, Michael Smith Laboratories, University of British Columbia, Vancouver, BCV6T 1Z4, Canada
- Edward Leong Center for Healthy Aging, University of British Columbia, Vancouver, BCV6T 1Z3, Canada
| |
Collapse
|
13
|
Liu X, Duan Y, Hong X, Xie J, Liu S. Challenges in structural modeling of RNA-protein interactions. Curr Opin Struct Biol 2023; 81:102623. [PMID: 37301066 DOI: 10.1016/j.sbi.2023.102623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 05/14/2023] [Accepted: 05/16/2023] [Indexed: 06/12/2023]
Abstract
In the past few years, the number of RNA-binding proteins (RBP) and RNA-RBP interactions has increased significantly. Here, we review recent developments in the methodology for protein-RNA and protein-protein complex structure modeling with deep learning and co-evolution, as well as discuss the challenges and opportunities for building a reliable approach for protein-RNA complex structure modelling. Protein Data bank (PDB) and Cross-linking immunoprecipitation (CLIP) data could be combined together and used to infer 2D geometry of protein-RNA interactions by deep learning.
Collapse
Affiliation(s)
- Xudong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yingtian Duan
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Xu Hong
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Juan Xie
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Shiyong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China.
| |
Collapse
|
14
|
Jin W, Brannan KW, Kapeli K, Park SS, Tan HQ, Gosztyla ML, Mujumdar M, Ahdout J, Henroid B, Rothamel K, Xiang JS, Wong L, Yeo GW. HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence. Mol Cell 2023; 83:2595-2611.e11. [PMID: 37421941 PMCID: PMC11098078 DOI: 10.1016/j.molcel.2023.06.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 03/20/2023] [Accepted: 06/13/2023] [Indexed: 07/10/2023]
Abstract
RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression and, when dysfunctional, underlie human diseases. Proteome-wide discovery efforts predict thousands of RBP candidates, many of which lack canonical RNA-binding domains (RBDs). Here, we present a hybrid ensemble RBP classifier (HydRA), which leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machines (SVMs), convolutional neural networks (CNNs), and Transformer-based protein language models. Occlusion mapping by HydRA robustly detects known RBDs and predicts hundreds of uncharacterized RNA-binding associated domains. Enhanced CLIP (eCLIP) for HydRA-predicted RBP candidates reveals transcriptome-wide RNA targets and confirms RNA-binding activity for HydRA-predicted RNA-binding associated domains. HydRA accelerates construction of a comprehensive RBP catalog and expands the diversity of RNA-binding associated domains.
Collapse
Affiliation(s)
- Wenhao Jin
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Kristopher W Brannan
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Katannya Kapeli
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Samuel S Park
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Hui Qing Tan
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Maya L Gosztyla
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Mayuresh Mujumdar
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Joshua Ahdout
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Bryce Henroid
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Katherine Rothamel
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Joy S Xiang
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore, Singapore
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
15
|
Yan K, Feng J, Huang J, Wu H. iDRPro-SC: identifying DNA-binding proteins and RNA-binding proteins based on subfunction classifiers. Brief Bioinform 2023:bbad251. [PMID: 37405873 DOI: 10.1093/bib/bbad251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/10/2023] [Accepted: 06/12/2023] [Indexed: 07/07/2023] Open
Abstract
Nucleic acid-binding proteins are proteins that interact with DNA and RNA to regulate gene expression and transcriptional control. The pathogenesis of many human diseases is related to abnormal gene expression. Therefore, recognizing nucleic acid-binding proteins accurately and efficiently has important implications for disease research. To address this question, some scientists have proposed the method of using sequence information to identify nucleic acid-binding proteins. However, different types of nucleic acid-binding proteins have different subfunctions, and these methods ignore their internal differences, so the performance of the predictor can be further improved. In this study, we proposed a new method, called iDRPro-SC, to predict the type of nucleic acid-binding proteins based on the sequence information. iDRPro-SC considers the internal differences of nucleic acid-binding proteins and combines their subfunctions to build a complete dataset. Additionally, we used an ensemble learning to characterize and predict nucleic acid-binding proteins. The results of the test dataset showed that iDRPro-SC achieved the best prediction performance and was superior to the other existing nucleic acid-binding protein prediction methods. We have established a web server that can be accessed online: http://bliulab.net/iDRPro-SC.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jiawei Feng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jing Huang
- Huajian Yutong Technology (Beijing) Co., Ltd
- State Key Laboratory of Media Convergence Production Technology and Systems, Beijing China,100803
- Xinhua New Media Culture Communication Co., Ltd
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
16
|
Zhao R, Fang X, Mai Z, Chen X, Mo J, Lin Y, Xiao R, Bao X, Weng X, Zhou X. Transcriptome-wide identification of single-stranded RNA binding proteins. Chem Sci 2023; 14:4038-4047. [PMID: 37063799 PMCID: PMC10094363 DOI: 10.1039/d3sc00957b] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 03/07/2023] [Indexed: 04/18/2023] Open
Abstract
RNA-protein interactions are precisely regulated by RNA secondary structures in various biological processes. Large-scale identification of proteins that interact with particular RNA structure is important to the RBPome. Herein, a kethoxal assisted single-stranded RNA interactome capture (KASRIC) strategy was developed to globally identify single-stranded RNA binding proteins (ssRBPs). This approach combines RNA secondary structure probing technology with the conventional method of RNA-binding proteins profiling, realizing the transcriptome-wide identification of ssRBPs. Applying KASRIC, we identified 3180 candidate RBPs and 244 candidate ssRBPs in HeLa cells. Importantly, the 244 candidate ssRBPs contained 55 previously reported ssRBPs and 189 novel ssRBPs. Function analysis of the candidate ssRBPs exhibited enrichment in cellular processes related to RNA splicing and RNA degradation. The KASRIC strategy will facilitate the investigation of RNA-protein interactions.
Collapse
Affiliation(s)
- Ruiqi Zhao
- College of Chemistry and Molecular Sciences, Key Laboratory of Biomedical Polymers-Ministry of Education, Wuhan University Wuhan Hubei 430072 P. R. China
| | - Xin Fang
- College of Chemistry and Molecular Sciences, Key Laboratory of Biomedical Polymers-Ministry of Education, Wuhan University Wuhan Hubei 430072 P. R. China
| | - Zhibiao Mai
- Laboratory of RNA Molecular Biology, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, CAS Key Laboratory of Regenerative Biology, GIBH-CUHK Joint Research Laboratory on Stem Cell and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences Guangzhou Guangdong Province 510530 China
| | - Xi Chen
- College of Chemistry and Molecular Sciences, Key Laboratory of Biomedical Polymers-Ministry of Education, Wuhan University Wuhan Hubei 430072 P. R. China
| | - Jing Mo
- College of Chemistry and Molecular Sciences, Key Laboratory of Biomedical Polymers-Ministry of Education, Wuhan University Wuhan Hubei 430072 P. R. China
| | - Yingying Lin
- Laboratory of RNA Molecular Biology, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, CAS Key Laboratory of Regenerative Biology, GIBH-CUHK Joint Research Laboratory on Stem Cell and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences Guangzhou Guangdong Province 510530 China
| | - Rui Xiao
- Frontier Science Center for Immunology and Metabolism, Medical Research Institute, Wuhan University Wuhan Hubei 430071 China
- TaiKang Center for Life and Medical Sciences, Wuhan University Wuhan Hubei 430071 China
| | - Xichen Bao
- Laboratory of RNA Molecular Biology, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, CAS Key Laboratory of Regenerative Biology, GIBH-CUHK Joint Research Laboratory on Stem Cell and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences Guangzhou Guangdong Province 510530 China
| | - Xiaocheng Weng
- College of Chemistry and Molecular Sciences, Key Laboratory of Biomedical Polymers-Ministry of Education, Wuhan University Wuhan Hubei 430072 P. R. China
| | - Xiang Zhou
- College of Chemistry and Molecular Sciences, Key Laboratory of Biomedical Polymers-Ministry of Education, Wuhan University Wuhan Hubei 430072 P. R. China
- TaiKang Center for Life and Medical Sciences, Wuhan University Wuhan Hubei 430071 China
| |
Collapse
|
17
|
Zhang X, Zhu W, Sun H, Ding Y, Liu L. Prediction of CTCF loop anchor based on machine learning. Front Genet 2023; 14:1181956. [PMID: 37077544 PMCID: PMC10106609 DOI: 10.3389/fgene.2023.1181956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 03/24/2023] [Indexed: 04/05/2023] Open
Abstract
Introduction: Various activities in biological cells are affected by three-dimensional genome structure. The insulators play an important role in the organization of higher-order structure. CTCF is a representative of mammalian insulators, which can produce barriers to prevent the continuous extrusion of chromatin loop. As a multifunctional protein, CTCF has tens of thousands of binding sites in the genome, but only a portion of them can be used as anchors of chromatin loops. It is still unclear how cells select the anchor in the process of chromatin looping.Methods: In this paper, a comparative analysis is performed to investigate the sequence preference and binding strength of anchor and non-anchor CTCF binding sites. Furthermore, a machine learning model based on the CTCF binding intensity and DNA sequence is proposed to predict which CTCF sites can form chromatin loop anchors.Results: The accuracy of the machine learning model that we constructed for predicting the anchor of the chromatin loop mediated by CTCF reached 0.8646. And we find that the formation of loop anchor is mainly influenced by the CTCF binding strength and binding pattern (which can be interpreted as the binding of different zinc fingers).Discussion: In conclusion, our results suggest that The CTCF core motif and it’s flanking sequence may be responsible for the binding specificity. This work contributes to understanding the mechanism of loop anchor selection and provides a reference for the prediction of CTCF-mediated chromatin loops.
Collapse
Affiliation(s)
- Xiao Zhang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
| | - Wen Zhu
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- *Correspondence: Wen Zhu,
| | - Huimin Sun
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Yijie Ding
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
| | - Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
18
|
Comparison of Biomolecular Condensate Localization and Protein Phase Separation Predictors. Biomolecules 2023; 13:biom13030527. [PMID: 36979462 PMCID: PMC10046894 DOI: 10.3390/biom13030527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 03/07/2023] [Accepted: 03/10/2023] [Indexed: 03/17/2023] Open
Abstract
Research in the field of biochemistry and cellular biology has entered a new phase due to the discovery of phase separation driving the formation of biomolecular condensates, or membraneless organelles, in cells. The implications of this novel principle of cellular organization are vast and can be applied at multiple scales, spawning exciting research questions in numerous directions. Of fundamental importance are the molecular mechanisms that underly biomolecular condensate formation within cells and whether insights gained into these mechanisms provide a gateway for accurate predictions of protein phase behavior. Within the last six years, a significant number of predictors for protein phase separation and condensate localization have emerged. Herein, we compare a collection of state-of-the-art predictors on different tasks related to protein phase behavior. We show that the tested methods achieve high AUCs in the identification of biomolecular condensate drivers and scaffolds, as well as in the identification of proteins able to phase separate in vitro. However, our benchmark tests reveal that their performance is poorer when used to predict protein segments that are involved in phase separation or to classify amino acid substitutions as phase-separation-promoting or -inhibiting mutations. Our results suggest that the phenomenological approach used by most predictors is insufficient to fully grasp the complexity of the phenomenon within biological contexts and make reliable predictions related to protein phase behavior at the residue level.
Collapse
|
19
|
Du X, Hu J. Deep Multi-Label Joint Learning for RNA and DNA-Binding Proteins Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:307-320. [PMID: 35148267 DOI: 10.1109/tcbb.2022.3150280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The recognition of DNA- (DBPs) and RNA-binding proteins (RBPs) is not only conducive to understanding cell function, but also a challenging task. Previous studies have shown that these proteins are usually considered separately due to different binding domains. In addition, due to the high similarity between DBPs and RBPs, it is possible for DBPs predictor to predict RBPs as DBPs, and vice versa, which leads to high cross-prediction rate. In this study, we creatively propose a novel deep multi-label joint learning framework to leverage the relationship between multiple labels and binding proteins. First, a multi-label variant network is designed to explore multi-scale context hidden information. Then, multi-label Long Short-Term Memory (multiLSTM) is used to mine the potential relationship between labels. Finally, the calibrated hidden features from variant network are considered for different levels of joint learning so that multiLSTM can better explore the correlation between them. Extensive experiments are also carried out to compare the proposed method with other existing methods. Furthermore, we also provide further insights into the importance of the relevant bioanalysis of proteins obtained from our model and summarize these binding proteins that are significantly related to a disease. Our method is freely available at http://39.108.90.186/dmlj.
Collapse
|
20
|
Wang N, Zhang J, Liu B. iDRBP-EL: Identifying DNA- and RNA- Binding Proteins Based on Hierarchical Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:432-441. [PMID: 34932484 DOI: 10.1109/tcbb.2021.3136905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) from the primary sequences is essential for further exploring protein-nucleic acid interactions. Previous studies have shown that machine-learning-based methods can efficiently identify DBPs or RBPs. However, the information used in these methods is slightly unitary, and most of them only can predict DBPs or RBPs. In this study, we proposed a computational predictor iDRBP-EL to identify DNA- and RNA- binding proteins, and introduced hierarchical ensemble learning to integrate three level information. The method can integrate the information of different features, machine learning algorithms and data into one multi-label model. The ablation experiment showed that the fusion of different information can improve the prediction performance and overcome the cross-prediction problem. Experimental results on the independent datasets showed that iDRBP-EL outperformed all the other competing methods. Moreover, we established a user-friendly webserver iDRBP-EL (http://bliulab.net/iDRBP-EL), which can predict both DBPs and RBPs only based on protein sequences.
Collapse
|
21
|
Selvaraj MK, Kaur J. Computational method for aromatase-related proteins using machine learning approach. PLoS One 2023; 18:e0283567. [PMID: 36989252 PMCID: PMC10057777 DOI: 10.1371/journal.pone.0283567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 03/12/2023] [Indexed: 03/30/2023] Open
Abstract
Human aromatase enzyme is a microsomal cytochrome P450 and catalyzes aromatization of androgens into estrogens during steroidogenesis. For breast cancer therapy, third-generation aromatase inhibitors (AIs) have proven to be effective; however patients acquire resistance to current AIs. Thus there is a need to predict aromatase-related proteins to develop efficacious AIs. A machine learning method was established to identify aromatase-related proteins using a five-fold cross validation technique. In this study, different SVM approach-based models were built using the following approaches like amino acid, dipeptide composition, hybrid and evolutionary profiles in the form of position-specific scoring matrix (PSSM); with maximum accuracy of 87.42%, 84.05%, 85.12%, and 92.02% respectively. Based on the primary sequence, the developed method is highly accurate to predict the aromatase-related proteins. Prediction scores graphs were developed using the known dataset to check the performance of the method. Based on the approach described above, a webserver for predicting aromatase-related proteins from primary sequence data was developed and implemented at https://bioinfo.imtech.res.in/servers/muthu/aromatase/home.html. We hope that the developed method will be useful for aromatase protein related research.
Collapse
Affiliation(s)
| | - Jasmeet Kaur
- Department of Biophysics, Postgraduate Institute of Medical Education and Research (PGIMER), Chandigarh, India
| |
Collapse
|
22
|
Sun Z, Zheng S, Zhao H, Niu Z, Lu Y, Pan Y, Yang Y. To Improve Prediction of Binding Residues With DNA, RNA, Carbohydrate, and Peptide Via Multi-Task Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3735-3743. [PMID: 34637380 DOI: 10.1109/tcbb.2021.3118916] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
MOTIVATION The interactions of proteins with DNA, RNA, peptide, and carbohydrate play key roles in various biological processes. The studies of uncharacterized protein-molecules interactions could be aided by accurate predictions of residues that bind with partner molecules. However, the existing methods for predicting binding residues on proteins remain of relatively low accuracies due to the limited number of complex structures in databases. As different types of molecules partially share chemical mechanisms, the predictions for each molecular type should benefit from the binding information with other molecule types. RESULTS In this study, we employed a multiple task deep learning strategy to develop a new sequence-based method for simultaneously predicting binding residues/sites with multiple important molecule types named MTDsite. By combining four training sets for DNA, RNA, peptide, and carbohydrate-binding proteins, our method yielded accurate and robust predictions with AUC values of 0.852, 0836, 0.758, and 0.776 on their respective independent test sets, which are 0.52 to 6.6% better than other state-of-the-art methods. To my best knowledge, this is the first method using multi-task framework to predict multiple molecular binding sites simultaneously.
Collapse
|
23
|
Balcerak A, Macech-Klicka E, Wakula M, Tomecki R, Goryca K, Rydzanicz M, Chmielarczyk M, Szostakowska-Rodzos M, Wisniewska M, Lyczek F, Helwak A, Tollervey D, Kudla G, Grzybowska EA. The RNA-Binding Landscape of HAX1 Protein Indicates Its Involvement in Translation and Ribosome Assembly. Cells 2022; 11:cells11192943. [PMID: 36230905 PMCID: PMC9564044 DOI: 10.3390/cells11192943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 09/13/2022] [Accepted: 09/15/2022] [Indexed: 11/18/2022] Open
Abstract
HAX1 is a human protein with no known homologues or structural domains. Mutations in the HAX1 gene cause severe congenital neutropenia through mechanisms that are poorly understood. Previous studies reported the RNA-binding capacity of HAX1, but the role of this binding in physiology and pathology remains unexplained. Here, we report the transcriptome-wide characterization of HAX1 RNA targets using RIP-seq and CRAC, indicating that HAX1 binds transcripts involved in translation, ribosome biogenesis, and rRNA processing. Using CRISPR knockouts, we find that HAX1 RNA targets partially overlap with transcripts downregulated in HAX1 KO, implying a role in mRNA stabilization. Gene ontology analysis demonstrated that genes differentially expressed in HAX1 KO (including genes involved in ribosome biogenesis and translation) are also enriched in a subset of genes whose expression correlates with HAX1 expression in four analyzed neoplasms. The functional connection to ribosome biogenesis was also demonstrated by gradient sedimentation ribosome profiles, which revealed differences in the small subunit:monosome ratio in HAX1 WT/KO. We speculate that changes in HAX1 expression may be important for the etiology of HAX1-linked diseases through dysregulation of translation.
Collapse
Affiliation(s)
- Anna Balcerak
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
| | - Ewelina Macech-Klicka
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
| | - Maciej Wakula
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
| | - Rafal Tomecki
- Laboratory of RNA Processing and Decay, Institute of Biochemistry and Biophysics, Polish Academy of Sciences, 02-106 Warsaw, Poland
- Faculty of Biology, Institute of Genetics and Biotechnology, University of Warsaw, 02-106 Warsaw, Poland
| | - Krzysztof Goryca
- Genomics Core Facility, Centre of New Technologies University of Warsaw, 02-097 Warsaw, Poland
| | - Malgorzata Rydzanicz
- Department of Medical Genetics, Medical University of Warsaw, 02-106 Warsaw, Poland
| | - Mateusz Chmielarczyk
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
| | - Malgorzata Szostakowska-Rodzos
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
| | - Marta Wisniewska
- Laboratory of Biological Chemistry of Metal Ions, Institute of Biochemistry and Biophysics, Polish Academy of Sciences, 02-106 Warsaw, Poland
| | - Filip Lyczek
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
| | - Aleksandra Helwak
- Wellcome Centre for Cell Biology, University of Edinburgh, Edinburgh EH9 3BF, UK
| | - David Tollervey
- Wellcome Centre for Cell Biology, University of Edinburgh, Edinburgh EH9 3BF, UK
| | - Grzegorz Kudla
- MRC Human Genetics Unit, University of Edinburgh, Edinburgh EH4 2XU, UK
| | - Ewa A. Grzybowska
- Molecular and Translational Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, 02-781 Warsaw, Poland
- Correspondence:
| |
Collapse
|
24
|
Chu LC, Arede P, Li W, Urdaneta EC, Ivanova I, McKellar SW, Wills JC, Fröhlich T, von Kriegsheim A, Beckmann BM, Granneman S. The RNA-bound proteome of MRSA reveals post-transcriptional roles for helix-turn-helix DNA-binding and Rossmann-fold proteins. Nat Commun 2022; 13:2883. [PMID: 35610211 PMCID: PMC9130240 DOI: 10.1038/s41467-022-30553-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 05/06/2022] [Indexed: 01/21/2023] Open
Abstract
RNA-binding proteins play key roles in controlling gene expression in many organisms, but relatively few have been identified and characterised in detail in Gram-positive bacteria. Here, we globally analyse RNA-binding proteins in methicillin-resistant Staphylococcus aureus (MRSA) using two complementary biochemical approaches. We identify hundreds of putative RNA-binding proteins, many containing unconventional RNA-binding domains such as Rossmann-fold domains. Remarkably, more than half of the proteins containing helix-turn-helix (HTH) domains, which are frequently found in prokaryotic transcription factors, bind RNA in vivo. In particular, the CcpA transcription factor, a master regulator of carbon metabolism, uses its HTH domain to bind hundreds of RNAs near intrinsic transcription terminators in vivo. We propose that CcpA, besides acting as a transcription factor, post-transcriptionally regulates the stability of many RNAs.
Collapse
Affiliation(s)
- Liang-Cui Chu
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK
| | - Pedro Arede
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK
| | - Wei Li
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK
| | - Erika C Urdaneta
- IRI Life Sciences, Humboldt University Berlin, 10115, Berlin, Germany
| | - Ivayla Ivanova
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK
| | - Stuart W McKellar
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK
| | - Jimi C Wills
- Cancer Research UK Edinburgh Centre, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, EH4 2XR, UK
| | - Theresa Fröhlich
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK
| | - Alexander von Kriegsheim
- Cancer Research UK Edinburgh Centre, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, EH4 2XR, UK
| | | | - Sander Granneman
- Centre for Synthetic and Systems Biology, University of Edinburgh, Edinburgh, EH9 3BF, UK.
| |
Collapse
|
25
|
Dong C, Rao N, Du W, Gao F, Lv X, Wang G, Zhang J. mRBioM: An Algorithm for the Identification of Potential mRNA Biomarkers From Complete Transcriptomic Profiles of Gastric Adenocarcinoma. Front Genet 2021; 12:679612. [PMID: 34386038 PMCID: PMC8354214 DOI: 10.3389/fgene.2021.679612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 05/06/2021] [Indexed: 12/09/2022] Open
Abstract
Purpose In this work, an algorithm named mRBioM was developed for the identification of potential mRNA biomarkers (PmBs) from complete transcriptomic RNA profiles of gastric adenocarcinoma (GA). Methods mRBioM initially extracts differentially expressed (DE) RNAs (mRNAs, miRNAs, and lncRNAs). Next, mRBioM calculates the total information amount of each DE mRNA based on the coexpression network, including three types of RNAs and the protein-protein interaction network encoded by DE mRNAs. Finally, PmBs were identified according to the variation trend of total information amount of all DE mRNAs. Four PmB-based classifiers without learning and with learning were designed to discriminate the sample types to confirm the reliability of PmBs identified by mRBioM. PmB-based survival analysis was performed. Finally, three other cancer datasets were used to confirm the generalization ability of mRBioM. Results mRBioM identified 55 PmBs (41 upregulated and 14 downregulated) related to GA. The list included thirteen PmBs that have been verified as biomarkers or potential therapeutic targets of gastric cancer, and some PmBs were newly identified. Most PmBs were primarily enriched in the pathways closely related to the occurrence and development of gastric cancer. Cancer-related factors without learning achieved sensitivity, specificity, and accuracy of 0.90, 1, and 0.90, respectively, in the classification of the GA and control samples. Average accuracy, sensitivity, and specificity of the three classifiers with machine learning ranged within 0.94–0.98, 0.94–0.97, and 0.97–1, respectively. The prognostic risk score model constructed by 4 PmBs was able to correctly and significantly (∗∗∗p < 0.001) classify 269 GA patients into the high-risk (n = 134) and low-risk (n = 135) groups. GA equivalent classification performance was achieved using the complete transcriptomic RNA profiles of colon adenocarcinoma, lung adenocarcinoma, and hepatocellular carcinoma using PmBs identified by mRBioM. Conclusions GA-related PmBs have high specificity and sensitivity and strong prognostic risk prediction. MRBioM has also good generalization. These PmBs may have good application prospects for early diagnosis of GA and may help to elucidate the mechanism governing the occurrence and development of GA. Additionally, mRBioM is expected to be applied for the identification of other cancer-related biomarkers.
Collapse
Affiliation(s)
- Changlong Dong
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Nini Rao
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Wenju Du
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Fenglin Gao
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaoqin Lv
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Guangbin Wang
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| | - Junpeng Zhang
- Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Key Laboratory for NeuroInformation of the Ministry of Education, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
26
|
Zhang J, Chen Q, Liu B. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1451-1463. [PMID: 31722485 DOI: 10.1109/tcbb.2019.2952338] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two kinds of crucial proteins, which are associated with various cellule activities and some important diseases. Accurate identification of DBPs and RBPs facilitate both theoretical research and real world application. Existing sequence-based DBP predictors can accurately identify DBPs but incorrectly predict many RBPs as DBPs, and vice versa, resulting in low prediction precision. Moreover, some proteins (DRBPs) interacting with both DNA and RNA play important roles in gene expression and cannot be identified by existing computational methods. In this study, a two-level predictor named DeepDRBP-2L was proposed by combining Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM). It is the first computational method that is able to identify DBPs, RBPs and DRBPs. Rigorous cross-validations and independent tests showed that DeepDRBP-2L is able to overcome the shortcoming of the existing methods and can go one further step to identify DRBPs. Application of DeepDRBP-2L to tomato genome further demonstrated its performance. The webserver of DeepDRBP-2L is freely available at http://bliulab.net/DeepDRBP-2L.
Collapse
|
27
|
Song J, Tian S, Yu L, Xing Y, Yang Q, Duan X, Dai Q. AC-Caps: Attention Based Capsule Network for Predicting RBP Binding Sites of LncRNA. Interdiscip Sci 2020; 12:414-423. [PMID: 32572768 DOI: 10.1007/s12539-020-00379-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 05/18/2020] [Accepted: 05/30/2020] [Indexed: 01/03/2023]
Abstract
Long non-coding RNA(lncRNA) is one of the non-coding RNAs longer than 200 nucleotides and it has no protein encoding function. LncRNA plays a key role in many biological processes. Studying the RNA-binding protein (RBP) binding sites on the lncRNA chain helps to reveal epigenetic and post-transcriptional mechanisms, to explore the physiological and pathological processes of cancer, and to discover new therapeutic breakthroughs. To improve the recognition rate of RBP binding sites and reduce the experimental time and cost, many calculation methods based on domain knowledge to predict RBP binding sites have emerged. However, these prediction methods are independent of nucleotides and do not take into account nucleotide statistics. In this paper, we use a high-order statistical-based encoding scheme, then the encoded lncRNA sequences are fed into a hybrid deep learning architecture named AC-Caps. It consists of a joint processing layer(composed of attention mechanism and convolutional neural network) and a capsule network. The AC-Caps model was evaluated using 31 independent experimental data sets from 12 lncRNA-binding proteins. In experiments, our method achieves excellent performance, with an average area under the curve (AUC) of 0.967 and an average accuracy (ACC) of 92.5%, which are 0.014, 2.3%, 0.261, 28.9%, 0.189, and 21.8% higher than HOCCNNLB, iDeepS, and DeepBind, respectively. The results show that the AC-Caps method can reliably process the large-scale RBP binding site data on the lncRNA chain, and the prediction performance is better than existing deep-learning models. The source code of AC-Caps and the datasets used in this paper are available at https://github.com/JinmiaoS/AC-Caps .
Collapse
Affiliation(s)
- Jinmiao Song
- School of Information Science and Engineering, Xinjiang University, Urumqi, 830008, China
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| | - Shengwei Tian
- School of Software, Xinjiang University, Urumqi, 830046, China.
| | - Long Yu
- Network Center, Xinjiang University, Urumqi, 830046, China
| | - Yan Xing
- Imaging Center, Xinjiang Medical University Affiliated First Hospital, Urumqi, 830011, China.
| | - Qimeng Yang
- School of Information Science and Engineering, Xinjiang University, Urumqi, 830008, China
| | - Xiaodong Duan
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| | - Qiguo Dai
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| |
Collapse
|
28
|
Kaur D, Arora C, Raghava GPS. A Hybrid Model for Predicting Pattern Recognition Receptors Using Evolutionary Information. Front Immunol 2020; 11:71. [PMID: 32082326 PMCID: PMC7002473 DOI: 10.3389/fimmu.2020.00071] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 01/13/2020] [Indexed: 12/17/2022] Open
Abstract
This study describes a method developed for predicting pattern recognition receptors (PRRs), which are an integral part of the immune system. The models developed here were trained and evaluated on the largest possible non-redundant PRRs, obtained from PRRDB 2.0, and non-pattern recognition receptors (Non-PRRs), obtained from Swiss-Prot. Firstly, a similarity-based approach using BLAST was used to predict PRRs and got limited success due to a large number of no-hits. Secondly, machine learning-based models were developed using sequence composition and achieved a maximum MCC of 0.63. In addition to this, models were developed using evolutionary information in the form of PSSM composition and achieved maximum MCC value of 0.66. Finally, we developed hybrid models that combined a similarity-based approach using BLAST and machine learning-based models. Our best model, which combined BLAST and PSSM based model, achieved a maximum MCC value of 0.82 with an AUROC value of 0.95, utilizing the potential of both similarity-based search and machine learning techniques. In order to facilitate the scientific community, we also developed a web server "PRRpred" based on the best model developed in this study (http://webs.iiitd.edu.in/raghava/prrpred/).
Collapse
Affiliation(s)
- Dilraj Kaur
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Chakit Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
29
|
Abstract
BACKGROUND Interactions between protein and nucleic acid molecules are essential to a variety of cellular processes. A large amount of interaction data generated by high-throughput technologies have triggered the development of several computational methods either to predict binding sites in a sequence or to determine whether a pair of sequences interacts or not. Most of these methods treat the problem of the interaction of nucleic acids with proteins as a classification problem rather than a generation problem. RESULTS We developed a generative model for constructing single-stranded nucleic acids binding to a target protein using a long short-term memory (LSTM) neural network. Experimental results of the generative model are promising in the sense that DNA and RNA sequences generated by the model for several target proteins show high specificity and that motifs present in the generated sequences are similar to known protein-binding motifs. CONCLUSIONS Although these are preliminary results of our ongoing research, our approach can be used to generate nucleic acid sequences binding to a target protein. In particular, it will help design efficient in vitro experiments by constructing an initial pool of potential aptamers that bind to a target protein with high affinity and specificity.
Collapse
Affiliation(s)
- Jinho Im
- Department of Computer Engineering, Inha University, Incheon, 22212, South Korea
| | - Byungkyu Park
- Department of Computer Engineering, Inha University, Incheon, 22212, South Korea
| | - Kyungsook Han
- Department of Computer Engineering, Inha University, Incheon, 22212, South Korea.
| |
Collapse
|
30
|
Tong X, Liu S. CPPred: coding potential prediction based on the global description of RNA sequence. Nucleic Acids Res 2019; 47:e43. [PMID: 30753596 PMCID: PMC6486542 DOI: 10.1093/nar/gkz087] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Revised: 01/26/2019] [Accepted: 02/01/2019] [Indexed: 11/12/2022] Open
Abstract
The rapid and accurate approach to distinguish between coding RNAs and ncRNAs has been playing a critical role in analyzing thousands of novel transcripts, which have been generated in recent years by next-generation sequencing technology. Previously developed methods CPAT, CPC2 and PLEK can distinguish coding RNAs and ncRNAs very well, but poorly distinguish between small coding RNAs and small ncRNAs. Herein, we report an approach, CPPred (coding potential prediction), which is based on SVM classifier and multiple sequence features including novel RNA features encoded by the global description. The CPPred can better distinguish not only between coding RNAs and ncRNAs, but also between small coding RNAs and small ncRNAs than the state-of-the-art methods due to the addition of the novel RNA features. A recent study proposes 1335 novel human coding RNAs from a large number of RNA-seq datasets. However, only 119 transcripts are predicted as coding RNAs by the CPPred. In fact, almost all proposed novel coding RNAs are ncRNAs (91.1%), which is consistent with previous reports. Remarkably, we also reveal that the global description of encoding features (T2, C0 and GC) plays an important role in the prediction of coding potential.
Collapse
Affiliation(s)
- Xiaoxue Tong
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Shiyong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
31
|
Bressin A, Schulte-Sasse R, Figini D, Urdaneta EC, Beckmann BM, Marsico A. TriPepSVM: de novo prediction of RNA-binding proteins based on short amino acid motifs. Nucleic Acids Res 2019; 47:4406-4417. [PMID: 30923827 PMCID: PMC6511874 DOI: 10.1093/nar/gkz203] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 02/20/2019] [Accepted: 03/18/2019] [Indexed: 12/26/2022] Open
Abstract
In recent years, hundreds of novel RNA-binding proteins (RBPs) have been identified, leading to the discovery of novel RNA-binding domains. Furthermore, unstructured or disordered low-complexity regions of RBPs have been identified to play an important role in interactions with nucleic acids. However, these advances in understanding RBPs are limited mainly to eukaryotic species and we only have limited tools to faithfully predict RNA-binders in bacteria. Here, we describe a support vector machine-based method, called TriPepSVM, for the prediction of RNA-binding proteins. TriPepSVM applies string kernels to directly handle protein sequences using tri-peptide frequencies. Testing the method in human and bacteria, we find that several RBP-enriched tri-peptides occur more often in structurally disordered regions of RBPs. TriPepSVM outperforms existing applications, which consider classical structural features of RNA-binding or homology, in the task of RBP prediction in both human and bacteria. Finally, we predict 66 novel RBPs in Salmonella Typhimurium and validate the bacterial proteins ClpX, DnaJ and UbiG to associate with RNA in vivo.
Collapse
Affiliation(s)
- Annkatrin Bressin
- Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Roman Schulte-Sasse
- Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany
| | - Davide Figini
- IRI Life Sciences, Humboldt University Berlin, Philippstrasse 13, 10115 Berlin, Germany
| | - Erika C Urdaneta
- IRI Life Sciences, Humboldt University Berlin, Philippstrasse 13, 10115 Berlin, Germany
| | - Benedikt M Beckmann
- IRI Life Sciences, Humboldt University Berlin, Philippstrasse 13, 10115 Berlin, Germany
| | - Annalisa Marsico
- Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, 14195 Berlin, Germany.,Free University of Berlin, Takustrasse 9, 14195 Berlin, Germany.,Institute of Computational Biology (ICB), Helmholtz Zentrum Munich, Ingolstaedter Landstr. 1 85764 Neuherberg, Germany
| |
Collapse
|
32
|
Wekesa JS, Luan Y, Chen M, Meng J. A Hybrid Prediction Method for Plant lncRNA-Protein Interaction. Cells 2019; 8:E521. [PMID: 31151273 PMCID: PMC6627874 DOI: 10.3390/cells8060521] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 05/22/2019] [Accepted: 05/29/2019] [Indexed: 01/23/2023] Open
Abstract
Long non-protein-coding RNAs (lncRNAs) identification and analysis are pervasive in transcriptome studies due to their roles in biological processes. In particular, lncRNA-protein interaction has plausible relevance to gene expression regulation and in cellular processes such as pathogen resistance in plants. While lncRNA-protein interaction has been studied in animals, there has yet to be extensive research in plants. In this paper, we propose a novel plant lncRNA-protein interaction prediction method, namely PLRPIM, which combines deep learning and shallow machine learning methods. The selection of an optimal feature subset and subsequent efficient compression are significant challenges for deep learning models. The proposed method adopts k-mer and extracts high-level abstraction sequence-based features using stacked sparse autoencoder. Based on the extracted features, the fusion of random forest (RF) and light gradient boosting machine (LGBM) is used to build the prediction model. The performances are evaluated on Arabidopsis thaliana and Zea mays datasets. Results from experiments demonstrate PLRPIM's superiority compared with other prediction tools on the two datasets. Based on 5-fold cross-validation, we obtain 89.98% and 93.44% accuracy, 0.954 and 0.982 AUC for Arabidopsis thaliana and Zea mays, respectively. PLRPIM predicts potential lncRNA-protein interaction pairs effectively, which can facilitate lncRNA related research including function prediction.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, Liaoning, China.
- Department of Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian 116023, Liaoning, China.
| | - Ming Chen
- College of Life Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, Liaoning, China.
| |
Collapse
|
33
|
Deep-RBPPred: Predicting RNA binding proteins in the proteome scale based on deep learning. Sci Rep 2018; 8:15264. [PMID: 30323214 PMCID: PMC6189057 DOI: 10.1038/s41598-018-33654-x] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 09/28/2018] [Indexed: 12/22/2022] Open
Abstract
RNA binding protein (RBP) plays an important role in cellular processes. Identifying RBPs by computation and experiment are both essential. Recently, an RBP predictor, RBPPred, is proposed in our group to predict RBPs. However, RBPPred is too slow for that it needs to generate PSSM matrix as its feature. Herein, based on the protein feature of RBPPred and Convolutional Neural Network (CNN), we develop a deep learning model called Deep-RBPPred. With the balance and imbalance training set, we obtain Deep-RBPPred-balance and Deep-RBPPred-imbalance models. Deep-RBPPred has three advantages comparing to previous methods. (1) Deep-RBPPred only needs few physicochemical properties based on protein sequences. (2) Deep-RBPPred runs much faster. (3) Deep-RBPPred has a good generalization ability. In the meantime, Deep-RBPPred is still as good as the state-of-the-art method. Testing in A. thaliana, S. cerevisiae and H. sapiens proteomes, MCC values are 0.82 (0.82), 0.65 (0.69) and 0.85 (0.80) for balance model (imbalance model) when the score cutoff is set to 0.5, respectively. In the same testing dataset, different machine learning algorithms (CNN and SVM) are also compared. The results show that CNN-based model can identify more RBPs than SVM-based. In comparing the balance and imbalance model, both CNN-base and SVM-based tend to favor the majority class in the imbalance set. Deep-RBPPred forecasts 280 (balance model) and 265 (imbalance model) of 299 new RBP. The sensitivity of balance model is about 7% higher than the state-of-the-art method. We also apply deep-RBPPred to 30 eukaryotes and 109 bacteria proteomes downloaded from Uniprot to estimate all possible RBPs. The estimating result shows that rates of RBPs in eukaryote proteomes are much higher than bacteria proteomes.
Collapse
|
34
|
Middleton SA, Illuminati J, Kim J. Complete fold annotation of the human proteome using a novel structural feature space. Sci Rep 2017; 7:46321. [PMID: 28406174 PMCID: PMC5390313 DOI: 10.1038/srep46321] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 03/14/2017] [Indexed: 11/11/2022] Open
Abstract
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.
Collapse
Affiliation(s)
- Sarah A Middleton
- Genomics and Computational Biology Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Joseph Illuminati
- Department of Computer Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Junhyong Kim
- Genomics and Computational Biology Program, University of Pennsylvania, Philadelphia, PA 19104, USA.,Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|