1
|
Gao Y, Shi R, Yu G, Huang Y, Yang Y. ZeRPI: A graph neural network model for zero-shot prediction of RNA-protein interactions. Methods 2025; 235:45-52. [PMID: 39892680 DOI: 10.1016/j.ymeth.2025.01.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 12/29/2024] [Accepted: 01/16/2025] [Indexed: 02/04/2025] Open
Abstract
RNA-protein interactions are crucial for biological functions across multiple levels. RNA binding proteins (RBPs) intricately engage in diverse biological processes through specific RNA molecule interactions. Previous studies have revealed the indispensable role of RBPs in both health and disease development. With the increase of experimental data, machine-learning methods have been widely used to predict RNA-protein interactions. However, most current methods either train models for individual RBPs or develop multi-task models for a fixed set of multiple RBPs. These approaches are incapable of predicting interactions with previously unseen RBPs. In this study, we present ZeRPI, a zero-shot method for predicting RNA-protein interactions. Based on a graph neural network model, ZeRPI integrates RNA and protein information to generate detailed representations, using a novel loss function based on contrastive learning principles to augment the alignment between interacting pairs in feature space. ZeRPI demonstrates competitive performance in predicting RNA-protein interactions across a wide array of RBPs. Notably, our model exhibits remarkable versatility in accurately predicting interactions for unseen RBPs, demonstrating its capacity to transfer knowledge learned from known RBPs.
Collapse
Affiliation(s)
- Yifei Gao
- SJTU Paris Elite Institute of Technology (SPEIT), Shanghai, 200240, China; Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Runhan Shi
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Gufeng Yu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuyang Huang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
2
|
Qiao Y, Yang R, Liu Y, Chen J, Zhao L, Huo P, Wang Z, Bu D, Wu Y, Zhao Y. DeepFusion: A deep bimodal information fusion network for unraveling protein-RNA interactions using in vivo RNA structures. Comput Struct Biotechnol J 2024; 23:617-625. [PMID: 38274994 PMCID: PMC10808905 DOI: 10.1016/j.csbj.2023.12.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 12/04/2023] [Accepted: 12/26/2023] [Indexed: 01/27/2024] Open
Abstract
RNA-binding proteins (RBPs) are key post-transcriptional regulators, and the malfunctions of RBP-RNA binding lead to diverse human diseases. However, prediction of RBP binding sites is largely based on RNA sequence features, whereas in vivo RNA structural features based on high-throughput sequencing are rarely incorporated. Here, we designed a deep bimodal information fusion network called DeepFusion for unraveling protein-RNA interactions by incorporating structural features derived from DMS-seq data. DeepFusion integrates two sub-models to extract local motif-like information and long-term context information. We show that DeepFusion performs best compared with other cutting-edge methods with only sequence inputs on two datasets. DeepFusion's performance is further improved with bimodal input after adding in vivo DMS-seq structural features. Furthermore, DeepFusion can be used for analyzing RNA degradation, demonstrating significantly different RBP-binding scores in genes with slow degradation rates versus those with rapid degradation rates. DeepFusion thus provides enhanced abilities for further analysis of functional RNAs. DeepFusion's code and data are available at http://bioinfo.org/deepfusion/.
Collapse
Affiliation(s)
- Yixuan Qiao
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Rui Yang
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yang Liu
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jiaxin Chen
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Lianhe Zhao
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Peipei Huo
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Zhihao Wang
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Dechao Bu
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Yang Wu
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Yi Zhao
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
3
|
Yang Y, Li G, Pang K, Cao W, Zhang Z, Li X. Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2407013. [PMID: 39159140 PMCID: PMC11497048 DOI: 10.1002/advs.202407013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 07/23/2024] [Indexed: 08/21/2024]
Abstract
The 3' untranslated regions (3'UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. It is hypothesized that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language techniques such as Transformers, which has been very effective in modeling complex protein sequence and structures. Here 3UTRBERT is described, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT is pre-trained on aggregated 3'UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model is then fine-tuned for specific downstream tasks such as identifying RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results show that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. More importantly, the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements and effectively identifies regions with important regulatory potential. It is expected that 3UTRBERT model can serve as the foundational tool to analyze various sequence labeling tasks within the 3'UTR fields, thus enhancing the decipherability of post-transcriptional regulatory mechanisms.
Collapse
Affiliation(s)
- Yuning Yang
- School of Information Science and TechnologyNortheast Normal UniversityChangchunJilin130117China
| | - Gen Li
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Kuan Pang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Wuxinhao Cao
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Computer ScienceUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Molecular GeneticsUniversity of TorontoTorontoONM5S 3E1Canada
| | - Xiangtao Li
- School of Artificial IntelligenceJilin UniversityChangchunJilin130012China
| |
Collapse
|
4
|
Saha R, Vázquez-Salazar A, Nandy A, Chen IA. Fitness Landscapes and Evolution of Catalytic RNA. Annu Rev Biophys 2024; 53:109-125. [PMID: 39013026 DOI: 10.1146/annurev-biophys-030822-025038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2024]
Abstract
The relationship between genotype and phenotype, or the fitness landscape, is the foundation of genetic engineering and evolution. However, mapping fitness landscapes poses a major technical challenge due to the amount of quantifiable data that is required. Catalytic RNA is a special topic in the study of fitness landscapes due to its relatively small sequence space combined with its importance in synthetic biology. The combination of in vitro selection and high-throughput sequencing has recently provided empirical maps of both complete and local RNA fitness landscapes, but the astronomical size of sequence space limits purely experimental investigations. Next steps are likely to involve data-driven interpolation and extrapolation over sequence space using various machine learning techniques. We discuss recent progress in understanding RNA fitness landscapes, particularly with respect to protocells and machine representations of RNA. The confluence of technical advances may significantly impact synthetic biology in the near future.
Collapse
Affiliation(s)
- Ranajay Saha
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California, USA; ,
| | - Alberto Vázquez-Salazar
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California, USA; ,
| | - Aditya Nandy
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California, USA; ,
- Department of Chemistry, The University of Chicago, Chicago, Illinois, USA
- The James Franck Institute, The University of Chicago, Chicago, Illinois, USA
| | - Irene A Chen
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California, USA; ,
- Department of Chemistry and Biochemistry, University of California, Los Angeles, California, USA
| |
Collapse
|
5
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
6
|
Rennie S. Deep Learning for Elucidating Modifications to RNA-Status and Challenges Ahead. Genes (Basel) 2024; 15:629. [PMID: 38790258 PMCID: PMC11121098 DOI: 10.3390/genes15050629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 05/11/2024] [Accepted: 05/11/2024] [Indexed: 05/26/2024] Open
Abstract
RNA-binding proteins and chemical modifications to RNA play vital roles in the co- and post-transcriptional regulation of genes. In order to fully decipher their biological roles, it is an essential task to catalogue their precise target locations along with their preferred contexts and sequence-based determinants. Recently, deep learning approaches have significantly advanced in this field. These methods can predict the presence or absence of modification at specific genomic regions based on diverse features, particularly sequence and secondary structure, allowing us to decipher the highly non-linear sequence patterns and structures that underlie site preferences. This article provides an overview of how deep learning is being applied to this area, with a particular focus on the problem of mRNA-RBP binding, while also considering other types of chemical modification to RNA. It discusses how different types of model can handle sequence-based and/or secondary-structure-based inputs, the process of model training, including choice of negative regions and separating sets for testing and training, and offers recommendations for developing biologically relevant models. Finally, it highlights four key areas that are crucial for advancing the field.
Collapse
Affiliation(s)
- Sarah Rennie
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
| |
Collapse
|
7
|
Lim D, Baek C, Blanchette M. Graphylo: A deep learning approach for predicting regulatory DNA and RNA sites from whole-genome multiple alignments. iScience 2024; 27:109002. [PMID: 38362268 PMCID: PMC10867641 DOI: 10.1016/j.isci.2024.109002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 12/17/2023] [Accepted: 01/19/2024] [Indexed: 02/17/2024] Open
Abstract
This study focuses on enhancing the prediction of regulatory functional sites in DNA and RNA sequences, a crucial aspect of gene regulation. Current methods, such as motif overrepresentation and machine learning, often lack specificity. To address this issue, the study leverages evolutionary information and introduces Graphylo, a deep-learning approach for predicting transcription factor binding sites in the human genome. Graphylo combines Convolutional Neural Networks for DNA sequences with Graph Convolutional Networks on phylogenetic trees, using information from placental mammals' genomes and evolutionary history. The research demonstrates that Graphylo consistently outperforms both single-species deep learning techniques and methods that incorporate inter-species conservation scores on a wide range of datasets. It achieves this by utilizing a species-based attention model for evolutionary insights and an integrated gradient approach for nucleotide-level model interpretability. This innovative approach offers a promising avenue for improving the accuracy of regulatory site prediction in genomics.
Collapse
|
8
|
Horlacher M, Cantini G, Hesse J, Schinke P, Goedert N, Londhe S, Moyon L, Marsico A. A systematic benchmark of machine learning methods for protein-RNA interaction prediction. Brief Bioinform 2023; 24:bbad307. [PMID: 37635383 PMCID: PMC10516373 DOI: 10.1093/bib/bbad307] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 06/15/2023] [Accepted: 07/18/2023] [Indexed: 08/29/2023] Open
Abstract
RNA-binding proteins (RBPs) are central actors of RNA post-transcriptional regulation. Experiments to profile-binding sites of RBPs in vivo are limited to transcripts expressed in the experimental cell type, creating the need for computational methods to infer missing binding information. While numerous machine-learning based methods have been developed for this task, their use of heterogeneous training and evaluation datasets across different sets of RBPs and CLIP-seq protocols makes a direct comparison of their performance difficult. Here, we compile a set of 37 machine learning (primarily deep learning) methods for in vivo RBP-RNA interaction prediction and systematically benchmark a subset of 11 representative methods across hundreds of CLIP-seq datasets and RBPs. Using homogenized sample pre-processing and two negative-class sample generation strategies, we evaluate methods in terms of predictive performance and assess the impact of neural network architectures and input modalities on model performance. We believe that this study will not only enable researchers to choose the optimal prediction method for their tasks at hand, but also aid method developers in developing novel, high-performing methods by introducing a standardized framework for their evaluation.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Germany
- School of Computation, Information and Technology, Technical University Munich (TUM), Germany
| | - Giulia Cantini
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Julian Hesse
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Patrick Schinke
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Nicolas Goedert
- Computational Health Center, Helmholtz Center Munich, Germany
| | | | - Lambert Moyon
- Computational Health Center, Helmholtz Center Munich, Germany
| | | |
Collapse
|
9
|
Yang E, Zhang H, Zang Z, Zhou Z, Wang S, Liu Z, Liu Y. GCNfold: A novel lightweight model with valid extractors for RNA secondary structure prediction. Comput Biol Med 2023; 164:107246. [PMID: 37487383 DOI: 10.1016/j.compbiomed.2023.107246] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 06/23/2023] [Accepted: 07/07/2023] [Indexed: 07/26/2023]
Abstract
RNA secondary structure is essential for predicting the tertiary structure and understanding RNA function. Recent research tends to stack numerous modules to design large deep-learning models. This can increase the accuracy to more than 70%, as well as significant training costs and prediction efficiency. We proposed a model with three feature extractors called GCNfold. Structure Extractor utilizes a three-layer Graph Convolutional Network (GCN) to mine the structural information of RNA, such as stems, hairpin, and internal loops. Structure and Sequence Fusion embeds structural information into sequences with Transformer Encoders. Long-distance Dependency Extractor captures long-range pairwise relationships by UNet. The experiments indicate that GCNfold has a small number of parameters, a fast inference speed, and a high accuracy among all models with over 80% accuracy. Additionally, GCNfold-Small takes only 90ms to infer an RNA secondary structure and can achieve close to 90% accuracy on average. The GCNfold code is available on Github https://github.com/EnbinYang/GCNfold.
Collapse
Affiliation(s)
- Enbin Yang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Hao Zhang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China; College of Software, Jilin University, Changchun, 130012, China
| | - Zinan Zang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Zhiyong Zhou
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Shuo Wang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
| | - Zhen Liu
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China; Graduate School of Engineering, Nagasaki Institute of Applied Science, 536 Aba-machi, Nagasaki 851-0193, Japan
| | - Yuanning Liu
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China; College of Software, Jilin University, Changchun, 130012, China.
| |
Collapse
|
10
|
Zhou Y, Wu J, Yao S, Xu Y, Zhao W, Tong Y, Zhou Z. DeepCIP: A multimodal deep learning method for the prediction of internal ribosome entry sites of circRNAs. Comput Biol Med 2023; 164:107288. [PMID: 37542919 DOI: 10.1016/j.compbiomed.2023.107288] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 07/05/2023] [Accepted: 07/28/2023] [Indexed: 08/07/2023]
Abstract
Circular RNAs (circRNAs) have been found to have the ability to encode proteins through internal ribosome entry sites (IRESs), which are essential RNA regulatory elements for cap-independent translation. Identification of IRES elements in circRNA is crucial for understanding its function. Previous studies have presented IRES predictors based on machine learning techniques, but they were mainly designed for linear RNA IRES. In this study, we proposed DeepCIP (Deep learning method for CircRNA IRES Prediction), a multimodal deep learning approach that employs both sequence and structural information for circRNA IRES prediction. Our results demonstrate the effectiveness of the sequence and structure models used by DeepCIP in feature extraction and suggest that integrating sequence and structural information efficiently improves the accuracy of prediction. The comparison studies indicate that DeepCIP outperforms other comparative methods on the test set and real circRNA IRES dataset. Furthermore, through the integration of an interpretable analysis mechanism, we elucidate the sequence patterns learned by our model, which align with the previous discovery of motifs that facilitate circRNA translation. Thus, DeepCIP has the potential to enhance the study of the coding potential of circRNAs and contribute to the design of circRNA-based drugs. DeepCIP as a standalone program is freely available at https://github.org/zjupgx/DeepCIP.
Collapse
Affiliation(s)
- Yuxuan Zhou
- Innovation Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang University Innovation Institute for Artificial Intelligence in Medicine - Aoming (Hangzhou) Biomedical Co., Ltd. Joint Laboratory, Hangzhou, 310018, China
| | - Jingcheng Wu
- Innovation Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Shihao Yao
- College of Life Sciences, China Jiliang University, Hangzhou, 310018, China; China Jiliang University - Aoming (Hangzhou) Biomedical Co., Ltd. Joint Laboratory, Hangzhou, 310018, China
| | - Yulian Xu
- College of Life Sciences, China Jiliang University, Hangzhou, 310018, China; China Jiliang University - Aoming (Hangzhou) Biomedical Co., Ltd. Joint Laboratory, Hangzhou, 310018, China
| | - Wenbin Zhao
- Innovation Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang University Innovation Institute for Artificial Intelligence in Medicine - Aoming (Hangzhou) Biomedical Co., Ltd. Joint Laboratory, Hangzhou, 310018, China
| | - Yunguang Tong
- Innovation Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; College of Life Sciences, China Jiliang University, Hangzhou, 310018, China; Aoming (Hangzhou) Biomedical Co., Ltd., Hangzhou, 310018, China; Zhejiang University Innovation Institute for Artificial Intelligence in Medicine - Aoming (Hangzhou) Biomedical Co., Ltd. Joint Laboratory, Hangzhou, 310018, China; China Jiliang University - Aoming (Hangzhou) Biomedical Co., Ltd. Joint Laboratory, Hangzhou, 310018, China.
| | - Zhan Zhou
- Innovation Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, 322000, China.
| |
Collapse
|
11
|
Horlacher M, Wagner N, Moyon L, Kuret K, Goedert N, Salvatore M, Ule J, Gagneur J, Winther O, Marsico A. Towards in silico CLIP-seq: predicting protein-RNA interaction via sequence-to-signal learning. Genome Biol 2023; 24:180. [PMID: 37542318 PMCID: PMC10403857 DOI: 10.1186/s13059-023-03015-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 07/17/2023] [Indexed: 08/06/2023] Open
Abstract
We present RBPNet, a novel deep learning method, which predicts CLIP-seq crosslink count distribution from RNA sequence at single-nucleotide resolution. By training on up to a million regions, RBPNet achieves high generalization on eCLIP, iCLIP and miCLIP assays, outperforming state-of-the-art classifiers. RBPNet performs bias correction by modeling the raw signal as a mixture of the protein-specific and background signal. Through model interrogation via Integrated Gradients, RBPNet identifies predictive sub-sequences that correspond to known and novel binding motifs and enables variant-impact scoring via in silico mutagenesis. Together, RBPNet improves imputation of protein-RNA interactions, as well as mechanistic interpretation of predictions.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Munich, Germany.
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
- Department of Informatics, Technical University of Munich, Garching, Germany.
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany.
| | - Nils Wagner
- Department of Informatics, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Lambert Moyon
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Klara Kuret
- National Institute of Chemistry, Ljubljana, Slovenia
- The Francis Crick Institute, London, UK
- Jozef Stefan International Postgraduate School, Jamova cesta 39, 1000, Ljubljana, Slovenia
| | - Nicolas Goedert
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Marco Salvatore
- Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jernej Ule
- National Institute of Chemistry, Ljubljana, Slovenia
- The Francis Crick Institute, London, UK
| | - Julien Gagneur
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
- Department of Informatics, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Ole Winther
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Annalisa Marsico
- Computational Health Center, Helmholtz Center Munich, Munich, Germany.
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany.
| |
Collapse
|
12
|
Acera Mateos P, Zhou Y, Zarnack K, Eyras E. Concepts and methods for transcriptome-wide prediction of chemical messenger RNA modifications with machine learning. Brief Bioinform 2023; 24:7150742. [PMID: 37139545 DOI: 10.1093/bib/bbad163] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 03/03/2023] [Indexed: 05/05/2023] Open
Abstract
The expanding field of epitranscriptomics might rival the epigenome in the diversity of biological processes impacted. In recent years, the development of new high-throughput experimental and computational techniques has been a key driving force in discovering the properties of RNA modifications. Machine learning applications, such as for classification, clustering or de novo identification, have been critical in these advances. Nonetheless, various challenges remain before the full potential of machine learning for epitranscriptomics can be leveraged. In this review, we provide a comprehensive survey of machine learning methods to detect RNA modifications using diverse input data sources. We describe strategies to train and test machine learning methods and to encode and interpret features that are relevant for epitranscriptomics. Finally, we identify some of the current challenges and open questions about RNA modification analysis, including the ambiguity in predicting RNA modifications in transcript isoforms or in single nucleotides, or the lack of complete ground truth sets to test RNA modifications. We believe this review will inspire and benefit the rapidly developing field of epitranscriptomics in addressing the current limitations through the effective use of machine learning.
Collapse
Affiliation(s)
- Pablo Acera Mateos
- EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, Australia
- The Shine-Dalgarno Centre for RNA Innovation, The John Curtin School of Medical Research, Australian National University, Canberra, Australia
- The Centre for Computational Biomedical Sciences, The John Curtin School of Medical Research, Australian National University, Canberra, Australia
| | - You Zhou
- Buchmann Institute for Molecular Life Sciences (BMLS), Goethe University Frankfurt, Max-von-Laue-Str. 15, 60438 Frankfurt a.M., Germany
- Institute of Molecular Biosciences, Goethe University Frankfurt, Max-von-Laue-Str. 15, 60438 Frankfurt a.M., Germany
| | - Kathi Zarnack
- Buchmann Institute for Molecular Life Sciences (BMLS), Goethe University Frankfurt, Max-von-Laue-Str. 15, 60438 Frankfurt a.M., Germany
- Institute of Molecular Biosciences, Goethe University Frankfurt, Max-von-Laue-Str. 15, 60438 Frankfurt a.M., Germany
| | - Eduardo Eyras
- EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, Australia
- The Shine-Dalgarno Centre for RNA Innovation, The John Curtin School of Medical Research, Australian National University, Canberra, Australia
- The Centre for Computational Biomedical Sciences, The John Curtin School of Medical Research, Australian National University, Canberra, Australia
| |
Collapse
|
13
|
Wang X, Zhang M, Long C, Yao L, Zhu M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1469-1479. [PMID: 36067103 DOI: 10.1109/tcbb.2022.3204661] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Proteins binding to Ribonucleic Acid (RNA) inside cells are called RNA-binding proteins (RBP), which play a crucial role in gene regulation. The identification of RNA-protein binding sites helps to understand the function of RBP better. Although many computational methods have been developed to predict RNA-protein binding sites, their prediction accuracy on small sample datasets needs improvement. To overcome this limitation, we propose a novel model called SA-Net, which utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features. K-mer embedding assists the model to discover significant subsequence fragments associated with binding sites. The self-attention mechanism captures contextual information from the entire input sequence globally, performing well in small sample sequence learning. Experimental results demonstrate that SA-Net attains state-of-the-art results on the RBP-24 dataset. We find that 4-mer embedding aids the model to achieve optimal performance. We also show that the self-attention network outperforms the commonly used CNN and CNN-BLSTM models in sequence feature extraction.
Collapse
|
14
|
Zhang L, Lu C, Zeng M, Li Y, Wang J. CRMSS: predicting circRNA-RBP binding sites based on multi-scale characterizing sequence and structure features. Brief Bioinform 2023; 24:6889442. [PMID: 36511222 DOI: 10.1093/bib/bbac530] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 11/01/2022] [Accepted: 11/07/2022] [Indexed: 12/14/2022] Open
Abstract
Circular RNAs (circRNAs) are reverse-spliced and covalently closed RNAs. Their interactions with RNA-binding proteins (RBPs) have multiple effects on the progress of many diseases. Some computational methods are proposed to identify RBP binding sites on circRNAs but suffer from insufficient accuracy, robustness and explanation. In this study, we first take the characteristics of both RNA and RBP into consideration. We propose a method for discriminating circRNA-RBP binding sites based on multi-scale characterizing sequence and structure features, called CRMSS. For circRNAs, we use sequence ${k}\hbox{-}{mer}$ embedding and the forming probabilities of local secondary structures as features. For RBPs, we combine sequence and structure frequencies of RNA-binding domain regions to generate features. We capture binding patterns with multi-scale residual blocks. With BiLSTM and attention mechanism, we obtain the contextual information of high-level representation for circRNA-RBP binding. To validate the effectiveness of CRMSS, we compare its predictive performance with other methods on 37 RBPs. Taking the properties of both circRNAs and RBPs into account, CRMSS achieves superior performance over state-of-the-art methods. In the case study, our model provides reliable predictions and correctly identifies experimentally verified circRNA-RBP pairs. The code of CRMSS is freely available at https://github.com/BioinformaticsCSU/CRMSS.
Collapse
Affiliation(s)
- Lishen Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| | - Chengqian Lu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| | - Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| | - Yaohang Li
- Department of Computer Science at Old Dominion University, USA
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| |
Collapse
|
15
|
Bheemireddy S, Sandhya S, Srinivasan N, Sowdhamini R. Computational tools to study RNA-protein complexes. Front Mol Biosci 2022; 9:954926. [PMID: 36275618 PMCID: PMC9585174 DOI: 10.3389/fmolb.2022.954926] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 09/20/2022] [Indexed: 11/19/2022] Open
Abstract
RNA is the key player in many cellular processes such as signal transduction, replication, transport, cell division, transcription, and translation. These diverse functions are accomplished through interactions of RNA with proteins. However, protein–RNA interactions are still poorly derstood in contrast to protein–protein and protein–DNA interactions. This knowledge gap can be attributed to the limited availability of protein-RNA structures along with the experimental difficulties in studying these complexes. Recent progress in computational resources has expanded the number of tools available for studying protein-RNA interactions at various molecular levels. These include tools for predicting interacting residues from primary sequences, modelling of protein-RNA complexes, predicting hotspots in these complexes and insights into derstanding in the dynamics of their interactions. Each of these tools has its strengths and limitations, which makes it significant to select an optimal approach for the question of interest. Here we present a mini review of computational tools to study different aspects of protein-RNA interactions, with focus on overall application, development of the field and the future perspectives.
Collapse
Affiliation(s)
- Sneha Bheemireddy
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Sankaran Sandhya
- Department of Biotechnology, Faculty of Life and Allied Health Sciences, M.S. Ramaiah University of Applied Sciences, Bengaluru, India
- *Correspondence: Sankaran Sandhya, ; Ramanathan Sowdhamini,
| | | | - Ramanathan Sowdhamini
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
- National Centre for Biological Sciences, TIFR, GKVK Campus, Bangalore, India
- Institute of Bioinformatics and Applied Biotechnology, Bangalore, India
- *Correspondence: Sankaran Sandhya, ; Ramanathan Sowdhamini,
| |
Collapse
|
16
|
Muneer A, Fati SM, Arifin Akbar N, Agustriawan D, Tri Wahyudi S. iVaccine-Deep: Prediction of COVID-19 mRNA vaccine degradation using deep learning. JOURNAL OF KING SAUD UNIVERSITY. COMPUTER AND INFORMATION SCIENCES 2022; 34:7419-7432. [PMID: 38620874 PMCID: PMC8513509 DOI: 10.1016/j.jksuci.2021.10.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 08/29/2021] [Accepted: 10/05/2021] [Indexed: 12/14/2022]
Abstract
Messenger RNA (mRNA) has emerged as a critical global technology that requires global joint efforts from different entities to develop a COVID-19 vaccine. However, the chemical properties of RNA pose a challenge in utilizing mRNA as a vaccine candidate. For instance, the molecules are prone to degradation, which has a negative impact on the distribution of mRNA among patients. In addition, little is known of the degradation properties of individual RNA bases in a molecule. Therefore, this study aims to investigate whether a hybrid deep learning can predict RNA degradation from RNA sequences. Two deep hybrid neural network models were proposed, namely GCN_GRU and GCN_CNN. The first model is based on graph convolutional neural networks (GCNs) and gated recurrent unit (GRU). The second model is based on GCN and convolutional neural networks (CNNs). Both models were computed over the structural graph of the mRNA molecule. The experimental results showed that GCN_GRU hybrid model outperform GCN_CNN model by a large margin during the test time. Validation of proposed hybrid models is performed by well-known evaluation measures. Among different deep neural networks, GCN_GRU based model achieved best scores on both public and private MCRMSE test scores with 0.22614 and 0.34152, respectively. Finally, GCN_GRU pre-trained model has achieved the highest AuC score of 0.938. Such proven outperformance of GCNs indicates that modeling RNA molecules using graphs is critical in understanding molecule degradation mechanisms, which helps in minimizing the aforementioned issues. To show the importance of the proposed GCN_GRU hybrid model, in silico experiments has been contacted. The in-silico results showed that our model pays local attention when predicting a given position's reactivity and exhibits interesting behavior on neighboring bases in the sequence.
Collapse
Affiliation(s)
- Amgad Muneer
- Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, Seri Iskandar 32160, Malaysia
| | - Suliman Mohamed Fati
- Information Systems Department, College of Computer and Information Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia
| | - Nur Arifin Akbar
- Research Department, Idenitive Mashable Prototyping, Banyumas 53124, Indonesia
| | - David Agustriawan
- Faculty of Bioinformatics, Indonesia International Institute for Life Sciences, Jakarta Timur 13210, Indonesia
| | | |
Collapse
|
17
|
Ma H, Wen H, Xue Z, Li G, Zhang Z. RNANetMotif: Identifying sequence-structure RNA network motifs in RNA-protein binding sites. PLoS Comput Biol 2022; 18:e1010293. [PMID: 35819951 PMCID: PMC9275694 DOI: 10.1371/journal.pcbi.1010293] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 06/09/2022] [Indexed: 11/19/2022] Open
Abstract
RNA molecules can adopt stable secondary and tertiary structures, which are essential in mediating physical interactions with other partners such as RNA binding proteins (RBPs) and in carrying out their cellular functions. In vivo and in vitro experiments such as RNAcompete and eCLIP have revealed in vitro binding preferences of RBPs to RNA oligomers and in vivo binding sites in cells. Analysis of these binding data showed that the structure properties of the RNAs in these binding sites are important determinants of the binding events; however, it has been a challenge to incorporate the structure information into an interpretable model. Here we describe a new approach, RNANetMotif, which takes predicted secondary structure of thousands of RNA sequences bound by an RBP as input and uses a graph theory approach to recognize enriched subgraphs. These enriched subgraphs are in essence shared sequence-structure elements that are important in RBP-RNA binding. To validate our approach, we performed RNA structure modeling via coarse-grained molecular dynamics folding simulations for selected 4 RBPs, and RNA-protein docking for LIN28B. The simulation results, e.g., solvent accessibility and energetics, further support the biological relevance of the discovered network subgraphs. RNA binding proteins (RBPs) regulate every aspect of RNA biology, including splicing, translation, transportation, and degradation. High-throughput technologies such as eCLIP have identified thousands of binding sites for a given RBP throughout the genome. It has been shown by earlier studies that, in addition to nucleotide sequences, the structure and conformation of RNAs also play important role in RBP-RNA interactions. Analogous to protein-protein interactions or protein-DNA interactions, it is likely that there exist intrinsic sequence-structure motifs common to these RNAs that underlie their binding specificity to specific RBPs. It is known that RNAs form energetically favorable secondary structures, which can be represented as graphs, with nucleotides being nodes and backbone covalent bonds and base-pairing hydrogen bonds representing edges. We hypothesize that these graphs can be mined by graph theory approaches to identify sequence-structure motifs as enriched sub-graphs. In this article, we described the details of this approach, termed RNANetMotif and associated new concepts, namely EKS (Extended K-mer Subgraph) and GraphK graph algorithm. To test the utility of our approach, we conducted 3D structure modeling of selected RNA sequences through molecular dynamics (MD) folding simulation and evaluated the significance of the discovered RNA motifs by comparing their spatial exposure with other regions on the RNA. We believe that this approach has the novelty of treating the RNA sequence as a graph and RBP binding sites as enriched subgraph, which has broader applications beyond RBP-RNA interactions.
Collapse
Affiliation(s)
- Hongli Ma
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- School of Mathematics, Shandong University, Jinan, China
| | - Han Wen
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Zhiyuan Xue
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- School of Mathematical Science, Liaocheng University, Liaocheng, China
| | - Zhaolei Zhang
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
18
|
Fu X, Bates PA. Application of deep learning methods: From molecular modelling to patient classification. Exp Cell Res 2022; 418:113278. [PMID: 35810775 DOI: 10.1016/j.yexcr.2022.113278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 06/16/2022] [Accepted: 07/05/2022] [Indexed: 11/28/2022]
Abstract
We are now well into the information driven age with complex, heterogeneous, datasets in the biological sciences continuing to grow at a rapid pace. Moreover, distilling of such datasets, to find new governing principles, are underway. Leading the surge are new and exciting algorithmic developments in computer simulation and machine learning, most notably for the latter, those centred on deep learning. However, practical applications of cell centric computations within the biological sciences, even when carefully benchmarked against existing experimental datasets, remain challenging. Here we discuss the application of deep learning methodologies to support our understanding of cell functionality and as an aid to patient classification. Whilst comprehensive end-to-end deep learning approaches that utilise knowledge of the cell and its molecular components to aid human disease classification are yet to be implemented, important for opening the door to more effective molecular and cell-based therapies, we illustrate that many deep learning applications have been developed to tackle components of such an ambitious pipeline. We end our discussion on what the future may hold, especially how an integrated framework of computer simulations and deep learning, in conjunction with wet-bench experimentation, could enable to reveal the governing principles underlying cell functionalities within the tissue environments cells operate.
Collapse
Affiliation(s)
- Xiao Fu
- Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Rd, London, NW1 1AT, UK.
| | - Paul A Bates
- Biomolecular Modelling Laboratory, The Francis Crick Institute, 1 Midland Rd, London, NW1 1AT, UK.
| |
Collapse
|
19
|
Du X, Zhao X, Zhang Y. DeepBtoD: Improved RNA-binding proteins prediction via integrated deep learning. J Bioinform Comput Biol 2022; 20:2250006. [PMID: 35451938 DOI: 10.1142/s0219720022500068] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
RNA-binding proteins (RBPs) have crucial roles in various cellular processes such as alternative splicing and gene regulation. Therefore, the analysis and identification of RBPs is an essential issue. However, although many computational methods have been developed for predicting RBPs, a few studies simultaneously consider local and global information from the perspective of the RNA sequence. Facing this challenge, we present a novel method called DeepBtoD, which predicts RBPs directly from RNA sequences. First, a [Formula: see text]-BtoD encoding is designed, which takes into account the composition of [Formula: see text]-nucleotides and their relative positions and forms a local module. Second, we designed a multi-scale convolutional module embedded with a self-attentive mechanism, the ms-focusCNN, which is used to further learn more effective, diverse, and discriminative high-level features. Finally, global information is considered to supplement local modules with ensemble learning to predict whether the target RNA binds to RBPs. Our preliminary 24 independent test datasets show that our proposed method can classify RBPs with the area under the curve of 0.933. Remarkably, DeepBtoD shows competitive results across seven state-of-the-art methods, suggesting that RBPs can be highly recognized by integrating local [Formula: see text]-BtoD and global information only from RNA sequences. Hence, our integrative method may be useful to improve the power of RBPs prediction, which might be particularly useful for modeling protein-nucleic acid interactions in systems biology studies. Our DeepBtoD server can be accessed at http://175.27.228.227/DeepBtoD/.
Collapse
Affiliation(s)
- XiuQuan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China.,School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - XiuJuan Zhao
- School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - YanPing Zhang
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China
| |
Collapse
|
20
|
Yamada K, Hamada M. Prediction of RNA-protein interactions using a nucleotide language model. BIOINFORMATICS ADVANCES 2022; 2:vbac023. [PMID: 36699410 PMCID: PMC9710633 DOI: 10.1093/bioadv/vbac023] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 02/28/2022] [Accepted: 04/05/2022] [Indexed: 01/28/2023]
Abstract
Motivation The accumulation of sequencing data has enabled researchers to predict the interactions between RNA sequences and RNA-binding proteins (RBPs) using novel machine learning techniques. However, existing models are often difficult to interpret and require additional information to sequences. Bidirectional encoder representations from transformer (BERT) is a language-based deep learning model that is highly interpretable. Therefore, a model based on BERT architecture can potentially overcome such limitations. Results Here, we propose BERT-RBP as a model to predict RNA-RBP interactions by adapting the BERT architecture pretrained on a human reference genome. Our model outperformed state-of-the-art prediction models using the eCLIP-seq data of 154 RBPs. The detailed analysis further revealed that BERT-RBP could recognize both the transcript region type and RNA secondary structure only based on sequence information. Overall, the results provide insights into the fine-tuning mechanism of BERT in biological contexts and provide evidence of the applicability of the model to other RNA-related problems. Availability and implementation Python source codes are freely available at https://github.com/kkyamada/bert-rbp. The datasets underlying this article were derived from sources in the public domain: [RBPsuite (http://www.csbio.sjtu.edu.cn/bioinf/RBPsuite/), Ensembl Biomart (http://asia.ensembl.org/biomart/martview/)]. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Keisuke Yamada
- Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Okubo, Shinjuku, Tokyo 169-8555, Japan
| |
Collapse
|
21
|
Arora V, Sanguinetti G. Challenges for machine learning in RNA-protein interaction prediction. Stat Appl Genet Mol Biol 2022; 21:sagmb-2021-0087. [PMID: 35073469 DOI: 10.1515/sagmb-2021-0087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 01/02/2022] [Indexed: 11/15/2022]
Abstract
RNA-protein interactions have long being recognised as crucial regulators of gene expression. Recently, the development of scalable experimental techniques to measure these interactions has revolutionised the field, leading to the production of large-scale datasets which offer both opportunities and challenges for machine learning techniques. In this brief note, we will discuss some of the major stumbling blocks towards the use of machine learning in computational RNA biology, focusing specifically on the problem of predicting RNA-protein interactions from next-generation sequencing data.
Collapse
Affiliation(s)
- Viplove Arora
- Data Science, Department of Physics, International School for Advanced Studies (SISSA), Trieste 34136, Italy
| | - Guido Sanguinetti
- Data Science, Department of Physics, International School for Advanced Studies (SISSA), Trieste 34136, Italy
| |
Collapse
|
22
|
Wei J, Chen S, Zong L, Gao X, Li Y. Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022; 23:bbab540. [PMID: 34929730 PMCID: PMC8790951 DOI: 10.1093/bib/bbab540] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/14/2021] [Accepted: 11/22/2021] [Indexed: 12/11/2022] Open
Abstract
Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions. Because of the limitation of the previous database, especially the lack of protein structure data, most of the existing computational methods rely heavily on the sequence data, with only a small portion of the methods utilizing the structural information. Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years. In this work, we give a thorough review of this field, surveying both the binding site and binding preference prediction problems and covering the commonly used datasets, features and models. We also point out the potential challenges and opportunities in this field. This survey summarizes the development of the RNA-binding protein-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
Collapse
Affiliation(s)
- Junkang Wei
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Siyuan Chen
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Licheng Zong
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
- The CUHK Shenzhen Research Institute, Hi-Tech Park, 518057,
Shenzhen, China
| |
Collapse
|
23
|
Zhao S, Hamada M. Multi-resBind: a residual network-based multi-label classifier for in vivo RNA binding prediction and preference visualization. BMC Bioinformatics 2021; 22:554. [PMID: 34781902 PMCID: PMC8594109 DOI: 10.1186/s12859-021-04430-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 10/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-RNA interactions play key roles in many processes regulating gene expression. To understand the underlying binding preference, ultraviolet cross-linking and immunoprecipitation (CLIP)-based methods have been used to identify the binding sites for hundreds of RNA-binding proteins (RBPs) in vivo. Using these large-scale experimental data to infer RNA binding preference and predict missing binding sites has become a great challenge. Some existing deep-learning models have demonstrated high prediction accuracy for individual RBPs. However, it remains difficult to avoid significant bias due to the experimental protocol. The DeepRiPe method was recently developed to solve this problem via introducing multi-task or multi-label learning into this field. However, this method has not reached an ideal level of prediction power due to the weak neural network architecture. RESULTS Compared to the DeepRiPe approach, our Multi-resBind method demonstrated substantial improvements using the same large-scale PAR-CLIP dataset with respect to an increase in the area under the receiver operating characteristic curve and average precision. We conducted extensive experiments to evaluate the impact of various types of input data on the final prediction accuracy. The same approach was used to evaluate the effect of loss functions. Finally, a modified integrated gradient was employed to generate attribution maps. The patterns disentangled from relative contributions according to context offer biological insights into the underlying mechanism of protein-RNA interactions. CONCLUSIONS Here, we propose Multi-resBind as a new multi-label deep-learning approach to infer protein-RNA binding preferences and predict novel interactions. The results clearly demonstrate that Multi-resBind is a promising tool to predict unknown binding sites in vivo and gain biology insights into why the neural network makes a given prediction.
Collapse
Affiliation(s)
- Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Graduate School of Medicine, Nippon Medical School, 1-1-5 Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan.
| |
Collapse
|
24
|
Busa VF, Favorov AV, Fertig EJ, Leung AK. Spatial correlation statistics enable transcriptome-wide characterization of RNA structure binding. CELL REPORTS METHODS 2021; 1:100088. [PMID: 35474897 PMCID: PMC9017189 DOI: 10.1016/j.crmeth.2021.100088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 06/23/2021] [Accepted: 08/30/2021] [Indexed: 11/20/2022]
Abstract
Molecular interactions at identical transcriptomic locations or at proximal but non-overlapping sites can mediate RNA modification and regulation, necessitating tools to uncover these spatial relationships. We present nearBynding, a flexible algorithm and software pipeline that models spatial correlation between transcriptome-wide tracks from diverse data types. nearBynding can process and correlate interval as well as continuous data and incorporate experimentally derived or in silico predicted transcriptomic tracks. nearBynding offers visualization functions for its statistics to identify colocalizations and adjacent features. We demonstrate the application of nearBynding to correlate RNA-binding protein (RBP) binding preferences with other RBPs, RNA structure, or RNA modification. By cross-correlating RBP binding and RNA structure data, we demonstrate that nearBynding recapitulates known RBP binding to structural motifs and provides biological insights into RBP binding preference of G-quadruplexes. nearBynding is available as an R/Bioconductor package and can run on a personal computer, making correlation of transcriptomic features broadly accessible.
Collapse
Affiliation(s)
- Veronica F. Busa
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Biochemistry and Molecular Biology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Alexander V. Favorov
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Elana J. Fertig
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Biomedical Engineering, Johns Hopkins University Whiting School of Engineering, Baltimore, MD 21205, USA
- Department of Applied Mathematics and Statistics, Johns Hopkins University Whiting School of Engineering, Baltimore, MD 21205, USA
| | - Anthony K.L. Leung
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Biochemistry and Molecular Biology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| |
Collapse
|
25
|
Zhang XM, Liang L, Liu L, Tang MJ. Graph Neural Networks and Their Current Applications in Bioinformatics. Front Genet 2021; 12:690049. [PMID: 34394185 PMCID: PMC8360394 DOI: 10.3389/fgene.2021.690049] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 05/28/2021] [Indexed: 12/22/2022] Open
Abstract
Graph neural networks (GNNs), as a branch of deep learning in non-Euclidean space, perform particularly well in various tasks that process graph structure data. With the rapid accumulation of biological network data, GNNs have also become an important tool in bioinformatics. In this research, a systematic survey of GNNs and their advances in bioinformatics is presented from multiple perspectives. We first introduce some commonly used GNN models and their basic principles. Then, three representative tasks are proposed based on the three levels of structural information that can be learned by GNNs: node classification, link prediction, and graph generation. Meanwhile, according to the specific applications for various omics data, we categorize and discuss the related studies in three aspects: disease prediction, drug discovery, and biomedical imaging. Based on the analysis, we provide an outlook on the shortcomings of current studies and point out their developing prospect. Although GNNs have achieved excellent results in many biological tasks at present, they still face challenges in terms of low-quality data processing, methodology, and interpretability and have a long road ahead. We believe that GNNs are potentially an excellent method that solves various biological problems in bioinformatics research.
Collapse
Affiliation(s)
- Xiao-Meng Zhang
- School of Information, Yunnan Normal University, Kunming, China
| | - Li Liang
- School of Information, Yunnan Normal University, Kunming, China
| | - Lin Liu
- School of Information, Yunnan Normal University, Kunming, China
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, China
| | - Ming-Jing Tang
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, China
- School of Life Sciences, Yunnan Normal University, Kunming, China
| |
Collapse
|
26
|
Yu H, Shen ZA, Zhou YK, Du PF. Recent advances in predicting protein-lncRNA interactions using machine learning methods. Curr Gene Ther 2021; 22:228-244. [PMID: 34254917 DOI: 10.2174/1566523221666210712190718] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 05/01/2021] [Accepted: 05/31/2021] [Indexed: 11/22/2022]
Abstract
Long non-coding RNAs (LncRNAs) are a type of RNA with little or no protein-coding ability. Their length is more than 200 nucleotides. A large number of studies have indicated that lncRNAs play a significant role in various biological processes, including chromatin organizations, epigenetic programmings, transcriptional regulations, post-transcriptional processing, and circadian mechanism at the cellular level. Since lncRNAs perform vast functions through their interactions with proteins, identifying lncRNA-protein interaction is crucial to the understandings of the lncRNA molecular functions. However, due to the high cost and time-consuming disadvantage of experimental methods, a variety of computational methods have emerged. Recently, many effective and novel machine learning methods have been developed. In general, these methods fall into two categories: semi-supervised learning methods and supervised learning methods. The latter category can be further classified into the deep learning-based method, the ensemble learning-based method, and the hybrid method. In this paper, we focused on supervised learning methods. We summarized the state-of-the-art methods in predicting lncRNA-protein interactions. Furthermore, the performance and the characteristics of different methods have also been compared in this work. Considering the limits of the existing models, we analyzed the problems and discussed future research potentials.
Collapse
Affiliation(s)
- Han Yu
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Zi-Ang Shen
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yuan-Ke Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|