1
|
Qiao Y, Yang R, Liu Y, Chen J, Zhao L, Huo P, Wang Z, Bu D, Wu Y, Zhao Y. DeepFusion: A deep bimodal information fusion network for unraveling protein-RNA interactions using in vivo RNA structures. Comput Struct Biotechnol J 2024; 23:617-625. [PMID: 38274994 PMCID: PMC10808905 DOI: 10.1016/j.csbj.2023.12.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 12/04/2023] [Accepted: 12/26/2023] [Indexed: 01/27/2024] Open
Abstract
RNA-binding proteins (RBPs) are key post-transcriptional regulators, and the malfunctions of RBP-RNA binding lead to diverse human diseases. However, prediction of RBP binding sites is largely based on RNA sequence features, whereas in vivo RNA structural features based on high-throughput sequencing are rarely incorporated. Here, we designed a deep bimodal information fusion network called DeepFusion for unraveling protein-RNA interactions by incorporating structural features derived from DMS-seq data. DeepFusion integrates two sub-models to extract local motif-like information and long-term context information. We show that DeepFusion performs best compared with other cutting-edge methods with only sequence inputs on two datasets. DeepFusion's performance is further improved with bimodal input after adding in vivo DMS-seq structural features. Furthermore, DeepFusion can be used for analyzing RNA degradation, demonstrating significantly different RBP-binding scores in genes with slow degradation rates versus those with rapid degradation rates. DeepFusion thus provides enhanced abilities for further analysis of functional RNAs. DeepFusion's code and data are available at http://bioinfo.org/deepfusion/.
Collapse
Affiliation(s)
- Yixuan Qiao
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Rui Yang
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yang Liu
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jiaxin Chen
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Lianhe Zhao
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Peipei Huo
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Zhihao Wang
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Dechao Bu
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Yang Wu
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Yi Zhao
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
2
|
Rakowski A, Monti R, Huryn V, Lemanczyk M, Ohler U, Lippert C. Metadata-guided feature disentanglement for functional genomics. Bioinformatics 2024; 40:ii4-ii10. [PMID: 39230700 PMCID: PMC11373386 DOI: 10.1093/bioinformatics/btae403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
With the development of high-throughput technologies, genomics datasets rapidly grow in size, including functional genomics data. This has allowed the training of large Deep Learning (DL) models to predict epigenetic readouts, such as protein binding or histone modifications, from genome sequences. However, large dataset sizes come at a price of data consistency, often aggregating results from a large number of studies, conducted under varying experimental conditions. While data from large-scale consortia are useful as they allow studying the effects of different biological conditions, they can also contain unwanted biases from confounding experimental factors. Here, we introduce Metadata-guided Feature Disentanglement (MFD)-an approach that allows disentangling biologically relevant features from potential technical biases. MFD incorporates target metadata into model training, by conditioning weights of the model output layer on different experimental factors. It then separates the factors into disjoint groups and enforces independence of the corresponding feature subspaces with an adversarially learned penalty. We show that the metadata-driven disentanglement approach allows for better model introspection, by connecting latent features to experimental factors, without compromising, or even improving performance in downstream tasks, such as enhancer prediction, or genetic variant discovery. The code will be made available at https://github.com/HealthML/MFD.
Collapse
Affiliation(s)
- Alexander Rakowski
- Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany
| | - Remo Monti
- Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany
| | - Viktoriia Huryn
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany
| | - Marta Lemanczyk
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Potsdam, Brandenburg, 14482, Germany
| | - Uwe Ohler
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Berlin Institute for Medical Systems Biology, Department of Biology, Humboldt Universität Berlin, Hannoversche Strasse 28, Building 101, Room 1.05, Berlin, 10115, Germany
| | - Christoph Lippert
- Digital Health Machine Learning, Hasso Plattner Institute for Digital Engineering, Digital Engineering, University of Potsdam, Campus III Building G2, Rudolf-Breitscheid-Strasse 187, Potsdam, Brandenburg, 14482, Germany
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, United States of America
| |
Collapse
|
3
|
Hervoso JL, Amoah K, Dodson J, Choudhury M, Bhattacharya A, Quinones-Valdez G, Pasaniuc B, Xiao X. Splicing-specific transcriptome-wide association uncovers genetic mechanisms for schizophrenia. Am J Hum Genet 2024; 111:1573-1587. [PMID: 38925119 PMCID: PMC11339621 DOI: 10.1016/j.ajhg.2024.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 05/28/2024] [Accepted: 06/03/2024] [Indexed: 06/28/2024] Open
Abstract
Recent studies have highlighted the essential role of RNA splicing, a key mechanism of alternative RNA processing, in establishing connections between genetic variations and disease. Genetic loci influencing RNA splicing variations show considerable influence on complex traits, possibly surpassing those affecting total gene expression. Dysregulated RNA splicing has emerged as a major potential contributor to neurological and psychiatric disorders, likely due to the exceptionally high prevalence of alternatively spliced genes in the human brain. Nevertheless, establishing direct associations between genetically altered splicing and complex traits has remained an enduring challenge. We introduce Spliced-Transcriptome-Wide Associations (SpliTWAS) to integrate alternative splicing information with genome-wide association studies to pinpoint genes linked to traits through exon splicing events. We applied SpliTWAS to two schizophrenia (SCZ) RNA-sequencing datasets, BrainGVEX and CommonMind, revealing 137 and 88 trait-associated exons (in 84 and 67 genes), respectively. Enriched biological functions in the associated gene sets converged on neuronal function and development, immune cell activation, and cellular transport, which are highly relevant to SCZ. SpliTWAS variants impacted RNA-binding protein binding sites, revealing potential disruption of RNA-protein interactions affecting splicing. We extended the probabilistic fine-mapping method FOCUS to the exon level, identifying 36 genes and 48 exons as putatively causal for SCZ. We highlight VPS45 and APOPT1, where splicing of specific exons was associated with disease risk, eluding detection by conventional gene expression analysis. Collectively, this study supports the substantial role of alternative splicing in shaping the genetic basis of SCZ, providing a valuable approach for future investigations in this area.
Collapse
Affiliation(s)
- Jonatan L Hervoso
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Kofi Amoah
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Jack Dodson
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Mudra Choudhury
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Arjun Bhattacharya
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Giovanni Quinones-Valdez
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Bogdan Pasaniuc
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| | - Xinshu Xiao
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
4
|
Sokolova K, Chen KM, Hao Y, Zhou J, Troyanskaya OG. Deep Learning Sequence Models for Transcriptional Regulation. Annu Rev Genomics Hum Genet 2024; 25:105-122. [PMID: 38594933 DOI: 10.1146/annurev-genom-021623-024727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Kathleen M Chen
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Yun Hao
- Flatiron Institute, Simons Foundation, New York, NY, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Olga G Troyanskaya
- Princeton Precision Health, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA;
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| |
Collapse
|
5
|
Rennie S. Deep Learning for Elucidating Modifications to RNA-Status and Challenges Ahead. Genes (Basel) 2024; 15:629. [PMID: 38790258 PMCID: PMC11121098 DOI: 10.3390/genes15050629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 05/11/2024] [Accepted: 05/11/2024] [Indexed: 05/26/2024] Open
Abstract
RNA-binding proteins and chemical modifications to RNA play vital roles in the co- and post-transcriptional regulation of genes. In order to fully decipher their biological roles, it is an essential task to catalogue their precise target locations along with their preferred contexts and sequence-based determinants. Recently, deep learning approaches have significantly advanced in this field. These methods can predict the presence or absence of modification at specific genomic regions based on diverse features, particularly sequence and secondary structure, allowing us to decipher the highly non-linear sequence patterns and structures that underlie site preferences. This article provides an overview of how deep learning is being applied to this area, with a particular focus on the problem of mRNA-RBP binding, while also considering other types of chemical modification to RNA. It discusses how different types of model can handle sequence-based and/or secondary-structure-based inputs, the process of model training, including choice of negative regions and separating sets for testing and training, and offers recommendations for developing biologically relevant models. Finally, it highlights four key areas that are crucial for advancing the field.
Collapse
Affiliation(s)
- Sarah Rennie
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
| |
Collapse
|
6
|
Fu T, Amoah K, Chan TW, Bahn JH, Lee JH, Terrazas S, Chong R, Kosuri S, Xiao X. Massively parallel screen uncovers many rare 3' UTR variants regulating mRNA abundance of cancer driver genes. Nat Commun 2024; 15:3335. [PMID: 38637555 PMCID: PMC11026479 DOI: 10.1038/s41467-024-46795-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 03/06/2024] [Indexed: 04/20/2024] Open
Abstract
Understanding the function of rare non-coding variants represents a significant challenge. Using MapUTR, a screening method, we studied the function of rare 3' UTR variants affecting mRNA abundance post-transcriptionally. Among 17,301 rare gnomAD variants, an average of 24.5% were functional, with 70% in cancer-related genes, many in critical cancer pathways. This observation motivated an interrogation of 11,929 somatic mutations, uncovering 3928 (33%) functional mutations in 155 cancer driver genes. Functional MapUTR variants were enriched in microRNA- or protein-binding sites and may underlie outlier gene expression in tumors. Further, we introduce untranslated tumor mutational burden (uTMB), a metric reflecting the amount of somatic functional MapUTR variants of a tumor and show its potential in predicting patient survival. Through prime editing, we characterized three variants in cancer-relevant genes (MFN2, FOSL2, and IRAK1), demonstrating their cancer-driving potential. Our study elucidates the function of tens of thousands of non-coding variants, nominates non-coding cancer driver mutations, and demonstrates their potential contributions to cancer.
Collapse
Affiliation(s)
- Ting Fu
- Molecular, Cellular and Integrative Physiology Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Kofi Amoah
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Tracey W Chan
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Jae Hoon Bahn
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Jae-Hyung Lee
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Life and Nanopharmaceutical Sciences & Oral Microbiology, School of Dentistry, Kyung Hee University, Seoul, South Korea
| | - Sari Terrazas
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Molecular Biology Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Rockie Chong
- Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Sriram Kosuri
- Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Xinshu Xiao
- Molecular, Cellular and Integrative Physiology Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Department of Integrative Biology and Physiology, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Molecular Biology Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
| |
Collapse
|
7
|
Wu H, Liu X, Fang Y, Yang Y, Huang Y, Pan X, Shen HB. Decoding protein binding landscape on circular RNAs with base-resolution transformer models. Comput Biol Med 2024; 171:108175. [PMID: 38402841 DOI: 10.1016/j.compbiomed.2024.108175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 01/16/2024] [Accepted: 02/18/2024] [Indexed: 02/27/2024]
Abstract
Circular RNAs (circRNAs), a class of endogenous RNA with a covalent loop structure, can regulate gene expression by serving as sponges for microRNAs and RNA-binding proteins (RBPs). To date, most computational methods for predicting RBP binding sites on circRNAs focus on circRNA fragments instead of circRNAs. These methods detect whether a circRNA fragment contains binding sites, but cannot determine where are the binding sites and how many binding sites are on the circRNA transcript. We report a hybrid deep learning-based tool, CircSite, to predict RBP binding sites at single-nucleotide resolution and detect key contributed nucleotides on circRNA transcripts. CircSite takes advantage of convolutional neural networks (CNNs) and Transformer for learning local and global representations of circRNAs binding to RBPs, respectively. We construct 37 datasets of circRNAs interacting with proteins for benchmarking and the experimental results show that CircSite offers accurate predictions of RBP binding nucleotides and detects key subsequences aligning well with known binding motifs. CircSite is an easy-to-use online webserver for predicting RBP binding sites on circRNA transcripts and freely available at http://www.csbio.sjtu.edu.cn/bioinf/CircSite/.
Collapse
Affiliation(s)
- Hehe Wu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaojian Liu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Yi Fang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Yang Yang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yan Huang
- State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics Chinese Academy of Sciences, 500 Yutian Road, Shanghai, 200083, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, And Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China.
| |
Collapse
|
8
|
Toussaint PA, Leiser F, Thiebes S, Schlesner M, Brors B, Sunyaev A. Explainable artificial intelligence for omics data: a systematic mapping study. Brief Bioinform 2023; 25:bbad453. [PMID: 38113073 PMCID: PMC10729786 DOI: 10.1093/bib/bbad453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/28/2023] [Accepted: 11/08/2023] [Indexed: 12/21/2023] Open
Abstract
Researchers increasingly turn to explainable artificial intelligence (XAI) to analyze omics data and gain insights into the underlying biological processes. Yet, given the interdisciplinary nature of the field, many findings have only been shared in their respective research community. An overview of XAI for omics data is needed to highlight promising approaches and help detect common issues. Toward this end, we conducted a systematic mapping study. To identify relevant literature, we queried Scopus, PubMed, Web of Science, BioRxiv, MedRxiv and arXiv. Based on keywording, we developed a coding scheme with 10 facets regarding the studies' AI methods, explainability methods and omics data. Our mapping study resulted in 405 included papers published between 2010 and 2023. The inspected papers analyze DNA-based (mostly genomic), transcriptomic, proteomic or metabolomic data by means of neural networks, tree-based methods, statistical methods and further AI methods. The preferred post-hoc explainability methods are feature relevance (n = 166) and visual explanation (n = 52), while papers using interpretable approaches often resort to the use of transparent models (n = 83) or architecture modifications (n = 72). With many research gaps still apparent for XAI for omics data, we deduced eight research directions and discuss their potential for the field. We also provide exemplary research questions for each direction. Many problems with the adoption of XAI for omics data in clinical practice are yet to be resolved. This systematic mapping study outlines extant research on the topic and provides research directions for researchers and practitioners.
Collapse
Affiliation(s)
- Philipp A Toussaint
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
- HIDSS4Health – Helmholtz Information and Data Science School for Health, Karlsruhe, Heidelberg, Germany
| | - Florian Leiser
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Scott Thiebes
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Matthias Schlesner
- Biomedical Informatics, Data Mining and Data Analytics, Faculty of Applied Computer Science and Medical Faculty, University of Augsburg, Augsburg, Germany
| | - Benedikt Brors
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Translational Oncology, National Center for Tumor Diseases, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ali Sunyaev
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
9
|
Vaculík O, Chalupová E, Grešová K, Majtner T, Alexiou P. Transfer Learning Allows Accurate RBP Target Site Prediction with Limited Sample Sizes. BIOLOGY 2023; 12:1276. [PMID: 37886986 PMCID: PMC10604046 DOI: 10.3390/biology12101276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/19/2023] [Accepted: 09/21/2023] [Indexed: 10/28/2023]
Abstract
RNA-binding proteins are vital regulators in numerous biological processes. Their disfunction can result in diverse diseases, such as cancer or neurodegenerative disorders, making the prediction of their binding sites of high importance. Deep learning (DL) has brought about a revolution in various biological domains, including the field of protein-RNA interactions. Nonetheless, several challenges persist, such as the limited availability of experimentally validated binding sites to train well-performing DL models for the majority of proteins. Here, we present a novel training approach based on transfer learning (TL) to address the issue of limited data. Employing a sophisticated and interpretable architecture, we compare the performance of our method trained using two distinct approaches: training from scratch (SCR) and utilizing TL. Additionally, we benchmark our results against the current state-of-the-art methods. Furthermore, we tackle the challenges associated with selecting appropriate input features and determining optimal interval sizes. Our results show that TL enhances model performance, particularly in datasets with minimal training data, where satisfactory results can be achieved with just a few hundred RNA binding sites. Moreover, we demonstrate that integrating both sequence and evolutionary conservation information leads to superior performance. Additionally, we showcase how incorporating an attention layer into the model facilitates the interpretation of predictions within a biologically relevant context.
Collapse
Affiliation(s)
- Ondřej Vaculík
- Central European Institute of Technology (CEITEC), Masaryk University, 625 00 Brno, Czech Republic
- Faculty of Science, National Centre for Biomolecular Research, Masaryk University, 625 00 Brno, Czech Republic
| | - Eliška Chalupová
- Faculty of Science, National Centre for Biomolecular Research, Masaryk University, 625 00 Brno, Czech Republic
| | - Katarína Grešová
- Central European Institute of Technology (CEITEC), Masaryk University, 625 00 Brno, Czech Republic
- Faculty of Science, National Centre for Biomolecular Research, Masaryk University, 625 00 Brno, Czech Republic
| | - Tomáš Majtner
- Central European Institute of Technology (CEITEC), Masaryk University, 625 00 Brno, Czech Republic
- Department of Molecular Sociology, Max Planck Institute of Biophysics, 60439 Frankfurt am Main, Germany
| | - Panagiotis Alexiou
- Central European Institute of Technology (CEITEC), Masaryk University, 625 00 Brno, Czech Republic
- Department of Applied Biomedical Science, Faculty of Health Sciences, University of Malta, MSD 2080 Msida, Malta
- Centre for Molecular Medicine & Biobanking, University of Malta, MSD 2080 Msida, Malta
| |
Collapse
|
10
|
Horlacher M, Cantini G, Hesse J, Schinke P, Goedert N, Londhe S, Moyon L, Marsico A. A systematic benchmark of machine learning methods for protein-RNA interaction prediction. Brief Bioinform 2023; 24:bbad307. [PMID: 37635383 PMCID: PMC10516373 DOI: 10.1093/bib/bbad307] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 06/15/2023] [Accepted: 07/18/2023] [Indexed: 08/29/2023] Open
Abstract
RNA-binding proteins (RBPs) are central actors of RNA post-transcriptional regulation. Experiments to profile-binding sites of RBPs in vivo are limited to transcripts expressed in the experimental cell type, creating the need for computational methods to infer missing binding information. While numerous machine-learning based methods have been developed for this task, their use of heterogeneous training and evaluation datasets across different sets of RBPs and CLIP-seq protocols makes a direct comparison of their performance difficult. Here, we compile a set of 37 machine learning (primarily deep learning) methods for in vivo RBP-RNA interaction prediction and systematically benchmark a subset of 11 representative methods across hundreds of CLIP-seq datasets and RBPs. Using homogenized sample pre-processing and two negative-class sample generation strategies, we evaluate methods in terms of predictive performance and assess the impact of neural network architectures and input modalities on model performance. We believe that this study will not only enable researchers to choose the optimal prediction method for their tasks at hand, but also aid method developers in developing novel, high-performing methods by introducing a standardized framework for their evaluation.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Germany
- School of Computation, Information and Technology, Technical University Munich (TUM), Germany
| | - Giulia Cantini
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Julian Hesse
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Patrick Schinke
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Nicolas Goedert
- Computational Health Center, Helmholtz Center Munich, Germany
| | | | - Lambert Moyon
- Computational Health Center, Helmholtz Center Munich, Germany
| | | |
Collapse
|
11
|
Wang Y, Wei Z, Su J, Coenen F, Meng J. RgnTX: Colocalization analysis of transcriptome elements in the presence of isoform heterogeneity and ambiguity. Comput Struct Biotechnol J 2023; 21:4110-4117. [PMID: 37671241 PMCID: PMC10475473 DOI: 10.1016/j.csbj.2023.08.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 08/13/2023] [Accepted: 08/23/2023] [Indexed: 09/07/2023] Open
Abstract
Colocalization analysis of genomic region sets has been widely adopted to unveil potential functional interactions between corresponding biological attributes, which often serves as the basis for further investigation. A number of methods have been developed for colocalization analysis of genomic elements. However, none of them explicitly considered the transcriptome heterogeneity and isoform ambiguity, making them less appropriate for analyzing transcriptome elements. Here, we developed RgnTX, an R/Bioconductor tool for the colocalization analysis of transcriptome elements with permutation tests. Different from existing approaches, RgnTX directly takes advantage of transcriptome annotation, and offers high flexibility in the null model to simulate realistic transcriptome-wide background, such as the complex alternative splicing patterns. Importantly, it supports the testing of transcriptome elements without clear isoform association, which is often the real scenario due to technical limitations. Proposed package offers a wide selection of pre-defined functions, easy to be utilized by users for visualizing permutation results, calculating shifted z-scores and conducting multiple hypothesis testing under Benjamini-Hochberg correction. Moreover, with synthetic and real datasets, we show that RgnTX novel testing modes return distinct and more significant results compared to existing genome-based methods. We believe RgnTX should make a useful tool to characterize the randomness of the transcriptome, and for conducting statistical association analysis for genomic region sets within the heterogeneous transcriptome. The package now has been accepted by Bioconductor and is freely available at: https://bioconductor.org/packages/RgnTX.
Collapse
Affiliation(s)
- Yue Wang
- Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Department of Computer Science, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Zhen Wei
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Jionglong Su
- School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Frans Coenen
- Department of Computer Science, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| | - Jia Meng
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- AI University Research Centre, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, United Kingdom
| |
Collapse
|
12
|
Monti R, Ohler U. Toward Identification of Functional Sequences and Variants in Noncoding DNA. Annu Rev Biomed Data Sci 2023; 6:191-210. [PMID: 37262323 DOI: 10.1146/annurev-biodatasci-122120-110102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Understanding the noncoding part of the genome, which encodes gene regulation, is necessary to identify genetic mechanisms of disease and translate findings from genome-wide association studies into actionable results for treatments and personalized care. Here we provide an overview of the computational analysis of noncoding regions, starting from gene-regulatory mechanisms and their representation in data. Deep learning methods, when applied to these data, highlight important regulatory sequence elements and predict the functional effects of genetic variants. These and other algorithms are used to predict damaging sequence variants. Finally, we introduce rare-variant association tests that incorporate functional annotations and predictions in order to increase interpretability and statistical power.
Collapse
Affiliation(s)
- Remo Monti
- Max Delbrück Center for Molecular Medicine (MDC), Helmholtz Association of German Research Centers, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany;
- Digital Health-Machine Learning, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | - Uwe Ohler
- Max Delbrück Center for Molecular Medicine (MDC), Helmholtz Association of German Research Centers, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany;
| |
Collapse
|
13
|
Horlacher M, Wagner N, Moyon L, Kuret K, Goedert N, Salvatore M, Ule J, Gagneur J, Winther O, Marsico A. Towards in silico CLIP-seq: predicting protein-RNA interaction via sequence-to-signal learning. Genome Biol 2023; 24:180. [PMID: 37542318 PMCID: PMC10403857 DOI: 10.1186/s13059-023-03015-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 07/17/2023] [Indexed: 08/06/2023] Open
Abstract
We present RBPNet, a novel deep learning method, which predicts CLIP-seq crosslink count distribution from RNA sequence at single-nucleotide resolution. By training on up to a million regions, RBPNet achieves high generalization on eCLIP, iCLIP and miCLIP assays, outperforming state-of-the-art classifiers. RBPNet performs bias correction by modeling the raw signal as a mixture of the protein-specific and background signal. Through model interrogation via Integrated Gradients, RBPNet identifies predictive sub-sequences that correspond to known and novel binding motifs and enables variant-impact scoring via in silico mutagenesis. Together, RBPNet improves imputation of protein-RNA interactions, as well as mechanistic interpretation of predictions.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Munich, Germany.
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
- Department of Informatics, Technical University of Munich, Garching, Germany.
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany.
| | - Nils Wagner
- Department of Informatics, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Lambert Moyon
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Klara Kuret
- National Institute of Chemistry, Ljubljana, Slovenia
- The Francis Crick Institute, London, UK
- Jozef Stefan International Postgraduate School, Jamova cesta 39, 1000, Ljubljana, Slovenia
| | - Nicolas Goedert
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Marco Salvatore
- Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jernej Ule
- National Institute of Chemistry, Ljubljana, Slovenia
- The Francis Crick Institute, London, UK
| | - Julien Gagneur
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
- Department of Informatics, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Ole Winther
- Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Annalisa Marsico
- Computational Health Center, Helmholtz Center Munich, Munich, Germany.
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany.
| |
Collapse
|
14
|
Boyle EA, Her HL, Mueller JR, Naritomi JT, Nguyen GG, Yeo GW. Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites. CELL GENOMICS 2023; 3:100317. [PMID: 37388912 PMCID: PMC10300551 DOI: 10.1016/j.xgen.2023.100317] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 02/17/2023] [Accepted: 04/06/2023] [Indexed: 07/01/2023]
Abstract
Technology for crosslinking and immunoprecipitation (CLIP) followed by sequencing (CLIP-seq) has identified the transcriptomic targets of hundreds of RNA-binding proteins in cells. To increase the power of existing and future CLIP-seq datasets, we introduce Skipper, an end-to-end workflow that converts unprocessed reads into annotated binding sites using an improved statistical framework. Compared with existing methods, Skipper on average calls 210%-320% more transcriptomic binding sites and sometimes >1,000% more sites, providing deeper insight into post-transcriptional gene regulation. Skipper also calls binding to annotated repetitive elements and identifies bound elements for 99% of enhanced CLIP experiments. We perform nine translation factor enhanced CLIPs and apply Skipper to learn determinants of translation factor occupancy, including transcript region, sequence, and subcellular localization. Furthermore, we observe depletion of genetic variation in occupied sites and nominate transcripts subject to selective constraint because of translation factor occupancy. Skipper offers fast, easy, customizable, and state-of-the-art analysis of CLIP-seq data.
Collapse
Affiliation(s)
- Evan A. Boyle
- Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, UCSD Stem Cell Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Hsuan-Lin Her
- Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, UCSD Stem Cell Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Jasmine R. Mueller
- Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, UCSD Stem Cell Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Jack T. Naritomi
- Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, UCSD Stem Cell Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Grady G. Nguyen
- Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, UCSD Stem Cell Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Gene W. Yeo
- Department of Cellular and Molecular Medicine, Institute for Genomic Medicine, UCSD Stem Cell Program, University of California San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
15
|
Street L, Rothamel K, Brannan K, Jin W, Bokor B, Dong K, Rhine K, Madrigal A, Al-Azzam N, Kim JK, Ma Y, Abdou A, Wolin E, Doron-Mandel E, Ahdout J, Mujumdar M, Jovanovic M, Yeo GW. Large-scale map of RNA binding protein interactomes across the mRNA life-cycle. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.08.544225. [PMID: 37333282 PMCID: PMC10274859 DOI: 10.1101/2023.06.08.544225] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Messenger RNAs (mRNAs) interact with RNA-binding proteins (RBPs) in diverse ribonucleoprotein complexes (RNPs) during distinct life-cycle stages for their processing and maturation. While substantial attention has focused on understanding RNA regulation by assigning proteins, particularly RBPs, to specific RNA substrates, there has been considerably less exploration leveraging protein-protein interaction (PPI) methodologies to identify and study the role of proteins in mRNA life-cycle stages. To address this gap, we generated an RNA-aware RBP-centric PPI map across the mRNA life-cycle by immunopurification (IP-MS) of ~100 endogenous RBPs across the life-cycle in the presence or absence of RNase, augmented by size exclusion chromatography (SEC-MS). Aside from confirming 8,700 known and discovering 20,359 novel interactions between 1125 proteins, we determined that 73% of our IP interactions are regulated by the presence of RNA. Our PPI data enables us to link proteins to life-cycle stage functions, highlighting that nearly half of the proteins participate in at least two distinct stages. We show that one of the most highly interconnected proteins, ERH, engages in multiple RNA processes, including via interactions with nuclear speckles and the mRNA export machinery. We also demonstrate that the spliceosomal protein SNRNP200 participates in distinct stress granule-associated RNPs and occupies different RNA target regions in the cytoplasm during stress. Our comprehensive RBP-focused PPI network is a novel resource for identifying multi-stage RBPs and exploring RBP complexes in RNA maturation.
Collapse
Affiliation(s)
- Lena Street
- These authors contributed equally
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Katherine Rothamel
- These authors contributed equally
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kristopher Brannan
- These authors contributed equally
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
- Center for RNA Therapeutics, Houston Methodist Research Institute, Houston, TX, USA
- Department of Cardiovascular Sciences, Houston Methodist Research Institute, Houston, TX, USA
| | - Wenhao Jin
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Benjamin Bokor
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Kevin Dong
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kevin Rhine
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Assael Madrigal
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Norah Al-Azzam
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Jenny Kim Kim
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Yanzhe Ma
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Ahmed Abdou
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Erica Wolin
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Ella Doron-Mandel
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Joshua Ahdout
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Mayuresh Mujumdar
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Marko Jovanovic
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
16
|
Wang Q, Xu T, Xu K, Lu Z, Ying J. Prediction of transport proteins from sequence information with the deep learning approach. Comput Biol Med 2023; 160:106974. [PMID: 37167658 DOI: 10.1016/j.compbiomed.2023.106974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 04/17/2023] [Accepted: 04/22/2023] [Indexed: 05/13/2023]
Abstract
Transport proteins (TPs) are vital to the growth and life of all living things, especially in fields of microbial pathogenesis and drug resistance of tumor cells. Accurately identifying potential TPs remains an important challenge for the advancement of functional genomics. This study aimed to develop a tool for predicting TPs using the deep learning approach. Here, we proposed DeepTP, a convolutional neural network model that uses parallel subnetworks to extract features from protein sequences and uses fully connected layers for TP classification. To train and evaluate the performance of the developed model, datasets were collected from the UniProtKB/Swiss-Prot database. The test results revealed that the proposed model could successfully identify TPs with the AUCROC, accuracy, F-value, and Matthews correlation coefficient of 0.9719, 0.9513, 0.8982, and 0.8679, respectively. By further comparison, DeepTP achieved better performance than other commonly used methods. Analysis of the gradients of prediction score concerning input suggested that DeepTP makes predictions by recognizing the functional domains of TPs. We anticipate that DeepTP will serve as a useful tool for predicting TPs in large-scale genome projects, which will facilitate the discovery of novel TPs.
Collapse
Affiliation(s)
- Qian Wang
- Department of Clinical Laboratory, Wenzhou People's Hospital, The Third Affiliated Hospital of Shanghai University, The Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, China
| | - Teng Xu
- Institute of Translational Medicine, Baotou Central Hospital, Baotou, China
| | - Kai Xu
- Department of Clinical Laboratory, Wenzhou People's Hospital, The Third Affiliated Hospital of Shanghai University, The Third Clinical Institute Affiliated to Wenzhou Medical University, Wenzhou, China
| | - Zhongqiu Lu
- Wenzhou Key Laboratory of Emergency, Critical Care, and Disaster Medicine, Department of Emergency, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China.
| | - Jianchao Ying
- Central Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China; Wenzhou Key Laboratory of Emergency, Critical Care, and Disaster Medicine, Department of Emergency, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China.
| |
Collapse
|
17
|
Wang X, Zhang M, Long C, Yao L, Zhu M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1469-1479. [PMID: 36067103 DOI: 10.1109/tcbb.2022.3204661] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Proteins binding to Ribonucleic Acid (RNA) inside cells are called RNA-binding proteins (RBP), which play a crucial role in gene regulation. The identification of RNA-protein binding sites helps to understand the function of RBP better. Although many computational methods have been developed to predict RNA-protein binding sites, their prediction accuracy on small sample datasets needs improvement. To overcome this limitation, we propose a novel model called SA-Net, which utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features. K-mer embedding assists the model to discover significant subsequence fragments associated with binding sites. The self-attention mechanism captures contextual information from the entire input sequence globally, performing well in small sample sequence learning. Experimental results demonstrate that SA-Net attains state-of-the-art results on the RBP-24 dataset. We find that 4-mer embedding aids the model to achieve optimal performance. We also show that the self-attention network outperforms the commonly used CNN and CNN-BLSTM models in sequence feature extraction.
Collapse
|
18
|
Horlacher M, Oleshko S, Hu Y, Ghanbari M, Cantini G, Schinke P, Vergara EE, Bittner F, Mueller NS, Ohler U, Moyon L, Marsico A. A computational map of the human-SARS-CoV-2 protein-RNA interactome predicted at single-nucleotide resolution. NAR Genom Bioinform 2023; 5:lqad010. [PMID: 36814457 PMCID: PMC9940458 DOI: 10.1093/nargab/lqad010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 01/10/2023] [Accepted: 02/14/2023] [Indexed: 02/22/2023] Open
Abstract
RNA-binding proteins (RBPs) are critical host factors for viral infection, however, large scale experimental investigation of the binding landscape of human RBPs to viral RNAs is costly and further complicated due to sequence variation between viral strains. To fill this gap, we investigated the role of RBPs in the context of SARS-CoV-2 by constructing the first in silico map of human RBP-viral RNA interactions at nucleotide-resolution using two deep learning methods (pysster and DeepRiPe) trained on data from CLIP-seq experiments on more than 100 human RBPs. We evaluated conservation of RBP binding between six other human pathogenic coronaviruses and identified sites of conserved and differential binding in the UTRs of SARS-CoV-1, SARS-CoV-2 and MERS. We scored the impact of mutations from 11 variants of concern on protein-RNA interaction, identifying a set of gain- and loss-of-binding events, as well as predicted the regulatory impact of putative future mutations. Lastly, we linked RBPs to functional, OMICs and COVID-19 patient data from other studies, and identified MBNL1, FTO and FXR2 RBPs as potential clinical biomarkers. Our results contribute towards a deeper understanding of how viruses hijack host cellular pathways and open new avenues for therapeutic intervention.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Svitlana Oleshko
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Yue Hu
- Computational Health Center, Helmholtz Center Munich, Munich, Germany,Informatics 12 Chair of Bioinformatics, Technical University Munich, Garching, Germany
| | - Mahsa Ghanbari
- Institutes of Biology and Computer Science, Humboldt University, Berlin, Germany,Max Delbruck Center, Computational Regulatory Genomics, Berlin, Germany
| | - Giulia Cantini
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Patrick Schinke
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | | | | | | | - Uwe Ohler
- Institutes of Biology and Computer Science, Humboldt University, Berlin, Germany,Max Delbruck Center, Computational Regulatory Genomics, Berlin, Germany
| | - Lambert Moyon
- To whom correspondence should be addressed. Tel: +49 89318749193;
| | - Annalisa Marsico
- Correspondence may also be addressed to Annalisa Marsico. Tel: +49 89318743073;
| |
Collapse
|
19
|
Koo PK, Ploenzke M, Anand P, Paul S, Majdandzic A. ResidualBind: Uncovering Sequence-Structure Preferences of RNA-Binding Proteins with Deep Neural Networks. Methods Mol Biol 2023; 2586:197-215. [PMID: 36705906 DOI: 10.1007/978-1-0716-2768-6_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| | | | - Steffan Paul
- Bioinformatics Program, Harvard Medical School, Boston, MA, USA
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
20
|
Agarwal V, Kelley DR. The genetic and biochemical determinants of mRNA degradation rates in mammals. Genome Biol 2022; 23:245. [PMID: 36419176 PMCID: PMC9684954 DOI: 10.1186/s13059-022-02811-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 11/02/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Degradation rate is a fundamental aspect of mRNA metabolism, and the factors governing it remain poorly characterized. Understanding the genetic and biochemical determinants of mRNA half-life would enable more precise identification of variants that perturb gene expression through post-transcriptional gene regulatory mechanisms. RESULTS We establish a compendium of 39 human and 27 mouse transcriptome-wide mRNA decay rate datasets. A meta-analysis of these data identified a prevalence of technical noise and measurement bias, induced partially by the underlying experimental strategy. Correcting for these biases allowed us to derive more precise, consensus measurements of half-life which exhibit enhanced consistency between species. We trained substantially improved statistical models based upon genetic and biochemical features to better predict half-life and characterize the factors molding it. Our state-of-the-art model, Saluki, is a hybrid convolutional and recurrent deep neural network which relies only upon an mRNA sequence annotated with coding frame and splice sites to predict half-life (r=0.77). The key novel principle learned by Saluki is that the spatial positioning of splice sites, codons, and RNA-binding motifs within an mRNA is strongly associated with mRNA half-life. Saluki predicts the impact of RNA sequences and genetic mutations therein on mRNA stability, in agreement with functional measurements derived from massively parallel reporter assays. CONCLUSIONS Our work produces a more robust ground truth for transcriptome-wide mRNA half-lives in mammalian cells. Using these revised measurements, we trained Saluki, a model that is over 50% more accurate in predicting half-life from sequence than existing models. Saluki succinctly captures many of the known determinants of mRNA half-life and can be rapidly deployed to predict the functional consequences of arbitrary mutations in the transcriptome.
Collapse
Affiliation(s)
- Vikram Agarwal
- Calico Life Sciences LLC, South San Francisco, CA, 94080, USA.
- Present Address: mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA, 02451, USA.
| | - David R Kelley
- Calico Life Sciences LLC, South San Francisco, CA, 94080, USA.
| |
Collapse
|
21
|
Huang D, Chen K, Song B, Wei Z, Su J, Coenen F, de Magalhães JP, Rigden DJ, Meng J. Geographic encoding of transcripts enabled high-accuracy and isoform-aware deep learning of RNA methylation. Nucleic Acids Res 2022; 50:10290-10310. [PMID: 36155798 PMCID: PMC9561283 DOI: 10.1093/nar/gkac830] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 08/26/2022] [Accepted: 09/15/2022] [Indexed: 12/25/2022] Open
Abstract
As the most pervasive epigenetic mark present on mRNA and lncRNA, N6-methyladenosine (m6A) RNA methylation regulates all stages of RNA life in various biological processes and disease mechanisms. Computational methods for deciphering RNA modification have achieved great success in recent years; nevertheless, their potential remains underexploited. One reason for this is that existing models usually consider only the sequence of transcripts, ignoring the various regions (or geography) of transcripts such as 3′UTR and intron, where the epigenetic mark forms and functions. Here, we developed three simple yet powerful encoding schemes for transcripts to capture the submolecular geographic information of RNA, which is largely independent from sequences. We show that m6A prediction models based on geographic information alone can achieve comparable performances to classic sequence-based methods. Importantly, geographic information substantially enhances the accuracy of sequence-based models, enables isoform- and tissue-specific prediction of m6A sites, and improves m6A signal detection from direct RNA sequencing data. The geographic encoding schemes we developed have exhibited strong interpretability, and are applicable to not only m6A but also N1-methyladenosine (m1A), and can serve as a general and effective complement to the widely used sequence encoding schemes in deep learning applications concerning RNA transcripts.
Collapse
Affiliation(s)
- Daiyun Huang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Department of Computer Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Kunqi Chen
- Key Laboratory of Gastrointestinal Cancer (Fujian Medical University), Ministry of Education, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, PR China
| | - Bowen Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Zhen Wei
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Institute of Life Course and Medical Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Jionglong Su
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,School of AI and Advanced Computing, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China
| | - Frans Coenen
- Department of Computer Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - João Pedro de Magalhães
- Institute of Life Course and Medical Sciences, University of Liverpool, Liverpool L69 7ZB, UK
| | - Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK.,AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou 215123, PR China
| |
Collapse
|
22
|
Cortés-López M, Schulz L, Enculescu M, Paret C, Spiekermann B, Quesnel-Vallières M, Torres-Diz M, Unic S, Busch A, Orekhova A, Kuban M, Mesitov M, Mulorz MM, Shraim R, Kielisch F, Faber J, Barash Y, Thomas-Tikhonenko A, Zarnack K, Legewie S, König J. High-throughput mutagenesis identifies mutations and RNA-binding proteins controlling CD19 splicing and CART-19 therapy resistance. Nat Commun 2022; 13:5570. [PMID: 36138008 PMCID: PMC9500061 DOI: 10.1038/s41467-022-31818-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 07/05/2022] [Indexed: 11/29/2022] Open
Abstract
Following CART-19 immunotherapy for B-cell acute lymphoblastic leukaemia (B-ALL), many patients relapse due to loss of the cognate CD19 epitope. Since epitope loss can be caused by aberrant CD19 exon 2 processing, we herein investigate the regulatory code that controls CD19 splicing. We combine high-throughput mutagenesis with mathematical modelling to quantitatively disentangle the effects of all mutations in the region comprising CD19 exons 1-3. Thereupon, we identify ~200 single point mutations that alter CD19 splicing and thus could predispose B-ALL patients to developing CART-19 resistance. Furthermore, we report almost 100 previously unknown splice isoforms that emerge from cryptic splice sites and likely encode non-functional CD19 proteins. We further identify cis-regulatory elements and trans-acting RNA-binding proteins that control CD19 splicing (e.g., PTBP1 and SF3B4) and validate that loss of these factors leads to pervasive CD19 mis-splicing. Our dataset represents a comprehensive resource for identifying predictive biomarkers for CART-19 therapy. Multiple alternative splicing events in CD19 mRNA have been associated with resistance/relapse to CD19 CAR-T therapy in patients with B cell malignancies. Here, by combining patient data and a high-throughput mutagenesis screen, the authors identify single point mutations and RNA-binding proteins that can control CD19 splicing and be associated with CD19 CAR-T therapy resistance.
Collapse
Affiliation(s)
| | - Laura Schulz
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Mihaela Enculescu
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Claudia Paret
- Department of Pediatric Hematology/Oncology, Center for Pediatric and Adolescent Medicine, University Medical Center of the Johannes Gutenberg University Mainz, 55131, Mainz, Germany.,University Cancer Center (UCT), University Medical Center of the Johannes Gutenberg University Mainz, 55131, Mainz, Germany.,German Cancer Consortium (DKTK), site Frankfurt/Mainz, Germany, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany
| | - Bea Spiekermann
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Mathieu Quesnel-Vallières
- Department of Genetics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, 19104, USA.,Department of Biochemistry and Biophysics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Manuel Torres-Diz
- Division of Cancer Pathobiology, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Sebastian Unic
- Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, Allmandring 30E, 70569, Stuttgart, Germany
| | - Anke Busch
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Anna Orekhova
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Monika Kuban
- Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, Allmandring 30E, 70569, Stuttgart, Germany
| | - Mikhail Mesitov
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Miriam M Mulorz
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Rawan Shraim
- Division of Cancer Pathobiology, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.,Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA
| | - Fridolin Kielisch
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Jörg Faber
- Department of Pediatric Hematology/Oncology, Center for Pediatric and Adolescent Medicine, University Medical Center of the Johannes Gutenberg University Mainz, 55131, Mainz, Germany.,University Cancer Center (UCT), University Medical Center of the Johannes Gutenberg University Mainz, 55131, Mainz, Germany.,German Cancer Consortium (DKTK), site Frankfurt/Mainz, Germany, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany
| | - Yoseph Barash
- Department of Genetics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Andrei Thomas-Tikhonenko
- Division of Cancer Pathobiology, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.,Department of Pathology & Laboratory Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Kathi Zarnack
- Buchmann Institute for Molecular Life Sciences (BMLS), Max-von-Laue-Str. 15, 60438, Frankfurt, Germany. .,Faculty Biological Sciences, Goethe University Frankfurt, Max-von-Laue-Str. 15, 60438, Frankfurt, Germany.
| | - Stefan Legewie
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany. .,Department of Systems Biology, Institute for Biomedical Genetics (IBMG), University of Stuttgart, Allmandring 30E, 70569, Stuttgart, Germany. .,Stuttgart Research Center for Systems Biology (SRCSB), University of Stuttgart, Stuttgart, Germany.
| | - Julian König
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany.
| |
Collapse
|
23
|
Identifying interpretable gene-biomarker associations with functionally informed kernel-based tests in 190,000 exomes. Nat Commun 2022; 13:5332. [PMID: 36088354 PMCID: PMC9464252 DOI: 10.1038/s41467-022-32864-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Accepted: 08/22/2022] [Indexed: 12/05/2022] Open
Abstract
Here we present an exome-wide rare genetic variant association study for 30 blood biomarkers in 191,971 individuals in the UK Biobank. We compare gene-based association tests for separate functional variant categories to increase interpretability and identify 193 significant gene-biomarker associations. Genes associated with biomarkers were ~ 4.5-fold enriched for conferring Mendelian disorders. In addition to performing weighted gene-based variant collapsing tests, we design and apply variant-category-specific kernel-based tests that integrate quantitative functional variant effect predictions for missense variants, splicing and the binding of RNA-binding proteins. For these tests, we present a computationally efficient combination of the likelihood-ratio and score tests that found 36% more associations than the score test alone while also controlling the type-1 error. Kernel-based tests identified 13% more associations than their gene-based collapsing counterparts and had advantages in the presence of gain of function missense variants. We introduce local collapsing by amino acid position for missense variants and use it to interpret associations and identify potential novel gain of function variants in PIEZO1. Our results show the benefits of investigating different functional mechanisms when performing rare-variant association tests, and demonstrate pervasive rare-variant contribution to biomarker variability.
Collapse
|
24
|
Ma H, Wen H, Xue Z, Li G, Zhang Z. RNANetMotif: Identifying sequence-structure RNA network motifs in RNA-protein binding sites. PLoS Comput Biol 2022; 18:e1010293. [PMID: 35819951 PMCID: PMC9275694 DOI: 10.1371/journal.pcbi.1010293] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 06/09/2022] [Indexed: 11/19/2022] Open
Abstract
RNA molecules can adopt stable secondary and tertiary structures, which are essential in mediating physical interactions with other partners such as RNA binding proteins (RBPs) and in carrying out their cellular functions. In vivo and in vitro experiments such as RNAcompete and eCLIP have revealed in vitro binding preferences of RBPs to RNA oligomers and in vivo binding sites in cells. Analysis of these binding data showed that the structure properties of the RNAs in these binding sites are important determinants of the binding events; however, it has been a challenge to incorporate the structure information into an interpretable model. Here we describe a new approach, RNANetMotif, which takes predicted secondary structure of thousands of RNA sequences bound by an RBP as input and uses a graph theory approach to recognize enriched subgraphs. These enriched subgraphs are in essence shared sequence-structure elements that are important in RBP-RNA binding. To validate our approach, we performed RNA structure modeling via coarse-grained molecular dynamics folding simulations for selected 4 RBPs, and RNA-protein docking for LIN28B. The simulation results, e.g., solvent accessibility and energetics, further support the biological relevance of the discovered network subgraphs. RNA binding proteins (RBPs) regulate every aspect of RNA biology, including splicing, translation, transportation, and degradation. High-throughput technologies such as eCLIP have identified thousands of binding sites for a given RBP throughout the genome. It has been shown by earlier studies that, in addition to nucleotide sequences, the structure and conformation of RNAs also play important role in RBP-RNA interactions. Analogous to protein-protein interactions or protein-DNA interactions, it is likely that there exist intrinsic sequence-structure motifs common to these RNAs that underlie their binding specificity to specific RBPs. It is known that RNAs form energetically favorable secondary structures, which can be represented as graphs, with nucleotides being nodes and backbone covalent bonds and base-pairing hydrogen bonds representing edges. We hypothesize that these graphs can be mined by graph theory approaches to identify sequence-structure motifs as enriched sub-graphs. In this article, we described the details of this approach, termed RNANetMotif and associated new concepts, namely EKS (Extended K-mer Subgraph) and GraphK graph algorithm. To test the utility of our approach, we conducted 3D structure modeling of selected RNA sequences through molecular dynamics (MD) folding simulation and evaluated the significance of the discovered RNA motifs by comparing their spatial exposure with other regions on the RNA. We believe that this approach has the novelty of treating the RNA sequence as a graph and RBP binding sites as enriched subgraph, which has broader applications beyond RBP-RNA interactions.
Collapse
Affiliation(s)
- Hongli Ma
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- School of Mathematics, Shandong University, Jinan, China
| | - Han Wen
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Zhiyuan Xue
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- School of Mathematical Science, Liaocheng University, Liaocheng, China
| | - Zhaolei Zhang
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
25
|
Barshai M, Aubert A, Orenstein Y. G4detector: Convolutional Neural Network to Predict DNA G-Quadruplexes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1946-1955. [PMID: 33872156 DOI: 10.1109/tcbb.2021.3073595] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
G-quadruplexes (G4s) are nucleic acid secondary structures that form within guanine-rich DNA or RNA sequences. G4 formation can affect chromatin architecture and gene regulation, and has been associated with genomic instability, genetic diseases, and cancer progression. The experimental data produced by the G4-seq experiment provides unprecedented details on G4 formation in the genome. Still, running the experimental protocol on a whole genome is an expensive and time-consuming process. Thus, it is highly desirable to have a computational method to predict G4 formation in new DNA sequences or whole genomes. Here, we present G4detector, a new method based on a convolutional neural network to predict G4s from DNA sequences. On top of the sequence information, we improved prediction accuracy by the addition of RNA secondary structure information. To train and test G4detector, we compiled novel high-throughput benchmarks over multiple species genomes measured by the G4-seq protocol. We show that G4detector outperforms extant methods for the same task on all benchmark datasets, can detect G4s genome-wide with high accuracy, and is able to extrapolate human-trained measurements to various non-human species. The code and benchmarks are publicly available on github.com/OrensteinLab/G4detector.
Collapse
|
26
|
Du X, Zhao X, Zhang Y. DeepBtoD: Improved RNA-binding proteins prediction via integrated deep learning. J Bioinform Comput Biol 2022; 20:2250006. [PMID: 35451938 DOI: 10.1142/s0219720022500068] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
RNA-binding proteins (RBPs) have crucial roles in various cellular processes such as alternative splicing and gene regulation. Therefore, the analysis and identification of RBPs is an essential issue. However, although many computational methods have been developed for predicting RBPs, a few studies simultaneously consider local and global information from the perspective of the RNA sequence. Facing this challenge, we present a novel method called DeepBtoD, which predicts RBPs directly from RNA sequences. First, a [Formula: see text]-BtoD encoding is designed, which takes into account the composition of [Formula: see text]-nucleotides and their relative positions and forms a local module. Second, we designed a multi-scale convolutional module embedded with a self-attentive mechanism, the ms-focusCNN, which is used to further learn more effective, diverse, and discriminative high-level features. Finally, global information is considered to supplement local modules with ensemble learning to predict whether the target RNA binds to RBPs. Our preliminary 24 independent test datasets show that our proposed method can classify RBPs with the area under the curve of 0.933. Remarkably, DeepBtoD shows competitive results across seven state-of-the-art methods, suggesting that RBPs can be highly recognized by integrating local [Formula: see text]-BtoD and global information only from RNA sequences. Hence, our integrative method may be useful to improve the power of RBPs prediction, which might be particularly useful for modeling protein-nucleic acid interactions in systems biology studies. Our DeepBtoD server can be accessed at http://175.27.228.227/DeepBtoD/.
Collapse
Affiliation(s)
- XiuQuan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China.,School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - XiuJuan Zhao
- School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - YanPing Zhang
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China
| |
Collapse
|
27
|
Chalupová E, Vaculík O, Poláček J, Jozefov F, Majtner T, Alexiou P. ENNGene: an Easy Neural Network model building tool for Genomics. BMC Genomics 2022; 23:248. [PMID: 35361122 PMCID: PMC8973509 DOI: 10.1186/s12864-022-08414-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 02/23/2022] [Indexed: 11/17/2022] Open
Abstract
Background The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field. Results Here we present ENNGene—Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein. Conclusions As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08414-x.
Collapse
Affiliation(s)
- Eliška Chalupová
- Faculty of Science, National Centre for Biomolecular Research, Masaryk University, Brno, Czechia.,Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
| | - Ondřej Vaculík
- Faculty of Science, National Centre for Biomolecular Research, Masaryk University, Brno, Czechia.,Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
| | - Jakub Poláček
- Faculty of Informatics, Masaryk University, Brno, Czechia
| | - Filip Jozefov
- Faculty of Informatics, Masaryk University, Brno, Czechia
| | - Tomáš Majtner
- Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
| | - Panagiotis Alexiou
- Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.
| |
Collapse
|
28
|
Liu Y, Li R, Luo J, Zhang Z. Inferring RNA-binding protein target preferences using adversarial domain adaptation. PLoS Comput Biol 2022; 18:e1009863. [PMID: 35202389 PMCID: PMC8870515 DOI: 10.1371/journal.pcbi.1009863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 01/25/2022] [Indexed: 11/18/2022] Open
Abstract
Precise identification of target sites of RNA-binding proteins (RBP) is important to understand their biochemical and cellular functions. A large amount of experimental data is generated by in vivo and in vitro approaches. The binding preferences determined from these platforms share similar patterns but there are discernable differences between these datasets. Computational methods trained on one dataset do not always work well on another dataset. To address this problem which resembles the classic "domain shift" in deep learning, we adopted the adversarial domain adaptation (ADDA) technique and developed a framework (RBP-ADDA) that can extract RBP binding preferences from an integration of in vivo and vitro datasets. Compared with conventional methods, ADDA has the advantage of working with two input datasets, as it trains the initial neural network for each dataset individually, projects the two datasets onto a feature space, and uses an adversarial framework to derive an optimal network that achieves an optimal discriminative predictive power. In the first step, for each RBP, we include only the in vitro data to pre-train a source network and a task predictor. Next, for the same RBP, we initiate the target network by using the source network and use adversarial domain adaptation to update the target network using both in vitro and in vivo data. These two steps help leverage the in vitro data to improve the prediction on in vivo data, which is typically challenging with a lower signal-to-noise ratio. Finally, to further take the advantage of the fused source and target data, we fine-tune the task predictor using both data. We showed that RBP-ADDA achieved better performance in modeling in vivo RBP binding data than other existing methods as judged by Pearson correlations. It also improved predictive performance on in vitro datasets. We further applied augmentation operations on RBPs with less in vivo data to expand the input data and showed that it can improve prediction performances. Lastly, we explored the predictive interpretability of RBP-ADDA, where we quantified the contribution of the input features by Integrated Gradients and identified nucleotide positions that are important for RBP recognition.
Collapse
Affiliation(s)
- Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Ruihui Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
29
|
Yaish O, Orenstein Y. Computational modeling of mRNA degradation dynamics using deep neural networks. Bioinformatics 2022; 38:1087-1101. [PMID: 34849591 DOI: 10.1093/bioinformatics/btab800] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 11/12/2021] [Accepted: 11/22/2021] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION messenger RNA (mRNA) degradation plays critical roles in post-transcriptional gene regulation. A major component of mRNA degradation is determined by 3'-UTR elements. Hence, researchers are interested in studying mRNA dynamics as a function of 3'-UTR elements. A recent study measured the mRNA degradation dynamics of tens of thousands of 3'-UTR sequences using a massively parallel reporter assay. However, the computational approach used to model mRNA degradation was based on a simplifying assumption of a linear degradation rate. Consequently, the underlying mechanism of 3'-UTR elements is still not fully understood. RESULTS Here, we developed deep neural networks to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3'-UTR and their positional effect. Given an input of a 110 nt-long 3'-UTR sequence and an initial mRNA level, the model predicts mRNA levels of eight consecutive time points. Our deep neural networks significantly improved prediction performance of mRNA degradation dynamics compared with extant methods for the task. Moreover, we demonstrated that models predicting the dynamics of two identical 3'-UTR sequences, differing by their poly(A) tail, performed better than single-task models. On the interpretability front, by using Integrated Gradients, our convolutional neural networks (CNNs) models identified known and novel cis-regulatory sequence elements of mRNA degradation. By applying a novel systematic evaluation of model interpretability, we demonstrated that the recurrent neural network models are inferior to the CNN models in terms of interpretability and that random initialization ensemble improves both prediction and interoperability performance. Moreover, using a mutagenesis analysis, we newly discovered the positional effect of various 3'-UTR elements. AVAILABILITY AND IMPLEMENTATION All the code developed through this study is available at github.com/OrensteinLab/DeepUTR/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ofir Yaish
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
30
|
RBPSpot: Learning on appropriate contextual information for RBP binding sites discovery. iScience 2021; 24:103381. [PMID: 34841226 PMCID: PMC8605353 DOI: 10.1016/j.isci.2021.103381] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 09/01/2021] [Accepted: 10/27/2021] [Indexed: 11/29/2022] Open
Abstract
Identifying the factors determining the RBP-RNA interactions remains a big challenge. It involves sparse binding motifs and a suitable sequence context for binding. The present work describes an approach to detect RBP binding sites in RNAs using an ultra-fast inexact k-mers search for statistically significant seeds. The seeds work as an anchor to evaluate the context and binding potential using flanking region information while leveraging from Deep Feed-forward Neural Network. The developed models also received support from MD-simulation studies. The implemented software, RBPSpot, scored consistently high for all the performance metrics including average accuracy of ∼90% across a large number of validated datasets. It outperformed the compared tools, including some with much complex deep-learning models, during a comprehensive benchmarking process. RBPSpot can identify RBP binding sites in the human system and can also be used to develop new models, making it a valuable resource in the area of regulatory system studies. Efficient motif anchoring helps to get good quality contextual information on binding Realistic and high granularity datasets ensure better performance of the classifiers DNN models on the contextual features outperform more complex deep learning tools RBPSpot algorithm may be used to develop RBP binding models for other species also
Collapse
|
31
|
Zhao S, Hamada M. Multi-resBind: a residual network-based multi-label classifier for in vivo RNA binding prediction and preference visualization. BMC Bioinformatics 2021; 22:554. [PMID: 34781902 PMCID: PMC8594109 DOI: 10.1186/s12859-021-04430-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 10/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-RNA interactions play key roles in many processes regulating gene expression. To understand the underlying binding preference, ultraviolet cross-linking and immunoprecipitation (CLIP)-based methods have been used to identify the binding sites for hundreds of RNA-binding proteins (RBPs) in vivo. Using these large-scale experimental data to infer RNA binding preference and predict missing binding sites has become a great challenge. Some existing deep-learning models have demonstrated high prediction accuracy for individual RBPs. However, it remains difficult to avoid significant bias due to the experimental protocol. The DeepRiPe method was recently developed to solve this problem via introducing multi-task or multi-label learning into this field. However, this method has not reached an ideal level of prediction power due to the weak neural network architecture. RESULTS Compared to the DeepRiPe approach, our Multi-resBind method demonstrated substantial improvements using the same large-scale PAR-CLIP dataset with respect to an increase in the area under the receiver operating characteristic curve and average precision. We conducted extensive experiments to evaluate the impact of various types of input data on the final prediction accuracy. The same approach was used to evaluate the effect of loss functions. Finally, a modified integrated gradient was employed to generate attribution maps. The patterns disentangled from relative contributions according to context offer biological insights into the underlying mechanism of protein-RNA interactions. CONCLUSIONS Here, we propose Multi-resBind as a new multi-label deep-learning approach to infer protein-RNA binding preferences and predict novel interactions. The results clearly demonstrate that Multi-resBind is a promising tool to predict unknown binding sites in vivo and gain biology insights into why the neural network makes a given prediction.
Collapse
Affiliation(s)
- Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Graduate School of Medicine, Nippon Medical School, 1-1-5 Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan.
| |
Collapse
|
32
|
Tayara H, Chong KT. Improved Predicting of The Sequence Specificities of RNA Binding Proteins by Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2526-2534. [PMID: 32191896 DOI: 10.1109/tcbb.2020.2981335] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-binding proteins (RBPs) have a significant role in various regulatory tasks. However, the mechanism by which RBPs identify the subsequence target RNAs is still not clear. In recent years, several machine and deep learning-based computational models have been proposed for understanding the binding preferences of RBPs. These methods required integrating multiple features with raw RNA sequences such as secondary structure and their performances can be further improved. In this paper, we propose an efficient and simple convolution neural network, RBPCNN, that relies on the combination of the raw RNA sequence and evolutionary information. We show that conservation scores (evolutionary information) for the RNA sequences can significantly improve the overall performance of the proposed predictor. In addition, the automatic extraction of the binding sequence motifs can enhance our understanding of the binding specificities of RBPs. The experimental results show that RBPCNN outperforms significantly the current state-of-the-art methods. More specifically, the average area under the receiver operator curve was improved by 2.67 percent and the mean average precision was improved by 8.03 percent. The datasets and results can be downloaded from https://home.jbnu.ac.kr/NSCL/RBPCNN.htm.
Collapse
|
33
|
Pezoulas VC, Hazapis O, Lagopati N, Exarchos TP, Goules AV, Tzioufas AG, Fotiadis DI, Stratis IG, Yannacopoulos AN, Gorgoulis VG. Machine Learning Approaches on High Throughput NGS Data to Unveil Mechanisms of Function in Biology and Disease. Cancer Genomics Proteomics 2021; 18:605-626. [PMID: 34479914 PMCID: PMC8441762 DOI: 10.21873/cgp.20284] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Revised: 07/21/2021] [Accepted: 08/03/2021] [Indexed: 12/13/2022] Open
Abstract
In this review, the fundamental basis of machine learning (ML) and data mining (DM) are summarized together with the techniques for distilling knowledge from state-of-the-art omics experiments. This includes an introduction to the basic mathematical principles of unsupervised/supervised learning methods, dimensionality reduction techniques, deep neural networks architectures and the applications of these in bioinformatics. Several case studies under evaluation mainly involve next generation sequencing (NGS) experiments, like deciphering gene expression from total and single cell (scRNA-seq) analysis; for the latter, a description of all recent artificial intelligence (AI) methods for the investigation of cell sub-types, biomarkers and imputation techniques are described. Other areas of interest where various ML schemes have been investigated are for providing information regarding transcription factors (TF) binding sites, chromatin organization patterns and RNA binding proteins (RBPs), while analyses on RNA sequence and structure as well as 3D dimensional protein structure predictions with the use of ML are described. Furthermore, we summarize the recent methods of using ML in clinical oncology, when taking into consideration the current omics data with pharmacogenomics to determine personalized treatments. With this review we wish to provide the scientific community with a thorough investigation of main novel ML applications which take into consideration the latest achievements in genomics, thus, unraveling the fundamental mechanisms of biology towards the understanding and cure of diseases.
Collapse
Affiliation(s)
- Vasileios C Pezoulas
- Unit of Medical Technology and Intelligent Information Systems, University of Ioannina, Ioannina, Greece
| | - Orsalia Hazapis
- Molecular Carcinogenesis Group, Department of Histology and Embryology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
| | - Nefeli Lagopati
- Molecular Carcinogenesis Group, Department of Histology and Embryology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
- Biomedical Research Foundation of the Academy of Athens, Athens, Greece
| | - Themis P Exarchos
- Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, Ioannina, Greece
- Department of Informatics, Ionian University, Corfu, Greece
| | - Andreas V Goules
- Department of Pathophysiology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
| | - Athanasios G Tzioufas
- Department of Pathophysiology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
| | - Dimitrios I Fotiadis
- Unit of Medical Technology and Intelligent Information Systems, University of Ioannina, Ioannina, Greece
| | - Ioannis G Stratis
- Department of Mathematics, National and Kapodistrian University of Athens, Athens, Greece
| | - Athanasios N Yannacopoulos
- Department of Statistics, and Stochastic Modelling and Applications Laboratory, Athens University of Economics and Business (AUEB), Athens, Greece;
| | - Vassilis G Gorgoulis
- Molecular Carcinogenesis Group, Department of Histology and Embryology, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece;
- Biomedical Research Foundation of the Academy of Athens, Athens, Greece
- Division of Cancer Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, Manchester Cancer Research Centre, NIHR Manchester Biomedical Research Centre, University of Manchester, Manchester, U.K
- Center for New Biotechnologies and Precision Medicine, Medical School, National and Kapodistrian University of Athens, Athens, Greece
- Faculty of Health and Medical Sciences, University of Surrey, Surrey, U.K
| |
Collapse
|
34
|
RNA-Binding Motif Protein 11 (RBM11) Serves as a Prognostic Biomarker and Promotes Ovarian Cancer Progression. DISEASE MARKERS 2021; 2021:3037337. [PMID: 34434291 PMCID: PMC8382552 DOI: 10.1155/2021/3037337] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 07/24/2021] [Accepted: 08/05/2021] [Indexed: 01/14/2023]
Abstract
Ovarian cancer is one of the most lethal gynecologic malignancies for women. Due to the lack of efficient target therapy, the overall survival rate for patients with advanced ovarian cancer is still low. Illustrating the molecular mechanisms dictating ovarian cancer progression is critically important to develop novel therapeutic agents. Here, we found that RNA-binding motif protein 11 (RBM11) was highly elevated in ovarian cancer tissues compared with normal ovary, while RBM11 depletion in ovarian cancer cells resulted in impaired cell growth and invasion. Moreover, knockdown of RBM11 also retarded tumor growth in the A2780 ovarian cancer xenograft model. Mechanically, we found that RBM11 positively regulated Akt/mTOR signaling pathway activation in ovarian cancer cells. Thus, these results identify RBM11 is a novel oncogenic protein and prognostic biomarker for ovarian cancers.
Collapse
|
35
|
Wu H, Pan X, Yang Y, Shen HB. Recognizing binding sites of poorly characterized RNA-binding proteins on circular RNAs using attention Siamese network. Brief Bioinform 2021; 22:6326526. [PMID: 34297803 DOI: 10.1093/bib/bbab279] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 06/04/2021] [Accepted: 07/01/2021] [Indexed: 12/24/2022] Open
Abstract
Circular RNAs (circRNAs) interact with RNA-binding proteins (RBPs) to play crucial roles in gene regulation and disease development. Computational approaches have attracted much attention to quickly predict highly potential RBP binding sites on circRNAs using the sequence or structure statistical binding knowledge. Deep learning is one of the popular learning models in this area but usually requires a lot of labeled training data. It would perform unsatisfactorily for the less characterized RBPs with a limited number of known target circRNAs. How to improve the prediction performance for such small-size labeled characterized RBPs is a challenging task for deep learning-based models. In this study, we propose an RBP-specific method iDeepC for predicting RBP binding sites on circRNAs from sequences. It adopts a Siamese neural network consisting of a lightweight attention module and a metric module. We have found that Siamese neural network effectively enhances the network capability of capturing mutual information between circRNAs with pairwise metric learning. To further deal with the small-sample size problem, we have performed the pretraining using available labeled data from other RBPs and also demonstrate the efficacy of this transfer-learning pipeline. We comprehensively evaluated iDeepC on the benchmark datasets of RBP-binding circRNAs, and the results suggest iDeepC achieving promising results on the poorly characterized RBPs. The source code is available at https://github.com/hehew321/iDeepC.
Collapse
Affiliation(s)
- Hehe Wu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Yang Yang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
36
|
Sohrabi-Jahromi S, Söding J. Thermodynamic modeling reveals widespread multivalent binding by RNA-binding proteins. Bioinformatics 2021; 37:i308-i316. [PMID: 34252974 PMCID: PMC8275352 DOI: 10.1093/bioinformatics/btab300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Understanding how proteins recognize their RNA targets is essential to elucidate regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow them to bind to RNA in a multivalent, cooperative manner. They can thereby achieve higher specificity and affinity than proteins with a single RNA-binding domain. However, current approaches to de novo discovery of RNA binding motifs do not take multivalent binding into account. RESULTS We present Bipartite Motif Finder (BMF), which is based on a thermodynamic model of RBPs with two cooperatively binding RNA-binding domains. We show that bivalent binding is a common strategy among RBPs, yielding higher affinity and sequence specificity. We furthermore illustrate that the spatial geometry between the binding sites can be learned from bound RNA sequences. These discovered bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions. AVAILABILITY AND IMPLEMENTATION BMF source code is available at https://github.com/soedinglab/bipartite_motif_finder under a GPL license. The BMF web server is accessible at https://bmf.soedinglab.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Salma Sohrabi-Jahromi
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany.,Campus-Institut Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
37
|
Boeckel JN, Möbius-Winkler M, Müller M, Rebs S, Eger N, Schoppe L, Tappu R, Kokot KE, Kneuer JM, Gaul S, Bordalo DM, Lai A, Haas J, Ghanbari M, Drewe-Boss P, Liss M, Katus HA, Ohler U, Gotthardt M, Laufs U, Streckfuss-Bömeke K, Meder B. SLM2 Is A Novel Cardiac Splicing Factor Involved in Heart Failure due to Dilated Cardiomyopathy. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 20:129-146. [PMID: 34273561 PMCID: PMC9510876 DOI: 10.1016/j.gpb.2021.01.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 02/01/2021] [Indexed: 01/09/2023]
Abstract
Alternative mRNA splicing is a fundamental process to increase the versatility of the genome. In humans, cardiac mRNA splicing is involved in the pathophysiology of heart failure. Mutations in the splicing factor RNA binding motif protein 20 (RBM20) cause severe forms of cardiomyopathy. To identify novel cardiomyopathy-associated splicing factors, RNA-seq and tissue-enrichment analyses were performed, which identified up-regulated expression of Sam68-Like mammalian protein 2 (SLM2) in the left ventricle of dilated cardiomyopathy (DCM) patients. In the human heart, SLM2 binds to important transcripts of sarcomere constituents, such as those encoding myosin light chain 2 (MYL2), troponin I3 (TNNI3), troponin T2 (TNNT2), tropomyosin 1/2 (TPM1/2), and titin (TTN). Mechanistically, SLM2 mediates intron retention, prevents exon exclusion, and thereby mediates alternative splicing of the mRNA regions encoding the variable proline-, glutamate-, valine-, and lysine-rich (PEVK) domain and another part of the I-band region of titin. In summary, SLM2 is a novel cardiac splicing regulator with essential functions for maintaining cardiomyocyte integrity by binding to and processing the mRNAs of essential cardiac constituents such as titin.
Collapse
Affiliation(s)
- Jes-Niels Boeckel
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; Klinik und Poliklinik für Kardiologie, Universitätskrankenhaus Leipzig, Leipzig 04103, Germany
| | | | - Marion Müller
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg, Heidelberg 69120, Germany; Clinic for General and Interventional Cardiology/ Angiology, Herz- und Diabeteszentrum NRW, Ruhr-Universität Bochum, Bad Oeynhausen 32545, Germany
| | - Sabine Rebs
- Department of Cardiology and Pneumology, University Hospital, Georg-August University Goettingen, Goettingen 37075, Germany; German Center for Cardiovascular Research (DZHK), Partner site Goettingen, Goettingen 37075, Germany
| | - Nicole Eger
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany
| | - Laura Schoppe
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany
| | - Rewati Tappu
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany
| | - Karoline E Kokot
- Klinik und Poliklinik für Kardiologie, Universitätskrankenhaus Leipzig, Leipzig 04103, Germany
| | - Jasmin M Kneuer
- Klinik und Poliklinik für Kardiologie, Universitätskrankenhaus Leipzig, Leipzig 04103, Germany
| | - Susanne Gaul
- Klinik und Poliklinik für Kardiologie, Universitätskrankenhaus Leipzig, Leipzig 04103, Germany
| | - Diana M Bordalo
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg, Heidelberg 69120, Germany
| | - Alan Lai
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg, Heidelberg 69120, Germany
| | - Jan Haas
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg, Heidelberg 69120, Germany
| | - Mahsa Ghanbari
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 10115, Germany; Institute of Biology, Humboldt Universität zu Berlin, Berlin 10099, Germany
| | - Philipp Drewe-Boss
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 10115, Germany; Institute of Biology, Humboldt Universität zu Berlin, Berlin 10099, Germany
| | - Martin Liss
- Neuromuscular and Cardiovascular Cell Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13092, Germany; German Center for Cardiovascular Research (DZHK), Partner site Berlin, Berlin 10117, Germany
| | - Hugo A Katus
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg, Heidelberg 69120, Germany
| | - Uwe Ohler
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 10115, Germany; Institute of Biology, Humboldt Universität zu Berlin, Berlin 10099, Germany
| | - Michael Gotthardt
- Neuromuscular and Cardiovascular Cell Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13092, Germany; German Center for Cardiovascular Research (DZHK), Partner site Berlin, Berlin 10117, Germany
| | - Ulrich Laufs
- Klinik und Poliklinik für Kardiologie, Universitätskrankenhaus Leipzig, Leipzig 04103, Germany
| | - Katrin Streckfuss-Bömeke
- Department of Cardiology and Pneumology, University Hospital, Georg-August University Goettingen, Goettingen 37075, Germany; German Center for Cardiovascular Research (DZHK), Partner site Goettingen, Goettingen 37075, Germany
| | - Benjamin Meder
- Department of Cardiology, Angiology and Pneumology, University Hospital Heidelberg, Heidelberg 69120, Germany; German Center for Cardiovascular Research (DZHK), Partner site Heidelberg, Heidelberg 69120, Germany; Stanford Genome Technology Center, Department of Genetics, Stanford Medical School, Palo Alto, CA 94304, USA.
| |
Collapse
|
38
|
Song Z, Huang D, Song B, Chen K, Song Y, Liu G, Su J, Magalhães JPD, Rigden DJ, Meng J. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat Commun 2021; 12:4011. [PMID: 34188054 PMCID: PMC8242015 DOI: 10.1038/s41467-021-24313-3] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Accepted: 06/07/2021] [Indexed: 02/08/2023] Open
Abstract
Recent studies suggest that epi-transcriptome regulation via post-transcriptional RNA modifications is vital for all RNA types. Precise identification of RNA modification sites is essential for understanding the functions and regulatory mechanisms of RNAs. Here, we present MultiRM, a method for the integrated prediction and interpretation of post-transcriptional RNA modifications from RNA sequences. Built upon an attention-based multi-label deep learning framework, MultiRM not only simultaneously predicts the putative sites of twelve widely occurring transcriptome modifications (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um), but also returns the key sequence contents that contribute most to the positive predictions. Importantly, our model revealed a strong association among different types of RNA modifications from the perspective of their associated sequence contexts. Our work provides a solution for detecting multiple RNA modifications, enabling an integrated analysis of these RNA modifications, and gaining a better understanding of sequence-based RNA modification mechanisms. RNA modifications appear to play a role in determining RNA structure and function. Here, the authors develop a deep learning model that predicts the location of 12 RNA modifications using primary sequence, and show that several modifications are associated, which suggests dependencies between them.
Collapse
Affiliation(s)
- Zitao Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Daiyun Huang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China. .,Department of Computer Sciences, University of Liverpool, Liverpool, United Kingdom.
| | - Bowen Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China.,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Kunqi Chen
- Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, PR China
| | - Yiyou Song
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Gang Liu
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Jionglong Su
- School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | | | - Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China. .,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom. .,AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou, PR China.
| |
Collapse
|
39
|
Guo X, Ohler U, Yildirim F. How to find genomic regions relevant for gene regulation. MED GENET-BERLIN 2021; 33:157-165. [PMID: 38836026 PMCID: PMC11007629 DOI: 10.1515/medgen-2021-2074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 07/09/2021] [Indexed: 06/06/2024]
Abstract
Genetic variants associated with human diseases are often located outside the protein coding regions of the genome. Identification and functional characterization of the regulatory elements in the non-coding genome is therefore of crucial importance for understanding the consequences of genetic variation and the mechanisms of disease. The past decade has seen rapid progress in high-throughput analysis and mapping of chromatin accessibility, looping, structure, and occupancy by transcription factors, as well as epigenetic modifications, all of which contribute to the proper execution of regulatory functions in the non-coding genome. Here, we review the current technologies for the definition and functional validation of non-coding regulatory regions in the genome.
Collapse
Affiliation(s)
- Xuanzong Guo
- Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 10117 Berlin, Germany
| | - Uwe Ohler
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin Institute for Medical Systems Biology, 10115 Berlin, Germany
- Department of Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ferah Yildirim
- Department of Psychiatry and Psychotherapy, Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 10117 Berlin, Germany
| |
Collapse
|
40
|
Back G, Walther D. Identification of cis-regulatory motifs in first introns and the prediction of intron-mediated enhancement of gene expression in Arabidopsis thaliana. BMC Genomics 2021; 22:390. [PMID: 34039279 PMCID: PMC8157754 DOI: 10.1186/s12864-021-07711-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 05/11/2021] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Intron mediated enhancement (IME) is the potential of introns to enhance the expression of its respective gene. This essential function of introns has been observed in a wide range of species, including fungi, plants, and animals. However, the mechanisms underlying the enhancement are as of yet poorly understood. The goal of this study was to identify potential IME-related sequence motifs and genomic features in first introns of genes in Arabidopsis thaliana. RESULTS Based on the rationale that functional sequence motifs are evolutionarily conserved, we exploited the deep sequencing information available for Arabidopsis thaliana, covering more than one thousand Arabidopsis accessions, and identified 81 candidate hexamer motifs with increased conservation across all accessions that also exhibit positional occurrence preferences. Of those, 71 were found associated with increased correlation of gene expression of genes harboring them, suggesting a cis-regulatory role. Filtering further for effect on gene expression correlation yielded a set of 16 hexamer motifs, corresponding to five consensus motifs. While all five motifs represent new motif definitions, two are similar to the two previously reported IME-motifs, whereas three are altogether novel. Both consensus and hexamer motifs were found associated with higher expression of alleles harboring them as compared to alleles containing mutated motif variants as found in naturally occurring Arabidopsis accessions. To identify additional IME-related genomic features, Random Forest models were trained for the classification of gene expression level based on an array of sequence-related features. The results indicate that introns contain information with regard to gene expression level and suggest sequence-compositional features as most informative, while position-related features, thought to be of central importance before, were found with lower than expected relevance. CONCLUSIONS Exploiting deep sequencing and broad gene expression information and on a genome-wide scale, this study confirmed the regulatory role on first-introns, characterized their intra-species conservation, and identified a set of novel sequence motifs located in first introns of genes in the genome of the plant Arabidopsis thaliana that may play a role in inducing high and correlated gene expression of the genes harboring them.
Collapse
Affiliation(s)
- Georg Back
- Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany
| | - Dirk Walther
- Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany.
| |
Collapse
|
41
|
DeepTFactor: A deep learning-based tool for the prediction of transcription factors. Proc Natl Acad Sci U S A 2021; 118:2021171118. [PMID: 33372147 DOI: 10.1073/pnas.2021171118] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
A transcription factor (TF) is a sequence-specific DNA-binding protein that modulates the transcription of a set of particular genes, and thus regulates gene expression in the cell. TFs have commonly been predicted by analyzing sequence homology with the DNA-binding domains of TFs already characterized. Thus, TFs that do not show homologies with the reported ones are difficult to predict. Here we report the development of a deep learning-based tool, DeepTFactor, that predicts whether a protein in question is a TF. DeepTFactor uses a convolutional neural network to extract features of a protein. It showed high performance in predicting TFs of both eukaryotic and prokaryotic origins, resulting in F1 scores of 0.8154 and 0.8000, respectively. Analysis of the gradients of prediction score with respect to input suggested that DeepTFactor detects DNA-binding domains and other latent features for TF prediction. DeepTFactor predicted 332 candidate TFs in Escherichia coli K-12 MG1655. Among them, 84 candidate TFs belong to the y-ome, which is a collection of genes that lack experimental evidence of function. We experimentally validated the results of DeepTFactor prediction by further characterizing genome-wide binding sites of three predicted TFs, YqhC, YiaU, and YahB. Furthermore, we made available the list of 4,674,808 TFs predicted from 73,873,012 protein sequences in 48,346 genomes. DeepTFactor will serve as a useful tool for predicting TFs, which is necessary for understanding the regulatory systems of organisms of interest. We provide DeepTFactor as a stand-alone program, available at https://bitbucket.org/kaistsystemsbiology/deeptfactor.
Collapse
|
42
|
Sun L, Xu K, Huang W, Yang YT, Li P, Tang L, Xiong T, Zhang QC. Predicting dynamic cellular protein-RNA interactions by deep learning using in vivo RNA structures. Cell Res 2021; 31:495-516. [PMID: 33623109 PMCID: PMC7900654 DOI: 10.1038/s41422-021-00476-y] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 01/19/2021] [Indexed: 01/31/2023] Open
Abstract
Interactions with RNA-binding proteins (RBPs) are integral to RNA function and cellular regulation, and dynamically reflect specific cellular conditions. However, presently available tools for predicting RBP-RNA interactions employ RNA sequence and/or predicted RNA structures, and therefore do not capture their condition-dependent nature. Here, after profiling transcriptome-wide in vivo RNA secondary structures in seven cell types, we developed PrismNet, a deep learning tool that integrates experimental in vivo RNA structure data and RBP binding data for matched cells to accurately predict dynamic RBP binding in various cellular conditions. PrismNet results for 168 RBPs support its utility for both understanding CLIP-seq results and largely extending such interaction data to accurately analyze additional cell types. Further, PrismNet employs an "attention" strategy to computationally identify exact RBP-binding nucleotides, and we discovered enrichment among dynamic RBP-binding sites for structure-changing variants (riboSNitches), which can link genetic diseases with dysregulated RBP bindings. Our rich profiling data and deep learning-based prediction tool provide access to a previously inaccessible layer of cell-type-specific RBP-RNA interactions, with clear utility for understanding and treating human diseases.
Collapse
Affiliation(s)
- Lei Sun
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China
| | - Kui Xu
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China
| | - Wenze Huang
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China
| | - Yucheng T Yang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Pan Li
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China
| | - Lei Tang
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China
| | - Tuanlin Xiong
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China
| | - Qiangfeng Cliff Zhang
- MOE Key Laboratory of Bioinformatics, Beijing Advanced Innovation Center for Structural Biology and Frontier Research Center for Biological Structure, Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
- Tsinghua-Peking Center for Life Sciences, Beijing 100084, China.
| |
Collapse
|
43
|
Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput Biol 2021; 17:e1008925. [PMID: 33983921 PMCID: PMC8118286 DOI: 10.1371/journal.pcbi.1008925] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 03/30/2021] [Indexed: 12/15/2022] Open
Abstract
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
Collapse
Affiliation(s)
- Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Matthew Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Praveen Anand
- Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
| | - Steffan B. Paul
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
44
|
Yan Z, Hamilton WL, Blanchette M. Graph neural representational learning of RNA secondary structures for predicting RNA-protein interactions. Bioinformatics 2021; 36:i276-i284. [PMID: 32657407 PMCID: PMC7355240 DOI: 10.1093/bioinformatics/btaa456] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Motivation RNA-protein interactions are key effectors of post-transcriptional regulation. Significant experimental and bioinformatics efforts have been expended on characterizing protein binding mechanisms on the molecular level, and on highlighting the sequence and structural traits of RNA that impact the binding specificity for different proteins. Yet our ability to predict these interactions in silico remains relatively poor. Results In this study, we introduce RPI-Net, a graph neural network approach for RNA-protein interaction prediction. RPI-Net learns and exploits a graph representation of RNA molecules, yielding significant performance gains over existing state-of-the-art approaches. We also introduce an approach to rectify an important type of sequence bias caused by the RNase T1 enzyme used in many CLIP-Seq experiments, and we show that correcting this bias is essential in order to learn meaningful predictors and properly evaluate their accuracy. Finally, we provide new approaches to interpret the trained models and extract simple, biologically interpretable representations of the learned sequence and structural motifs. Availability and implementation Source code can be accessed at https://www.github.com/HarveyYan/RNAonGraph. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zichao Yan
- School of Computer Science, McGill University, Montreal, QC H3A 2B2, Canada.,MILA, Quebec AI Institute, Montreal, QC H2S 3H1, Canada
| | - William L Hamilton
- School of Computer Science, McGill University, Montreal, QC H3A 2B2, Canada.,MILA, Quebec AI Institute, Montreal, QC H2S 3H1, Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, Montreal, QC H3A 2B2, Canada
| |
Collapse
|
45
|
Hafner M, Katsantoni M, Köster T, Marks J, Mukherjee J, Staiger D, Ule J, Zavolan M. CLIP and complementary methods. ACTA ACUST UNITED AC 2021. [DOI: 10.1038/s43586-021-00018-1] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
46
|
Yang S, Liu X, Ng RT. ProbeRating: a recommender system to infer binding profiles for nucleic acid-binding proteins. Bioinformatics 2021; 36:4797-4804. [PMID: 32573679 PMCID: PMC7750938 DOI: 10.1093/bioinformatics/btaa580] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 05/18/2020] [Accepted: 06/18/2020] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION The interaction between proteins and nucleic acids plays a crucial role in gene regulation and cell function. Determining the binding preferences of nucleic acid-binding proteins (NBPs), namely RNA-binding proteins (RBPs) and transcription factors (TFs), is the key to decipher the protein-nucleic acids interaction code. Today, available NBP binding data from in vivo or in vitro experiments are still limited, which leaves a large portion of NBPs uncovered. Unfortunately, existing computational methods that model the NBP binding preferences are mostly protein specific: they need the experimental data for a specific protein in interest, and thus only focus on experimentally characterized NBPs. The binding preferences of experimentally unexplored NBPs remain largely unknown. RESULTS Here, we introduce ProbeRating, a nucleic acid recommender system that utilizes techniques from deep learning and word embeddings of natural language processing. ProbeRating is developed to predict binding profiles for unexplored or poorly studied NBPs by exploiting their homologs NBPs which currently have available binding data. Requiring only sequence information as input, ProbeRating adapts FastText from Facebook AI Research to extract biological features. It then builds a neural network-based recommender system. We evaluate the performance of ProbeRating on two different tasks: one for RBP and one for TF. As a result, ProbeRating outperforms previous methods on both tasks. The results show that ProbeRating can be a useful tool to study the binding mechanism for the many NBPs that lack direct experimental evidence. and implementation. AVAILABILITY AND IMPLEMENTATION The source code is freely available at <https://github.com/syang11/ProbeRating>. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shu Yang
- Department of Computer Science, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Xiaoxi Liu
- RIKEN Center for Integrative Medical Sciences (IMS), Yokohama 230-0045, Japan
| | - Raymond T Ng
- Department of Computer Science, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| |
Collapse
|
47
|
Miko H, Qiu Y, Gaertner B, Sander M, Ohler U. Inferring time series chromatin states for promoter-enhancer pairs based on Hi-C data. BMC Genomics 2021; 22:84. [PMID: 33509077 PMCID: PMC7841892 DOI: 10.1186/s12864-021-07373-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 01/07/2021] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Co-localized combinations of histone modifications ("chromatin states") have been shown to correlate with promoter and enhancer activity. Changes in chromatin states over multiple time points ("chromatin state trajectories") have previously been analyzed at promoter and enhancers separately. With the advent of time series Hi-C data it is now possible to connect promoters and enhancers and to analyze chromatin state trajectories at promoter-enhancer pairs. RESULTS We present TimelessFlex, a framework for investigating chromatin state trajectories at promoters and enhancers and at promoter-enhancer pairs based on Hi-C information. TimelessFlex extends our previous approach Timeless, a Bayesian network for clustering multiple histone modification data sets at promoter and enhancer feature regions. We utilize time series ATAC-seq data measuring open chromatin to define promoters and enhancer candidates. We developed an expectation-maximization algorithm to assign promoters and enhancers to each other based on Hi-C interactions and jointly cluster their feature regions into paired chromatin state trajectories. We find jointly clustered promoter-enhancer pairs showing the same activation patterns on both sides but with a stronger trend at the enhancer side. While the promoter side remains accessible across the time series, the enhancer side becomes dynamically more open towards the gene activation time point. Promoter cluster patterns show strong correlations with gene expression signals, whereas Hi-C signals get only slightly stronger towards activation. The code of the framework is available at https://github.com/henriettemiko/TimelessFlex . CONCLUSIONS TimelessFlex clusters time series histone modifications at promoter-enhancer pairs based on Hi-C and it can identify distinct chromatin states at promoter and enhancer feature regions and their changes over time.
Collapse
Affiliation(s)
- Henriette Miko
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, 13125, Berlin, Germany
- Department of Computer Science, Humboldt-Universität zu Berlin, 10117, Berlin, Germany
| | - Yunjiang Qiu
- Ludwig Institute for Cancer Research, La Jolla, CA, 92093, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Bjoern Gaertner
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, 92093, USA
- Department of Pediatrics, Pediatric Diabetes Research Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Maike Sander
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, 92093, USA
- Department of Pediatrics, Pediatric Diabetes Research Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Uwe Ohler
- Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, 13125, Berlin, Germany.
- Department of Computer Science, Humboldt-Universität zu Berlin, 10117, Berlin, Germany.
- Department of Biology, Humboldt-Universität zu Berlin, 10117, Berlin, Germany.
| |
Collapse
|
48
|
Asif M, Orenstein Y. DeepSELEX: inferring DNA-binding preferences from HT-SELEX data using multi-class CNNs. Bioinformatics 2020; 36:i634-i642. [PMID: 33381817 DOI: 10.1093/bioinformatics/btaa789] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Transcription factor (TF) DNA-binding is a central mechanism in gene regulation. Biologists would like to know where and when these factors bind DNA. Hence, they require accurate DNA-binding models to enable binding prediction to any DNA sequence. Recent technological advancements measure the binding of a single TF to thousands of DNA sequences. One of the prevailing techniques, high-throughput SELEX, measures protein-DNA binding by high-throughput sequencing over several cycles of enrichment. Unfortunately, current computational methods to infer the binding preferences from high-throughput SELEX data do not exploit the richness of these data, and are under-using the most advanced computational technique, deep neural networks. RESULTS To better characterize the binding preferences of TFs from these experimental data, we developed DeepSELEX, a new algorithm to infer intrinsic DNA-binding preferences using deep neural networks. DeepSELEX takes advantage of the richness of high-throughput sequencing data and learns the DNA-binding preferences by observing the changes in DNA sequences through the experimental cycles. DeepSELEX outperforms extant methods for the task of DNA-binding inference from high-throughput SELEX data in binding prediction in vitro and is on par with the state of the art in in vivo binding prediction. Analysis of model parameters reveals it learns biologically relevant features that shed light on TFs' binding mechanism. AVAILABILITY AND IMPLEMENTATION DeepSELEX is available through github.com/OrensteinLab/DeepSELEX/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maor Asif
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
49
|
Long-read RNA sequencing of human and animal filarial parasites improves gene models and discovers operons. PLoS Negl Trop Dis 2020; 14:e0008869. [PMID: 33196647 PMCID: PMC7704054 DOI: 10.1371/journal.pntd.0008869] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 11/30/2020] [Accepted: 10/09/2020] [Indexed: 01/01/2023] Open
Abstract
Filarial parasitic nematodes (Filarioidea) cause substantial disease burden to humans and animals around the world. Recently there has been a coordinated global effort to generate, annotate, and curate genomic data from nematode species of medical and veterinary importance. This has resulted in two chromosome-level assemblies (Brugia malayi and Onchocerca volvulus) and 11 additional draft genomes from Filarioidea. These reference assemblies facilitate comparative genomics to explore basic helminth biology and prioritize new drug and vaccine targets. While the continual improvement of genome contiguity and completeness advances these goals, experimental functional annotation of genes is often hindered by poor gene models. Short-read RNA sequencing data and expressed sequence tags, in cooperation with ab initio prediction algorithms, are employed for gene prediction, but these can result in missing clade-specific genes, fragmented models, imperfect mapping of gene ends, and lack of isoform resolution. Long-read RNA sequencing can overcome these drawbacks and greatly improve gene model quality. Here, we present Iso-Seq data for B. malayi and Dirofilaria immitis, etiological agents of lymphatic filariasis and canine heartworm disease, respectively. These data cover approximately half of the known coding genomes and substantially improve gene models by extending untranslated regions, cataloging novel splice junctions from novel isoforms, and correcting mispredicted junctions. Furthermore, we validated computationally predicted operons, manually curated new operons, and merged fragmented gene models. We carried out analyses of poly(A) tails in both species, leading to the identification of non-canonical poly(A) signals. Finally, we prioritized and assessed known and putative anthelmintic targets, correcting or validating gene models for molecular cloning and target-based anthelmintic screening efforts. Overall, these data significantly improve the catalog of gene models for two important parasites, and they demonstrate how long-read RNA sequencing should be prioritized for ongoing improvement of parasitic nematode genome assemblies. Filarial parasitic nematodes are vector-borne parasites that infect humans and animals. Brugia malayi and Dirofilaria immitis are transmitted by mosquitoes and cause human lymphatic filariasis and canine heartworm disease, respectively. Recent years have seen a dramatic increase in genomic and transcriptomic data sets and the concomitant increase in innovative strategies for drug target identification, validation, and screening. However, while the completeness of genome assemblies of filarial parasitic nematodes has seen steady improvements, the reliability of gene models has not kept pace, hindering cloning efforts. Long-read RNA sequencing technologies are uniquely able to improve gene models, but have not been widely used for the causative agents of neglected tropical diseases. Here, we report the improvement of gene models in both B. malayi and D. immitis by long-read RNA sequencing. We identified novel operons, deprecated false positive operons, identified dozens of novel genes, and described the parameters of polyadenylation. We also focused on putative anthelmintic targets, identifying novel isoforms and correcting gene models. These data substantially increase the trustworthiness of gene models in these two species and demonstrate how long-read sequencing approaches should be prioritized in the continued improvement of genome assemblies and their gene annotations.
Collapse
|
50
|
Song J, Tian S, Yu L, Xing Y, Yang Q, Duan X, Dai Q. AC-Caps: Attention Based Capsule Network for Predicting RBP Binding Sites of LncRNA. Interdiscip Sci 2020; 12:414-423. [PMID: 32572768 DOI: 10.1007/s12539-020-00379-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 05/18/2020] [Accepted: 05/30/2020] [Indexed: 01/03/2023]
Abstract
Long non-coding RNA(lncRNA) is one of the non-coding RNAs longer than 200 nucleotides and it has no protein encoding function. LncRNA plays a key role in many biological processes. Studying the RNA-binding protein (RBP) binding sites on the lncRNA chain helps to reveal epigenetic and post-transcriptional mechanisms, to explore the physiological and pathological processes of cancer, and to discover new therapeutic breakthroughs. To improve the recognition rate of RBP binding sites and reduce the experimental time and cost, many calculation methods based on domain knowledge to predict RBP binding sites have emerged. However, these prediction methods are independent of nucleotides and do not take into account nucleotide statistics. In this paper, we use a high-order statistical-based encoding scheme, then the encoded lncRNA sequences are fed into a hybrid deep learning architecture named AC-Caps. It consists of a joint processing layer(composed of attention mechanism and convolutional neural network) and a capsule network. The AC-Caps model was evaluated using 31 independent experimental data sets from 12 lncRNA-binding proteins. In experiments, our method achieves excellent performance, with an average area under the curve (AUC) of 0.967 and an average accuracy (ACC) of 92.5%, which are 0.014, 2.3%, 0.261, 28.9%, 0.189, and 21.8% higher than HOCCNNLB, iDeepS, and DeepBind, respectively. The results show that the AC-Caps method can reliably process the large-scale RBP binding site data on the lncRNA chain, and the prediction performance is better than existing deep-learning models. The source code of AC-Caps and the datasets used in this paper are available at https://github.com/JinmiaoS/AC-Caps .
Collapse
Affiliation(s)
- Jinmiao Song
- School of Information Science and Engineering, Xinjiang University, Urumqi, 830008, China
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| | - Shengwei Tian
- School of Software, Xinjiang University, Urumqi, 830046, China.
| | - Long Yu
- Network Center, Xinjiang University, Urumqi, 830046, China
| | - Yan Xing
- Imaging Center, Xinjiang Medical University Affiliated First Hospital, Urumqi, 830011, China.
| | - Qimeng Yang
- School of Information Science and Engineering, Xinjiang University, Urumqi, 830008, China
| | - Xiaodong Duan
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| | - Qiguo Dai
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| |
Collapse
|