1
|
Du C, Fan W, Zhou Y. Integrated Biochemical and Computational Methods for Deciphering RNA-Processing Codes. WILEY INTERDISCIPLINARY REVIEWS. RNA 2024; 15:e1875. [PMID: 39523464 DOI: 10.1002/wrna.1875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 09/23/2024] [Accepted: 10/21/2024] [Indexed: 11/16/2024]
Abstract
RNA processing involves steps such as capping, splicing, polyadenylation, modification, and nuclear export. These steps are essential for transforming genetic information in DNA into proteins and contribute to RNA diversity and complexity. Many biochemical methods have been developed to profile and quantify RNAs, as well as to identify the interactions between RNAs and RNA-binding proteins (RBPs), especially when coupled with high-throughput sequencing technologies. With the rapid accumulation of diverse data, it is crucial to develop computational methods to convert the big data into biological knowledge. In particular, machine learning and deep learning models are commonly utilized to learn the rules or codes governing the transformation from DNA sequences to intriguing RNAs based on manually designed or automatically extracted features. When precise enough, the RNA codes can be incredibly useful for predicting RNA products, decoding the molecular mechanisms, forecasting the impact of disease variants on RNA processing events, and identifying driver mutations. In this review, we systematically summarize the biochemical and computational methods for deciphering five important RNA codes related to alternative splicing, alternative polyadenylation, RNA localization, RNA modifications, and RBP binding. For each code, we review the main types of experimental methods used to generate training data, as well as the key features, strategic model structures, and advantages of representative tools. We also discuss the challenges encountered in developing predictive models using large language models and extensive domain knowledge. Additionally, we highlight useful resources and propose ways to improve computational tools for studying RNA codes.
Collapse
Affiliation(s)
- Chen Du
- College of Life Sciences, TaiKang Center for Life and Medical Sciences, RNA Institute, Wuhan University, Wuhan, China
| | - Weiliang Fan
- College of Life Sciences, TaiKang Center for Life and Medical Sciences, RNA Institute, Wuhan University, Wuhan, China
| | - Yu Zhou
- College of Life Sciences, TaiKang Center for Life and Medical Sciences, RNA Institute, Wuhan University, Wuhan, China
- Frontier Science Center for Immunology and Metabolism, Wuhan University, Wuhan, China
- State Key Laboratory of Virology, Wuhan University, Wuhan, China
| |
Collapse
|
2
|
Yang Y, Li G, Pang K, Cao W, Zhang Z, Li X. Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2407013. [PMID: 39159140 PMCID: PMC11497048 DOI: 10.1002/advs.202407013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 07/23/2024] [Indexed: 08/21/2024]
Abstract
The 3' untranslated regions (3'UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. It is hypothesized that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language techniques such as Transformers, which has been very effective in modeling complex protein sequence and structures. Here 3UTRBERT is described, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT is pre-trained on aggregated 3'UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model is then fine-tuned for specific downstream tasks such as identifying RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results show that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. More importantly, the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements and effectively identifies regions with important regulatory potential. It is expected that 3UTRBERT model can serve as the foundational tool to analyze various sequence labeling tasks within the 3'UTR fields, thus enhancing the decipherability of post-transcriptional regulatory mechanisms.
Collapse
Affiliation(s)
- Yuning Yang
- School of Information Science and TechnologyNortheast Normal UniversityChangchunJilin130117China
| | - Gen Li
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Kuan Pang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Wuxinhao Cao
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Computer ScienceUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Molecular GeneticsUniversity of TorontoTorontoONM5S 3E1Canada
| | - Xiangtao Li
- School of Artificial IntelligenceJilin UniversityChangchunJilin130012China
| |
Collapse
|
3
|
Zuo Y, Zhang B, He W, Bi Y, Liu X, Zeng X, Deng Z. MSlocPRED: deep transfer learning-based identification of multi-label mRNA subcellular localization. Brief Bioinform 2024; 25:bbae504. [PMID: 39401145 PMCID: PMC11472759 DOI: 10.1093/bib/bbae504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 08/19/2024] [Accepted: 09/30/2024] [Indexed: 10/17/2024] Open
Abstract
Subcellular localization of messenger ribonucleic acid (mRNA) is a universal mechanism for precise and efficient control of the translation process. Although many computational methods have been constructed by researchers for predicting mRNA subcellular localization, very few of these computational methods have been designed to predict subcellular localization with multiple localization annotations, and their generalization performance could be improved. In this study, the prediction model MSlocPRED was constructed to identify multi-label mRNA subcellular localization. First, the preprocessed Dataset 1 and Dataset 2 are transformed into the form of images. The proposed MDNDO-SMDU resampling technique is then used to balance the number of samples in each category in the training dataset. Finally, deep transfer learning was used to construct the predictive model MSlocPRED to identify subcellular localization for 16 classes (Dataset 1) and 18 classes (Dataset 2). The results of comparative tests of different resampling techniques show that the resampling technique proposed in this study is more effective in preprocessing for subcellular localization. The prediction results of the datasets constructed by intercepting different NC end (Both the 5' and 3' untranslated regions that flank the protein-coding sequence and influence mRNA function without encoding proteins themselves.) lengths show that for Dataset 1 and Dataset 2, the prediction performance is best when the NC end is intercepted by 35 nucleotides, respectively. The results of both independent testing and five-fold cross-validation comparisons with established prediction tools show that MSlocPRED is significantly better than established tools for identifying multi-label mRNA subcellular localization. Additionally, to understand how the MSlocPRED model works during the prediction process, SHapley Additive exPlanations was used to explain it. The predictive model and associated datasets are available on the following github: https://github.com/ZBYnb1/MSlocPRED/tree/main.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, 5340 Xiping Road, Beichen District, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Wellington Rd, Clayton VIC 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, 422 Siming South Road, Siming District, Xiamen City, Fujian 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Yuelu District, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| |
Collapse
|
4
|
Wang X, Yang L, Wang R. DRpred: A Novel Deep Learning-Based Predictor for Multi-Label mRNA Subcellular Localization Prediction by Incorporating Bayesian Inferred Prior Label Relationships. Biomolecules 2024; 14:1067. [PMID: 39334834 PMCID: PMC11430783 DOI: 10.3390/biom14091067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2024] [Revised: 08/23/2024] [Accepted: 08/26/2024] [Indexed: 09/30/2024] Open
Abstract
The subcellular localization of messenger RNA (mRNA) not only helps us to understand the localization regulation of gene expression but also helps to understand the relationship between RNA localization pattern and human disease mechanism, which has profound biological and medical significance. Several predictors have been proposed for predicting the subcellular localization of mRNA. However, there is still considerable room for improvement in their predictive performance, especially regarding multi-label prediction. This study proposes a novel multi-label predictor, DRpred, for mRNA subcellular localization prediction. This predictor first utilizes Bayesian networks to capture the dependencies among labels. Subsequently, it combines these dependencies with features extracted from mRNA sequences using Word2vec, forming the input for the predictor. Finally, it employs a neural network combining BiLSTM and an attention mechanism to capture the internal relationships of the input features for mRNA subcellular localization. The experimental validation on an independent test set demonstrated that DRpred obtained a competitive predictive performance in multi-label prediction and outperformed state-of-the-art predictors in predicting single subcellular localizations, obtaining accuracies of 82.14%, 93.02%, 80.37%, 94.00%, 90.58%, 84.53%, 82.01%, 79.71%, and 85.67% for the chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus, and ribosome, respectively. It is anticipated to offer profound insights for biological and medical research.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450000, China
- Henan Provincial Key Laboratory of Data Intelligence for Food Safety, Zhengzhou University of Light Industry, Zhengzhou 450000, China
| | - Lixiang Yang
- School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450000, China
| | - Rong Wang
- School of Electronic Information, Zhengzhou University of Light Industry, Zhengzhou 450000, China
| |
Collapse
|
5
|
Li F, Bi Y, Guo X, Tan X, Wang C, Pan S. Advancing mRNA subcellular localization prediction with graph neural network and RNA structure. Bioinformatics 2024; 40:btae504. [PMID: 39133151 PMCID: PMC11361792 DOI: 10.1093/bioinformatics/btae504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 08/06/2024] [Accepted: 08/09/2024] [Indexed: 08/13/2024] Open
Abstract
MOTIVATION The asymmetrical distribution of expressed mRNAs tightly controls the precise synthesis of proteins within human cells. This non-uniform distribution, a cornerstone of developmental biology, plays a pivotal role in numerous cellular processes. To advance our comprehension of gene regulatory networks, it is essential to develop computational tools for accurately identifying the subcellular localizations of mRNAs. However, considering multi-localization phenomena remains limited in existing approaches, with none considering the influence of RNA's secondary structure. RESULTS In this study, we propose Allocator, a multi-view parallel deep learning framework that seamlessly integrates the RNA sequence-level and structure-level information, enhancing the prediction of mRNA multi-localization. The Allocator models equip four efficient feature extractors, each designed to handle different inputs. Two are tailored for sequence-based inputs, incorporating multilayer perceptron and multi-head self-attention mechanisms. The other two are specialized in processing structure-based inputs, employing graph neural networks. Benchmarking results underscore Allocator's superiority over state-of-the-art methods, showcasing its strength in revealing intricate localization associations. AVAILABILITY AND IMPLEMENTATION The webserver of Allocator is available at http://Allocator.unimelb-biotools.cloud.edu.au; the source code and datasets are available on GitHub (https://github.com/lifuyi774/Allocator) and Zenodo (https://doi.org/10.5281/zenodo.13235798).
Collapse
Affiliation(s)
- Fuyi Li
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
- South Australian immunoGENomics Cancer Institute (SAiGENCI), The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yue Bi
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Xiaolan Tan
- Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Cong Wang
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shirui Pan
- Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD 4222, Australia
| |
Collapse
|
6
|
Chen Y, Du Z, Ren X, Pan C, Zhu Y, Li Z, Meng T, Yao X. mRNA-CLA: An interpretable deep learning approach for predicting mRNA subcellular localization. Methods 2024; 227:17-26. [PMID: 38705502 DOI: 10.1016/j.ymeth.2024.04.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 03/30/2024] [Accepted: 04/28/2024] [Indexed: 05/07/2024] Open
Abstract
Messenger RNA (mRNA) is vital for post-transcriptional gene regulation, acting as the direct template for protein synthesis. However, the methods available for predicting mRNA subcellular localization need to be improved and enhanced. Notably, few existing algorithms can annotate mRNA sequences with multiple localizations. In this work, we propose the mRNA-CLA, an innovative multi-label subcellular localization prediction framework for mRNA, leveraging a deep learning approach with a multi-head self-attention mechanism. The framework employs a multi-scale convolutional layer to extract sequence features across different regions and uses a self-attention mechanism explicitly designed for each sequence. Paired with Position Weight Matrices (PWMs) derived from the convolutional neural network layers, our model offers interpretability in the analysis. In particular, we perform a base-level analysis of mRNA sequences from diverse subcellular localizations to determine the nucleotide specificity corresponding to each site. Our evaluations demonstrate that the mRNA-CLA model substantially outperforms existing methods and tools.
Collapse
Affiliation(s)
- Yifan Chen
- Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China
| | - Zhenya Du
- Guangzhou Xinhua University, 510520, Guangzhou, China
| | - Xuanbai Ren
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Chu Pan
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Yangbin Zhu
- Manufacturing and Electronic Engineering, Wenzhou University of Technology, 325027, Wenzhou, China.
| | - Zhen Li
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Tao Meng
- Institute of Artificial Intelligence Application, College of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, Hunan 410004, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, 999078, Macao.
| |
Collapse
|
7
|
Wang X, Yang L, Wang R. mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features. Biomolecules 2024; 14:767. [PMID: 39062481 PMCID: PMC11274395 DOI: 10.3390/biom14070767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 06/23/2024] [Accepted: 06/25/2024] [Indexed: 07/28/2024] Open
Abstract
The subcellular localization of messenger RNAs (mRNAs) is a pivotal aspect of biomolecules, tightly linked to gene regulation and protein synthesis, and offers innovative insights into disease diagnosis and drug development in the field of biomedicine. Several computational methods have been proposed to predict the subcellular localization of mRNAs within cells. However, there remains a deficiency in the accuracy of these predictions. In this study, we propose an mRCat predictor based on the gradient boosting tree algorithm specifically to predict whether mRNAs are localized in the nucleus or in the cytoplasm. This predictor firstly uses large language models to thoroughly explore hidden information within sequences and then integrates traditional sequence features to collectively characterize mRNA gene sequences. Finally, it employs CatBoost as the base classifier for predicting the subcellular localization of mRNAs. The experimental validation on an independent test set demonstrates that mRCat obtained accuracy of 0.761, F1 score of 0.710, MCC of 0.511, and AUROC of 0.751. The results indicate that our method has higher accuracy and robustness compared to other state-of-the-art methods. It is anticipated to offer deep insights for biomolecular research.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
- Henan Provincial Key Laboratory of Data Intelligence for Food Safety, Zhengzhou University of Light Industry, Zhengzhou 450002, China
| | - Lixiang Yang
- School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
| | - Rong Wang
- School of Electronic Information, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
| |
Collapse
|
8
|
Yan Y, Li W, Wang S, Huang T. Seq-RBPPred: Predicting RNA-Binding Proteins from Sequence. ACS OMEGA 2024; 9:12734-12742. [PMID: 38524500 PMCID: PMC10955590 DOI: 10.1021/acsomega.3c08381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/18/2023] [Accepted: 12/28/2023] [Indexed: 03/26/2024]
Abstract
RNA-binding proteins (RBPs) can interact with RNAs to regulate RNA translation, modification, splicing, and other important biological processes. The accurate identification of RBPs is of paramount importance for gaining insights into the intricate mechanisms underlying organismal life activities. Traditional experimental methods to predict RBPs require a lot of time and money, so it is important to develop computational methods to predict RBPs. However, the existing approaches for RBP prediction still require further improvement due to unidentified RBPs in many species. In this study, we present Seq-RBPPred (predicting RBPs from sequence), a novel method that utilizes a comprehensive feature representation encompassing both biophysical properties and hidden-state features derived from protein sequences. In the results, comprehensive performance evaluations of Seq-RBPPred its superiority compare with state-of-the-art methods, yielding impressive performance including 0.922 for overall accuracy, 0.926 for sensitivity, 0.903 for specificity, and Matthew's correlation coefficient (MCC) of 0.757 as ascertained from the evaluation of the testing set. The data and code of Seq-RBPPred are available at https://github.com/yaoyao-11/Seq-RBPPred.
Collapse
Affiliation(s)
- Yuyao Yan
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Wenran Li
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Sijia Wang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Tao Huang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| |
Collapse
|
9
|
Lim D, Baek C, Blanchette M. Graphylo: A deep learning approach for predicting regulatory DNA and RNA sites from whole-genome multiple alignments. iScience 2024; 27:109002. [PMID: 38362268 PMCID: PMC10867641 DOI: 10.1016/j.isci.2024.109002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 12/17/2023] [Accepted: 01/19/2024] [Indexed: 02/17/2024] Open
Abstract
This study focuses on enhancing the prediction of regulatory functional sites in DNA and RNA sequences, a crucial aspect of gene regulation. Current methods, such as motif overrepresentation and machine learning, often lack specificity. To address this issue, the study leverages evolutionary information and introduces Graphylo, a deep-learning approach for predicting transcription factor binding sites in the human genome. Graphylo combines Convolutional Neural Networks for DNA sequences with Graph Convolutional Networks on phylogenetic trees, using information from placental mammals' genomes and evolutionary history. The research demonstrates that Graphylo consistently outperforms both single-species deep learning techniques and methods that incorporate inter-species conservation scores on a wide range of datasets. It achieves this by utilizing a species-based attention model for evolutionary insights and an integrated gradient approach for nucleotide-level model interpretability. This innovative approach offers a promising avenue for improving the accuracy of regulatory site prediction in genomics.
Collapse
|
10
|
Musleh S, Arif M, Alajez NM, Alam T. Unified mRNA Subcellular Localization Predictor based on machine learning techniques. BMC Genomics 2024; 25:151. [PMID: 38326777 PMCID: PMC10848524 DOI: 10.1186/s12864-024-10077-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 02/01/2024] [Indexed: 02/09/2024] Open
Abstract
BACKGROUND The mRNA subcellular localization bears substantial impact in the regulation of gene expression, cellular migration, and adaptation. However, the methods employed for experimental determination of this localization are arduous, time-intensive, and come with a high cost. METHODS In this research article, we tackle the essential challenge of predicting the subcellular location of messenger RNAs (mRNAs) through Unified mRNA Subcellular Localization Predictor (UMSLP), a machine learning (ML) based approach. We embrace an in silico strategy that incorporate four distinct feature sets: kmer, pseudo k-tuple nucleotide composition, nucleotide physicochemical attributes, and the 3D sequence depiction achieved via Z-curve transformation for predicting subcellular localization in benchmark dataset across five distinct subcellular locales, encompassing nucleus, cytoplasm, extracellular region (ExR), mitochondria, and endoplasmic reticulum (ER). RESULTS The proposed ML model UMSLP attains cutting-edge outcomes in predicting mRNA subcellular localization. On independent testing dataset, UMSLP ahcieved over 87% precision, 94% specificity, and 94% accuracy. Compared to other existing tools, UMSLP outperformed mRNALocator, mRNALoc, and SubLocEP by 11%, 21%, and 32%, respectively on average prediction accuracy for all five locales. SHapley Additive exPlanations analysis highlights the dominance of k-mer features in predicting cytoplasm, nucleus, ER, and ExR localizations, while Z-curve based features play pivotal roles in mitochondria subcellular localization detection. AVAILABILITY We have shared datasets, code, Docker API for users in GitHub at: https://github.com/smusleh/UMSLP .
Collapse
Affiliation(s)
- Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Nehad M Alajez
- Translational Cancer and Immunity Center (TCIC), Qatar Biomedical Research Institute (QBRI), Hamad Bin Khalifa University, Doha, Qatar
- College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
11
|
Choudhury S, Bajiya N, Patiyal S, Raghava GPS. MRSLpred-a hybrid approach for predicting multi-label subcellular localization of mRNA at the genome scale. FRONTIERS IN BIOINFORMATICS 2024; 4:1341479. [PMID: 38379813 PMCID: PMC10877048 DOI: 10.3389/fbinf.2024.1341479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 01/15/2024] [Indexed: 02/22/2024] Open
Abstract
In the past, several methods have been developed for predicting the single-label subcellular localization of messenger RNA (mRNA). However, only limited methods are designed to predict the multi-label subcellular localization of mRNA. Furthermore, the existing methods are slow and cannot be implemented at a transcriptome scale. In this study, a fast and reliable method has been developed for predicting the multi-label subcellular localization of mRNA that can be implemented at a genome scale. Machine learning-based methods have been developed using mRNA sequence composition, where the XGBoost-based classifier achieved an average area under the receiver operator characteristic (AUROC) of 0.709 (0.668-0.732). In addition to alignment-free methods, we developed alignment-based methods using motif search techniques. Finally, a hybrid technique that combines the XGBoost model and the motif-based approach has been developed, achieving an average AUROC of 0.742 (0.708-0.816). Our method-MRSLpred-outperforms the existing state-of-the-art classifier in terms of performance and computation efficiency. A publicly accessible webserver and a standalone tool have been developed to facilitate researchers (webserver: https://webs.iiitd.edu.in/raghava/mrslpred/).
Collapse
Affiliation(s)
| | | | | | - Gajendra P. S. Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India
| |
Collapse
|
12
|
Wang J, Horlacher M, Cheng L, Winther O. DeepLocRNA: an interpretable deep learning model for predicting RNA subcellular localization with domain-specific transfer-learning. Bioinformatics 2024; 40:btae065. [PMID: 38317052 PMCID: PMC10879750 DOI: 10.1093/bioinformatics/btae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 01/22/2024] [Accepted: 02/01/2024] [Indexed: 02/07/2024] Open
Abstract
MOTIVATION Accurate prediction of RNA subcellular localization plays an important role in understanding cellular processes and functions. Although post-transcriptional processes are governed by trans-acting RNA binding proteins (RBPs) through interaction with cis-regulatory RNA motifs, current methods do not incorporate RBP-binding information. RESULTS In this article, we propose DeepLocRNA, an interpretable deep-learning model that leverages a pre-trained multi-task RBP-binding prediction model to predict the subcellular localization of RNA molecules via fine-tuning. We constructed DeepLocRNA using a comprehensive dataset with variant RNA types and evaluated it on the held-out dataset. Our model achieved state-of-the-art performance in predicting RNA subcellular localization in mRNA and miRNA. It has also demonstrated great generalization capabilities, performing well on both human and mouse RNA. Additionally, a motif analysis was performed to enhance the interpretability of the model, highlighting signal factors that contributed to the predictions. The proposed model provides general and powerful prediction abilities for different RNA types and species, offering valuable insights into the localization patterns of RNA molecules and contributing to our understanding of cellular processes at the molecular level. A user-friendly web server is available at: https://biolib.com/KU/DeepLocRNA/.
Collapse
Affiliation(s)
- Jun Wang
- Bioinformatics Centre, Department of Biology, University of Copenhagen, København Ø 2100, Denmark
| | - Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Neuherberg 85764, Germany
| | - Lixin Cheng
- Shenzhen People’s Hospital, First Affiliated Hospital of Southern University of Science and Technology, Second Clinical Medicine College of Jinan University, Shenzhen 518020, China
| | - Ole Winther
- Bioinformatics Centre, Department of Biology, University of Copenhagen, København Ø 2100, Denmark
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen 2100, Denmark
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| |
Collapse
|
13
|
Ntini E, Budach S, Vang Ørom UA, Marsico A. Genome-wide measurement of RNA dissociation from chromatin classifies transcripts by their dynamics and reveals rapid dissociation of enhancer lncRNAs. Cell Syst 2023; 14:906-922.e6. [PMID: 37857083 DOI: 10.1016/j.cels.2023.09.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2022] [Revised: 05/24/2023] [Accepted: 09/20/2023] [Indexed: 10/21/2023]
Abstract
Long non-coding RNAs (lncRNAs) are involved in gene expression regulation in cis. Although enriched in the cell chromatin fraction, to what degree this defines their regulatory potential remains unclear. Furthermore, the factors underlying lncRNA chromatin tethering, as well as the molecular basis of efficient lncRNA chromatin dissociation and its impact on enhancer activity and target gene expression, remain to be resolved. Here, we developed chrTT-seq, which combines the pulse-chase metabolic labeling of nascent RNA with chromatin fractionation and transient transcriptome sequencing to follow nascent RNA transcripts from their transcription on chromatin to release and allows the quantification of dissociation dynamics. By incorporating genomic, transcriptomic, and epigenetic metrics, as well as RNA-binding protein propensities, in machine learning models, we identify features that define transcript groups of different chromatin dissociation dynamics. Notably, lncRNAs transcribed from enhancers display reduced chromatin retention, suggesting that, in addition to splicing, their chromatin dissociation may shape enhancer activity.
Collapse
Affiliation(s)
- Evgenia Ntini
- Max-Planck Institute for Molecular Genetics, 14195 Berlin, Germany; Freie Universität Berlin, 14195 Berlin, Germany; Institute of Molecular Biology and Biotechnology, IMBB-FORTH, 70013 Heraklio, Greece.
| | - Stefan Budach
- Max-Planck Institute for Molecular Genetics, 14195 Berlin, Germany; Freie Universität Berlin, 14195 Berlin, Germany
| | - Ulf A Vang Ørom
- Aarhus University, Department of Molecular Biology and Genetics, 8000 Aarhus, Denmark
| | - Annalisa Marsico
- Max-Planck Institute for Molecular Genetics, 14195 Berlin, Germany; Freie Universität Berlin, 14195 Berlin, Germany; Computational Health Center, Helmholtz Center Munich, Munich, Germany.
| |
Collapse
|
14
|
Wang J, Horlacher M, Cheng L, Winther O. RNA trafficking and subcellular localization-a review of mechanisms, experimental and predictive methodologies. Brief Bioinform 2023; 24:bbad249. [PMID: 37466130 PMCID: PMC10516376 DOI: 10.1093/bib/bbad249] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 05/30/2023] [Accepted: 06/16/2023] [Indexed: 07/20/2023] Open
Abstract
RNA localization is essential for regulating spatial translation, where RNAs are trafficked to their target locations via various biological mechanisms. In this review, we discuss RNA localization in the context of molecular mechanisms, experimental techniques and machine learning-based prediction tools. Three main types of molecular mechanisms that control the localization of RNA to distinct cellular compartments are reviewed, including directed transport, protection from mRNA degradation, as well as diffusion and local entrapment. Advances in experimental methods, both image and sequence based, provide substantial data resources, which allow for the design of powerful machine learning models to predict RNA localizations. We review the publicly available predictive tools to serve as a guide for users and inspire developers to build more effective prediction models. Finally, we provide an overview of multimodal learning, which may provide a new avenue for the prediction of RNA localization.
Collapse
Affiliation(s)
- Jun Wang
- Bioinformatics Centre, Department of Biology, University of Copenhagen, København Ø 2100, Denmark
| | - Marc Horlacher
- Computational Health Center, Helmholtz Center, Munich, Germany
| | - Lixin Cheng
- Shenzhen People’s Hospital, First Affiliated Hospital of Southern University of Science and Technology, Second Clinical Medicine College of Jinan University, Shenzhen 518020, China
| | - Ole Winther
- Bioinformatics Centre, Department of Biology, University of Copenhagen, København Ø 2100, Denmark
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen 2100, Denmark
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| |
Collapse
|
15
|
Babaiha NS, Aghdam R, Ghiam S, Eslahchi C. NN-RNALoc: Neural network-based model for prediction of mRNA sub-cellular localization using distance-based sub-sequence profiles. PLoS One 2023; 18:e0258793. [PMID: 37708177 PMCID: PMC10501558 DOI: 10.1371/journal.pone.0258793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 05/12/2023] [Indexed: 09/16/2023] Open
Abstract
The localization of messenger RNAs (mRNAs) is a frequently observed phenomenon and a crucial aspect of gene expression regulation. It is also a mechanism for targeting proteins to a specific cellular region. Moreover, prior research and studies have shown the significance of intracellular RNA positioning during embryonic and neural dendrite formation. Incorrect RNA localization, which can be caused by a variety of factors, such as mutations in trans-regulatory elements, has been linked to the development of certain neuromuscular diseases and cancer. In this study, we introduced NN-RNALoc, a neural network-based method for predicting the cellular location of mRNA using novel features extracted from mRNA sequence data and protein interaction patterns. In fact, we developed a distance-based subsequence profile for RNA sequence representation that is more memory and time-efficient than well-known k-mer sequence representation. Combining protein-protein interaction data, which is essential for numerous biological processes, with our novel distance-based subsequence profiles of mRNA sequences produces more accurate features. On two benchmark datasets, CeFra-Seq and RNALocate, the performance of NN-RNALoc is compared to powerful predictive models proposed in previous works (mRNALoc, RNATracker, mLoc-mRNA, DM3Loc, iLoc-mRNA, and EL-RMLocNet), and a ground neural (DNN5-mer) network. Compared to the previous methods, NN-RNALoc significantly reduces computation time and also outperforms them in terms of accuracy. This study's source code and datasets are freely accessible at https://github.com/NeginBabaiha/NN-RNALoc.
Collapse
Affiliation(s)
- Negin Sadat Babaiha
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany
| | - Rosa Aghdam
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, United States of America
| | - Shokoofeh Ghiam
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Changiz Eslahchi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
16
|
Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. BIOLOGY 2023; 12:1033. [PMID: 37508462 PMCID: PMC10376273 DOI: 10.3390/biology12071033] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 07/18/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023]
Abstract
The emergence and rapid development of deep learning, specifically transformer-based architectures and attention mechanisms, have had transformative implications across several domains, including bioinformatics and genome data analysis. The analogous nature of genome sequences to language texts has enabled the application of techniques that have exhibited success in fields ranging from natural language processing to genomic data. This review provides a comprehensive analysis of the most recent advancements in the application of transformer architectures and attention mechanisms to genome and transcriptome data. The focus of this review is on the critical evaluation of these techniques, discussing their advantages and limitations in the context of genome data analysis. With the swift pace of development in deep learning methodologies, it becomes vital to continually assess and reflect on the current standing and future direction of the research. Therefore, this review aims to serve as a timely resource for both seasoned researchers and newcomers, offering a panoramic view of the recent advancements and elucidating the state-of-the-art applications in the field. Furthermore, this review paper serves to highlight potential areas of future investigation by critically evaluating studies from 2019 to 2023, thereby acting as a stepping-stone for further research endeavors.
Collapse
Affiliation(s)
| | - Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea;
| |
Collapse
|
17
|
Li J, Zou Q, Yuan L. A review from biological mapping to computation-based subcellular localization. MOLECULAR THERAPY. NUCLEIC ACIDS 2023; 32:507-521. [PMID: 37215152 PMCID: PMC10192651 DOI: 10.1016/j.omtn.2023.04.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Subcellular localization is crucial to the study of virus and diseases. Specifically, research on protein subcellular localization can help identify clues between virus and host cells that can aid in the design of targeted drugs. Research on RNA subcellular localization is significant for human diseases (such as Alzheimer's disease, colon cancer, etc.). To date, only reviews addressing subcellular localization of proteins have been published, which are outdated for reference, and reviews of RNA subcellular localization are not comprehensive. Therefore, we collated (the most up-to-date) literature on protein and RNA subcellular localization to help researchers understand changes in the field of protein and RNA subcellular localization. Extensive and complete methods for constructing subcellular localization models have also been summarized, which can help readers understand the changes in application of biotechnology and computer science in subcellular localization research and explore how to use biological data to construct improved subcellular localization models. This paper is the first review to cover both protein subcellular localization and RNA subcellular localization. We urge researchers from biology and computational biology to jointly pay attention to transformation patterns, interrelationships, differences, and causality of protein subcellular localization and RNA subcellular localization.
Collapse
Affiliation(s)
- Jing Li
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang 324000, China
- School of Biomedical Sciences, University of Hong Kong, Hong Kong, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, 1 Chengdian Road, Quzhou, Zhejiang 324000, China
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, 100 Minjiang Main Road, Quzhou, Zhejiang 324000, China
| |
Collapse
|
18
|
Musleh S, Islam MT, Qureshi R, Alajez N, Alam T. MSLP: mRNA subcellular localization predictor based on machine learning techniques. BMC Bioinformatics 2023; 24:109. [PMID: 36949389 PMCID: PMC10035125 DOI: 10.1186/s12859-023-05232-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 03/15/2023] [Indexed: 03/24/2023] Open
Abstract
BACKGROUND Subcellular localization of messenger RNA (mRNAs) plays a pivotal role in the regulation of gene expression, cell migration as well as in cellular adaptation. Experiment techniques for pinpointing the subcellular localization of mRNAs are laborious, time-consuming and expensive. Therefore, in silico approaches for this purpose are attaining great attention in the RNA community. METHODS In this article, we propose MSLP, a machine learning-based method to predict the subcellular localization of mRNA. We propose a novel combination of four types of features representing k-mer, pseudo k-tuple nucleotide composition (PseKNC), physicochemical properties of nucleotides, and 3D representation of sequences based on Z-curve transformation to feed into machine learning algorithm to predict the subcellular localization of mRNAs. RESULTS Considering the combination of the above-mentioned features, ennsemble-based models achieved state-of-the-art results in mRNA subcellular localization prediction tasks for multiple benchmark datasets. We evaluated the performance of our method in ten subcellular locations, covering cytoplasm, nucleus, endoplasmic reticulum (ER), extracellular region (ExR), mitochondria, cytosol, pseudopodium, posterior, exosome, and the ribosome. Ablation study highlighted k-mer and PseKNC to be more dominant than other features for predicting cytoplasm, nucleus, and ER localizations. On the other hand, physicochemical properties and Z-curve based features contributed the most to ExR and mitochondria detection. SHAP-based analysis revealed the relative importance of features to provide better insights into the proposed approach. AVAILABILITY We have implemented a Docker container and API for end users to run their sequences on our model. Datasets, the code of API and the Docker are shared for the community in GitHub at: https://github.com/smusleh/MSLP .
Collapse
Affiliation(s)
- Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | | | - Rizwan Qureshi
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Nihad Alajez
- Translational Cancer and Immunity Center (TCIC), Qatar Biomedical Research Institute (QBRI), Hamad Bin Khalifa University, Doha, Qatar
- College of Health and Life Sciences, Hamad Bin Khalifa University, Doha, Qatar
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
19
|
DeepmRNALoc: A Novel Predictor of Eukaryotic mRNA Subcellular Localization Based on Deep Learning. Molecules 2023; 28:molecules28052284. [PMID: 36903531 PMCID: PMC10005629 DOI: 10.3390/molecules28052284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Revised: 02/02/2023] [Accepted: 02/10/2023] [Indexed: 03/06/2023] Open
Abstract
The subcellular localization of messenger RNA (mRNA) precisely controls where protein products are synthesized and where they function. However, obtaining an mRNA's subcellular localization through wet-lab experiments is time-consuming and expensive, and many existing mRNA subcellular localization prediction algorithms need to be improved. In this study, a deep neural network-based eukaryotic mRNA subcellular location prediction method, DeepmRNALoc, was proposed, utilizing a two-stage feature extraction strategy that featured bimodal information splitting and fusing for the first stage and a VGGNet-like CNN module for the second stage. The five-fold cross-validation accuracies of DeepmRNALoc in the cytoplasm, endoplasmic reticulum, extracellular region, mitochondria, and nucleus were 0.895, 0.594, 0.308, 0.944, and 0.865, respectively, demonstrating that it outperforms existing models and techniques.
Collapse
|
20
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
21
|
Bi Y, Li F, Guo X, Wang Z, Pan T, Guo Y, Webb GI, Yao J, Jia C, Song J. Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations. Brief Bioinform 2022; 23:bbac467. [PMID: 36341591 PMCID: PMC10148739 DOI: 10.1093/bib/bbac467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 09/09/2022] [Accepted: 09/29/2022] [Indexed: 11/09/2022] Open
Abstract
Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.
Collapse
Affiliation(s)
- Yue Bi
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Yangling, 712100, China
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Tong Pan
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Yuming Guo
- Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia
| | - Geoffrey I Webb
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | | | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
22
|
Wei A, Wang L. Prediction of Synaptically Localized RNAs in Human Neurons Using Developmental Brain Gene Expression Data. Genes (Basel) 2022; 13:1488. [PMID: 36011399 PMCID: PMC9408096 DOI: 10.3390/genes13081488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 08/16/2022] [Accepted: 08/19/2022] [Indexed: 11/16/2022] Open
Abstract
In the nervous system, synapses are special and pervasive structures between axonal and dendritic terminals, which facilitate electrical and chemical communications among neurons. Extensive studies have been conducted in mice and rats to explore the RNA pool at synapses and investigate RNA transport, local protein synthesis, and synaptic plasticity. However, owing to the experimental difficulties of studying human synaptic transcriptomes, the full pool of human synaptic RNAs remains largely unclear. We developed a new machine learning method, called PredSynRNA, to predict the synaptic localization of human RNAs. Training instances of dendritically localized RNAs were compiled from previous rodent studies, overcoming the shortage of empirical instances of human synaptic RNAs. Using RNA sequence and gene expression data as features, various models with different learning algorithms were constructed and evaluated. Strikingly, the models using the developmental brain gene expression features achieved superior performance for predicting synaptically localized RNAs. We examined the relevant expression features learned by PredSynRNA and used an independent test dataset to further validate the model performance. PredSynRNA models were then applied to the prediction and prioritization of candidate RNAs localized to human synapses, providing valuable targets for experimental investigations into neuronal mechanisms and brain disorders.
Collapse
Affiliation(s)
- Anqi Wei
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA
- Center for Human Genetics, Clemson University, Greenwood, SC 29646, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA
- Center for Human Genetics, Clemson University, Greenwood, SC 29646, USA
| |
Collapse
|
23
|
Asim MN, Ibrahim MA, Malik MI, Zehe C, Cloarec O, Trygg J, Dengel A, Ahmed S. EL-RMLocNet: An explainable LSTM network for RNA-associated multi-compartment localization prediction. Comput Struct Biotechnol J 2022; 20:3986-4002. [PMID: 35983235 PMCID: PMC9356161 DOI: 10.1016/j.csbj.2022.07.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 07/16/2022] [Accepted: 07/16/2022] [Indexed: 11/23/2022] Open
Abstract
Subcellular localization of Ribonucleic Acid (RNA) molecules provide significant insights into the functionality of RNAs and helps to explore their association with various diseases. Predominantly developed single-compartment localization predictors (SCLPs) lack to demystify RNA association with diverse biochemical and pathological processes mainly happen through RNA co-localization in multiple compartments. Limited multi-compartment localization predictors (MCLPs) manage to produce decent performance only for target RNA class of particular sub-type. Further, existing computational approaches have limited practical significance and potential to optimize therapeutics due to the poor degree of model explainability. The paper in hand presents an explainable Long Short-Term Memory (LSTM) network "EL-RMLocNet", predictive performance and interpretability of which are optimized using a novel GeneticSeq2Vec statistical representation learning scheme and attention mechanism for accurate multi-compartment localization prediction of different RNAs solely using raw RNA sequences. GeneticSeq2Vec generates optimized statistical vectors of raw RNA sequences by capturing short and long range relations of nucleotide k-mers. Using sequence vectors generated by GeneticSeq2Vec scheme, Long Short Term Memory layers extract most informative features, weighting of which on the basis of discriminative potential for accurate multi-compartment localization prediction is performed using attention layer. Through reverse engineering, weights of statistical feature space are mapped to nucleotide k-mers patterns to make multi-compartment localization prediction decision making transparent and explainable for different RNA classes and species. Empirical evaluation indicates that EL-RMLocNet outperforms state-of-the-art predictor for subcellular localization prediction of 4 different RNA classes by an average accuracy figure of 8% for Homo Sapiens species and 6% for Mus Musculus species. EL-RMLocNet is freely available as a web server at (https://sds_genetic_analysis.opendfki.de/subcellular_loc/).
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Imran Malik
- School of Computer Science & Electrical Engineering, National University of Sciences and Technology, 44000, Islamabad, Pakistan
| | - Christoph Zehe
- Sartorius Corporate Research, Sartorius Stedim Cellca GmbH, 89081 Ulm, Germany
| | - Olivier Cloarec
- Sartorius Corporate Research, Sartorius Stedim Cellca GmbH, 89081 Ulm, Germany
| | - Johan Trygg
- Computational Life Science Cluster (CLiC), Umeå University, 90187 Umea, Sweden
- Sartorius Corporate Research, Sartorius Stedim Data Analytics, 90333 Umea, Sweden
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| |
Collapse
|
24
|
Ahsan F, Yan Z, Precup D, Blanchette M. PhyloPGM: boosting regulatory function prediction accuracy using evolutionary information. Bioinformatics 2022; 38:i299-i306. [PMID: 35758792 PMCID: PMC9235490 DOI: 10.1093/bioinformatics/btac259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Motivation The computational prediction of regulatory function associated with a genomic sequence is of utter importance in -omics study, which facilitates our understanding of the underlying mechanisms underpinning the vast gene regulatory network. Prominent examples in this area include the binding prediction of transcription factors in DNA regulatory regions, and predicting RNA–protein interaction in the context of post-transcriptional gene expression. However, existing computational methods have suffered from high false-positive rates and have seldom used any evolutionary information, despite the vast amount of available orthologous data across multitudes of extant and ancestral genomes, which readily present an opportunity to improve the accuracy of existing computational methods. Results In this study, we present a novel probabilistic approach called PhyloPGM that leverages previously trained TFBS or RNA–RBP binding predictors by aggregating their predictions from various orthologous regions, in order to boost the overall prediction accuracy on human sequences. Throughout our experiments, PhyloPGM has shown significant improvement over baselines such as the sequence-based RNA–RBP binding predictor RNATracker and the sequence-based TFBS predictor that is known as FactorNet. PhyloPGM is simple in principle, easy to implement and yet, yields impressive results. Availability and implementation The PhyloPGM package is available at https://github.com/BlanchetteLab/PhyloPGM Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Faizy Ahsan
- School of Computer Science, McGill University, Montreal H3A 0G4, Canada
| | - Zichao Yan
- School of Computer Science, McGill University, Montreal H3A 0G4, Canada
| | - Doina Precup
- School of Computer Science, McGill University, Montreal H3A 0G4, Canada
| | | |
Collapse
|
25
|
Le P, Ahmed N, Yeo GW. Illuminating RNA biology through imaging. Nat Cell Biol 2022; 24:815-824. [PMID: 35697782 PMCID: PMC11132331 DOI: 10.1038/s41556-022-00933-9] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2021] [Accepted: 05/06/2022] [Indexed: 12/14/2022]
Abstract
RNA processing plays a central role in accurately transmitting genetic information into functional RNA and protein regulators. To fully appreciate the RNA life-cycle, tools to observe RNA with high spatial and temporal resolution are critical. Here we review recent advances in RNA imaging and highlight how they will propel the field of RNA biology. We discuss current trends in RNA imaging and their potential to elucidate unanswered questions in RNA biology.
Collapse
Affiliation(s)
- Phuong Le
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Noorsher Ahmed
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
- Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA.
- Stem Cell Program, University of California San Diego, La Jolla, CA, USA.
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA.
- Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
26
|
Yamada K, Hamada M. Prediction of RNA-protein interactions using a nucleotide language model. BIOINFORMATICS ADVANCES 2022; 2:vbac023. [PMID: 36699410 PMCID: PMC9710633 DOI: 10.1093/bioadv/vbac023] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 02/28/2022] [Accepted: 04/05/2022] [Indexed: 01/28/2023]
Abstract
Motivation The accumulation of sequencing data has enabled researchers to predict the interactions between RNA sequences and RNA-binding proteins (RBPs) using novel machine learning techniques. However, existing models are often difficult to interpret and require additional information to sequences. Bidirectional encoder representations from transformer (BERT) is a language-based deep learning model that is highly interpretable. Therefore, a model based on BERT architecture can potentially overcome such limitations. Results Here, we propose BERT-RBP as a model to predict RNA-RBP interactions by adapting the BERT architecture pretrained on a human reference genome. Our model outperformed state-of-the-art prediction models using the eCLIP-seq data of 154 RBPs. The detailed analysis further revealed that BERT-RBP could recognize both the transcript region type and RNA secondary structure only based on sequence information. Overall, the results provide insights into the fine-tuning mechanism of BERT in biological contexts and provide evidence of the applicability of the model to other RNA-related problems. Availability and implementation Python source codes are freely available at https://github.com/kkyamada/bert-rbp. The datasets underlying this article were derived from sources in the public domain: [RBPsuite (http://www.csbio.sjtu.edu.cn/bioinf/RBPsuite/), Ensembl Biomart (http://asia.ensembl.org/biomart/martview/)]. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Keisuke Yamada
- Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Okubo, Shinjuku, Tokyo 169-8555, Japan
| |
Collapse
|
27
|
Liu J, Xu J, Luo B, Tang J, Hou Z, Zhu Z, Zhu L, Yao G, Li C. Immune Landscape and an RBM38-Associated Immune Prognostic Model with Laboratory Verification in Malignant Melanoma. Cancers (Basel) 2022; 14:cancers14061590. [PMID: 35326741 PMCID: PMC8946480 DOI: 10.3390/cancers14061590] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 03/16/2022] [Accepted: 03/17/2022] [Indexed: 02/04/2023] Open
Abstract
Simple Summary The primary treatment of malignant melanoma is a classical regimen of surgery combined with chemotherapy, targeted drugs, and immunotherapy. The purpose of this study was to explore the immune response mechanism of RNA binding protein RBM38 in the development of melanoma with the screening of effective immunodiagnostic models and targeted therapy. We found that RBM38, as an oncogene, promotes the proliferation, invasion, and migration of melanoma cells and is associated with immune infiltration and pathways. Our investigation presented the prognostic significance of RBM38-associated immune signature. In addition, this model may provide a potential strategy for improving the survival and immunotherapy of melanoma patients. Abstract Background: Current studies have revealed that RNA-binding protein RBM38 is closely related to tumor development, while its role in malignant melanoma remains unclear. Therefore, this research aimed to investigate the function of RBM38 in melanoma and the prognosis of the disease. Methods: Functional experiments (CCK-8 assay, cell colony formation, transwell cell migration/invasion experiment, wound healing assay, nude mouse tumor formation, and immunohistochemical analysis) were applied to evaluate the role of RBM38 in malignant melanoma. Immune-associated differentially expressed genes (DEGs) on RBM38 related immune pathways were comprehensively analyzed based on RNA sequencing results. Results: We found that high expression of RBM38 promoted melanoma cell proliferation, invasion, and migration, and RBM38 was associated with immune infiltration. Then, a five-gene (A2M, NAMPT, LIF, EBI3, and ERAP1) model of RBM38-associated immune DEGs was constructed and validated. Our signature showed superior prognosis capacity compared with other melanoma prognostic signatures. Moreover, the risk score of our signature was connected with the infiltration of immune cells, immune-regulatory proteins, and immunophenoscore in melanoma. Conclusions: We constructed an immune prognosis model using RBM38-related immune DEGs that may help evaluate melanoma patient prognosis and immunotherapy modalities.
Collapse
Affiliation(s)
- Jinfang Liu
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
| | - Jun Xu
- Department of Oncology, The Third Affiliated Hospital of Soochow University, Soochow 213000, China;
| | - Binlin Luo
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
| | - Jian Tang
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
| | - Zuoqiong Hou
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
| | - Zhechen Zhu
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
| | - Lingjun Zhu
- Department of Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing 210029, China;
| | - Gang Yao
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
- Correspondence: (G.Y.); (C.L.)
| | - Chujun Li
- Department of Plastic and Burns Surgery, The First Affiliated Hospital of Nanjing Medical University, 300 GuangZhou Rd, Nanjing 210029, China; (J.L.); (B.L.); (J.T.); (Z.H.); (Z.Z.)
- Correspondence: (G.Y.); (C.L.)
| |
Collapse
|
28
|
Zhang D, Wang S. A protein succinylation sites prediction method based on the hybrid architecture of LSTM network and CNN. J Bioinform Comput Biol 2022; 20:2250003. [PMID: 35191361 DOI: 10.1142/s0219720022500032] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The succinylation modification of protein participates in the regulation of a variety of cellular processes. Identification of modified substrates with precise sites is the basis for understanding the molecular mechanism and regulation of succinylation. In this work, we picked and chose five superior feature codes: CKSAAP, ACF, BLOSUM62, AAindex, and one-hot, according to their performance in the problem of succinylation sites prediction. Then, LSTM network and CNN were used to construct four models: LSTM-CNN, CNN-LSTM, LSTM, and CNN. The five selected features were, respectively, input into each of these four models for training to compare the four models. Based on the performance of each model, the optimal model among them was chosen to construct a hybrid model DeepSucc that was composed of five sub-modules for integrating heterogeneous information. Under the 10-fold cross-validation, the hybrid model DeepSucc achieves 86.26% accuracy, 84.94% specificity, 87.57% sensitivity, 0.9406 AUC, and 0.7254 MCC. When compared with other prediction tools using an independent test set, DeepSucc outperformed them in sensitivity and MCC. The datasets and source codes can be accessed at https://github.com/1835174863zd/DeepSucc.
Collapse
Affiliation(s)
- Die Zhang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, P. R. China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, P. R. China
| |
Collapse
|
29
|
Abstract
Most of the transcribed human genome codes for noncoding RNAs (ncRNAs), and long noncoding RNAs (lncRNAs) make for the lion's share of the human ncRNA space. Despite growing interest in lncRNAs, because there are so many of them, and because of their tissue specialization and, often, lower abundance, their catalog remains incomplete and there are multiple ongoing efforts to improve it. Consequently, the number of human lncRNA genes may be lower than 10,000 or higher than 200,000. A key open challenge for lncRNA research, now that so many lncRNA species have been identified, is the characterization of lncRNA function and the interpretation of the roles of genetic and epigenetic alterations at their loci. After all, the most important human genes to catalog and study are those that contribute to important cellular functions-that affect development or cell differentiation and whose dysregulation may play a role in the genesis and progression of human diseases. Multiple efforts have used screens based on RNA-mediated interference (RNAi), antisense oligonucleotide (ASO), and CRISPR screens to identify the consequences of lncRNA dysregulation and predict lncRNA function in select contexts, but these approaches have unresolved scalability and accuracy challenges. Instead-as was the case for better-studied ncRNAs in the past-researchers often focus on characterizing lncRNA interactions and investigating their effects on genes and pathways with known functions. Here, we focus most of our review on computational methods to identify lncRNA interactions and to predict the effects of their alterations and dysregulation on human disease pathways.
Collapse
|
30
|
Khanal J, Tayara H, Zou Q, To Chong K. DeepCap-Kcr: accurate identification and investigation of protein lysine crotonylation sites based on capsule network. Brief Bioinform 2021; 23:6457166. [PMID: 34882222 DOI: 10.1093/bib/bbab492] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 10/13/2021] [Accepted: 10/25/2021] [Indexed: 12/22/2022] Open
Abstract
Lysine crotonylation (Kcr) is a posttranslational modification widely detected in histone and nonhistone proteins. It plays a vital role in human disease progression and various cellular processes, including cell cycle, cell organization, chromatin remodeling and a key mechanism to increase proteomic diversity. Thus, accurate information on such sites is beneficial for both drug development and basic research. Existing computational methods can be improved to more effectively identify Kcr sites in proteins. In this study, we proposed a deep learning model, DeepCap-Kcr, a capsule network (CapsNet) based on a convolutional neural network (CNN) and long short-term memory (LSTM) for robust prediction of Kcr sites on histone and nonhistone proteins (mammals). The proposed model outperformed the existing CNN architecture Deep-Kcr and other well-established tools in most cases and provided promising outcomes for practical use; in particular, the proposed model characterized the internal hierarchical representation as well as the important features from multiple levels of abstraction automatically learned from a small number of samples. The trained model was well generalized in other species (papaya). Moreover, we showed the features and properties generated by the internal capsule layer that can explore the internal data distribution related to biological significance (as a motif detector). The source code and data are freely available at https://github.com/Jhabindra-bioinfo/DeepCap-Kcr.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
31
|
Savulescu AF, Bouilhol E, Beaume N, Nikolski M. Prediction of RNA subcellular localization: Learning from heterogeneous data sources. iScience 2021; 24:103298. [PMID: 34765919 PMCID: PMC8571491 DOI: 10.1016/j.isci.2021.103298] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
RNA subcellular localization has recently emerged as a widespread phenomenon, which may apply to the majority of RNAs. The two main sources of data for characterization of RNA localization are sequence features and microscopy images, such as obtained from single-molecule fluorescent in situ hybridization-based techniques. Although such imaging data are ideal for characterization of RNA distribution, these techniques remain costly, time-consuming, and technically challenging. Given these limitations, imaging data exist only for a limited number of RNAs. We argue that the field of RNA localization would greatly benefit from complementary techniques able to characterize location of RNA. Here we discuss the importance of RNA localization and the current methodology in the field, followed by an introduction on prediction of location of molecules. We then suggest a machine learning approach based on the integration between imaging localization data and sequence-based data to assist in characterization of RNA localization on a transcriptome level.
Collapse
Affiliation(s)
- Anca Flavia Savulescu
- Division of Chemical, Systems & Synthetic Biology, Institute for Infectious Disease & Molecular Medicine, Faculty of Health Sciences, University of Cape Town, 7925 Cape Town, South Africa
| | - Emmanuel Bouilhol
- Université de Bordeaux, Bordeaux Bioinformatics Center, Bordeaux, France
- Université de Bordeaux, CNRS, IBGC, UMR 5095, Bordeaux, France
| | - Nicolas Beaume
- Division of Medical Virology, Faculty of Health Sciences, University of Cape Town,7925 Cape Town, South Africa
| | - Macha Nikolski
- Université de Bordeaux, Bordeaux Bioinformatics Center, Bordeaux, France
- Université de Bordeaux, CNRS, IBGC, UMR 5095, Bordeaux, France
| |
Collapse
|
32
|
Liao Z, Pan G, Sun C, Tang J. Predicting subcellular location of protein with evolution information and sequence-based deep learning. BMC Bioinformatics 2021; 22:515. [PMID: 34686152 PMCID: PMC8539821 DOI: 10.1186/s12859-021-04404-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 09/24/2021] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. RESULTS Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. CONCLUSION The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location.
Collapse
Affiliation(s)
- Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, 1 Xuefu North Road, University Town, Fuzhou, 350122 FJ China
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Chao Sun
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, 550 Assembly St, Columbia, SC 29208 USA
- College of Electrical and Power Engineering, Taiyuan University of Technology, No. 79 Yinze West Street, Taiyuan, 030024 SX China
| |
Collapse
|
33
|
Zeng M, Wu Y, Lu C, Zhang F, Wu FX, Li M. DeepLncLoc: a deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Brief Bioinform 2021; 23:6366323. [PMID: 34498677 DOI: 10.1093/bib/bbab360] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 08/04/2021] [Accepted: 08/16/2021] [Indexed: 11/14/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. A growing amount of evidence reveals that subcellular localization of lncRNAs can provide valuable insights into their biological functions. Existing computational methods for predicting lncRNA subcellular localization use k-mer features to encode lncRNA sequences. However, the sequence order information is lost by using only k-mer features. We proposed a deep learning framework, DeepLncLoc, to predict lncRNA subcellular localization. In DeepLncLoc, we introduced a new subsequence embedding method that keeps the order information of lncRNA sequences. The subsequence embedding method first divides a sequence into some consecutive subsequences and then extracts the patterns of each subsequence, last combines these patterns to obtain a complete representation of the lncRNA sequence. After that, a text convolutional neural network is employed to learn high-level features and perform the prediction task. Compared with traditional machine learning models, popular representation methods and existing predictors, DeepLncLoc achieved better performance, which shows that DeepLncLoc could effectively predict lncRNA subcellular localization. Our study not only presented a novel computational model for predicting lncRNA subcellular localization but also introduced a new subsequence embedding method which is expected to be applied in other sequence-based prediction tasks. The DeepLncLoc web server is freely accessible at http://bioinformatics.csu.edu.cn/DeepLncLoc/, and source code and datasets can be downloaded from https://github.com/CSUBioGroup/DeepLncLoc.
Collapse
Affiliation(s)
- Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - Yifan Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - Chengqian Lu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - Fuhao Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada
| | - Min Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan, 410083, China
| |
Collapse
|
34
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Advances in Computational Methodologies for Classification and Sub-Cellular Locality Prediction of Non-Coding RNAs. Int J Mol Sci 2021; 22:8719. [PMID: 34445436 PMCID: PMC8395733 DOI: 10.3390/ijms22168719] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Revised: 08/02/2021] [Accepted: 08/03/2021] [Indexed: 02/06/2023] Open
Abstract
Apart from protein-coding Ribonucleic acids (RNAs), there exists a variety of non-coding RNAs (ncRNAs) which regulate complex cellular and molecular processes. High-throughput sequencing technologies and bioinformatics approaches have largely promoted the exploration of ncRNAs which revealed their crucial roles in gene regulation, miRNA binding, protein interactions, and splicing. Furthermore, ncRNAs are involved in the development of complicated diseases like cancer. Categorization of ncRNAs is essential to understand the mechanisms of diseases and to develop effective treatments. Sub-cellular localization information of ncRNAs demystifies diverse functionalities of ncRNAs. To date, several computational methodologies have been proposed to precisely identify the class as well as sub-cellular localization patterns of RNAs). This paper discusses different types of ncRNAs, reviews computational approaches proposed in the last 10 years to distinguish coding-RNA from ncRNA, to identify sub-types of ncRNAs such as piwi-associated RNA, micro RNA, long ncRNA, and circular RNA, and to determine sub-cellular localization of distinct ncRNAs and RNAs. Furthermore, it summarizes diverse ncRNA classification and sub-cellular localization determination datasets along with benchmark performance to aid the development and evaluation of novel computational methodologies. It identifies research gaps, heterogeneity, and challenges in the development of computational approaches for RNA sequence analysis. We consider that our expert analysis will assist Artificial Intelligence researchers with knowing state-of-the-art performance, model selection for various tasks on one platform, dominantly used sequence descriptors, neural architectures, and interpreting inter-species and intra-species performance deviation.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- National Center for Artificial Intelligence (NCAI), National University of Sciences and Technology, Islamabad 44000, Pakistan;
- School of Electrical Engineering & Computer Science, National University of Sciences and Technology, Islamabad 44000, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
35
|
Zhou D, Peng S, Wei DQ, Zhong W, Dou Y, Xie X. LUNAR :Drug Screening for Novel Coronavirus Based on Representation Learning Graph Convolutional Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1290-1298. [PMID: 34081583 PMCID: PMC8769035 DOI: 10.1109/tcbb.2021.3085972] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 04/23/2021] [Accepted: 05/30/2021] [Indexed: 06/12/2023]
Abstract
An outbreak of COVID-19 that began in late 2019 was caused by a novel coronavirus(SARS-CoV-2). It has become a global pandemic. As of June 9, 2020, it has infected nearly 7 million people and killed more than 400,000, but there is no specific drug. Therefore, there is an urgent need to find or develop more drugs to suppress the virus. Here, we propose a new nonlinear end-to-end model called LUNAR. It uses graph convolutional neural networks to automatically learn the neighborhood information of complex heterogeneous relational networks and combines the attention mechanism to reflect the importance of the sum of different types of neighborhood information to obtain the representation characteristics of each node. Finally, through the topology reconstruction process, the feature representations of drugs and targets are forcibly extracted to match the observed network as much as possible. Through this reconstruction process, we obtain the strength of the relationship between different nodes and predict drug candidates that may affect the treatment of COVID-19 based on the known targets of COVID-19. These selected candidate drugs can be used as a reference for experimental scientists and accelerate the speed of drug development. LUNAR can well integrate various topological structure information in heterogeneous networks, and skillfully combine attention mechanisms to reflect the importance of neighborhood information of different types of nodes, improving the interpretability of the model. The area under the curve(AUC) of the model is 0.949 and the accurate recall curve (AUPR) is 0.866 using 10-fold cross-validation. These two performance indexes show that the model has superior predictive performance. Besides, some of the drugs screened out by our model have appeared in some clinical studies to further illustrate the effectiveness of the model.
Collapse
Affiliation(s)
- Deshan Zhou
- College of Computer ScienceHunan UniversityChangshaHunan410082China
| | - Shaoliang Peng
- College of Computer Science and Electronic Engineering & National Supercomputing Centre in ChangshaHunan UniversityChangshaHunan410082China
- School of Computer ScienceNational University of Defense TechnologyChangshaHunan410082China
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation Center on Antibacterial Resistances, Joint International Research Laboratory of Metabolic & Developmental Sciences, School of Life Sciences and BiotechnologyShanghai Jiao Tong UniversityShanghai200030China
- Peng Cheng LaboratoryShenzhenGuangdong518055China
| | - Wu Zhong
- National Engineering Research Center for the Emergency DrugBeijing Institute of Pharmacology and ToxicologyBeijing100850China
| | - Yutao Dou
- School of Computer ScienceThe University of SydneySydneyNSW2006Australia
| | - Xiaolan Xie
- School of Information Science and EngineeringGuilin University of TechnologyGuilin CityGuangxi541004China
| |
Collapse
|
36
|
Meher PK, Rai A, Rao AR. mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net. BMC Bioinformatics 2021; 22:342. [PMID: 34167457 PMCID: PMC8223360 DOI: 10.1186/s12859-021-04264-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 06/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Localization of messenger RNAs (mRNAs) plays a crucial role in the growth and development of cells. Particularly, it plays a major role in regulating spatio-temporal gene expression. The in situ hybridization is a promising experimental technique used to determine the localization of mRNAs but it is costly and laborious. It is also a known fact that a single mRNA can be present in more than one location, whereas the existing computational tools are capable of predicting only a single location for such mRNAs. Thus, the development of high-end computational tool is required for reliable and timely prediction of multiple subcellular locations of mRNAs. Hence, we develop the present computational model to predict the multiple localizations of mRNAs. RESULTS The mRNA sequences from 9 different localizations were considered. Each sequence was first transformed to a numeric feature vector of size 5460, based on the k-mer features of sizes 1-6. Out of 5460 k-mer features, 1812 important features were selected by the Elastic Net statistical model. The Random Forest supervised learning algorithm was then employed for predicting the localizations with the selected features. Five-fold cross-validation accuracies of 70.87, 68.32, 68.36, 68.79, 96.46, 73.44, 70.94, 97.42 and 71.77% were obtained for the cytoplasm, cytosol, endoplasmic reticulum, exosome, mitochondrion, nucleus, pseudopodium, posterior and ribosome respectively. With an independent test set, accuracies of 65.33, 73.37, 75.86, 72.99, 94.26, 70.91, 65.53, 93.60 and 73.45% were obtained for the respective localizations. The developed approach also achieved higher accuracies than the existing localization prediction tools. CONCLUSIONS This study presents a novel computational tool for predicting the multiple localization of mRNAs. Based on the proposed approach, an online prediction server "mLoc-mRNA" is accessible at http://cabgrid.res.in:8080/mlocmrna/ . The developed approach is believed to supplement the existing tools and techniques for the localization prediction of mRNAs.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Anil Rai
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | | |
Collapse
|
37
|
Wang D, Zhang Z, Jiang Y, Mao Z, Wang D, Lin H, Xu D. DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism. Nucleic Acids Res 2021; 49:e46. [PMID: 33503258 PMCID: PMC8096227 DOI: 10.1093/nar/gkab016] [Citation(s) in RCA: 84] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 12/09/2020] [Accepted: 01/06/2021] [Indexed: 12/30/2022] Open
Abstract
Subcellular localization of messenger RNAs (mRNAs), as a prevalent mechanism, gives precise and efficient control for the translation process. There is mounting evidence for the important roles of this process in a variety of cellular events. Computational methods for mRNA subcellular localization prediction provide a useful approach for studying mRNA functions. However, few computational methods were designed for mRNA subcellular localization prediction and their performance have room for improvement. Especially, there is still no available tool to predict for mRNAs that have multiple localization annotations. In this paper, we propose a multi-head self-attention method, DM3Loc, for multi-label mRNA subcellular localization prediction. Evaluation results show that DM3Loc outperforms existing methods and tools in general. Furthermore, DM3Loc has the interpretation ability to analyze RNA-binding protein motifs and key signals on mRNAs for subcellular localization. Our analyses found hundreds of instances of mRNA isoform-specific subcellular localizations and many significantly enriched gene functions for mRNAs in different subcellular localizations.
Collapse
Affiliation(s)
- Duolin Wang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65203, USA
| | - Zhaoyue Zhang
- Center for Information Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yuexu Jiang
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65203, USA
| | - Ziting Mao
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65203, USA
| | - Dong Wang
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China
| | - Hao Lin
- Center for Information Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65203, USA
| |
Collapse
|
38
|
Tang Q, Nie F, Kang J, Chen W. mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy. Mol Ther 2021; 29:2617-2623. [PMID: 33823302 DOI: 10.1016/j.ymthe.2021.04.004] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 03/23/2021] [Accepted: 03/31/2021] [Indexed: 02/07/2023] Open
Abstract
The functions of mRNAs are closely correlated with their locations in cells. Knowledge about the subcellular locations of mRNA is helpful to understand their biological functions. In recent years, it has become a hot topic to develop effective computational models to predict eukaryotic mRNA subcellular localizations. However, existing state-of-the-art models still have certain deficiencies in terms of prediction accuracy and generalization ability. Therefore, it is urgent to develop novel methods to accurately predict mRNA subcellular localizations. In this study, a novel method called mRNALocater was proposed to detect the subcellular localization of eukaryotic mRNA by adopting the model fusion strategy. To fully extract information from mRNA sequences, the electron-ion interaction pseudopotential and pseudo k-tuple nucleotide composition were used to encode the sequences. Moreover, the correlation coefficient filtering algorithm and feature forward search technology were used to mine hidden feature information, which guarantees that mRNALocater can be more effectively applied to new sequences. The results based on the independent dataset tests demonstrate that mRNALocater yields promising performances for predicting eukaryotic mRNA subcellular localizations and is a powerful tool in practical applications. A freely available online web server for mRNALocater has been established at http://bio-bigdata.cn/mRNALocater.
Collapse
Affiliation(s)
- Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- School of Life Sciences, North China University of Science and Technology, Tangshan 063210, China; School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Juanjuan Kang
- Affiliated Foshan Maternity & Child Healthcare Hospital, Southern Medical University (Foshan Maternity & Child Healthcare Hospital), Foshan 528000, China
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China; School of Life Sciences, North China University of Science and Technology, Tangshan 063210, China; School of Public Health, North China University of Science and Technology, Tangshan 063210, China.
| |
Collapse
|
39
|
Li J, Zhang L, He S, Guo F, Zou Q. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021; 22:6059770. [PMID: 33388743 DOI: 10.1093/bib/bbaa401] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. RESULTS In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
Collapse
Affiliation(s)
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | | | | | | |
Collapse
|
40
|
MirLocPredictor: A ConvNet-Based Multi-Label MicroRNA Subcellular Localization Predictor by Incorporating k-Mer Positional Information. Genes (Basel) 2020; 11:genes11121475. [PMID: 33316943 PMCID: PMC7763197 DOI: 10.3390/genes11121475] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 11/23/2020] [Accepted: 11/25/2020] [Indexed: 02/06/2023] Open
Abstract
MicroRNAs (miRNA) are small noncoding RNA sequences consisting of about 22 nucleotides that are involved in the regulation of almost 60% of mammalian genes. Presently, there are very limited approaches for the visualization of miRNA locations present inside cells to support the elucidation of pathways and mechanisms behind miRNA function, transport, and biogenesis. MIRLocator, a state-of-the-art tool for the prediction of subcellular localization of miRNAs makes use of a sequence-to-sequence model along with pretrained k-mer embeddings. Existing pretrained k-mer embedding generation methodologies focus on the extraction of semantics of k-mers. However, in RNA sequences, positional information of nucleotides is more important because distinct positions of the four nucleotides define the function of an RNA molecule. Considering the importance of the nucleotide position, we propose a novel approach (kmerPR2vec) which is a fusion of positional information of k-mers with randomly initialized neural k-mer embeddings. In contrast to existing k-mer-based representation, the proposed kmerPR2vec representation is much more rich in terms of semantic information and has more discriminative power. Using novel kmerPR2vec representation, we further present an end-to-end system (MirLocPredictor) which couples the discriminative power of kmerPR2vec with Convolutional Neural Networks (CNNs) for miRNA subcellular location prediction. The effectiveness of the proposed kmerPR2vec approach is evaluated with deep learning-based topologies (i.e., Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN)) and by using 9 different evaluation measures. Analysis of the results reveals that MirLocPredictor outperform state-of-the-art methods with a significant margin of 18% and 19% in terms of precision and recall.
Collapse
|
41
|
Aillaud M, Schulte LN. Emerging Roles of Long Noncoding RNAs in the Cytoplasmic Milieu. Noncoding RNA 2020; 6:ncrna6040044. [PMID: 33182489 PMCID: PMC7711603 DOI: 10.3390/ncrna6040044] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 10/26/2020] [Accepted: 11/05/2020] [Indexed: 02/06/2023] Open
Abstract
While the important functions of long noncoding RNAs (lncRNAs) in nuclear organization are well documented, their orchestrating and architectural roles in the cytoplasmic environment have long been underestimated. However, recently developed fractionation and proximity labelling approaches have shown that a considerable proportion of cellular lncRNAs is exported into the cytoplasm and associates nonrandomly with proteins in the cytosol and organelles. The functions of these lncRNAs range from the control of translation and mitochondrial metabolism to the anchoring of cellular components on the cytoskeleton and regulation of protein degradation at the proteasome. In the present review, we provide an overview of the functions of lncRNAs in cytoplasmic structures and machineries und discuss their emerging roles in the coordination of the dense intracellular milieu. It is becoming apparent that further research into the functions of these lncRNAs will lead to an improved understanding of the spatiotemporal organization of cytoplasmic processes during homeostasis and disease.
Collapse
Affiliation(s)
- Michelle Aillaud
- Institute for Lung Research, Philipps University Marburg, 35043 Marburg, Germany;
| | - Leon N Schulte
- Institute for Lung Research, Philipps University Marburg, 35043 Marburg, Germany;
- German Center for Lung Research (DZL), 35392 Giessen, Germany
- Correspondence:
| |
Collapse
|
42
|
Garg A, Singhal N, Kumar R, Kumar M. mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization. Nucleic Acids Res 2020; 48:W239-W243. [PMID: 32421834 PMCID: PMC7319581 DOI: 10.1093/nar/gkaa385] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/14/2020] [Accepted: 04/30/2020] [Indexed: 02/06/2023] Open
Abstract
Recent evidences suggest that the localization of mRNAs near the subcellular compartment of the translated proteins is a more robust cellular tool, which optimizes protein expression, post-transcriptionally. Retention of mRNA in the nucleus can regulate the amount of protein translated from each mRNA, thus allowing a tight temporal regulation of translation or buffering of protein levels from bursty transcription. Besides, mRNA localization performs a variety of additional roles like long-distance signaling, facilitating assembly of protein complexes and coordination of developmental processes. Here, we describe a novel machine-learning based tool, mRNALoc, to predict five sub-cellular locations of eukaryotic mRNAs using cDNA/mRNA sequences. During five fold cross-validations, the maximum overall accuracy was 65.19, 75.36, 67.10, 99.70 and 73.59% for the extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. Assessment on independent datasets revealed the prediction accuracies of 58.10, 69.23, 64.55, 96.88 and 69.35% for extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. The corresponding values of AUC were 0.76, 0.75, 0.70, 0.98 and 0.74 for the extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. The mRNALoc standalone software and web-server are freely available for academic use under GNU GPL at http://proteininformatics.org/mkumar/mrnaloc.
Collapse
Affiliation(s)
- Anjali Garg
- Department of Biophysics, University of Delhi South Campus, New Delhi 110021, India
| | - Neelja Singhal
- Department of Biophysics, University of Delhi South Campus, New Delhi 110021, India
| | - Ravindra Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi 110021, India
| | - Manish Kumar
- Department of Biophysics, University of Delhi South Campus, New Delhi 110021, India
| |
Collapse
|
43
|
Li J, Pu Y, Tang J, Zou Q, Guo F. DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences. Brief Bioinform 2020; 22:5890498. [PMID: 32778871 DOI: 10.1093/bib/bbaa159] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/05/2020] [Accepted: 06/19/2020] [Indexed: 12/23/2022] Open
Abstract
Quantifying DNA properties is a challenging task in the broad field of human genomics. Since the vast majority of non-coding DNA is still poorly understood in terms of function, this task is particularly important to have enormous benefit for biology research. Various DNA sequences should have a great variety of representations, and specific functions may focus on corresponding features in the front part of learning model. Currently, however, for multi-class prediction of non-coding DNA regulatory functions, most powerful predictive models do not have appropriate feature extraction and selection approaches for specific functional effects, so that it is difficult to gain a better insight into their internal correlations. Hence, we design a category attention layer and category dense layer in order to select efficient features and distinguish different DNA functions. In this study, we propose a hybrid deep neural network method, called DeepATT, for identifying $919$ regulatory functions on nearly $5$ million DNA sequences. Our model has four built-in neural network constructions: convolution layer captures regulatory motifs, recurrent layer captures a regulatory grammar, category attention layer selects corresponding valid features for different functions and category dense layer classifies predictive labels with selected features of regulatory functions. Importantly, we compare our novel method, DeepATT, with existing outstanding prediction tools, DeepSEA and DanQ. DeepATT performs significantly better than other existing tools for identifying DNA functions, at least increasing $1.6\%$ area under precision recall. Furthermore, we can mine the important correlation among different DNA functions according to the category attention module. Moreover, our novel model can greatly reduce the number of parameters by the mechanism of attention and locally connected, on the basis of ensuring accuracy.
Collapse
Affiliation(s)
| | | | | | - Quan Zou
- University of Electronic Science and Technology of China
| | | |
Collapse
|
44
|
Wu KE, Parker KR, Fazal FM, Chang HY, Zou J. RNA-GPS predicts high-resolution RNA subcellular localization and highlights the role of splicing. RNA (NEW YORK, N.Y.) 2020; 26:851-865. [PMID: 32220894 PMCID: PMC7297119 DOI: 10.1261/rna.074161.119] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 03/19/2020] [Indexed: 06/10/2023]
Abstract
Subcellular localization is essential to RNA biogenesis, processing, and function across the gene expression life cycle. However, the specific nucleotide sequence motifs that direct RNA localization are incompletely understood. Fortunately, new sequencing technologies have provided transcriptome-wide atlases of RNA localization, creating an opportunity to leverage computational modeling. Here we present RNA-GPS, a new machine learning model that uses nucleotide-level features to predict RNA localization across eight different subcellular locations-the first to provide such a wide range of predictions. RNA-GPS's design enables high-throughput sequence ablation and feature importance analyses to probe the sequence motifs that drive localization prediction. We find localization informative motifs to be concentrated on 3'-UTRs and scattered along the coding sequence, and motifs related to splicing to be important drivers of predicted localization, even for cytotopic distinctions for membraneless bodies within the nucleus or for organelles within the cytoplasm. Overall, our results suggest transcript splicing is one of many elements influencing RNA subcellular localization.
Collapse
Affiliation(s)
- Kevin E Wu
- Department of Computer Science, Stanford University, Stanford, California 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, California 94305, USA
- Center for Personal and Dynamic Regulomes, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Kevin R Parker
- Center for Personal and Dynamic Regulomes, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Furqan M Fazal
- Center for Personal and Dynamic Regulomes, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Howard Y Chang
- Center for Personal and Dynamic Regulomes, Stanford University School of Medicine, Stanford, California 94305, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, California 94305, USA
| | - James Zou
- Department of Computer Science, Stanford University, Stanford, California 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, California 94305, USA
| |
Collapse
|
45
|
Fazal FM, Chang HY. Subcellular Spatial Transcriptomes: Emerging Frontier for Understanding Gene Regulation. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2020; 84:31-45. [PMID: 32482897 PMCID: PMC7426137 DOI: 10.1101/sqb.2019.84.040352] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
RNAs are trafficked and localized with exquisite precision inside the cell. Studies of candidate messenger RNAs have shown the vital importance of RNA subcellular location in development and cellular function. New sequencing- and imaging-based methods are providing complementary insights into subcellular localization of RNAs transcriptome-wide. APEX-seq and ribosome profiling as well as proximity-labeling approaches have revealed thousands of transcript isoforms are localized to distinct cytotopic locations, including locations that defy biochemical fractionation and hence were missed by prior studies. Sequences in the 3' and 5' untranslated regions (UTRs) serve as "zip codes" to direct transcripts to particular locales, and it is clear that intronic and retrotransposable sequences within transcripts have been co-opted by cells to control localization. Molecular motors, nuclear-to-cytosol RNA export, liquid-liquid phase separation, RNA modifications, and RNA structure dynamically shape the subcellular transcriptome. Location-based RNA regulation continues to pose new mysteries for the field, yet promises to reveal insights into fundamental cell biology and disease mechanisms.
Collapse
Affiliation(s)
- Furqan M Fazal
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, California 94305, USA
| | - Howard Y Chang
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, California 94305, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, California 94305, USA
| |
Collapse
|
46
|
Chaudhuri A, Das S, Das B. Localization elements and zip codes in the intracellular transport and localization of messenger RNAs in Saccharomyces cerevisiae. WILEY INTERDISCIPLINARY REVIEWS-RNA 2020; 11:e1591. [PMID: 32101377 DOI: 10.1002/wrna.1591] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 02/05/2020] [Accepted: 02/07/2020] [Indexed: 12/13/2022]
Abstract
Intracellular trafficking and localization of mRNAs provide a mechanism of regulation of expression of genes with excellent spatial control. mRNA localization followed by localized translation appears to be a mechanism of targeted protein sorting to a specific cell-compartment, which is linked to the establishment of cell polarity, cell asymmetry, embryonic axis determination, and neuronal plasticity in metazoans. However, the complexity of the mechanism and the components of mRNA localization in higher organisms prompted the use of the unicellular organism Saccharomyces cerevisiae as a simplified model organism to study this vital process. Current knowledge indicates that a variety of mRNAs are asymmetrically and selectively localized to the tip of the bud of the daughter cells, to the vicinity of endoplasmic reticulum, mitochondria, and nucleus in this organism, which are connected to diverse cellular processes. Interestingly, specific cis-acting RNA localization elements (LEs) or RNA zip codes play a crucial role in the localization and trafficking of these localized mRNAs by providing critical binding sites for the specific RNA-binding proteins (RBPs). In this review, we present a comprehensive account of mRNA localization in S. cerevisiae, various types of localization elements influencing the mRNA localization, and the RBPs, which bind to these LEs to implement a number of vital physiological processes. Finally, we emphasize the significance of this process by highlighting their connection to several neuropathological disorders and cancers. This article is categorized under: RNA Export and Localization > RNA Localization.
Collapse
Affiliation(s)
- Anusha Chaudhuri
- Department of Life Science and Biotechnology, Jadavpur University, Kolkata, India
| | - Subhadeep Das
- Department of Life Science and Biotechnology, Jadavpur University, Kolkata, India
| | - Biswadip Das
- Department of Life Science and Biotechnology, Jadavpur University, Kolkata, India
| |
Collapse
|
47
|
Zhang ZY, Yang YH, Ding H, Wang D, Chen W, Lin H. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief Bioinform 2020; 22:526-535. [PMID: 31994694 DOI: 10.1093/bib/bbz177] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 11/05/2019] [Accepted: 11/21/2019] [Indexed: 12/14/2022] Open
Abstract
Messenger RNAs (mRNAs) shoulder special responsibilities that transmit genetic code from DNA to discrete locations in the cytoplasm. The locating process of mRNA might provide spatial and temporal regulation of mRNA and protein functions. The situ hybridization and quantitative transcriptomics analysis could provide detail information about mRNA subcellular localization; however, they are time consuming and expensive. It is highly desired to develop computational tools for timely and effectively predicting mRNA subcellular location. In this work, by using binomial distribution and one-way analysis of variance, the optimal nonamer composition was obtained to represent mRNA sequences. Subsequently, a predictor based on support vector machine was developed to identify the mRNA subcellular localization. In 5-fold cross-validation, results showed that the accuracy is 90.12% for Homo sapiens (H. sapiens). The predictor may provide a reference for the study of mRNA localization mechanisms and mRNA translocation strategies. An online web server was established based on our models, which is available at http://lin-group.cn/server/iLoc-mRNA/.
Collapse
Affiliation(s)
- Zhao-Yue Zhang
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Yu-He Yang
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Hui Ding
- Center for Informational Biology at University of Electronic Science and Technology of China
| | - Dong Wang
- Department of Bioinformatics at Southern Medical University
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy at Chengdu University of Traditional Chinese Medicine
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China
| |
Collapse
|
48
|
Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method. BMC Bioinformatics 2019; 20:719. [PMID: 31888447 PMCID: PMC6936157 DOI: 10.1186/s12859-019-3232-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Subcellular localization prediction of protein is an important component of bioinformatics, which has great importance for drug design and other applications. A multitude of computational tools for proteins subcellular location have been developed in the recent decades, however, existing methods differ in the protein sequence representation techniques and classification algorithms adopted. RESULTS In this paper, we firstly introduce two kinds of protein sequences encoding schemes: dipeptide information with space and Gapped k-mer information. Then, the Gapped k-mer calculation method which is based on quad-tree is also introduced. CONCLUSIONS >From the prediction results, this method not only reduces the dimension, but also improves the prediction precision of protein subcellular localization.
Collapse
|
49
|
Bioinformatics Approaches to Gain Insights into cis-Regulatory Motifs Involved in mRNA Localization. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1203:165-194. [PMID: 31811635 DOI: 10.1007/978-3-030-31434-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Messenger RNA (mRNA) is a fundamental intermediate in the expression of proteins. As an integral part of this important process, protein production can be localized by the targeting of mRNA to a specific subcellular compartment. The subcellular destination of mRNA is suggested to be governed by a region of its primary sequence or secondary structure, which consequently dictates the recruitment of trans-acting factors, such as RNA-binding proteins or regulatory RNAs, to form a messenger ribonucleoprotein particle. This molecular ensemble is requisite for precise and spatiotemporal control of gene expression. In the context of RNA localization, the description of the binding preferences of an RNA-binding protein defines a motif, and one, or more, instance of a given motif is defined as a localization element (zip code). In this chapter, we first discuss the cis-regulatory motifs previously identified as mRNA localization elements. We then describe motif representation in terms of entropy and information content and offer an overview of motif databases and search algorithms. Finally, we provide an outline of the motif topology of asymmetrically localized mRNA molecules.
Collapse
|