1
|
Yuge CC, Hang ES, Mamtha MRN, Vishwakarma S, Wang S, Wang C, Le NQK. RNA-ModX: a multilabel prediction and interpretation framework for RNA modifications. Brief Bioinform 2024; 26:bbae688. [PMID: 39737566 DOI: 10.1093/bib/bbae688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 11/18/2024] [Accepted: 12/16/2024] [Indexed: 01/01/2025] Open
Abstract
Accurate prediction of RNA modifications holds profound implications for elucidating RNA function and mechanism, with potential applications in drug development. Here, the RNA-ModX presents a highly precise predictive model designed to forecast post-transcriptional RNA modifications, complemented by a user-friendly web application tailored for seamless utilization by future researchers. To achieve exceptional accuracy, the RNA-ModX systematically explored a range of machine learning models, including Long Short-Term Memory (LSTM), Gated Recurrent Unit, and Transformer-based architectures. The model underwent rigorous testing using a dataset comprising RNA sequences containing the four fundamental nucleotides (A, C, G, U) and spanning 12 prevalent modification classes (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um), with sequences of length 1001 nucleotides. Notably, the LSTM model, augmented with 3-mer encoding, demonstrated the highest level of model accuracy. Furthermore, Local Interpretable Model-Agnostic Explanations were employed to facilitate result interpretation, enhancing the transparency and interpretability of the model's predictions. In conjunction with the model development, a user-friendly web application was meticulously crafted, featuring an intuitive interface for researchers to effortlessly upload RNA sequences. Upon submission, the model executes in the backend, generating predictions which are seamlessly presented to the user in a coherent manner. This integration of cutting-edge predictive modeling with a user-centric interface signifies a significant step forward in facilitating the exploration and utilization of RNA modification prediction technologies by the broader research community.
Collapse
Affiliation(s)
- Chelsea Chen Yuge
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Ee Soon Hang
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | | | - Shashikant Vishwakarma
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Sijia Wang
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Cheng Wang
- Independent Researcher, Singapore, Singapore
| | - Nguyen Quoc Khanh Le
- In-Service Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, 252 Wuxing Street, 110, Taipei, Taiwan
| |
Collapse
|
2
|
Yang Y, Li G, Pang K, Cao W, Zhang Z, Li X. Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2407013. [PMID: 39159140 PMCID: PMC11497048 DOI: 10.1002/advs.202407013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 07/23/2024] [Indexed: 08/21/2024]
Abstract
The 3' untranslated regions (3'UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. It is hypothesized that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language techniques such as Transformers, which has been very effective in modeling complex protein sequence and structures. Here 3UTRBERT is described, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT is pre-trained on aggregated 3'UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model is then fine-tuned for specific downstream tasks such as identifying RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results show that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. More importantly, the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements and effectively identifies regions with important regulatory potential. It is expected that 3UTRBERT model can serve as the foundational tool to analyze various sequence labeling tasks within the 3'UTR fields, thus enhancing the decipherability of post-transcriptional regulatory mechanisms.
Collapse
Affiliation(s)
- Yuning Yang
- School of Information Science and TechnologyNortheast Normal UniversityChangchunJilin130117China
| | - Gen Li
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Kuan Pang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Wuxinhao Cao
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Computer ScienceUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Molecular GeneticsUniversity of TorontoTorontoONM5S 3E1Canada
| | - Xiangtao Li
- School of Artificial IntelligenceJilin UniversityChangchunJilin130012China
| |
Collapse
|
3
|
Chen J, Zhang H, Zou Q, Liao B, Bi XA. Multi-kernel Learning Fusion Algorithm Based on RNN and GRU for ASD Diagnosis and Pathogenic Brain Region Extraction. Interdiscip Sci 2024; 16:755-768. [PMID: 38683281 DOI: 10.1007/s12539-024-00629-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/06/2024] [Accepted: 03/31/2024] [Indexed: 05/01/2024]
Abstract
Autism spectrum disorder (ASD) is a complex, severe disorder related to brain development. It impairs patient language communication and social behaviors. In recent years, ASD researches have focused on a single-modal neuroimaging data, neglecting the complementarity between multi-modal data. This omission may lead to poor classification. Therefore, it is important to study multi-modal data of ASD for revealing its pathogenesis. Furthermore, recurrent neural network (RNN) and gated recurrent unit (GRU) are effective for sequence data processing. In this paper, we introduce a novel framework for a Multi-Kernel Learning Fusion algorithm based on RNN and GRU (MKLF-RAG). The framework utilizes RNN and GRU to provide feature selection for data of different modalities. Then these features are fused by MKLF algorithm to detect the pathological mechanisms of ASD and extract the most relevant the Regions of Interest (ROIs) for the disease. The MKLF-RAG proposed in this paper has been tested in a variety of experiments with the Autism Brain Imaging Data Exchange (ABIDE) database. Experimental findings indicate that our framework notably enhances the classification accuracy for ASD. Compared with other methods, MKLF-RAG demonstrates superior efficacy across multiple evaluation metrics and could provide valuable insights into the early diagnosis of ASD.
Collapse
Affiliation(s)
- Jie Chen
- Key Laboratory of Data Science and Intelligence Education, Ministry of Education, Hainan Normal University, Haikou, 571126, China
- College of Mathematics and Statistics, Hainan Normal University, Haikou, 571126, China
| | - Huilian Zhang
- Key Laboratory of Data Science and Intelligence Education, Ministry of Education, Hainan Normal University, Haikou, 571126, China
- College of Mathematics and Statistics, Hainan Normal University, Haikou, 571126, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Bo Liao
- Key Laboratory of Data Science and Intelligence Education, Ministry of Education, Hainan Normal University, Haikou, 571126, China
- College of Mathematics and Statistics, Hainan Normal University, Haikou, 571126, China
| | - Xia-An Bi
- Key Laboratory of Data Science and Intelligence Education, Ministry of Education, Hainan Normal University, Haikou, 571126, China.
- College of Information Science and Engineering, Hunan Normal University, Changsha, 410081, China.
- College of Mathematics and Statistics, Hainan Normal University, Haikou, 571126, China.
| |
Collapse
|
4
|
Zhao Y, Jin J, Gao W, Qiao J, Wei L. Moss-m7G: A Motif-Based Interpretable Deep Learning Method for RNA N7-Methlguanosine Site Prediction. J Chem Inf Model 2024; 64:6230-6240. [PMID: 39011571 DOI: 10.1021/acs.jcim.4c00802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
N-7methylguanosine (m7G) modification plays a crucial role in various biological processes and is closely associated with the development and progression of many cancers. Accurate identification of m7G modification sites is essential for understanding their regulatory mechanisms and advancing cancer therapy. Previous studies often suffered from insufficient research data, underutilization of motif information, and lack of interpretability. In this work, we designed a novel motif-based interpretable method for m7G modification site prediction, called Moss-m7G. This approach enables the analysis of RNA sequences from a motif-centric perspective. Our proposed word-detection module and motif-embedding module within Moss-m7G extract motif information from sequences, transforming the raw sequences from base-level into motif-level and generating embeddings for these motif sequences. Compared with base sequences, motif sequences contain richer contextual information, which is further analyzed and integrated through the Transformer model. We constructed a comprehensive m7G data set to implement the training and testing process to address the data insufficiency noted in prior research. Our experimental results affirm the effectiveness and superiority of Moss-m7G in predicting m7G modification sites. Moreover, the introduction of the word-detection module enhances the interpretability of the model, providing insights into the predictive mechanisms.
Collapse
Affiliation(s)
- Yanxi Zhao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Wenjia Gao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
- School of Informatics, Xiamen University, Xiamen 361104, China
| |
Collapse
|
5
|
Bortoletto E, Rosani U. Bioinformatics for Inosine: Tools and Approaches to Trace This Elusive RNA Modification. Genes (Basel) 2024; 15:996. [PMID: 39202357 PMCID: PMC11353476 DOI: 10.3390/genes15080996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 07/23/2024] [Accepted: 07/25/2024] [Indexed: 09/03/2024] Open
Abstract
Inosine is a nucleotide resulting from the deamination of adenosine in RNA. This chemical modification process, known as RNA editing, is typically mediated by a family of double-stranded RNA binding proteins named Adenosine Deaminase Acting on dsRNA (ADAR). While the presence of ADAR orthologs has been traced throughout the evolution of metazoans, the existence and extension of RNA editing have been characterized in a more limited number of animals so far. Undoubtedly, ADAR-mediated RNA editing plays a vital role in physiology, organismal development and disease, making the understanding of the evolutionary conservation of this phenomenon pivotal to a deep characterization of relevant biological processes. However, the lack of direct high-throughput methods to reveal RNA modifications at single nucleotide resolution limited an extended investigation of RNA editing. Nowadays, these methods have been developed, and appropriate bioinformatic pipelines are required to fully exploit this data, which can complement existing approaches to detect ADAR editing. Here, we review the current literature on the "bioinformatics for inosine" subject and we discuss future research avenues in the field.
Collapse
Affiliation(s)
| | - Umberto Rosani
- Department of Biology, University of Padova, 35131 Padova, Italy;
| |
Collapse
|
6
|
Yang S, Kim SH, Yang E, Kang M, Joo JY. Molecular insights into regulatory RNAs in the cellular machinery. Exp Mol Med 2024; 56:1235-1249. [PMID: 38871819 PMCID: PMC11263585 DOI: 10.1038/s12276-024-01239-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
It is apparent that various functional units within the cellular machinery are derived from RNAs. The evolution of sequencing techniques has resulted in significant insights into approaches for transcriptome studies. Organisms utilize RNA to govern cellular systems, and a heterogeneous class of RNAs is involved in regulatory functions. In particular, regulatory RNAs are increasingly recognized to participate in intricately functioning machinery across almost all levels of biological systems. These systems include those mediating chromatin arrangement, transcription, suborganelle stabilization, and posttranscriptional modifications. Any class of RNA exhibiting regulatory activity can be termed a class of regulatory RNA and is typically represented by noncoding RNAs, which constitute a substantial portion of the genome. These RNAs function based on the principle of structural changes through cis and/or trans regulation to facilitate mutual RNA‒RNA, RNA‒DNA, and RNA‒protein interactions. It has not been clearly elucidated whether regulatory RNAs identified through deep sequencing actually function in the anticipated mechanisms. This review addresses the dominant properties of regulatory RNAs at various layers of the cellular machinery and covers regulatory activities, structural dynamics, modifications, associated molecules, and further challenges related to therapeutics and deep learning.
Collapse
Affiliation(s)
- Sumin Yang
- Department of Pharmacy, College of Pharmacy, Hanyang University, Ansan, Gyeonggi-do, 15588, Republic of Korea
| | - Sung-Hyun Kim
- Department of Pharmacy, College of Pharmacy, Hanyang University, Ansan, Gyeonggi-do, 15588, Republic of Korea
| | - Eunjeong Yang
- Department of Pharmacy, College of Pharmacy, Hanyang University, Ansan, Gyeonggi-do, 15588, Republic of Korea
| | - Mingon Kang
- Department of Computer Science, University of Nevada, Las Vegas, NV, 89154, USA
| | - Jae-Yeol Joo
- Department of Pharmacy, College of Pharmacy, Hanyang University, Ansan, Gyeonggi-do, 15588, Republic of Korea.
| |
Collapse
|
7
|
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol 2024; 22:126. [PMID: 38816885 PMCID: PMC11555825 DOI: 10.1186/s12915-024-01923-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024] Open
Abstract
BACKGROUND A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches. RESULTS In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability. CONCLUSIONS msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
Collapse
Affiliation(s)
- Yazi Li
- School of Mathematics and Statistics, Hainan University, Haikou, 570228, China
| | - Xiaoman Wei
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Qinglin Yang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - An Xiong
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| |
Collapse
|
8
|
Qiu X, Wang H, Tan X, Fang Z. G-K BertDTA: A graph representation learning and semantic embedding-based framework for drug-target affinity prediction. Comput Biol Med 2024; 173:108376. [PMID: 38552281 DOI: 10.1016/j.compbiomed.2024.108376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/21/2024] [Accepted: 03/24/2024] [Indexed: 04/17/2024]
Abstract
Developing new drugs is costly, time-consuming, and risky. Drug-target affinity (DTA), indicating the binding capability between drugs and target proteins, is a crucial indicator for drug development. Accurately predicting interaction strength between new drug-target pairs by analyzing previous experiments aids in screening potential drug molecules, repurposing them, and developing safe and effective medicines. Existing computational models for DTA prediction rely on strings or single-graph neural networks, lacking consideration of protein structure and molecular semantic information, leading to limited accuracy. Our experiments demonstrate that string-based methods may overlook protein conformations, causing a high root mean square error (RMSE) of 3.584 in affinity due to a lack of spatial context. Single graph networks also underperform on topology features, with a 6% lower confidence interval (CI) for activity classification. Absent semantic information also limits generalization across diverse compounds, resulting in 18% increment in RMSE and 5% in misclassifications within quantifications study, restricting potential drug discovery. To address these limitations, we propose G-K BertDTA, a novel framework for accurate DTA prediction incorporating protein features, molecular semantic features, and molecular structural information. In this proposed model, we represent drugs as graphs, with a GIN employed to learn the molecular topological information. For the extraction of protein structural features, we utilize a DenseNet architecture. A knowledge-based BERT semantic model is incorporated to obtain rich pre-trained semantic embeddings, thereby enhancing the feature information. We extensively evaluated our proposed approach on the publicly available benchmark datasets (i.e., KIBA and Davis), and experimental results demonstrate the promising performance of our method, which consistently outperforms previous state-of-the-art approaches. Code is available at https://github.com/AmbitYuki/G-K-BertDTA.
Collapse
Affiliation(s)
- Xihe Qiu
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Haoyu Wang
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Xiaoyu Tan
- INF Technology (Shanghai) Co., Ltd., Shanghai, China
| | - Zhijun Fang
- School of Computer Science and Technology, Donghua University, Shanghai, China.
| |
Collapse
|
9
|
Wang D, Jin J, Li Z, Wang Y, Fan M, Liang S, Su R, Wei L. StructuralDPPIV: a novel deep learning model based on atom structure for predicting dipeptidyl peptidase-IV inhibitory peptides. Bioinformatics 2024; 40:btae057. [PMID: 38305458 PMCID: PMC10904144 DOI: 10.1093/bioinformatics/btae057] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 12/07/2023] [Accepted: 01/30/2024] [Indexed: 02/03/2024] Open
Abstract
MOTIVATION Diabetes is a chronic metabolic disorder that has been a major cause of blindness, kidney failure, heart attacks, stroke, and lower limb amputation across the world. To alleviate the impact of diabetes, researchers have developed the next generation of anti-diabetic drugs, known as dipeptidyl peptidase IV inhibitory peptides (DPP-IV-IPs). However, the discovery of these promising drugs has been restricted due to the lack of effective peptide-mining tools. RESULTS Here, we presented StructuralDPPIV, a deep learning model designed for DPP-IV-IP identification, which takes advantage of both molecular graph features in amino acid and sequence information. Experimental results on the independent test dataset and two wet experiment datasets show that our model outperforms the other state-of-art methods. Moreover, to better study what StructuralDPPIV learns, we used CAM technology and perturbation experiment to analyze our model, which yielded interpretable insights into the reasoning behind prediction results. AVAILABILITY AND IMPLEMENTATION The project code is available at https://github.com/WeiLab-BioChem/Structural-DPP-IV.
Collapse
Affiliation(s)
- Ding Wang
- School of Software, Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China
| | - Zhongshen Li
- School of Software, Shandong University, Jinan 250101, China
| | - Yu Wang
- School of Software, Shandong University, Jinan 250101, China
| | - Mushuang Fan
- School of Software, Shandong University, Jinan 250101, China
| | - Sirui Liang
- School of Software, Shandong University, Jinan 250101, China
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| |
Collapse
|