1
|
Tu G, Wang X, Xia R, Song B. m6A-TCPred: a web server to predict tissue-conserved human m 6A sites using machine learning approach. BMC Bioinformatics 2024; 25:127. [PMID: 38528499 PMCID: PMC10962094 DOI: 10.1186/s12859-024-05738-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 03/11/2024] [Indexed: 03/27/2024] Open
Abstract
BACKGROUND N6-methyladenosine (m6A) is the most prevalent post-transcriptional modification in eukaryotic cells that plays a crucial role in regulating various biological processes, and dysregulation of m6A status is involved in multiple human diseases including cancer contexts. A number of prediction frameworks have been proposed for high-accuracy identification of putative m6A sites, however, none have targeted for direct prediction of tissue-conserved m6A modified residues from non-conserved ones at base-resolution level. RESULTS We report here m6A-TCPred, a computational tool for predicting tissue-conserved m6A residues using m6A profiling data from 23 human tissues. By taking advantage of the traditional sequence-based characteristics and additional genome-derived information, m6A-TCPred successfully captured distinct patterns between potentially tissue-conserved m6A modifications and non-conserved ones, with an average AUROC of 0.871 and 0.879 tested on cross-validation and independent datasets, respectively. CONCLUSION Our results have been integrated into an online platform: a database holding 268,115 high confidence m6A sites with their conserved information across 23 human tissues; and a web server to predict the conserved status of user-provided m6A collections. The web interface of m6A-TCPred is freely accessible at: www.rnamd.org/m6ATCPred .
Collapse
Affiliation(s)
- Gang Tu
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, 215123, China
| | - Xuan Wang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, 215123, China.
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, L7 8TX, UK.
| | - Rong Xia
- Department of Financial and Actuarial Mathematics, Xi'an Jiaotong-Liverpool University, Suzhou, 215123, China
| | - Bowen Song
- Department of Public Health, School of Medicine, Nanjing University of Chinese Medicine, Nanjing, 210023, China
| |
Collapse
|
2
|
Liu L, Wei Y, Tan Z, Zhang Q, Sun J, Zhao Q. Predicting circRNA-RBP Binding Sites Using a Hybrid Deep Neural Network. Interdiscip Sci 2024:10.1007/s12539-024-00616-z. [PMID: 38381315 DOI: 10.1007/s12539-024-00616-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 01/26/2024] [Accepted: 01/29/2024] [Indexed: 02/22/2024]
Abstract
Circular RNAs (circRNAs) are non-coding RNAs generated by reverse splicing. They are involved in biological process and human diseases by interacting with specific RNA-binding proteins (RBPs). Due to traditional biological experiments being costly, computational methods have been proposed to predict the circRNA-RBP interaction. However, these methods have problems of single feature extraction. Therefore, we propose a novel model called circ-FHN, which utilizes only circRNA sequences to predict circRNA-RBP interactions. The circ-FHN approach involves feature coding and a hybrid deep learning model. Feature coding takes into account the physicochemical properties of circRNA sequences and employs four coding methods to extract sequence features. The hybrid deep structure comprises a convolutional neural network (CNN) and a bidirectional gated recurrent unit (BiGRU). The CNN learns high-level abstract features, while the BiGRU captures long-term dependencies in the sequence. To assess the effectiveness of circ-FHN, we compared it to other computational methods on 16 datasets and conducted ablation experiments. Additionally, we conducted motif analysis. The results demonstrate that circ-FHN exhibits exceptional performance and surpasses other methods. circ-FHN is freely available at https://github.com/zhaoqi106/circ-FHN .
Collapse
Affiliation(s)
- Liwei Liu
- College of Science, Dalian Jiaotong University, Dalian, 116028, China
- Key Laboratory of Computational Science and Application of Hainan Province, Hainan Normal University, Haikou, 571158, China
| | - Yixin Wei
- College of Science, Dalian Jiaotong University, Dalian, 116028, China
| | - Zhebin Tan
- College of Software, Dalian Jiaotong University, Dalian, 116028, China
| | - Qi Zhang
- College of Science, Dalian Jiaotong University, Dalian, 116028, China
| | - Jianqiang Sun
- School of Information Science and Engineering, Linyi University, Linyi, 276000, China.
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China.
| |
Collapse
|
3
|
Jia J, Wei Z, Sun M. EMDL_m6Am: identifying N6,2'-O-dimethyladenosine sites based on stacking ensemble deep learning. BMC Bioinformatics 2023; 24:397. [PMID: 37880673 PMCID: PMC10598967 DOI: 10.1186/s12859-023-05543-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 10/20/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND N6, 2'-O-dimethyladenosine (m6Am) is an abundant RNA methylation modification on vertebrate mRNAs and is present in the transcription initiation region of mRNAs. It has recently been experimentally shown to be associated with several human disorders, including obesity genes, and stomach cancer, among others. As a result, N6,2'-O-dimethyladenosine (m6Am) site will play a crucial part in the regulation of RNA if it can be correctly identified. RESULTS This study proposes a novel deep learning-based m6Am prediction model, EMDL_m6Am, which employs one-hot encoding to expressthe feature map of the RNA sequence and recognizes m6Am sites by integrating different CNN models via stacking. Including DenseNet, Inflated Convolutional Network (DCNN) and Deep Multiscale Residual Network (MSRN), the sensitivity (Sn), specificity (Sp), accuracy (ACC), Mathews correlation coefficient (MCC) and area under the curve (AUC) of our model on the training data set reach 86.62%, 88.94%, 87.78%, 0.7590 and 0.8778, respectively, and the prediction results on the independent test set are as high as 82.25%, 79.72%, 80.98%, 0.6199, and 0.8211. CONCLUSIONS In conclusion, the experimental results demonstrated that EMDL_m6Am greatly improved the predictive performance of the m6Am sites and could provide a valuable reference for the next part of the study. The source code and experimental data are available at: https://github.com/13133989982/EMDL-m6Am .
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China.
| | - Zhangying Wei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China.
| | - Mingwei Sun
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| |
Collapse
|
4
|
Wang Y, Tai S, Zhang S, Sheng N, Xie X. PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence. Genes (Basel) 2023; 14:1441. [PMID: 37510345 PMCID: PMC10379012 DOI: 10.3390/genes14071441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/04/2023] [Accepted: 07/10/2023] [Indexed: 07/30/2023] Open
Abstract
Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew's correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Shiwen Tai
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Shuangquan Zhang
- School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Nan Sheng
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Xuping Xie
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| |
Collapse
|
5
|
Yao Y, Zhang S, Xue T. Integrating LASSO Feature Selection and Soft Voting Classifier to Identify Origins of Replication Sites. Curr Genomics 2022; 23:83-93. [PMID: 36778978 PMCID: PMC9878833 DOI: 10.2174/1389202923666220214122506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 12/11/2021] [Accepted: 01/18/2022] [Indexed: 11/22/2022] Open
Abstract
Background: DNA replication plays an indispensable role in the transmission of genetic information. It is considered to be the basis of biological inheritance and the most fundamental process in all biological life. Considering that DNA replication initiates with a special location, namely the origin of replication, a better and accurate prediction of the origins of replication sites (ORIs) is essential to gain insight into the relationship with gene expression. Objective: In this study, we have developed an efficient predictor called iORI-LAVT for ORIs identification. Methods: This work focuses on extracting feature information from three aspects, including mono-nucleotide encoding, k-mer and ring-function-hydrogen-chemical properties. Subsequently, least absolute shrinkage and selection operator (LASSO) as a feature selection is applied to select the optimal features. Comparing the different combined soft voting classifiers results, the soft voting classifier based on GaussianNB and Logistic Regression is employed as the final classifier. Results: Based on 10-fold cross-validation test, the prediction accuracies of two benchmark datasets are 90.39% and 95.96%, respectively. As for the independent dataset, our method achieves high accuracy of 91.3%. Conclusion: Compared with previous predictors, iORI-LAVT outperforms the existing methods. It is believed that iORI-LAVT predictor is a promising alternative for further research on identifying ORIs.
Collapse
Affiliation(s)
- Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China,Address correspondence to this author at the School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China; Tel/Fax: +86-29- 88202860; E-mail:
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| |
Collapse
|
6
|
Zhang Y, Huang D, Wei Z, Chen K. Primary sequence-assisted prediction of m6A RNA methylation sites from Oxford nanopore direct RNA sequencing data. Methods 2022; 203:62-69. [DOI: 10.1016/j.ymeth.2022.04.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 03/27/2022] [Accepted: 04/11/2022] [Indexed: 11/28/2022] Open
|
7
|
El Allali A, Elhamraoui Z, Daoud R. Machine learning applications in RNA modification sites prediction. Comput Struct Biotechnol J 2021; 19:5510-5524. [PMID: 34712397 PMCID: PMC8517552 DOI: 10.1016/j.csbj.2021.09.025] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 09/24/2021] [Accepted: 09/25/2021] [Indexed: 12/15/2022] Open
Abstract
Ribonucleic acid (RNA) modifications are post-transcriptional chemical composition changes that have a fundamental role in regulating the main aspect of RNA function. Recently, large datasets have become available thanks to the recent development in deep sequencing and large-scale profiling. This availability of transcriptomic datasets has led to increased use of machine learning based approaches in epitranscriptomics, particularly in identifying RNA modifications. In this review, we comprehensively explore machine learning based approaches used for the prediction of 11 RNA modification types, namely,m 1 A ,m 6 A ,m 5 C , 5 hmC , ψ , 2 ' - O - Me , ac 4 C ,m 7 G , A - to - I ,m 2 G , and D . This review covers the life cycle of machine learning methods to predict RNA modification sites including available benchmark datasets, feature extraction, and classification algorithms. We compare available methods in terms of datasets, target species, approach, and accuracy for each RNA modification type. Finally, we discuss the advantages and limitations of the reviewed approaches and suggest future perspectives.
Collapse
Affiliation(s)
- A. El Allali
- African Genome Center, University Mohamed VI Polytechnic, Morocco
| | - Zahra Elhamraoui
- African Genome Center, University Mohamed VI Polytechnic, Morocco
| | - Rachid Daoud
- African Genome Center, University Mohamed VI Polytechnic, Morocco
| |
Collapse
|
8
|
Li HL, Pang YH, Liu B. BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Res 2021; 49:e129. [PMID: 34581805 PMCID: PMC8682797 DOI: 10.1093/nar/gkab829] [Citation(s) in RCA: 87] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 08/24/2021] [Accepted: 09/09/2021] [Indexed: 01/08/2023] Open
Abstract
In order to uncover the meanings of ‘book of life’, 155 different biological language models (BLMs) for DNA, RNA and protein sequence analysis are discussed in this study, which are able to extract the linguistic properties of ‘book of life’. We also extend the BLMs into a system called BioSeq-BLM for automatically representing and analyzing the sequence data. Experimental results show that the predictors generated by BioSeq-BLM achieve comparable or even obviously better performance than the exiting state-of-the-art predictors published in literatures, indicating that BioSeq-BLM will provide new approaches for biological sequence analysis based on natural language processing technologies, and contribute to the development of this very important field. In order to help the readers to use BioSeq-BLM for their own experiments, the corresponding web server and stand-alone package are established and released, which can be freely accessed at http://bliulab.net/BioSeq-BLM/.
Collapse
Affiliation(s)
- Hong-Liang Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Yi-He Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
9
|
Wang Y, Guo R, Huang L, Yang S, Hu X, He K. m6AGE: A Predictor for N6-Methyladenosine Sites Identification Utilizing Sequence Characteristics and Graph Embedding-Based Geometrical Information. Front Genet 2021; 12:670852. [PMID: 34122525 PMCID: PMC8191635 DOI: 10.3389/fgene.2021.670852] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 04/29/2021] [Indexed: 11/30/2022] Open
Abstract
N6-methyladenosine (m6A) is one of the most prevalent RNA post-transcriptional modifications and is involved in various vital biological processes such as mRNA splicing, exporting, stability, and so on. Identifying m6A sites contributes to understanding the functional mechanism and biological significance of m6A. The existing biological experimental methods for identifying m6A sites are time-consuming and costly. Thus, developing a high confidence computational method is significant to explore m6A intrinsic characters. In this study, we propose a predictor called m6AGE which utilizes sequence-derived and graph embedding features. To the best of our knowledge, our predictor is the first to combine sequence-derived features and graph embeddings for m6A site prediction. Comparison results show that our proposed predictor achieved the best performance compared with other predictors on four public datasets across three species. On the A101 dataset, our predictor outperformed 1.34% (accuracy), 0.0227 (Matthew's correlation coefficient), 5.63% (specificity), and 0.0081 (AUC) than comparing predictors, which indicates that m6AGE is a useful tool for m6A site prediction. The source code of m6AGE is available at https://github.com/bokunoBike/m6AGE.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Rui Guo
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Lan Huang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Xuemei Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Kai He
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| |
Collapse
|
10
|
i6mA-VC: A Multi-Classifier Voting Method for the Computational Identification of DNA N6-methyladenine Sites. Interdiscip Sci 2021; 13:413-425. [PMID: 33834381 DOI: 10.1007/s12539-021-00429-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Revised: 03/26/2021] [Accepted: 03/29/2021] [Indexed: 12/14/2022]
Abstract
DNA N6-methyladenine (6 mA), as an essential component of epigenetic modification, cannot be neglected in genetic regulation mechanism. The efficient and accurate prediction of 6 mA sites is beneficial to the development of biological genetics. Biochemical experimental methods are considered to be time-consuming and laborious. Most of the established machine learning methods have a single dataset. Although some of them have achieved cross-species prediction, their results are not satisfactory. Therefore, we designed a novel statistical model called i6mA-VC to improve the accuracy for 6 mA sites. On the one hand, kmer and binary encoding are applied to extract features, and then gradient boosting decision tree (GBDT) embedded method is applied as the feature selection strategy. On the other hand, DNA sequences are represented by vectors through the feature extraction method of ring-function-hydrogen-chemical properties (RFHCP) and the feature selection strategy of ExtraTree. After fusing the two optimal features, a voting classifier based on gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM) and multilayer perceptron classifier (MLPC) is constructed for final classification and prediction. The accuracy of Rice dataset and M.musculus dataset with five-fold cross-validation are 0.888 and 0.967, respectively. The cross-species dataset is selected as independent testing dataset, and the accuracy reaches 0.848. Through rigorous experiments, it is demonstrated that the proposed predictor is convincing and applicable. The development of i6mA-VC predictor will become an effective way for the recognition of N6-methyladenine sites, and it will also be beneficial for biological geneticists to further study gene expression and DNA modification. In addition, an accessible web-server for i6mA-VC is available from http://www.zhanglab.site/ .
Collapse
|
11
|
Jiang J, Song B, Chen K, Lu Z, Rong R, Zhong Y, Meng J. m6AmPred: Identifying RNA N6, 2'-O-dimethyladenosine (m 6A m) sites based on sequence-derived information. Methods 2021; 203:328-334. [PMID: 33540081 DOI: 10.1016/j.ymeth.2021.01.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 01/14/2021] [Accepted: 01/20/2021] [Indexed: 12/11/2022] Open
Abstract
N6,2'-O-dimethyladenosine (m6Am) is a reversible modification widely occurred on varied RNA molecules. The biological function of m6Am is yet to be known though recent studies have revealed its influences in cellular mRNA fate. Precise identification of m6Am sites on RNA is vital for the understanding of its biological functions. We present here m6AmPred, the first web server for in silico identification of m6Am sites from the primary sequences of RNA. Built upon the eXtreme Gradient Boosting with Dart algorithm (XgbDart) and EIIP-PseEIIP encoding scheme, m6AmPred achieved promising prediction performance with the AUCs greater than 0.954 when tested by 10-fold cross-validation and independent testing datasets. To critically test and validate the performance of m6AmPred, the experimentally verified m6Am sites from two data sources were cross-validated. The m6AmPred web server is freely accessible at: https://www.xjtlu.edu.cn/biologicalsciences/m6am, and it should make a useful tool for the researchers who are interested in N6,2'-O-dimethyladenosine RNA modification.
Collapse
Affiliation(s)
- Jie Jiang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China; Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, United Kingdom
| | - Bowen Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China; Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, United Kingdom.
| | - Kunqi Chen
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China; Institute of Ageing & Chronic Disease, University of Liverpool, L7 8TX Liverpool, United Kingdom
| | - Zhiliang Lu
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Rong Rong
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Yu Zhong
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China; AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China; Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, United Kingdom
| |
Collapse
|
12
|
Chen X, Xiong Y, Liu Y, Chen Y, Bi S, Zhu X. m5CPred-SVM: a novel method for predicting m5C sites of RNA. BMC Bioinformatics 2020; 21:489. [PMID: 33126851 PMCID: PMC7602301 DOI: 10.1186/s12859-020-03828-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Accepted: 10/21/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND As one of the most common post-transcriptional modifications (PTCM) in RNA, 5-cytosine-methylation plays important roles in many biological functions such as RNA metabolism and cell fate decision. Through accurate identification of 5-methylcytosine (m5C) sites on RNA, researchers can better understand the exact role of 5-cytosine-methylation in these biological functions. In recent years, computational methods of predicting m5C sites have attracted lots of interests because of its efficiency and low-cost. However, both the accuracy and efficiency of these methods are not satisfactory yet and need further improvement. RESULTS In this work, we have developed a new computational method, m5CPred-SVM, to identify m5C sites in three species, H. sapiens, M. musculus and A. thaliana. To build this model, we first collected benchmark datasets following three recently published methods. Then, six types of sequence-based features were generated based on RNA segments and the sequential forward feature selection strategy was used to obtain the optimal feature subset. After that, the performance of models based on different learning algorithms were compared, and the model based on the support vector machine provided the highest prediction accuracy. Finally, our proposed method, m5CPred-SVM was compared with several existing methods, and the result showed that m5CPred-SVM offered substantially higher prediction accuracy than previously published methods. It is expected that our method, m5CPred-SVM, can become a useful tool for accurate identification of m5C sites. CONCLUSION In this study, by introducing position-specific propensity related features, we built a new model, m5CPred-SVM, to predict RNA m5C sites of three different species. The result shows that our model outperformed the existing state-of-art models. Our model is available for users through a web server at https://zhulab.ahu.edu.cn/m5CPred-SVM .
Collapse
Affiliation(s)
- Xiao Chen
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Yuqing Chen
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Shoudong Bi
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, 230036 Anhui China
| |
Collapse
|
13
|
Song B, Chen K, Tang Y, Ma J, Meng J, Wei Z. PSI-MOUSE: Predicting Mouse Pseudouridine Sites From Sequence and Genome-Derived Features. Evol Bioinform Online 2020; 16:1176934320925752. [PMID: 32565674 PMCID: PMC7285933 DOI: 10.1177/1176934320925752] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 03/30/2020] [Indexed: 12/04/2022] Open
Abstract
Pseudouridine (Ψ) is the first discovered and the most prevalent posttranscriptional modification, which has been widely studied during the past decades. Pseudouridine was observed in almost all kinds of RNAs and shown to have important biological functions. Currently, the time-consuming and high-cost procedures of experimental approaches limit its uses in real-life Ψ site detection. Alternatively, by taking advantage of the explosive growth of Ψ sequencing data, the computational methods may provide a more cost-effective avenue. To date, the existing mouse Ψ site predictors were all developed based on sequence-derived features, and their performance can be further improved by adding the domain knowledge derived feature. Therefore, it is highly desirable to propose a genomic feature-based computational method to increase the accuracy and efficiency of the identification of Ψ RNA modification in the mouse transcriptome. In our study, a predictive framework PSI-MOUSE was built. Besides the conventional sequence-based features, PSI-MOUSE first introduced 38 additional genomic features derived from the mouse genome, which achieved a satisfactory improvement in the prediction performance, compared with other existing models. Moreover, PSI-MOUSE also features in automatically annotating the putative Ψ sites with diverse types of posttranscriptional regulations (RNA-binding protein [RBP]-binding regions, miRNA-RNA interactions, and splicing sites), which can serve as a useful research tool for the study of Ψ RNA modification in the mouse genome. Finally, 3282 experimentally validated mouse Ψ sites were also collected in a database with customized query functions. For the convenience of academic users, a website was built to provide a user-friendly interface for the query and analysis on the database. The website is freely accessible at www.xjtlu.edu.cn/biologicalsciences/psimouse and http://psimouse.rnamd.com. We introduced the genome-derived features to mouse for the first time, and we achieved a good performance in mouse Ψ site prediction. Compared with the existing state-of-art methods, our newly developed approach PSI-MOUSE obtained a substantial improvement in prediction accuracy, marking the reliable contributions of genomic features for the prediction of RNA modifications in a species other than human.
Collapse
Affiliation(s)
- Bowen Song
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Kunqi Chen
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Yujiao Tang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Jialin Ma
- Cancer Genome Computational Analysis, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
| | - Zhen Wei
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China
| |
Collapse
|
14
|
Song Y, Xu Q, Wei Z, Zhen D, Su J, Chen K, Meng J. Predict Epitranscriptome Targets and Regulatory Functions of N 6-Methyladenosine (m 6A) Writers and Erasers. Evol Bioinform Online 2019; 15:1176934319871290. [PMID: 31523126 PMCID: PMC6728658 DOI: 10.1177/1176934319871290] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 07/31/2019] [Indexed: 12/13/2022] Open
Abstract
Currently, although many successful bioinformatics efforts have been reported in the epitranscriptomics field for N 6-methyladenosine (m6A) site identification, none is focused on the substrate specificity of different m6A-related enzymes, ie, the methyltransferases (writers) and demethylases (erasers). In this work, to untangle the target specificity and the regulatory functions of different RNA m6A writers (METTL3-METT14 and METTL16) and erasers (ALKBH5 and FTO), we extracted 49 genomic features along with the conventional sequence features and used the machine learning approach of random forest to predict their epitranscriptome substrates. Our method achieved reasonable performance on both the writer target prediction (as high as 0.918) and the eraser target prediction (as high as 0.888) in a 5-fold cross-validation, and results of the gene ontology analysis of their preferential targets further revealed the functional relevance of different RNA methylation writers and erasers.
Collapse
Affiliation(s)
- Yiyou Song
- Department of Biological Sciences, Xi’an
Jiaotong-Liverpool University, Suzhou, China
| | - Qingru Xu
- Department of Biological Sciences, Xi’an
Jiaotong-Liverpool University, Suzhou, China
| | - Zhen Wei
- Department of Biological Sciences, Xi’an
Jiaotong-Liverpool University, Suzhou, China
- Department of Mathematical Sciences,
Xi’an Jiaotong-Liverpool University, Suzhou, China
| | - Di Zhen
- Department of Biological Sciences, Xi’an
Jiaotong-Liverpool University, Suzhou, China
| | - Jionglong Su
- Department of Mathematical Sciences,
Xi’an Jiaotong-Liverpool University, Suzhou, China
- Research Center for Precision Medicine,
Xi’an Jiaotong-Liverpool University, Suzhou, China
| | - Kunqi Chen
- Department of Biological Sciences, Xi’an
Jiaotong-Liverpool University, Suzhou, China
- Institute of Ageing and Chronic Disease,
University of Liverpool, Liverpool, UK
| | - Jia Meng
- Research Center for Precision Medicine,
Xi’an Jiaotong-Liverpool University, Suzhou, China
- Institute of Integrative Biology,
University of Liverpool, Liverpool, UK
| |
Collapse
|
15
|
Wei L, Su R, Luan S, Liao Z, Manavalan B, Zou Q, Shi X. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics 2019; 35:4930-4937. [DOI: 10.1093/bioinformatics/btz408] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 05/01/2019] [Accepted: 05/08/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
Accurate identification of N4-methylcytosine (4mC) modifications in a genome wide can provide insights into their biological functions and mechanisms. Machine learning recently have become effective approaches for computational identification of 4mC sites in genome. Unfortunately, existing methods cannot achieve satisfactory performance, owing to the lack of effective DNA feature representations that are capable to capture the characteristics of 4mC modifications.
Results
In this work, we developed a new predictor named 4mcPred-IFL, aiming to identify 4mC sites. To represent and capture discriminative features, we proposed an iterative feature representation algorithm that enables to learn informative features from several sequential models in a supervised iterative mode. Our analysis results showed that the feature representations learnt by our algorithm can capture the discriminative distribution characteristics between 4mC sites and non-4mC sites, enlarging the decision margin between the positives and negatives in feature space. Additionally, by evaluating and comparing our predictor with the state-of-the-art predictors on benchmark datasets, we demonstrate that our predictor can identify 4mC sites more accurately.
Availability and implementation
The user-friendly webserver that implements the proposed 4mcPred-IFL is well established, and is freely accessible at http://server.malab.cn/4mcPred-IFL.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shasha Luan
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Zhijun Liao
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Fujian Medical University, Fujian, China
| | | | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaolong Shi
- Institute of Computing Science & Technology, Guangzhou University, Guangzhou, China
| |
Collapse
|