1
|
Han K, Wang J, Chu Y, Liao Q, Ding Y, Zheng D, Wan J, Guo X, Zou Q. Deep learning based method for predicting DNA N6-methyladenosine sites. Methods 2024; 230:91-98. [PMID: 39097179 DOI: 10.1016/j.ymeth.2024.07.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 07/22/2024] [Accepted: 07/29/2024] [Indexed: 08/05/2024] Open
Abstract
DNA N6 methyladenine (6mA) plays an important role in many biological processes, and accurately identifying its sites helps one to understand its biological effects more comprehensively. Previous traditional experimental methods are very labor-intensive and traditional machine learning methods also seem to be somewhat insufficient as the database of 6mA methylation groups becomes progressively larger, so we propose a deep learning-based method called multi-scale convolutional model based on global response normalization (CG6mA) to solve the prediction problem of 6mA site. This method is tested with other methods on three different kinds of benchmark datasets, and the results show that our model can get more excellent prediction results.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Ying Chu
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Qian Liao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin 150001, China
| | - Xiaoyi Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| |
Collapse
|
2
|
Hou A, Luo H, Liu H, Luo L, Ding P. Multi-scale DNA language model improves 6 mA binding sites prediction. Comput Biol Chem 2024; 112:108129. [PMID: 39067351 DOI: 10.1016/j.compbiolchem.2024.108129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/05/2024] [Accepted: 06/10/2024] [Indexed: 07/30/2024]
Abstract
DNA methylation at the N6 position of adenine (N6-methyladenine, 6 mA), which refers to the attachment of a methyl group to the N6 site of the adenine (A) of DNA, is an important epigenetic modification in prokaryotic and eukaryotic genomes. Accurately predicting the 6 mA binding sites can provide crucial insights into gene regulation, DNA repair, disease development and so on. Wet experiments are commonly used for analyzing 6 mA binding sites. However, they suffer from high cost and expensive time. Therefore, various deep learning methods have been widely used to predict 6 mA binding sites recently. In this study, we develop a framework based on multi-scale DNA language model named "iDNA6mA-MDL". "iDNA6mA-MDL" integrates multiple kmers and the nucleotide property and frequency method for feature embedding, which can capture a full range of DNA sequence context information. At the prediction stage, it also leverages DNABERT to compensate for the incomplete capture of global DNA information. Experiments show that our framework obtains average AUC of 0.981 on a classic 6 mA rice gene dataset, going beyond all existing advanced models under fivefold cross-validations. Moreover, "iDNA6mA-MDL" outperforms most of the popular state-of-the-art methods on another 11 6 mA datasets, demonstrating its effectiveness in 6 mA binding sites prediction.
Collapse
Affiliation(s)
- Anlin Hou
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Hanyu Luo
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Huan Liu
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang 421001, China.
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang 421001, China
| |
Collapse
|
3
|
Yin Z, Lyu J, Zhang G, Huang X, Ma Q, Jiang J. SoftVoting6mA: An improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:3798-3815. [PMID: 38549308 DOI: 10.3934/mbe.2024169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2024]
Abstract
The DNA N6-methyladenine (6mA) is an epigenetic modification, which plays a pivotal role in biological processes encompassing gene expression, DNA replication, repair, and recombination. Therefore, the precise identification of 6mA sites is fundamental for better understanding its function, but challenging. We proposed an improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes called SoftVoting6mA. The SoftVoting6mA selected four (electron-ion-interaction pseudo potential, One-hot encoding, Kmer, and pseudo dinucleotide composition) codes from 15 types of encoding to represent DNA sequences by comparing their performances. Similarly, the SoftVoting6mA combined four learning algorithms using the soft voting strategy. The 5-fold cross-validation and the independent tests showed that SoftVoting6mA reached the state-of-the-art performance. To enhance accessibility, a user-friendly web server is provided at http://www.biolscience.cn/SoftVoting6mA/.
Collapse
Affiliation(s)
- Zhaoting Yin
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Jianyi Lyu
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Guiyang Zhang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Qinghua Ma
- College of Information Science and Engineering, Hohai University, Nanjing 210000, China
- Faculty of Information Technology, University of Jyvaskyla, Jyvaskyla, Finland
| | - Jinyun Jiang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
4
|
Huang G, Huang X, Luo W. 6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. BioData Min 2023; 16:34. [PMID: 38012796 PMCID: PMC10680251 DOI: 10.1186/s13040-023-00348-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 11/04/2023] [Indexed: 11/29/2023] Open
Abstract
DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Wei Luo
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
5
|
Yu X, Hu J, Zhang Y. SNN6mA: Improved DNA N6-methyladenine site prediction using Siamese network-based feature embedding. Comput Biol Med 2023; 166:107533. [PMID: 37793205 DOI: 10.1016/j.compbiomed.2023.107533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 09/01/2023] [Accepted: 09/27/2023] [Indexed: 10/06/2023]
Abstract
DNA N6-methyladenine (6mA) is one of the most common and abundant modifications, which plays essential roles in various biological processes and cellular functions. Therefore, the accurate identification of DNA 6mA sites is of great importance for a better understanding of its regulatory mechanisms and biological functions. Although significant progress has been made, there still has room for further improvement in 6mA site prediction in DNA sequences. In this study, we report a smart but accurate 6mA predictor, termed as SNN6mA, using Siamese network. To be specific, DNA segments are firstly encoded into feature vectors using the one-hot encoding scheme; then, these original feature vectors are mapped to a low-dimensional embedding space derived from Siamese network to capture more discriminative features; finally, the obtained low-dimensional features are fed to a fully connected neural network to perform final prediction. Stringent benchmarking tests on the datasets of two species demonstrated that the proposed SNN6mA is superior to the state-of-the-art 6mA predictors. Detailed data analyses show that the major advantage of SNN6mA lies in the utilization of Siamese network, which can map the original features into a low-dimensional embedding space with more discriminative capability. In summary, the proposed SNN6mA is the first attempt to use Siamese network for 6mA site prediction and could be easily extended to predict other types of modifications. The codes and datasets used in the study are freely available at https://github.com/YuXuan-Glasgow/SNN6mA for academic use.
Collapse
Affiliation(s)
- Xuan Yu
- Glasgow College, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
6
|
Sinha D, Dasmandal T, Paul K, Yeasin M, Bhattacharjee S, Murmu S, Mishra DC, Pal S, Rai A, Archak S. MethSemble-6mA: an ensemble-based 6mA prediction server and its application on promoter region of LBD gene family in Poaceae. FRONTIERS IN PLANT SCIENCE 2023; 14:1256186. [PMID: 37877081 PMCID: PMC10591185 DOI: 10.3389/fpls.2023.1256186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 09/01/2023] [Indexed: 10/26/2023]
Abstract
The Lateral Organ Boundaries Domain (LBD) containing genes are a set of plant-specific transcription factors and are crucial for controlling both organ development and defense mechanisms as well as anthocyanin synthesis and nitrogen metabolism. It is imperative to understand how methylation regulates gene expression, through predicting methylation sites of their promoters particularly in major crop species. In this study, we developed a user-friendly prediction server for accurate prediction of 6mA sites by incorporating a robust feature set, viz., Binary Encoding of Mono-nucleotide DNA. Our model,MethSemble-6mA, outperformed other state-of-the-art tools in terms of accuracy (93.12%). Furthermore, we investigated the pattern of probable 6mA sites at the upstream promoter regions of the LBD-containing genes in Triticum aestivum and its allied species using the developed tool. On average, each selected species had four 6mA sites, and it was found that with speciation and due course of evolution in wheat, the frequency of methylation have reduced, and a few sites remain conserved. This obviously cues gene birth and gene expression alteration through methylation over time in a species and reflects functional conservation throughout evolution. Since DNA methylation is a vital event in almost all plant developmental processes (e.g., genomic imprinting and gametogenesis) along with other life processes, our findings on epigenetic regulation of LBD-containing genes have dynamic implications in basic and applied research. Additionally, MethSemble-6mA (http://cabgrid.res.in:5799/) will serve as a useful resource for a plant breeders who are interested to pursue epigenetic-based crop improvement research.
Collapse
Affiliation(s)
- Dipro Sinha
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
| | - Tanwy Dasmandal
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
- ICAR-National Bureau of Fish Genetic Resources, Lucknow, India
| | - Krishnayan Paul
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
- ICAR-National Institute for Plant Biotechnology, Delhi, India
| | - Md Yeasin
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
| | - Sougata Bhattacharjee
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
- ICAR-National Institute for Plant Biotechnology, Delhi, India
- ICAR-Indian Agricultural Research Institute, Hazaribagh, Jharkhand, India
| | - Sneha Murmu
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
| | | | - Soumen Pal
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
| | - Anil Rai
- Indian Council of Agricultural Research, Delhi, India
| | - Sunil Archak
- ICAR-National Bureau of Plant Genetic Resources, Delhi, India
| |
Collapse
|
7
|
Hu J, Tang YX, Zhou Y, Li Z, Rao B, Zhang GJ. Improving DNA 6mA Site Prediction via Integrating Bidirectional Long Short-Term Memory, Convolutional Neural Network, and Self-Attention Mechanism. J Chem Inf Model 2023; 63:5689-5700. [PMID: 37603823 DOI: 10.1021/acs.jcim.3c00698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/23/2023]
Abstract
Identifying DNA N6-methyladenine (6mA) sites is significantly important to understanding the function of DNA. Many deep learning-based methods have been developed to improve the performance of 6mA site prediction. In this study, to further improve the performance of 6mA site prediction, we propose a new meta method, called Co6mA, to integrate bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNNs), and self-attention mechanisms (SAM) via assembling two different deep learning-based models. The first model developed in this study is called CBi6mA, which is composed of CNN, BiLSTM, and fully connected modules. The second model is borrowed from LA6mA, which is an existing 6mA prediction method based on BiLSTM and SAM modules. Experimental results on two independent testing sets of different model organisms, i.e., Arabidopsis thaliana and Drosophila melanogaster, demonstrate that Co6mA can achieve an average accuracy of 91.8%, covering 89% of all 6mA samples while achieving an average Matthews correlation coefficient value (0.839), which is higher than the second-best method DeepM6A.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yu-Xuan Tang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yu Zhou
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Zhe Li
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Bing Rao
- School of Information and Electrical Engineering, Hangzhou City University, Hangzhou City University, Hangzhou 310015, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
8
|
Yu X, Ren J, Cui Y, Zeng R, Long H, Ma C. DRSN4mCPred: accurately predicting sites of DNA N4-methylcytosine using deep residual shrinkage network for diagnosis and treatment of gastrointestinal cancer in the precision medicine era. Front Med (Lausanne) 2023; 10:1187430. [PMID: 37215722 PMCID: PMC10192687 DOI: 10.3389/fmed.2023.1187430] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 04/05/2023] [Indexed: 05/24/2023] Open
Abstract
Introduction The DNA N4-methylcytosine (4mC) site levels of those suffering from digestive system cancers were higher, and the pathogenesis of digestive system cancers may also be related to the changes in DNA 4mC levels. Identifying DNA 4mC sites is a very important step in studying the analysis of biological function and cancer prediction. Extracting accurate features from DNA sequences is the key to establishing a prediction model of effective DNA 4mC sites. This study sought to develop a new predictive model, DRSN4mCPred, which aimed to improve the performance of the predicting DNA 4mC sites. Methods The model adopted multi-scale channel attention to extract features and used attention feature fusion (AFF) to fuse features. In order to capture features information more accurately and effectively, this model utilized Deep Residual Shrinkage Network with Channel-Wise thresholds (DRSN-CW) to eliminate noise-related features and achieve a more precise feature representation, thereby, distinguishing the sites in DNA with 4mC and non-4mC. Additionally, the predictive model incorporated an inverted residual block, a Multi-scale Channel Attention Module (MS-CAM), a Bi-directional Long Short Term Memory Network (Bi-LSTM), AFF, and DRSN-CW. Results and Discussion The results indicated the predictive model DRSN4mCPred had extremely good performance in predicting the DNA 4mC sites across different species. This paper will potentially provide support for the diagnosis and treatment of gastrointestinal cancer based on artificial intelligence in the precise medical era.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- Industrial Design School, Shandong University of ART and Design, Jinan, Shandong, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Cuihua Ma
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
9
|
Fan XQ, Lin B, Hu J, Guo ZY. I-DNAN6mA: Accurate Identification of DNA N 6-Methyladenine Sites Using the Base-Pairing Map and Deep Learning. J Chem Inf Model 2023; 63:1076-1086. [PMID: 36722621 DOI: 10.1021/acs.jcim.2c01465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The recent discovery of numerous DNA N6-methyladenine (6mA) sites has transformed our perception about the roles of 6mA in living organisms. However, our ability to understand them is hampered by our inability to identify 6mA sites rapidly and cost-efficiently by existing experimental methods. Developing a novel method to quickly and accurately identify 6mA sites is critical for speeding up the progress of its function detection and understanding. In this study, we propose a novel computational method, called I-DNAN6mA, to identify 6mA sites and complement experimental methods well, by leveraging the base-pairing rules and a well-designed three-stage deep learning model with pairwise inputs. The performance of our proposed method is benchmarked and evaluated on four species, i.e., Arabidopsis thaliana, Drosophila melanogaster, Rice, and Rosaceae. The experimental results demonstrate that I-DNAN6mA achieves area under the receiver operating characteristic curve values of 0.967, 0.963, 0.947, 0.976, and 0.990, accuracies of 91.5, 92.7, 88.2, 0.938, and 96.2%, and Mathew's correlation coefficient values of 0.855, 0.831, 0.763, 0.877, and 0.924 on five benchmark data sets, respectively, and outperforms several existing state-of-the-art methods. To our knowledge, I-DNAN6mA is the first approach to identify 6mA sites using a novel image-like representation of DNA sequences and a deep learning model with pairwise inputs. I-DNAN6mA is expected to be useful for locating functional regions of DNA.
Collapse
Affiliation(s)
- Xue-Qiang Fan
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Bing Lin
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Zhong-Yi Guo
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| |
Collapse
|
10
|
Su W, Xie XQ, Liu XW, Gao D, Ma CY, Zulfiqar H, Yang H, Lin H, Yu XL, Li YW. iRNA-ac4C: A novel computational method for effectively detecting N4-acetylcytidine sites in human mRNA. Int J Biol Macromol 2023; 227:1174-1181. [PMID: 36470433 DOI: 10.1016/j.ijbiomac.2022.11.299] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 11/10/2022] [Accepted: 11/25/2022] [Indexed: 12/07/2022]
Abstract
RNA N4-acetylcytidine (ac4C) is the acetylation of cytidine at the nitrogen-4 position, which is a highly conserved RNA modification and involves a variety of biological processes. Hence, accurate identification of genome-wide ac4C sites is vital for understanding regulation mechanism of gene expression. In this work, a novel predictor, named iRNA-ac4C, was established to identify ac4C sites in human mRNA based on three feature extraction methods, including nucleotide composition, nucleotide chemical property, and accumulated nucleotide frequency. Subsequently, minimum-Redundancy-Maximum-Relevance combined with incremental feature selection strategies was utilized to select the optimal feature subset. According to the optimal feature subset, the best ac4C classification model was trained by gradient boosting decision tree with 10-fold cross-validation. The results of independent testing set indicated that our proposed method could produce encouraging generalization capabilities. For the convenience of other researchers, we established a user-friendly web server which is freely available at http://lin-group.cn/server/iRNA-ac4C/. We hope that the tool could provide guide for wet-experimental scholars.
Collapse
Affiliation(s)
- Wei Su
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Xue-Qin Xie
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Xiao-Wei Liu
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Dong Gao
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Cai-Yi Ma
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Hasan Zulfiqar
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Hui Yang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Hao Lin
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Xiao-Long Yu
- School of Materials Science and Engineering, Hainan University, Haikou 570228, China.
| | - Yan-Wen Li
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China; Key Laboratory of Intelligent Information Processing of Jilin Province, Northeast Normal University, Changchun 130117, China; Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
11
|
Han K, Wang J, Wang Y, Zhang L, Yu M, Xie F, Zheng D, Xu Y, Ding Y, Wan J. A review of methods for predicting DNA N6-methyladenine sites. Brief Bioinform 2023; 24:6887111. [PMID: 36502371 DOI: 10.1093/bib/bbac514] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 10/07/2022] [Accepted: 10/27/2022] [Indexed: 12/14/2022] Open
Abstract
Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China.,College of Pharmacy, Harbin University of Commerce, Harbin, 150076, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yu Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Lei Zhang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Mengyao Yu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Fang Xie
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yaoqun Xu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin, 150001, China
| |
Collapse
|
12
|
Luo Z, Lou L, Qiu W, Xu Z, Xiao X. Predicting N6-Methyladenosine Sites in Multiple Tissues of Mammals through Ensemble Deep Learning. Int J Mol Sci 2022; 23:15490. [PMID: 36555143 PMCID: PMC9778682 DOI: 10.3390/ijms232415490] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/03/2022] [Accepted: 12/05/2022] [Indexed: 12/13/2022] Open
Abstract
N6-methyladenosine (m6A) is the most abundant within eukaryotic messenger RNA modification, which plays an essential regulatory role in the control of cellular functions and gene expression. However, it remains an outstanding challenge to detect mRNA m6A transcriptome-wide at base resolution via experimental approaches, which are generally time-consuming and expensive. Developing computational methods is a good strategy for accurate in silico detection of m6A modification sites from the large amount of RNA sequence data. Unfortunately, the existing computational models are usually only for m6A site prediction in a single species, without considering the tissue level of species, while most of them are constructed based on low-confidence level data generated by an m6A antibody immunoprecipitation (IP)-based sequencing method, thereby restricting reliability and generalizability of proposed models. Here, we review recent advances in computational prediction of m6A sites and construct a new computational approach named im6APred using ensemble deep learning to accurately identify m6A sites based on high-confidence level data in multiple tissues of mammals. Our model im6APred builds upon a comprehensive evaluation of multiple classification methods, including four traditional classification algorithms and three deep learning methods and their ensembles. The optimal base-classifier combinations are then chosen by five-fold cross-validation test to achieve an effective stacked model. Our model im6APred can produce the area under the receiver operating characteristic curve (AUROC) in the range of 0.82-0.91 on independent tests, indicating that our model has the ability to learn general methylation rules on RNA bases and generalize to m6A transcriptome-wide identification. Moreover, AUROCs in the range of 0.77-0.96 were achieved using cross-species/tissues validation on the benchmark dataset, demonstrating differences in predictive performance at the tissue level and the need for constructing tissue-specific models for m6A site prediction.
Collapse
Affiliation(s)
| | | | | | - Zhaochun Xu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China
| |
Collapse
|
13
|
Zhou J, Wang X, Wei Z, Meng J, Huang D. 4acCPred: Weakly supervised prediction of N4-acetyldeoxycytosine DNA modification from sequences. MOLECULAR THERAPY - NUCLEIC ACIDS 2022; 30:337-345. [DOI: 10.1016/j.omtn.2022.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022]
|
14
|
Rehman MU, Tayara H, Zou Q, Chong KT. i6mA-Caps: a CapsuleNet-based framework for identifying DNA N6-methyladenine sites. Bioinformatics 2022; 38:3885-3891. [PMID: 35771648 DOI: 10.1093/bioinformatics/btac434] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Revised: 05/19/2022] [Accepted: 06/28/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6mA) has been demonstrated to have an essential function in epigenetic modification in eukaryotic species in recent research. 6mA has been linked to various biological processes. It's critical to create a new algorithm that can rapidly and reliably detect 6mA sites in genomes to investigate their biological roles. The identification of 6mA marks in the genome is the first and most important step in understanding the underlying molecular processes, as well as their regulatory functions. RESULTS In this article, we proposed a novel computational tool called i6mA-Caps which CapsuleNet based a framework for identifying the DNA N6-methyladenine sites. The proposed framework uses a single encoding scheme for numerical representation of the DNA sequence. The numerical data is then used by the set of convolution layers to extract low-level features. These features are then used by the capsule network to extract intermediate-level and later high-level features to classify the 6mA sites. The proposed network is evaluated on three datasets belonging to three genomes which are Rosaceae, Rice and Arabidopsis thaliana. Proposed method has attained an accuracy of 96.71%, 94% and 86.83% for independent Rosaceae dataset, Rice dataset and A.thaliana dataset respectively. The proposed framework has exhibited improved results when compared with the existing top-of-the-line methods. AVAILABILITY AND IMPLEMENTATION A user-friendly web-server is made available for the biological experts which can be accessed at: http://nsclbio.jbnu.ac.kr/tools/i6mA-Caps/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
15
|
ENet-6mA: Identification of 6mA Modification Sites in Plant Genomes Using ElasticNet and Neural Networks. Int J Mol Sci 2022; 23:ijms23158314. [PMID: 35955447 PMCID: PMC9369089 DOI: 10.3390/ijms23158314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/22/2022] [Accepted: 07/24/2022] [Indexed: 02/01/2023] Open
Abstract
N6-methyladenine (6mA) has been recognized as a key epigenetic alteration that affects a variety of biological activities. Precise prediction of 6mA modification sites is essential for understanding the logical consistency of biological activity. There are various experimental methods for identifying 6mA modification sites, but in silico prediction has emerged as a potential option due to the very high cost and labor-intensive nature of experimental procedures. Taking this into consideration, developing an efficient and accurate model for identifying N6-methyladenine is one of the top objectives in the field of bioinformatics. Therefore, we have created an in silico model for the classification of 6mA modifications in plant genomes. ENet-6mA uses three encoding methods, including one-hot, nucleotide chemical properties (NCP), and electron–ion interaction potential (EIIP), which are concatenated and fed as input to ElasticNet for feature reduction, and then the optimized features are given directly to the neural network to get classified. We used a benchmark dataset of rice for five-fold cross-validation testing and three other datasets from plant genomes for cross-species testing purposes. The results show that the model can predict the N6-methyladenine sites very well, even cross-species. Additionally, we separated the datasets into different ratios and calculated the performance using the area under the precision–recall curve (AUPRC), achieving 0.81, 0.79, and 0.50 with 1:10 (positive:negative) samples for F. vesca, R. chinensis, and A. thaliana, respectively.
Collapse
|
16
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
17
|
Hasan MM, Tsukiyama S, Cho JY, Kurata H, Alam MA, Liu X, Manavalan B, Deng HW. Deepm5C: A deep learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy. Mol Ther 2022; 30:2856-2867. [PMID: 35526094 PMCID: PMC9372321 DOI: 10.1016/j.ymthe.2022.05.001] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 04/25/2022] [Accepted: 05/03/2022] [Indexed: 11/30/2022] Open
Abstract
As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C), plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method to identify RNA m5C sites in the throughout human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature encoding algorithms and a feature derived from word embedding approaches. Afterwards, four variants of deep learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a 1-D convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and formulate the novel testable biological hypothesis.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Jae Youl Cho
- Molecular Immunology Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Ashad Alam
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea.
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| |
Collapse
|
18
|
Yu B, Zhang Y, Wang X, Gao H, Sun J, Gao X. Identification of DNA modification sites based on elastic net and bidirectional gated recurrent unit with convolutional neural network. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103566] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
19
|
Tang X, Zheng P, Li X, Wu H, Wei DQ, Liu Y, Huang G. Deep6mAPred: A CNN and Bi-LSTM-based deep learning method for predicting DNA N6-methyladenosine sites across plant species. Methods 2022; 204:142-150. [PMID: 35477057 DOI: 10.1016/j.ymeth.2022.04.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 04/16/2022] [Accepted: 04/20/2022] [Indexed: 12/11/2022] Open
Abstract
DNA N6-methyladenine (6mA) is a key DNA modification, which plays versatile roles in the cellular processes, including regulation of gene expression, DNA repair, and DNA replication. DNA 6mA is closely associated with many diseases in the mammals and with growth as well as development of plants. Precisely detecting DNA 6mA sites is of great importance to exploration of 6mA functions. Although many computational methods have been presented for DNA 6mA prediction, there is still a wide gap in the practical application. We presented a convolution neural network (CNN) and bi-directional long-short term memory (Bi-LSTM)-based deep learning method (Deep6mAPred) for predicting DNA 6mA sites across plant species. The Deep6mAPred stacked the CNNs and the Bi-LSTMs in a paralleling manner instead of a series-connection manner. The Deep6mAPred also employed the attention mechanism for improving the representations of sequences. The Deep6mAPred reached an accuracy of 0.9556 over the independent rice dataset, far outperforming the state-of-the-art methods. The tests across plant species showed that the Deep6mAPred is of a remarkable advantage over the state of the art methods. We developed a user-friendly web application for DNA 6mA prediction, which is freely available at http://106.13.196.152:7001/ for all the scientific researchers. The Deep6mAPred would enrich tools to predict DNA 6mA sites and speed up the exploration of DNA modification.
Collapse
Affiliation(s)
- Xingyu Tang
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Peijie Zheng
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Xueyong Li
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Hongyan Wu
- The Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Dong-Qing Wei
- The Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China; State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China.
| | - Yuewu Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha, Hunan 410081, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China.
| |
Collapse
|
20
|
Liu M, Sun ZL, Zeng Z, Lam KM. MGF6mARice: prediction of DNA N6-methyladenine sites in rice by exploiting molecular graph feature and residual block. Brief Bioinform 2022; 23:6553606. [PMID: 35325050 DOI: 10.1093/bib/bbac082] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/13/2022] [Accepted: 02/16/2022] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine (6mA) is produced by the N6 position of the adenine being methylated, which occurs at the molecular level, and is involved in numerous vital biological processes in the rice genome. Given the shortcomings of biological experiments, researchers have developed many computational methods to predict 6mA sites and achieved good performance. However, the existing methods do not consider the occurrence mechanism of 6mA to extract features from the molecular structure. In this paper, a novel deep learning method is proposed by devising DNA molecular graph feature and residual block structure for 6mA sites prediction in rice, named MGF6mARice. Firstly, the DNA sequence is changed into a simplified molecular input line entry system (SMILES) format, which reflects chemical molecular structure. Secondly, for the molecular structure data, we construct the DNA molecular graph feature based on the principle of graph convolutional network. Then, the residual block is designed to extract higher level, distinguishable features from molecular graph features. Finally, the prediction module is used to obtain the result of whether it is a 6mA site. By means of 10-fold cross-validation, MGF6mARice outperforms the state-of-the-art approaches. Multiple experiments have shown that the molecular graph feature and residual block can promote the performance of MGF6mARice in 6mA prediction. To the best of our knowledge, it is the first time to derive a feature of DNA sequence by considering the chemical molecular structure. We hope that MGF6mARice will be helpful for researchers to analyze 6mA sites in rice.
Collapse
Affiliation(s)
- Mengya Liu
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Zhan-Li Sun
- School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Zhigang Zeng
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Kin-Man Lam
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
21
|
Tsukiyama S, Hasan MM, Deng HW, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform 2022; 23:6539171. [PMID: 35225328 PMCID: PMC8921755 DOI: 10.1093/bib/bbac053] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 01/28/2022] [Accepted: 01/31/2022] [Indexed: 01/29/2023] Open
Abstract
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Corresponding author: Hiroyuki Kurata, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan. Tel: 81-948-29-7828; E-mail:
| |
Collapse
|
22
|
Cai J, Xiao G, Su R. GC6mA-Pred: A deep learning approach to identify DNA N6-methyladenine sites in the rice genome. Methods 2022; 204:14-21. [PMID: 35149214 DOI: 10.1016/j.ymeth.2022.02.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 01/31/2022] [Accepted: 02/05/2022] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6mA) is a pivotal DNA modification for various biological processes. More accurate prediction of 6mA methylation sites plays an irreplaceable part in grasping the internal rationale of related biological activities. However, the existing prediction methods only extract information from a single dimension, which has some limitations. Therefore, it is very necessary to obtain the information of 6mA sites from different dimensions, so as to establish a reliable prediction method. RESULTS In this study, a neural network based bioinformatics model named GC6mA-Pred is proposed to predict N6-methyladenine modifications in DNA sequences. GC6mA-Pred extracts significant information from both sequence level and graph level. In the sequence level, GC6mA-Pred uses a three-layer convolution neural network (CNN) model to represent the sequence. In the graph level, GC6mA-Pred employs graph neural network (GNN) method to integrate various information contained in the chemical molecular formula corresponding to DNA sequence. In our newly built dataset, GC6mA-Pred shows better performance than other existing models. The results of comparative experiments have illustrated that GC6mA-Pred is capable of producing a marked effect in accurately identifying DNA 6mA modifications.
Collapse
Affiliation(s)
- Jianhua Cai
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China; College of Mathematics and Computer Science, Fuzhou University, Fuzhou, PR China
| | - Guobao Xiao
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China.
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
23
|
Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, Wang G. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. FRONTIERS IN PLANT SCIENCE 2022; 13:845835. [PMID: 35237293 PMCID: PMC8882731 DOI: 10.3389/fpls.2022.845835] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/24/2022] [Indexed: 05/17/2023]
Abstract
DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhengnan Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- *Correspondence: Qianzi Lu,
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- Guohua Wang,
| |
Collapse
|
24
|
Le NQK, Ho QT. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes. Methods 2021; 204:199-206. [PMID: 34915158 DOI: 10.1016/j.ymeth.2021.12.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Revised: 11/30/2021] [Accepted: 12/09/2021] [Indexed: 12/19/2022] Open
Abstract
As one of the most common post-transcriptional epigenetic modifications, N6-methyladenine (6 mA), plays an essential role in various cellular processes and disease pathogenesis. Therefore, accurately identifying 6 mA modifications is necessary for a deep understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models were developed with small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we present a novel model based on transformer architecture and deep learning to identify DNA 6 mA sites from the cross-species genome. The model is constructed on a benchmark dataset and explored a feature derived from pre-trained transformer word embedding approaches. Subsequently, a convolutional neural network was employed to learn the generated features and generate the prediction outcomes. As a result, our predictor achieved excellent performance during independent test with the accuracy and Matthews correlation coefficient (MCC) of 79.3% and 0.58, respectively. Overall, its performance achieved better accuracy than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, our model is expected to assist biologists in accurately identifying 6mAs and formulate the novel testable biological hypothesis. We also release source codes and datasets freely at https://github.com/khanhlee/bert-dna for front-end users.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| | - Quang-Thai Ho
- College of Information & Communication Technology, Can Tho University, Viet Nam; Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| |
Collapse
|
25
|
Zhang Y, Liu Y, Xu J, Wang X, Peng X, Song J, Yu DJ. Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites. Brief Bioinform 2021; 22:bbab351. [PMID: 34459479 PMCID: PMC8575024 DOI: 10.1093/bib/bbab351] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 08/02/2021] [Accepted: 08/09/2021] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine is an important type of DNA modification that plays important roles in multiple biological processes. Despite the recent progress in developing DNA 6mA site prediction methods, several challenges remain to be addressed. For example, although the hand-crafted features are interpretable, they contain redundant information that may bias the model training and have a negative impact on the trained model. Furthermore, although deep learning (DL)-based models can perform feature extraction and classification automatically, they lack the interpretability of the crucial features learned by those models. As such, considerable research efforts have been focused on achieving the trade-off between the interpretability and straightforwardness of DL neural networks. In this study, we develop two new DL-based models for improving the prediction of N6-methyladenine sites, termed LA6mA and AL6mA, which use bidirectional long short-term memory to respectively capture the long-range information and self-attention mechanism to extract the key position information from DNA sequences. The performance of the two proposed methods is benchmarked and evaluated on the two model organisms Arabidopsis thaliana and Drosophila melanogaster. On the two benchmark datasets, LA6mA achieves an area under the receiver operating characteristic curve (AUROC) value of 0.962 and 0.966, whereas AL6mA achieves an AUROC value of 0.945 and 0.941, respectively. Moreover, an in-depth analysis of the attention matrix is conducted to interpret the important information, which is hidden in the sequence and relevant for 6mA site prediction. The two novel pipelines developed for DNA 6mA site prediction in this work will facilitate a better understanding of the underlying principle of DL-based DNA methylation site prediction and its future applications.
Collapse
Affiliation(s)
- Ying Zhang
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Yan Liu
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Xinxin Peng
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
26
|
Wang Y, Zhang P, Guo W, Liu H, Li X, Zhang Q, Du Z, Hu G, Han X, Pu L, Tian J, Gu X. A deep learning approach to automate whole-genome prediction of diverse epigenomic modifications in plants. THE NEW PHYTOLOGIST 2021; 232:880-897. [PMID: 34287908 DOI: 10.1111/nph.17630] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 07/09/2021] [Indexed: 06/13/2023]
Abstract
Epigenetic modifications function in gene transcription, RNA metabolism, and other biological processes. However, multiple factors currently limit the scientific utility of epigenomic datasets generated for plants. Here, using deep-learning approaches, we developed a Smart Model for Epigenetics in Plants (SMEP) to predict six types of epigenomic modifications: DNA 5-methylcytosine (5mC) and N6-methyladenosine (6mA) methylation, RNA N6-methyladenosine (m6 A) methylation, and three types of histone modification. Using the datasets from the japonica rice Nipponbare, SMEP achieved 95% prediction accuracy for 6mA, and also achieved around 80% for 5mC, m6 A, and the three types of histone modification based on the 10-fold cross-validation. Additionally, > 95% of the 6mA peaks detected after a heat-shock treatment were predicted. We also successfully applied the SMEP for examining epigenomic modifications in indica rice 93-11 and even the B73 maize line. Taken together, we show that the deep-learning-enabled SMEP can reliably mine epigenomic datasets from diverse plants to yield actionable insights about epigenomic sites. Thus, our work opens new avenues for the application of predictive tools to facilitate functional research, and will almost certainly increase the efficiency of genome engineering efforts.
Collapse
Affiliation(s)
- Yifan Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Pingxian Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Weijun Guo
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Hanqing Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiulan Li
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Qian Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Zhuoying Du
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Guihua Hu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiao Han
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350108, China
| | - Li Pu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Jian Tian
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiaofeng Gu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| |
Collapse
|
27
|
Ao C, Gao L, Yu L. Research progress in predicting DNA methylation modifications and the relation with human diseases. Curr Med Chem 2021; 29:822-836. [PMID: 34533438 DOI: 10.2174/0929867328666210917115733] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/05/2021] [Accepted: 07/11/2021] [Indexed: 11/22/2022]
Abstract
DNA methylation is an important mode of regulation in epigenetic mechanisms, and it is one of the research foci in the field of epigenetics. DNA methylation modification affects a series of biological processes, such as eukaryotic cell growth, differentiation and transformation mechanisms, by regulating gene expression. In this review, we systematically summarized the DNA methylation databases, prediction tools for DNA methylation modification, machine learning algorithms for predicting DNA methylation modification, and the relationship between DNA methylation modification and diseases such as hypertension, Alzheimer's disease, diabetic nephropathy, and cancer. An in-depth understanding of DNA methylation mechanisms can promote accurate prediction of DNA methylation modifications and the treatment and diagnosis of related diseases.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
28
|
Liang Y, Zhang S, Qiao H, Yao Y. iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection. Anal Biochem 2021; 630:114335. [PMID: 34389299 DOI: 10.1016/j.ab.2021.114335] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Revised: 07/24/2021] [Accepted: 08/09/2021] [Indexed: 10/20/2022]
Abstract
Promoter is a region of DNA that determines the transcription of a particular gene. There are several σ factors in the RNA polymerase, which has the function of identifying the promoter and facilitating the binding of the RNA polymerase to the promoter. Owing to the importance of promoter in genome research, it is an urgent task to develop computational tool for effectively identifying promoters and their strength facing the avalanche of DNA sequences discovered in the post-genomic age. In this paper, we develop a model named iPromoter-ET using the k-mer nucleotide composition, binary encoding and dinucleotide property matrix-based distance transformation for features extraction, and extremely randomized trees (extra trees) for feature selection. Its 1st layer is used to identify whether a DNA sequence is of promoter or not, while its 2nd layer is to identify promoter samples as being strong or weak promoter. Support vector machine and the five cross-validation are used to perform identification and assess performance, respectively. The results indicate that our model remarkably outperforms the existing models in both the 1st and 2nd layers for accuracy and stability. We anticipate that our proposed model will become a very effective intelligent tool, or at the least, a complementary tool to the existing modes of identifying promoters and their strength. Moreover, the datasets and codes for iPromoter-ET are freely available at https://github.com/shengli0201/iPromoter-ET.
Collapse
Affiliation(s)
- Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, PR China.
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
29
|
He S, Kong L, Chen J. iDNA6mA-Rice-DL: A local web server for identifying DNA N6-methyladenine sites in rice genome by deep learning method. J Bioinform Comput Biol 2021; 19:2150019. [PMID: 34291710 DOI: 10.1142/s0219720021500190] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Accurate detection of N6-methyladenine (6mA) sites by biochemical experiments will help to reveal their biological functions, still, these wet experiments are laborious and expensive. Therefore, it is necessary to introduce a powerful computational model to identify the 6mA sites on a genomic scale, especially for plant genomes. In view of this, we proposed a model called iDNA6mA-Rice-DL for the effective identification of 6mA sites in rice genome, which is an intelligent computing model based on deep learning method. Traditional machine learning methods assume the preparation of the features for analysis. However, our proposed model automatically encodes and extracts key DNA features through an embedded layer and several groups of dense layers. We use an independent dataset to evaluate the generalization ability of our model. An area under the receiver operating characteristic curve (auROC) of 0.98 with an accuracy of 95.96% was obtained. The experiment results demonstrate that our model had good performance in predicting 6mA sites in the rice genome. A user-friendly local web server has been established. The Docker image of the local web server can be freely downloaded at https://hub.docker.com/r/his1server/idna6ma-rice-dl.
Collapse
Affiliation(s)
- Shiqian He
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao 066000, P. R. China
| | - Liang Kong
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao 066000, P. R. China
| | - Jing Chen
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066000, P. R. China
| |
Collapse
|
30
|
Rahman CR, Amin R, Shatabda S, Toaha MSI. A convolution based computational approach towards DNA N6-methyladenine site identification and motif extraction in rice genome. Sci Rep 2021; 11:10357. [PMID: 33990665 PMCID: PMC8121938 DOI: 10.1038/s41598-021-89850-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 05/04/2021] [Indexed: 12/23/2022] Open
Abstract
DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification responsible for many biological functions. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network (CNN) based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC (Pseudo Amino Acid Composition) inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves auROC (area under Receiver Operating Characteristic curve) score of 0.98 with an overall accuracy of 93.97% using fivefold cross validation on benchmark dataset. Finally, we evaluate our model on three other plant genome 6mA site identification test datasets. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. An algorithm for potential motif extraction and a feature importance analysis procedure are two by products of this research. Web tool for this research can be found at: https://cutt.ly/dgp3QTR.
Collapse
Affiliation(s)
| | - Ruhul Amin
- United International University, Dhaka, Bangladesh
| | | | | |
Collapse
|
31
|
Chachar S, Liu J, Zhang P, Riaz A, Guan C, Liu S. Harnessing Current Knowledge of DNA N6-Methyladenosine From Model Plants for Non-model Crops. Front Genet 2021; 12:668317. [PMID: 33995495 PMCID: PMC8118384 DOI: 10.3389/fgene.2021.668317] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 04/06/2021] [Indexed: 12/12/2022] Open
Abstract
Epigenetic modifications alter the gene activity and function by causing change in the chromosomal architecture through DNA methylation/demethylation, or histone modifications without causing any change in DNA sequence. In plants, DNA cytosine methylation (5mC) is vital for various pathways such as, gene regulation, transposon suppression, DNA repair, replication, transcription, and recombination. Thanks to recent advances in high throughput sequencing (HTS) technologies for epigenomic “Big Data” generation, accumulated studies have revealed the occurrence of another novel DNA methylation mark, N6-methyladenosine (6mA), which is highly present on gene bodies mainly activates gene expression in model plants such as eudicot Arabidopsis (Arabidopsis thaliana) and monocot rice (Oryza sativa). However, in non-model crops, the occurrence and importance of 6mA remains largely less known, with only limited reports in few species, such as Rosaceae (wild strawberry), and soybean (Glycine max). Given the aforementioned vital roles of 6mA in plants, hereinafter, we summarize the latest advances of DNA 6mA modification, and investigate the historical, known and vital functions of 6mA in plants. We also consider advanced artificial-intelligence biotechnologies that improve extraction and prediction of 6mA concepts. In this Review, we discuss the potential challenges that may hinder exploitation of 6mA, and give future goals of 6mA from model plants to non-model crops.
Collapse
Affiliation(s)
- Sadaruddin Chachar
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Horticulture, Northwest A&F University, Yangling, China.,Department of Biotechnology, Faculty of Crop Production, Sindh Agriculture University, Tandojam, Pakistan
| | - Jingrong Liu
- College of Mathematics and Statistics, Northwest Normal University, Lanzhou, China
| | - Pingxian Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Adeel Riaz
- Deaprtment of Biochemistry, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Changfei Guan
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Horticulture, Northwest A&F University, Yangling, China
| | - Shuyuan Liu
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Horticulture, Northwest A&F University, Yangling, China
| |
Collapse
|
32
|
i6mA-VC: A Multi-Classifier Voting Method for the Computational Identification of DNA N6-methyladenine Sites. Interdiscip Sci 2021; 13:413-425. [PMID: 33834381 DOI: 10.1007/s12539-021-00429-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Revised: 03/26/2021] [Accepted: 03/29/2021] [Indexed: 12/14/2022]
Abstract
DNA N6-methyladenine (6 mA), as an essential component of epigenetic modification, cannot be neglected in genetic regulation mechanism. The efficient and accurate prediction of 6 mA sites is beneficial to the development of biological genetics. Biochemical experimental methods are considered to be time-consuming and laborious. Most of the established machine learning methods have a single dataset. Although some of them have achieved cross-species prediction, their results are not satisfactory. Therefore, we designed a novel statistical model called i6mA-VC to improve the accuracy for 6 mA sites. On the one hand, kmer and binary encoding are applied to extract features, and then gradient boosting decision tree (GBDT) embedded method is applied as the feature selection strategy. On the other hand, DNA sequences are represented by vectors through the feature extraction method of ring-function-hydrogen-chemical properties (RFHCP) and the feature selection strategy of ExtraTree. After fusing the two optimal features, a voting classifier based on gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM) and multilayer perceptron classifier (MLPC) is constructed for final classification and prediction. The accuracy of Rice dataset and M.musculus dataset with five-fold cross-validation are 0.888 and 0.967, respectively. The cross-species dataset is selected as independent testing dataset, and the accuracy reaches 0.848. Through rigorous experiments, it is demonstrated that the proposed predictor is convincing and applicable. The development of i6mA-VC predictor will become an effective way for the recognition of N6-methyladenine sites, and it will also be beneficial for biological geneticists to further study gene expression and DNA modification. In addition, an accessible web-server for i6mA-VC is available from http://www.zhanglab.site/ .
Collapse
|
33
|
Yao Y, Zhang S, Liang Y. iORI-ENST: identifying origin of replication sites based on elastic net and stacking learning. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2021; 32:317-331. [PMID: 33730950 DOI: 10.1080/1062936x.2021.1895884] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/23/2021] [Indexed: 06/12/2023]
Abstract
DNA replication is not only the basis of biological inheritance but also the most fundamental process in all living organisms. It plays a crucial role in the cell-division cycle and gene expression regulation. Hence, the accurate identification of the origin of replication sites (ORIs) has a great meaning for further understanding the regulatory mechanism of gene expression and treating genic diseases. In this paper, a novel, feasible and powerful model, namely, iORI-ENST is designed for identifying ORIs. Firstly, we extract the different features by incorporating mono-nucleotide binary encoding and dinucleotide-based spatial autocorrelation. Subsequently, elastic net is utilized as the feature selection method to select the optimal feature set. And then stacking learning is employed to predict ORIs and non-ORIs, which contains random forest, adaboost, gradient boosting decision tree, extra trees and support vector machine. Finally, the ORI sites are identified on the benchmark datasets S1 and S2 with their accuracies of 91.41% and 95.07%, respectively. Meanwhile, an independent dataset S3 is employed to verify the validation and transferability of our model and its accuracy reaches 91.10%. Comparing with state-of-the-art methods, our model achieves more remarkable performance. The results show our model is a feasible, effective and powerful tool for identifying ORIs. The source code and datasets are available at https://github.com/YingyingYao/iORI-ENST.
Collapse
Affiliation(s)
- Y Yao
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - S Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - Y Liang
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| |
Collapse
|
34
|
Huang Q, Zhou W, Guo F, Xu L, Zhang L. 6mA-Pred: identifying DNA N6-methyladenine sites based on deep learning. PeerJ 2021; 9:e10813. [PMID: 33604189 PMCID: PMC7866889 DOI: 10.7717/peerj.10813] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 12/30/2020] [Indexed: 01/03/2023] Open
Abstract
With the accumulation of data on 6mA modification sites, an increasing number of scholars have begun to focus on the identification of 6mA sites. Despite the recognized importance of 6mA sites, methods for their identification remain lacking, with most existing methods being aimed at their identification in individual species. In the present study, we aimed to develop an identification method suitable for multiple species. Based on previous research, we propose a method for 6mA site recognition. Our experiments prove that the proposed 6mA-Pred method is effective for identifying 6mA sites in genes from taxa such as rice, Mus musculus, and human. A series of experimental results show that 6mA-Pred is an excellent method. We provide the source code used in the study, which can be obtained from http://39.100.246.211:5004/6mA_Pred/.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Wenyang Zhou
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
35
|
Abbas Z, Tayara H, Chong KT. 4mCPred-CNN-Prediction of DNA N4-Methylcytosine in the Mouse Genome Using a Convolutional Neural Network. Genes (Basel) 2021; 12:296. [PMID: 33672576 PMCID: PMC7924022 DOI: 10.3390/genes12020296] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 02/16/2021] [Accepted: 02/17/2021] [Indexed: 02/07/2023] Open
Abstract
Among DNA modifications, N4-methylcytosine (4mC) is one of the most significant ones, and it is linked to the development of cell proliferation and gene expression. To know different its biological functions, the accurate detection of 4mC sites is required. Although we have several techniques for the prediction of 4mC sites in different genomes based on both machine learning (ML) and convolutional neural networks (CNNs), there is no CNN-based tool for the identification of 4mC sites in the mouse genome. In this article, a CNN-based model named 4mCPred-CNN was developed to classify 4mC locations in the mouse genome. Until now, we had only two ML-based models for this purpose; they utilized several feature encoding schemes, and thus still had a lot of space available to improve the prediction accuracy. Utilizing only a single feature encoding scheme-one-hot encoding-we outperformed both of the previous ML-based techniques. In a ten-fold validation test, the proposed model, 4mCPred-CNN, achieved an accuracy of 85.71% and Matthews correlation coefficient (MCC) of 0.717. On an independent dataset, the achieved accuracy was 87.50% with an MCC value of 0.750. The attained results exhibit that the proposed model can be of great use for researchers in the fields of biology and bioinformatics.
Collapse
Affiliation(s)
- Zeeshan Abbas
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea;
- Institute of Avionics and Aeronautics (IAA), Air University, Islamabad 44000, Pakistan
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea;
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
36
|
Li Z, Jiang H, Kong L, Chen Y, Lang K, Fan X, Zhang L, Pian C. Deep6mA: A deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species. PLoS Comput Biol 2021; 17:e1008767. [PMID: 33600435 PMCID: PMC7924747 DOI: 10.1371/journal.pcbi.1008767] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 03/02/2021] [Accepted: 02/03/2021] [Indexed: 12/25/2022] Open
Abstract
N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA’s biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression. DNA N6 methyladenine (6mA) is a newly recognized methylation modification in eukaryotes. It exists widely and conservatively in organisms, and its modification level changes dynamically in the whole life cycle. This study proposes an algorithm based on a deep learning framework including LSTM and CNN to predict 6mA sites. The results showed that our method could accurately predict the 6mA sites in different species, which means DNA sub-sequences containing 6mA sites among species have certain conservation. Importantly, we found that 6mA methylation in most different species is more likely to occur on the GAGG motif. In addition, we also found that 6mA is rich in the promoter’s TATA box, which may be a mechanism of regulating downstream gene expression.
Collapse
Affiliation(s)
- Zutan Li
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
| | - Hangjin Jiang
- Center for Data Science, Zhejiang University, Hangzhou, China
| | - Lingpeng Kong
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
| | - Kun Lang
- College of information science & Technology, Nanjing Agricultural University, Nanjing, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
- * E-mail: (LYZ); (CP)
| | - Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
- * E-mail: (LYZ); (CP)
| |
Collapse
|
37
|
Hasan MM, Shoombuatong W, Kurata H, Manavalan B. Critical evaluation of web-based DNA N6-methyladenine site prediction tools. Brief Funct Genomics 2021; 20:258-272. [PMID: 33491072 DOI: 10.1093/bfgp/elaa028] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 12/11/2020] [Accepted: 12/15/2020] [Indexed: 12/13/2022] Open
Abstract
Methylation of DNA N6-methyladenosine (6mA) is a type of epigenetic modification that plays pivotal roles in various biological processes. The accurate genome-wide identification of 6mA is a challenging task that leads to understanding the biological functions. For the last 5 years, a number of bioinformatics approaches and tools for 6mA site prediction have been established, and some of them are easily accessible as web application. Nevertheless, the accurate genome-wide identification of 6mA is still one of the challenging works that lead to understanding the biological functions. Especially in practical applications, these tools have implemented diverse encoding schemes, machine learning algorithms and feature selection methods, whereas few systematic performance comparisons of 6mA site predictors have been reported. In this review, 11 publicly available 6mA predictors evaluated with seven different species-specific datasets (Arabidopsis thaliana, Tolypocladium, Diospyros lotus, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans and Escherichia coli). Of those, few species are close homologs, and the remaining datasets are distant sequences. Our independent, validation tests demonstrated that Meta-i6mA and MM-6mAPred models for A. thaliana, Tolypocladium, S. cerevisiae and D. melanogaster achieved excellent overall performance when compared with their counterparts. However, none of the existing methods were suitable for E. coli, C. elegans and D. lotus. A feasibility of the existing predictors is also discussed for the seven species. Our evaluation provides useful guidelines for the development of 6mA site predictors and helps biologists selecting suitable prediction tools.
Collapse
Affiliation(s)
| | - Watshara Shoombuatong
- Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics in the Kyushu Institute of Technology, Japan
| | | |
Collapse
|
38
|
Gouda G, Gupta MK, Donde R, Sabarinathan S, Vadde R, Behera L, Mohapatra T. Computational Epigenetics in Rice Research. APPLICATIONS OF BIOINFORMATICS IN RICE RESEARCH 2021:113-140. [DOI: 10.1007/978-981-16-3997-5_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/06/2023]
|
39
|
Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Brief Funct Genomics 2020; 20:1-18. [PMID: 33313647 DOI: 10.1093/bfgp/elaa023] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 11/09/2020] [Accepted: 11/10/2020] [Indexed: 12/22/2022] Open
Abstract
Modifications of protein, RNA and DNA play an important role in many biological processes and are related to some diseases. Therefore, accurate identification and comprehensive understanding of protein, RNA and DNA modification sites can promote research on disease treatment and prevention. With the development of sequencing technology, the number of known sequences has continued to increase. In the past decade, many computational tools that can be used to predict protein, RNA and DNA modification sites have been developed. In this review, we comprehensively summarized the modification site predictors for three different biological sequences and the association with diseases. The relevant web server is accessible at http://lab.malab.cn/∼acy/PTM_data/ some sample data on protein, RNA and DNA modification can be downloaded from that website.
Collapse
|
40
|
Dao FY, Lv H, Zhang D, Zhang ZM, Liu L, Lin H. DeepYY1: a deep learning approach to identify YY1-mediated chromatin loops. Brief Bioinform 2020; 22:6024741. [PMID: 33279983 DOI: 10.1093/bib/bbaa356] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 10/19/2020] [Accepted: 11/04/2020] [Indexed: 12/29/2022] Open
Abstract
The protein Yin Yang 1 (YY1) could form dimers that facilitate the interaction between active enhancers and promoter-proximal elements. YY1-mediated enhancer-promoter interaction is the general feature of mammalian gene control. Recently, some computational methods have been developed to characterize the interactions between DNA elements by elucidating important features of chromatin folding; however, no computational methods have been developed for identifying the YY1-mediated chromatin loops. In this study, we developed a deep learning algorithm named DeepYY1 based on word2vec to determine whether a pair of YY1 motifs would form a loop. The proposed models showed a high prediction performance (AUCs$\ge$0.93) on both training datasets and testing datasets in different cell types, demonstrating that DeepYY1 has an excellent performance in the identification of the YY1-mediated chromatin loops. Our study also suggested that sequences play an important role in the formation of YY1-mediated chromatin loops. Furthermore, we briefly discussed the distribution of the replication origin site in the loops. Finally, a user-friendly web server was established, and it can be freely accessed at http://lin-group.cn/server/DeepYY1.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Hao Lv
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Dan Zhang
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Zi-Mei Zhang
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Li Liu
- Laboratory of Theoretical Biophysics at the Inner Mongolia University
| | - Hao Lin
- Center for Informational Biology at the University of Electronic Science and Technology of China
| |
Collapse
|
41
|
Xu H, Hu R, Jia P, Zhao Z. 6mA-Finder: a novel online tool for predicting DNA N6-methyladenine sites in genomes. Bioinformatics 2020; 36:3257-3259. [PMID: 32091591 DOI: 10.1093/bioinformatics/btaa113] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Revised: 01/29/2020] [Accepted: 02/14/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6 mA) has recently been found as an essential epigenetic modification, playing its roles in a variety of cellular processes. The abnormal status of DNA 6 mA modification has been reported in cancer and other disease. The annotation of 6 mA marks in genome is the first crucial step to explore the underlying molecular mechanisms including its regulatory roles. RESULTS We present a novel online DNA 6 mA site tool, 6 mA-Finder, by incorporating seven sequence-derived information and three physicochemical-based features through recursive feature elimination strategy. Our multiple cross-validations indicate the promising accuracy and robustness of our model. 6 mA-Finder outperforms its peer tools in general and species-specific 6 mA site prediction, suggesting it can provide a useful resource for further experimental investigation of DNA 6 mA modification. AVAILABILITY AND IMPLEMENTATION https://bioinfo.uth.edu/6mA_Finder. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haodong Xu
- School of Biomedical Informatics, Center for Precision Health
| | - Ruifeng Hu
- School of Biomedical Informatics, Center for Precision Health
| | - Peilin Jia
- School of Biomedical Informatics, Center for Precision Health
| | - Zhongming Zhao
- School of Biomedical Informatics, Center for Precision Health.,MD Anderson Cancer Center, UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
42
|
Hasan MM, Basith S, Khatun MS, Lee G, Manavalan B, Kurata H. Meta-i6mA: an interspecies predictor for identifying DNA N6-methyladenine sites of plant genomes by exploiting informative features in an integrative machine-learning framework. Brief Bioinform 2020; 22:5903398. [PMID: 32910169 DOI: 10.1093/bib/bbaa202] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 08/06/2020] [Accepted: 08/06/2020] [Indexed: 12/13/2022] Open
Abstract
DNA N6-methyladenine (6mA) represents important epigenetic modifications, which are responsible for various cellular processes. The accurate identification of 6mA sites is one of the challenging tasks in genome analysis, which leads to an understanding of their biological functions. To date, several species-specific machine learning (ML)-based models have been proposed, but majority of them did not test their model to other species. Hence, their practical application to other plant species is quite limited. In this study, we explored 10 different feature encoding schemes, with the goal of capturing key characteristics around 6mA sites. We selected five feature encoding schemes based on physicochemical and position-specific information that possesses high discriminative capability. The resultant feature sets were inputted to six commonly used ML methods (random forest, support vector machine, extremely randomized tree, logistic regression, naïve Bayes and AdaBoost). The Rosaceae genome was employed to train the above classifiers, which generated 30 baseline models. To integrate their individual strength, Meta-i6mA was proposed that combined the baseline models using the meta-predictor approach. In extensive independent test, Meta-i6mA showed high Matthews correlation coefficient values of 0.918, 0.827 and 0.635 on Rosaceae, rice and Arabidopsis thaliana, respectively and outperformed the existing predictors. We anticipate that the Meta-i6mA can be applied across different plant species. Furthermore, we developed an online user-friendly web server, which is available at http://kurata14.bio.kyutech.ac.jp/Meta-i6mA/.
Collapse
Affiliation(s)
| | - Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Mst Shamima Khatun
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Japan
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | | | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics in the Kyushu Institute of Technology, Japan
| |
Collapse
|
43
|
Wahab A, Mahmoudi O, Kim J, Chong KT. DNC4mC-Deep: Identification and Analysis of DNA N4-Methylcytosine Sites Based on Different Encoding Schemes By Using Deep Learning. Cells 2020; 9:E1756. [PMID: 32707969 PMCID: PMC7465362 DOI: 10.3390/cells9081756] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/24/2022] Open
Abstract
N4-methylcytosine as one kind of modification of DNA has a critical role which alters genetic performance such as protein interactions, conformation, stability in DNA as well as the regulation of gene expression same cell developmental and genomic imprinting. Some different 4mC site identifiers have been proposed for various species. Herein, we proposed a computational model, DNC4mC-Deep, including six encoding techniques plus a deep learning model to predict 4mC sites in the genome of F. vesca, R. chinensis, and Cross-species dataset. It was demonstrated by the 10-fold cross-validation test to get superior performance. The DNC4mC-Deep obtained 0.829 and 0.929 of MCC on F. vesca and R. chinensis training dataset, respectively, and 0.814 on cross-species. This means the proposed method outperforms the state-of-the-art predictors at least 0.284 and 0.265 on F. vesca and R. chinensis training dataset in turn. Furthermore, the DNC4mC-Deep achieved 0.635 and 0.565 of MCC on F. vesca and R. chinensis independent dataset, respectively, and 0.562 on cross-species which shows it can achieve the best performance to predict 4mC sites as compared to the state-of-the-art predictor.
Collapse
Affiliation(s)
- Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.W.); (O.M.)
| | - Omid Mahmoudi
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.W.); (O.M.)
| | - Jeehong Kim
- Department of New & Renewable Energy, VISION College of Jeonju, Jeonju 55069, Korea
| | - Kil To Chong
- Department of Electronics Engineering, Jeonbuk National University, Jeonju 54896, Korea
- Advance Electronics & Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
44
|
Hasan MM, Manavalan B, Shoombuatong W, Khatun MS, Kurata H. i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation. PLANT MOLECULAR BIOLOGY 2020; 103:225-234. [PMID: 32140819 DOI: 10.1007/s11103-020-00988-y] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 02/29/2020] [Indexed: 05/28/2023]
Abstract
DNA N6-methyladenine (6 mA) is one of the most vital epigenetic modifications and involved in controlling the various gene expression levels. With the avalanche of DNA sequences generated in numerous databases, the accurate identification of 6 mA plays an essential role for understanding molecular mechanisms. Because the experimental approaches are time-consuming and costly, it is desirable to develop a computation model for rapidly and accurately identifying 6 mA. To the best of our knowledge, we first proposed a computational model named i6mA-Fuse to predict 6 mA sites from the Rosaceae genomes, especially in Rosa chinensis and Fragaria vesca. We implemented the five encoding schemes, i.e., mononucleotide binary, dinucleotide binary, k-space spectral nucleotide, k-mer, and electron-ion interaction pseudo potential compositions, to build the five, single-encoding random forest (RF) models. The i6mA-Fuse uses a linear regression model to combine the predicted probability scores of the five, single encoding-based RF models. The resultant species-specific i6mA-Fuse achieved remarkably high performances with AUCs of 0.982 and 0.978 and with MCCs of 0.869 and 0.858 on the independent datasets of Rosa chinensis and Fragaria vesca, respectively. In the F. vesca-specific i6mA-Fuse, the MBE and EIIP contributed to 75% and 25% of the total prediction; in the R. chinensis-specific i6mA-Fuse, Kmer, MBE, and EIIP contribute to 15%, 65%, and 20% of the total prediction. To assist high-throughput prediction for DNA 6 mA identification, the i6mA-Fuse is publicly accessible at https://kurata14.bio.kyutech.ac.jp/i6mA-Fuse/.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
- Japan Society for the Promotion of Science, 5-3-1 Kojimachi, Chiyoda-ku, Tokyo, 102-0083, Japan
| | | | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Mst Shamima Khatun
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan.
- Biomedical Informatics R&D Center, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan.
| |
Collapse
|
45
|
Lv H, Dao FY, Zhang D, Guan ZX, Yang H, Su W, Liu ML, Ding H, Chen W, Lin H. iDNA-MS: An Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes. iScience 2020; 23:100991. [PMID: 32240948 PMCID: PMC7115099 DOI: 10.1016/j.isci.2020.100991] [Citation(s) in RCA: 72] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Revised: 02/23/2020] [Accepted: 03/11/2020] [Indexed: 12/11/2022] Open
Abstract
5hmC, 6mA, and 4mC are three common DNA modifications and are involved in various of biological processes. Accurate genome-wide identification of these sites is invaluable for better understanding their biological functions. Owing to the labor-intensive and expensive nature of experimental methods, it is urgent to develop computational methods for the genome-wide detection of these sites. Keeping this in mind, the current study was devoted to construct a computational method to identify 5hmC, 6mA, and 4mC. We initially used K-tuple nucleotide component, nucleotide chemical property and nucleotide frequency, and mono-nucleotide binary encoding scheme to formulate samples. Subsequently, random forest was utilized to identify 5hmC, 6mA, and 4mC sites. Cross-validated results showed that the proposed method could produce the excellent generalization ability in the identification of the three modification sites. Based on the proposed model, a web-server called iDNA-MS was established and is freely accessible at http://lin-group.cn/server/iDNA-MS.
Collapse
Affiliation(s)
- Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Meng-Lu Liu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
46
|
Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Front Bioeng Biotechnol 2020; 8:134. [PMID: 32175316 PMCID: PMC7054385 DOI: 10.3389/fbioe.2020.00134] [Citation(s) in RCA: 62] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 02/10/2020] [Indexed: 12/21/2022] Open
Abstract
One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
47
|
Karanthamalai J, Chodon A, Chauhan S, Pandi G. DNA N 6-Methyladenine Modification in Plant Genomes-A Glimpse into Emerging Epigenetic Code. PLANTS (BASEL, SWITZERLAND) 2020; 9:E247. [PMID: 32075056 PMCID: PMC7076483 DOI: 10.3390/plants9020247] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 02/09/2020] [Accepted: 02/11/2020] [Indexed: 02/08/2023]
Abstract
N6-methyladenine (6mA) is a DNA base modification at the 6th nitrogen position; recently, it has been resurfaced as a potential reversible epigenetic mark in eukaryotes. Despite its existence, 6mA was considered to be absent due to its undetectable level. However, with the new advancements in methods, considerable 6mA distribution is identified across the plant genome. Unlike 5-methylcytosine (5mC) in the gene promoter, 6mA does not have a definitive role in repression but is exposed to have divergent regulation in gene expression. Though 6mA information is less known, the available evidences suggest its function in plant development, tissue differentiation, and regulations in gene expression. The current review article emphasizes the research advances in DNA 6mA modifications, identification, available databases, analysis tools and its significance in plant development, cellular functions and future perspectives of research.
Collapse
Affiliation(s)
| | | | | | - Gopal Pandi
- Department of Plant Biotechnology, School of Biotechnology, Madurai Kamaraj University, Madurai625021, Tamil Nadu, India; (J.K.); (A.C.); (S.C.)
| |
Collapse
|
48
|
Huang Q, Zhang J, Wei L, Guo F, Zou Q. 6mA-RicePred: A Method for Identifying DNA N 6-Methyladenine Sites in the Rice Genome Based on Feature Fusion. FRONTIERS IN PLANT SCIENCE 2020; 11:4. [PMID: 32076430 PMCID: PMC7006724 DOI: 10.3389/fpls.2020.00004] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 01/06/2020] [Indexed: 06/01/2023]
Abstract
MOTIVATION The biological function of N 6-methyladenine DNA (6mA) in plants is largely unknown. Rice is one of the most important crops worldwide and is a model species for molecular and genetic studies. There are few methods for 6mA site recognition in the rice genome, and an effective computational method is needed. RESULTS In this paper, we propose a new computational method called 6mA-Pred to identify 6mA sites in the rice genome. 6mA-Pred employs a feature fusion method to combine advantageous features from other methods and thus obtain a new feature to identify 6mA sites. This method achieved an accuracy of 87.27% in the identification of 6mA sites with 10-fold cross-validation and achieved an accuracy of 85.6% in independent test sets.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
49
|
i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features. Genes (Basel) 2019; 10:genes10100828. [PMID: 31635172 PMCID: PMC6826501 DOI: 10.3390/genes10100828] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 10/16/2019] [Accepted: 10/18/2019] [Indexed: 12/22/2022] Open
Abstract
DNA N6-methyladenine (6mA) plays an important role in regulating the gene expression of eukaryotes. Accurate identification of 6mA sites may assist in understanding genomic 6mA distributions and biological functions. Various experimental methods have been applied to detect 6mA sites in a genome-wide scope, but they are too time-consuming and expensive. Developing computational methods to rapidly identify 6mA sites is needed. In this paper, a new machine learning-based method, i6mA-DNCP, was proposed for identifying 6mA sites in the rice genome. Dinucleotide composition and dinucleotide-based DNA properties were first employed to represent DNA sequences. After a specially designed DNA property selection process, a bagging classifier was used to build the prediction model. The jackknife test on a benchmark dataset demonstrated that i6mA-DNCP could obtain 84.43% sensitivity, 88.86% specificity, 86.65% accuracy, a 0.734 Matthew's correlation coefficient (MCC), and a 0.926 area under the receiver operating characteristic curve (AUC). Moreover, three independent datasets were established to assess the generalization ability of our method. Extensive experiments validated the effectiveness of i6mA-DNCP.
Collapse
|
50
|
Yu H, Dai Z. SNNRice6mA: A Deep Learning Method for Predicting DNA N6-Methyladenine Sites in Rice Genome. Front Genet 2019; 10:1071. [PMID: 31681441 PMCID: PMC6797597 DOI: 10.3389/fgene.2019.01071] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Accepted: 10/04/2019] [Indexed: 01/08/2023] Open
Abstract
DNA N6-methyladenine (6mA) is an important epigenetic modification, which is involved in many biology regulation processes. An accurate and reliable method for 6mA identification can help us gain a better insight into the regulatory mechanism of the modification. Although many experimental techniques have been proposed to identify 6mA sites genome-wide, these techniques are time consuming and laborious. Recently, several machine learning methods have been developed to identify 6mA sites genome-wide. However, there is room for the improvement on their performance for predicting 6mA sites in rice genome. In this paper, we developed a simple and lightweight deep learning model to identify DNA 6mA sites in rice genome. Our model needs no prior knowledge of 6mA or manually crafted sequence feature. We built our model based on two rice 6mA benchmark datasets. Our method got an average prediction accuracy of ∼93% and ∼92% on the two datasets we used. We compared our method with existing 6mA prediction tools. The comparison results show that our model outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Haitao Yu
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
| | - Zhiming Dai
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China.,Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|