1
|
Yu X, Yani C, Wang Z, Long H, Zeng R, Liu X, Anas B, Ren J. iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation. PLoS One 2024; 19:e0301791. [PMID: 39480834 PMCID: PMC11527195 DOI: 10.1371/journal.pone.0301791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 03/20/2024] [Indexed: 11/02/2024] Open
Abstract
In this study, from the perspective of image processing, we propose the iDNA-ITLM model, using a novel data enhance strategy by continuously self-replicating a short DNA sequence into a longer DNA sequence and then embedding it into a high-dimensional matrix to enlarge the receptive field, for identifying DNA methylation sites. Our model consistently outperforms the current state-of-the-art sequence-based DNA methylation site recognition methods when evaluated on 17 benchmark datasets that cover multiple species and include three DNA methylation modifications (4mC, 5hmC, and 6mA). The experimental results demonstrate the robustness and superior performance of our model across these datasets. In addition, our model can transfer learning to RNA methylation sequences and produce good results without modifying the hyperparameters in the model. The proposed iDNA-ITLM model can be considered a universal predictor across DNA and RNA methylation species.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Cui Yani
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Zhichao Wang
- Unit 32033, The People’s Liberation Army, Beijing, China
| | - Haixia Long
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Xiling Liu
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Bilal Anas
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
2
|
Yu B, Zhang H, Pian C, Chen Y. MMG4: Recognition of G4-Forming Sequences Based on Markov Model. J Comput Biol 2024. [PMID: 39419074 DOI: 10.1089/cmb.2024.0523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2024] Open
Abstract
G-quadruplexes (G4s) are special nucleic acid structures with various important biological functions. Existing tools and technologies for G4-forming sequences recognition are limited to time-consuming and costly methods such as circular dichroism and nuclear magnetic resonance. Developing a fast and accurate model for G4-forming sequences recognition has far-reaching significance. In this study, MMG4, a novel model to recognize G4-forming sequences based on Markov model (MM), was developed and the phenomenon of high recognition accuracy in the central region of the sequence and low accuracy in the two end regions was discovered. It was further found that the differences in base transfer probabilities, ratio distribution, and G4-motif structural content in different regions may be the causes of this phenomenon. The study also explored the impact of sequence length on recognition accuracy and found the optimal recognition interval to be [910-1049], with the highest recognition accuracy reaching 85.95%. By extracting sequence features, the study constructed three types of machine learning models: random forest (RF), support vector machine, and back-propagation neural network. It was found that recognition performance of MM was significantly better than that of the other three machine learning models, proving that the recognition method based on MM can effectively capture the correlation information between adjacent nucleotides of G4. By combining MM with the three machine learning models, the predictive performance of MMG4 improved. Among them, the RF model combined with MM has the best performance, achieving an area under the receiver operating characteristic curve value of 0.93 and an area under the precision-recall curve value of 0.9. Finally, the study validated the model robustness and generalization ability through independent testing dataset.
Collapse
Affiliation(s)
- Boyuan Yu
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Hao Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Cong Pian
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yuanyuan Chen
- College of Science, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
3
|
Teragawa S, Wang L, Liu Y. DeepPGD: A Deep Learning Model for DNA Methylation Prediction Using Temporal Convolution, BiLSTM, and Attention Mechanism. Int J Mol Sci 2024; 25:8146. [PMID: 39125714 PMCID: PMC11311892 DOI: 10.3390/ijms25158146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 06/07/2024] [Accepted: 06/25/2024] [Indexed: 08/12/2024] Open
Abstract
As part of the field of DNA methylation identification, this study tackles the challenge of enhancing recognition performance by introducing a specialized deep learning framework called DeepPGD. DNA methylation, a crucial biological modification, plays a vital role in gene expression analyses, cellular differentiation, and the study of disease progression. However, accurately and efficiently identifying DNA methylation sites remains a pivotal concern in the field of bioinformatics. The issue addressed in this paper is the presence of methylation in DNA, which is a binary classification problem. To address this, our research aimed to develop a deep learning algorithm capable of more precisely identifying these sites. The DeepPGD framework combined a dual residual structure involving Temporal convolutional networks (TCNs) and bidirectional long short-term memory (BiLSTM) networks to effectively extract intricate DNA structural and sequence features. Additionally, to meet the practical requirements of DNA methylation identification, extensive experiments were conducted across a variety of biological species. The experimental results highlighted DeepPGD's exceptional performance across multiple evaluation metrics, including accuracy, Matthews' correlation coefficient (MCC), and the area under the curve (AUC). In comparison to other algorithms in the same domain, DeepPGD demonstrated superior classification and predictive capabilities across various biological species datasets. This significant advancement in algorithmic prowess not only offers substantial technical support, but also holds potential for research and practical implementation within the DNA methylation identification domain. Moreover, the DeepPGD framework shows potential for application in genomics research, biomedicine, and disease diagnostics, among other fields.
Collapse
Affiliation(s)
- Shoryu Teragawa
- School of Software, Dalian University of Technology, Dalian 116024, China;
| | - Lei Wang
- School of Software, Dalian University of Technology, Dalian 116024, China;
| | - Yi Liu
- School of Engineering, University of Southern Queensland, 487-535 West Street, Toowoomba, QLD 4350, Australia;
| |
Collapse
|
4
|
Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, Cui Y. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Front Genet 2024; 15:1377285. [PMID: 38689652 PMCID: PMC11058834 DOI: 10.3389/fgene.2024.1377285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 03/07/2024] [Indexed: 05/02/2024] Open
Abstract
Introduction: DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression without changing the DNA sequence. The main difficulty in identifying DNA methylation sites lies in the subtle and complex nature of methylation patterns, which may vary across different tissues, developmental stages, and environmental conditions. Traditional methods for methylation site identification, such as bisulfite sequencing, are typically labor-intensive, costly, and require large amounts of DNA, hindering high-throughput analysis. Moreover, these methods may not always provide the resolution needed to detect methylation at specific sites, especially in genomic regions that are rich in repetitive sequences or have low levels of methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. Methods: This study introduces the iDNA-OpenPrompt model, leveraging the novel OpenPrompt learning framework. The model combines a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to construct the prompt-learning framework for DNA methylation sequences. Moreover, a DNA vocabulary library, BERT tokenizer, and specific label words are also introduced into the model to enable accurate identification of DNA methylation sites. Results and Discussion: An extensive analysis is conducted to evaluate the predictive, reliability, and consistency capabilities of the iDNA-OpenPrompt model. The experimental outcomes, covering 17 benchmark datasets that include various species and three DNA methylation modifications (4mC, 5hmC, 6mA), consistently indicate that our model surpasses outstanding performance and robustness approaches.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Guoqiang Zhang
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Anas Bilal
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
5
|
Xiang X, Zhou J, Deng Y, Yang X. Identifying the generator matrix of a stationary Markov chain using partially observable data. CHAOS (WOODBURY, N.Y.) 2024; 34:023132. [PMID: 38386908 DOI: 10.1063/5.0156458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 01/23/2024] [Indexed: 02/24/2024]
Abstract
Given that most states in real-world systems are inaccessible, it is critical to study the inverse problem of an irreversibly stationary Markov chain regarding how a generator matrix can be identified using minimal observations. The hitting-time distribution of an irreversibly stationary Markov chain is first generalized from a reversible case. The hitting-time distribution is then decoded via the taboo rate, and the results show remarkably that under mild conditions, the generator matrix of a reversible Markov chain or a specific case of irreversibly stationary ones can be identified by utilizing observations from all leaves and two adjacent states in each cycle. Several algorithms are proposed for calculating the generator matrix accurately, and numerical examples are presented to confirm their validity and efficiency. An application to neurophysiology is provided to demonstrate the applicability of such statistics to real-world data. This means that partially observable data can be used to identify the generator matrix of a stationary Markov chain.
Collapse
Affiliation(s)
- Xuyan Xiang
- School of Mathematics and Physics Science, Hunan University of Arts and Science, Changde 415000, China
- College of Mathematics and Statistics, Hunan Normal University, Changsha 410081, China
| | - Jieming Zhou
- College of Mathematics and Statistics, Hunan Normal University, Changsha 410081, China
| | - Yingchun Deng
- College of Mathematics and Statistics, Hunan Normal University, Changsha 410081, China
| | - Xiangqun Yang
- College of Mathematics and Statistics, Hunan Normal University, Changsha 410081, China
| |
Collapse
|
6
|
Zhuo L, Wang R, Fu X, Yao X. StableDNAm: towards a stable and efficient model for predicting DNA methylation based on adaptive feature correction learning. BMC Genomics 2023; 24:742. [PMID: 38053026 DOI: 10.1186/s12864-023-09802-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 11/11/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND DNA methylation, instrumental in numerous life processes, underscores the paramount importance of its accurate prediction. Recent studies suggest that deep learning, due to its capacity to extract profound insights, provides a more precise DNA methylation prediction. However, issues related to the stability and generalization performance of these models persist. RESULTS In this study, we introduce an efficient and stable DNA methylation prediction model. This model incorporates a feature fusion approach, adaptive feature correction technology, and a contrastive learning strategy. The proposed model presents several advantages. First, DNA sequences are encoded at four levels to comprehensively capture intricate information across multi-scale and low-span features. Second, we design a sequence-specific feature correction module that adaptively adjusts the weights of sequence features. This improvement enhances the model's stability and scalability, or its generality. Third, our contrastive learning strategy mitigates the instability issues resulting from sparse data. To validate our model, we conducted multiple sets of experiments on commonly used datasets, demonstrating the model's robustness and stability. Simultaneously, we amalgamate various datasets into a single, unified dataset. The experimental outcomes from this combined dataset substantiate the model's robust adaptability. CONCLUSIONS Our research findings affirm that the StableDNAm model is a general, stable, and effective instrument for DNA methylation prediction. It holds substantial promise for providing invaluable assistance in future methylation-related research and analyses.
Collapse
Affiliation(s)
- Linlin Zhuo
- College of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325000, China
| | - Rui Wang
- College of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325000, China
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410000, China.
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China.
| |
Collapse
|
7
|
Huang G, Huang X, Luo W. 6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. BioData Min 2023; 16:34. [PMID: 38012796 PMCID: PMC10680251 DOI: 10.1186/s13040-023-00348-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 11/04/2023] [Indexed: 11/29/2023] Open
Abstract
DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Wei Luo
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
8
|
Fan XQ, Lin B, Hu J, Guo ZY. I-DNAN6mA: Accurate Identification of DNA N 6-Methyladenine Sites Using the Base-Pairing Map and Deep Learning. J Chem Inf Model 2023; 63:1076-1086. [PMID: 36722621 DOI: 10.1021/acs.jcim.2c01465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The recent discovery of numerous DNA N6-methyladenine (6mA) sites has transformed our perception about the roles of 6mA in living organisms. However, our ability to understand them is hampered by our inability to identify 6mA sites rapidly and cost-efficiently by existing experimental methods. Developing a novel method to quickly and accurately identify 6mA sites is critical for speeding up the progress of its function detection and understanding. In this study, we propose a novel computational method, called I-DNAN6mA, to identify 6mA sites and complement experimental methods well, by leveraging the base-pairing rules and a well-designed three-stage deep learning model with pairwise inputs. The performance of our proposed method is benchmarked and evaluated on four species, i.e., Arabidopsis thaliana, Drosophila melanogaster, Rice, and Rosaceae. The experimental results demonstrate that I-DNAN6mA achieves area under the receiver operating characteristic curve values of 0.967, 0.963, 0.947, 0.976, and 0.990, accuracies of 91.5, 92.7, 88.2, 0.938, and 96.2%, and Mathew's correlation coefficient values of 0.855, 0.831, 0.763, 0.877, and 0.924 on five benchmark data sets, respectively, and outperforms several existing state-of-the-art methods. To our knowledge, I-DNAN6mA is the first approach to identify 6mA sites using a novel image-like representation of DNA sequences and a deep learning model with pairwise inputs. I-DNAN6mA is expected to be useful for locating functional regions of DNA.
Collapse
Affiliation(s)
- Xue-Qiang Fan
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Bing Lin
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Zhong-Yi Guo
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| |
Collapse
|
9
|
Jin J, Yu Y, Wang R, Zeng X, Pang C, Jiang Y, Li Z, Dai Y, Su R, Zou Q, Nakai K, Wei L. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol 2022; 23:219. [PMID: 36253864 PMCID: PMC9575223 DOI: 10.1186/s13059-022-02780-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 10/03/2022] [Indexed: 11/29/2022] Open
Abstract
In this study, we propose iDNA-ABF, a multi-scale deep biological language learning model that enables the interpretable prediction of DNA methylations based on genomic sequences only. Benchmarking comparisons show that our iDNA-ABF outperforms state-of-the-art methods for different methylation predictions. Importantly, we show the power of deep language learning in capturing both sequential and functional semantics information from background genomes. Moreover, by integrating the interpretable analysis mechanism, we well explain what the model learns, helping us build the mapping from the discovery of important sequential determinants to the in-depth analysis of their biological functions.
Collapse
Affiliation(s)
- Junru Jin
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yingying Yu
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Ruheng Wang
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Xin Zeng
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan
- Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan
| | - Chao Pang
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yi Jiang
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Zhongshen Li
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yutong Dai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan
- Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Kenta Nakai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan.
- Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan.
| | - Leyi Wei
- School of Software, Shandong University, Jinan, 250101, China.
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China.
| |
Collapse
|
10
|
ENet-6mA: Identification of 6mA Modification Sites in Plant Genomes Using ElasticNet and Neural Networks. Int J Mol Sci 2022; 23:ijms23158314. [PMID: 35955447 PMCID: PMC9369089 DOI: 10.3390/ijms23158314] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/22/2022] [Accepted: 07/24/2022] [Indexed: 02/01/2023] Open
Abstract
N6-methyladenine (6mA) has been recognized as a key epigenetic alteration that affects a variety of biological activities. Precise prediction of 6mA modification sites is essential for understanding the logical consistency of biological activity. There are various experimental methods for identifying 6mA modification sites, but in silico prediction has emerged as a potential option due to the very high cost and labor-intensive nature of experimental procedures. Taking this into consideration, developing an efficient and accurate model for identifying N6-methyladenine is one of the top objectives in the field of bioinformatics. Therefore, we have created an in silico model for the classification of 6mA modifications in plant genomes. ENet-6mA uses three encoding methods, including one-hot, nucleotide chemical properties (NCP), and electron–ion interaction potential (EIIP), which are concatenated and fed as input to ElasticNet for feature reduction, and then the optimized features are given directly to the neural network to get classified. We used a benchmark dataset of rice for five-fold cross-validation testing and three other datasets from plant genomes for cross-species testing purposes. The results show that the model can predict the N6-methyladenine sites very well, even cross-species. Additionally, we separated the datasets into different ratios and calculated the performance using the area under the precision–recall curve (AUPRC), achieving 0.81, 0.79, and 0.50 with 1:10 (positive:negative) samples for F. vesca, R. chinensis, and A. thaliana, respectively.
Collapse
|
11
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
12
|
Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, Wang G. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. FRONTIERS IN PLANT SCIENCE 2022; 13:845835. [PMID: 35237293 PMCID: PMC8882731 DOI: 10.3389/fpls.2022.845835] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/24/2022] [Indexed: 05/17/2023]
Abstract
DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhengnan Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- *Correspondence: Qianzi Lu,
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- Guohua Wang,
| |
Collapse
|
13
|
Zhang Y, Liu Y, Xu J, Wang X, Peng X, Song J, Yu DJ. Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites. Brief Bioinform 2021; 22:bbab351. [PMID: 34459479 PMCID: PMC8575024 DOI: 10.1093/bib/bbab351] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 08/02/2021] [Accepted: 08/09/2021] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine is an important type of DNA modification that plays important roles in multiple biological processes. Despite the recent progress in developing DNA 6mA site prediction methods, several challenges remain to be addressed. For example, although the hand-crafted features are interpretable, they contain redundant information that may bias the model training and have a negative impact on the trained model. Furthermore, although deep learning (DL)-based models can perform feature extraction and classification automatically, they lack the interpretability of the crucial features learned by those models. As such, considerable research efforts have been focused on achieving the trade-off between the interpretability and straightforwardness of DL neural networks. In this study, we develop two new DL-based models for improving the prediction of N6-methyladenine sites, termed LA6mA and AL6mA, which use bidirectional long short-term memory to respectively capture the long-range information and self-attention mechanism to extract the key position information from DNA sequences. The performance of the two proposed methods is benchmarked and evaluated on the two model organisms Arabidopsis thaliana and Drosophila melanogaster. On the two benchmark datasets, LA6mA achieves an area under the receiver operating characteristic curve (AUROC) value of 0.962 and 0.966, whereas AL6mA achieves an AUROC value of 0.945 and 0.941, respectively. Moreover, an in-depth analysis of the attention matrix is conducted to interpret the important information, which is hidden in the sequence and relevant for 6mA site prediction. The two novel pipelines developed for DNA 6mA site prediction in this work will facilitate a better understanding of the underlying principle of DL-based DNA methylation site prediction and its future applications.
Collapse
Affiliation(s)
- Ying Zhang
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Yan Liu
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Xinxin Peng
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
14
|
Rahman CR, Amin R, Shatabda S, Toaha MSI. A convolution based computational approach towards DNA N6-methyladenine site identification and motif extraction in rice genome. Sci Rep 2021; 11:10357. [PMID: 33990665 PMCID: PMC8121938 DOI: 10.1038/s41598-021-89850-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 05/04/2021] [Indexed: 12/23/2022] Open
Abstract
DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification responsible for many biological functions. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network (CNN) based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC (Pseudo Amino Acid Composition) inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves auROC (area under Receiver Operating Characteristic curve) score of 0.98 with an overall accuracy of 93.97% using fivefold cross validation on benchmark dataset. Finally, we evaluate our model on three other plant genome 6mA site identification test datasets. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. An algorithm for potential motif extraction and a feature importance analysis procedure are two by products of this research. Web tool for this research can be found at: https://cutt.ly/dgp3QTR.
Collapse
Affiliation(s)
| | - Ruhul Amin
- United International University, Dhaka, Bangladesh
| | | | | |
Collapse
|
15
|
Chachar S, Liu J, Zhang P, Riaz A, Guan C, Liu S. Harnessing Current Knowledge of DNA N6-Methyladenosine From Model Plants for Non-model Crops. Front Genet 2021; 12:668317. [PMID: 33995495 PMCID: PMC8118384 DOI: 10.3389/fgene.2021.668317] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 04/06/2021] [Indexed: 12/12/2022] Open
Abstract
Epigenetic modifications alter the gene activity and function by causing change in the chromosomal architecture through DNA methylation/demethylation, or histone modifications without causing any change in DNA sequence. In plants, DNA cytosine methylation (5mC) is vital for various pathways such as, gene regulation, transposon suppression, DNA repair, replication, transcription, and recombination. Thanks to recent advances in high throughput sequencing (HTS) technologies for epigenomic “Big Data” generation, accumulated studies have revealed the occurrence of another novel DNA methylation mark, N6-methyladenosine (6mA), which is highly present on gene bodies mainly activates gene expression in model plants such as eudicot Arabidopsis (Arabidopsis thaliana) and monocot rice (Oryza sativa). However, in non-model crops, the occurrence and importance of 6mA remains largely less known, with only limited reports in few species, such as Rosaceae (wild strawberry), and soybean (Glycine max). Given the aforementioned vital roles of 6mA in plants, hereinafter, we summarize the latest advances of DNA 6mA modification, and investigate the historical, known and vital functions of 6mA in plants. We also consider advanced artificial-intelligence biotechnologies that improve extraction and prediction of 6mA concepts. In this Review, we discuss the potential challenges that may hinder exploitation of 6mA, and give future goals of 6mA from model plants to non-model crops.
Collapse
Affiliation(s)
- Sadaruddin Chachar
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Horticulture, Northwest A&F University, Yangling, China.,Department of Biotechnology, Faculty of Crop Production, Sindh Agriculture University, Tandojam, Pakistan
| | - Jingrong Liu
- College of Mathematics and Statistics, Northwest Normal University, Lanzhou, China
| | - Pingxian Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Adeel Riaz
- Deaprtment of Biochemistry, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Changfei Guan
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Horticulture, Northwest A&F University, Yangling, China
| | - Shuyuan Liu
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Horticulture, Northwest A&F University, Yangling, China
| |
Collapse
|
16
|
i6mA-VC: A Multi-Classifier Voting Method for the Computational Identification of DNA N6-methyladenine Sites. Interdiscip Sci 2021; 13:413-425. [PMID: 33834381 DOI: 10.1007/s12539-021-00429-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Revised: 03/26/2021] [Accepted: 03/29/2021] [Indexed: 12/14/2022]
Abstract
DNA N6-methyladenine (6 mA), as an essential component of epigenetic modification, cannot be neglected in genetic regulation mechanism. The efficient and accurate prediction of 6 mA sites is beneficial to the development of biological genetics. Biochemical experimental methods are considered to be time-consuming and laborious. Most of the established machine learning methods have a single dataset. Although some of them have achieved cross-species prediction, their results are not satisfactory. Therefore, we designed a novel statistical model called i6mA-VC to improve the accuracy for 6 mA sites. On the one hand, kmer and binary encoding are applied to extract features, and then gradient boosting decision tree (GBDT) embedded method is applied as the feature selection strategy. On the other hand, DNA sequences are represented by vectors through the feature extraction method of ring-function-hydrogen-chemical properties (RFHCP) and the feature selection strategy of ExtraTree. After fusing the two optimal features, a voting classifier based on gradient boosting decision tree (GBDT), light gradient boosting machine (LightGBM) and multilayer perceptron classifier (MLPC) is constructed for final classification and prediction. The accuracy of Rice dataset and M.musculus dataset with five-fold cross-validation are 0.888 and 0.967, respectively. The cross-species dataset is selected as independent testing dataset, and the accuracy reaches 0.848. Through rigorous experiments, it is demonstrated that the proposed predictor is convincing and applicable. The development of i6mA-VC predictor will become an effective way for the recognition of N6-methyladenine sites, and it will also be beneficial for biological geneticists to further study gene expression and DNA modification. In addition, an accessible web-server for i6mA-VC is available from http://www.zhanglab.site/ .
Collapse
|
17
|
Pian C, Yang Z, Yang Y, Zhang L, Chen Y. Identifying RNA N6-Methyladenine Sites in Three Species Based on a Markov Model. Front Genet 2021; 12:650803. [PMID: 33815484 PMCID: PMC8017269 DOI: 10.3389/fgene.2021.650803] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 03/03/2021] [Indexed: 11/17/2022] Open
Abstract
N6-methyladenosine (m6A), the most common posttranscriptional modification in eukaryotic mRNAs, plays an important role in mRNA splicing, editing, stability, degradation, etc. Since the methylation state is dynamic, methylation sequencing needs to be carried out over different time periods, which brings some difficulties to identify the RNA methyladenine sites. Thus, it is necessary to develop a fast and accurate method to identify the RNA N6-methyladenosine sites in the transcriptome. In this study, we use first-order and second-order Markov models to identify RNA N6-methyladenine sites in three species (Saccharomyces cerevisiae, mouse, and Homo sapiens). These two methods can fully consider the correlation between adjacent nucleotides. The results show that the performance of our method is better than that of other existing methods. Furthermore, the codons encoded by three nucleotides have biases in mRNA, and a second-order Markov model can capture this kind of information exactly. This may be the main reason why the performance of the second-order Markov model is better than that of the first-order Markov model in the m6A prediction problem. In addition, we provide a corresponding web tool called MM-m6APred.
Collapse
Affiliation(s)
- Cong Pian
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Zhixin Yang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yuqian Yang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Liangyun Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yuanyuan Chen
- College of Science, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
18
|
Li Z, Jiang H, Kong L, Chen Y, Lang K, Fan X, Zhang L, Pian C. Deep6mA: A deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species. PLoS Comput Biol 2021; 17:e1008767. [PMID: 33600435 PMCID: PMC7924747 DOI: 10.1371/journal.pcbi.1008767] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 03/02/2021] [Accepted: 02/03/2021] [Indexed: 12/25/2022] Open
Abstract
N6-methyladenine (6mA) is an important DNA modification form associated with a wide range of biological processes. Identifying accurately 6mA sites on a genomic scale is crucial for under-standing of 6mA’s biological functions. However, the existing experimental techniques for detecting 6mA sites are cost-ineffective, which implies the great need of developing new computational methods for this problem. In this paper, we developed, without requiring any prior knowledge of 6mA and manually crafted sequence features, a deep learning framework named Deep6mA to identify DNA 6mA sites, and its performance is superior to other DNA 6mA prediction tools. Specifically, the 5-fold cross-validation on a benchmark dataset of rice gives the sensitivity and specificity of Deep6mA as 92.96% and 95.06%, respectively, and the overall prediction accuracy is 94%. Importantly, we find that the sequences with 6mA sites share similar patterns across different species. The model trained with rice data predicts well the 6mA sites of other three species: Arabidopsis thaliana, Fragaria vesca and Rosa chinensis with a prediction accuracy over 90%. In addition, we find that (1) 6mA tends to occur at GAGG motifs, which means the sequence near the 6mA site may be conservative; (2) 6mA is enriched in the TATA box of the promoter, which may be the main source of its regulating downstream gene expression. DNA N6 methyladenine (6mA) is a newly recognized methylation modification in eukaryotes. It exists widely and conservatively in organisms, and its modification level changes dynamically in the whole life cycle. This study proposes an algorithm based on a deep learning framework including LSTM and CNN to predict 6mA sites. The results showed that our method could accurately predict the 6mA sites in different species, which means DNA sub-sequences containing 6mA sites among species have certain conservation. Importantly, we found that 6mA methylation in most different species is more likely to occur on the GAGG motif. In addition, we also found that 6mA is rich in the promoter’s TATA box, which may be a mechanism of regulating downstream gene expression.
Collapse
Affiliation(s)
- Zutan Li
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
| | - Hangjin Jiang
- Center for Data Science, Zhejiang University, Hangzhou, China
| | - Lingpeng Kong
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
| | - Kun Lang
- College of information science & Technology, Nanjing Agricultural University, Nanjing, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
- * E-mail: (LYZ); (CP)
| | - Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, China
- * E-mail: (LYZ); (CP)
| |
Collapse
|
19
|
Lv Z, Ding H, Wang L, Zou Q. A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.09.056] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
20
|
Ahmed S, Hossain Z, Uddin M, Taherzadeh G, Sharma A, Shatabda S, Dehzangi A. Accurate prediction of RNA 5-hydroxymethylcytosine modification by utilizing novel position-specific gapped k-mer descriptors. Comput Struct Biotechnol J 2020; 18:3528-3538. [PMID: 33304452 PMCID: PMC7701324 DOI: 10.1016/j.csbj.2020.10.032] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 10/30/2020] [Accepted: 10/30/2020] [Indexed: 12/13/2022] Open
Abstract
RNA modification is an essential step towards generation of new RNA structures. Such modification is potentially able to modify RNA function or its stability. Among different modifications, 5-Hydroxymethylcytosine (5hmC) modification of RNA exhibit significant potential for a series of biological processes. Understanding the distribution of 5hmC in RNA is essential to determine its biological functionality. Although conventional sequencing techniques allow broad identification of 5hmC, they are both time-consuming and resource-intensive. In this study, we propose a new computational tool called iRNA5hmC-PS to tackle this problem. To build iRNA5hmC-PS we extract a set of novel sequence-based features called Position-Specific Gapped k-mer (PSG k-mer) to obtain maximum sequential information. Our feature analysis shows that our proposed PSG k-mer features contain vital information for the identification of 5hmC sites. We also use a group-wise feature importance calculation strategy to select a small subset of features containing maximum discriminative information. Our experimental results demonstrate that iRNA5hmC-PS is able to enhance the prediction performance, dramatically. iRNA5hmC-PS achieves 78.3% prediction performance, which is 12.8% better than those reported in the previous studies. iRNA5hmC-PS is publicly available as an online tool at http://103.109.52.8:81/iRNA5hmC-PS. Its benchmark dataset, source codes, and documentation are available at https://github.com/zahid6454/iRNA5hmC-PS.
Collapse
Affiliation(s)
- Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Zahid Hossain
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Mahtab Uddin
- Department of Natural Science, United International University, Dhaka, Bangladesh
| | - Ghazaleh Taherzadeh
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, MD 20742, USA
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD 4111, Australia.,Department of Medical Science Mathematics, Tokyo Medical and Dental University (TMDU), Tokyo, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.,School of Engineering and Physics, University of the South Pacific, Suva, Fiji
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Abdollah Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ 08102, USA.,Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA
| |
Collapse
|
21
|
Xu H, Hu R, Jia P, Zhao Z. 6mA-Finder: a novel online tool for predicting DNA N6-methyladenine sites in genomes. Bioinformatics 2020; 36:3257-3259. [PMID: 32091591 DOI: 10.1093/bioinformatics/btaa113] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Revised: 01/29/2020] [Accepted: 02/14/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6 mA) has recently been found as an essential epigenetic modification, playing its roles in a variety of cellular processes. The abnormal status of DNA 6 mA modification has been reported in cancer and other disease. The annotation of 6 mA marks in genome is the first crucial step to explore the underlying molecular mechanisms including its regulatory roles. RESULTS We present a novel online DNA 6 mA site tool, 6 mA-Finder, by incorporating seven sequence-derived information and three physicochemical-based features through recursive feature elimination strategy. Our multiple cross-validations indicate the promising accuracy and robustness of our model. 6 mA-Finder outperforms its peer tools in general and species-specific 6 mA site prediction, suggesting it can provide a useful resource for further experimental investigation of DNA 6 mA modification. AVAILABILITY AND IMPLEMENTATION https://bioinfo.uth.edu/6mA_Finder. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haodong Xu
- School of Biomedical Informatics, Center for Precision Health
| | - Ruifeng Hu
- School of Biomedical Informatics, Center for Precision Health
| | - Peilin Jia
- School of Biomedical Informatics, Center for Precision Health
| | - Zhongming Zhao
- School of Biomedical Informatics, Center for Precision Health.,MD Anderson Cancer Center, UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
22
|
SSH: A Tool for Predicting Hydrophobic Interaction of Monoclonal Antibodies Using Sequences. BIOMED RESEARCH INTERNATIONAL 2020; 2020:3508107. [PMID: 32596302 PMCID: PMC7288208 DOI: 10.1155/2020/3508107] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 04/28/2020] [Accepted: 05/13/2020] [Indexed: 12/31/2022]
Abstract
Therapeutic antibodies are one of the most important parts of the pharmaceutical industry. They are widely used in treating various diseases such as autoimmune diseases, cancer, inflammation, and infectious diseases. Their development process however is often brought to a standstill or takes a longer time and is then more expensive due to their hydrophobicity problems. Hydrophobic interactions can cause problems on half-life, drug administration, and immunogenicity at all stages of antibody drug development. Some of the most widely accepted and used technologies for determining the hydrophobic interactions of antibodies include standup monolayer adsorption chromatography (SMAC), salt-gradient affinity-capture self-interaction nanoparticle spectroscopy (SGAC-SINS), and hydrophobic interaction chromatography (HIC). However, to measure SMAC, SGAC-SINS, and HIC for hundreds of antibody drug candidates is time-consuming and costly. To save time and money, a predictor called SSH is developed. Based on the antibody's sequence only, it can predict the hydrophobic interactions of monoclonal antibodies (mAbs). Using the leave-one-out crossvalidation, SSH achieved 91.226% accuracy, 96.396% sensitivity or recall, 84.196% specificity, 87.754% precision, 0.828 Mathew correlation coefficient (MCC), 0.919 f-score, and 0.961 area under the receiver operating characteristic (ROC) curve (AUC).
Collapse
|
23
|
Liu Y, Chen D, Su R, Chen W, Wei L. iRNA5hmC: The First Predictor to Identify RNA 5-Hydroxymethylcytosine Modifications Using Machine Learning. Front Bioeng Biotechnol 2020; 8:227. [PMID: 32296686 PMCID: PMC7137033 DOI: 10.3389/fbioe.2020.00227] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 03/05/2020] [Indexed: 01/27/2023] Open
Abstract
RNA 5-hydroxymethylcytosine (5hmC) modification plays an important role in a series of biological processes. Characterization of its distributions in transcriptome is fundamentally important to reveal the biological functions of 5hmC. Sequencing-based technologies allow the high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Thus, there is an urgent need to develop more effective and efficient computational methods, at least complementary to the high-throughput technologies. In this study, we developed iRNA5hmC, a computational predictive protocol to identify RNA 5hmC sites using machine learning. In this predictor, we introduced a sequence-based feature algorithm consisting of two feature representations, (1) k-mer spectrum and (2) positional nucleotide binary vector, to capture the sequential characteristics of 5hmC sites. Afterward, we utilized a two-stage feature space optimization strategy to improve the feature representation ability, and trained a predictive model using support vector machine (SVM). Our feature analysis results showed that feature optimization can help to capture the most discriminative features. As compared to well-known existing feature descriptors, our proposed representations can more accurately separate true 5hmC from non-5hmC sites. To the best of our knowledge, iRNA5hmC is the first RNA 5hmC predictor that enables to make predictions based on RNA primary sequences only, without any need of prior experimental knowledge. Importantly, we have established an easy-to-use webserver which is currently available at http://server.malab.cn/iRNA5hmC. We expect it has potential to be a useful tool for the prediction of 5hmC sites.
Collapse
Affiliation(s)
- Yuan Liu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Dasheng Chen
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Wei Chen
- Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan, China.,Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, Jinan, China
| |
Collapse
|
24
|
Wang HT, Xiao FH, Li GH, Kong QP. Identification of DNA N 6-methyladenine sites by integration of sequence features. Epigenetics Chromatin 2020; 13:8. [PMID: 32093759 PMCID: PMC7038560 DOI: 10.1186/s13072-020-00330-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 02/03/2020] [Indexed: 02/21/2023] Open
Abstract
Background An increasing number of nucleic acid modifications have been profiled with the development of sequencing technologies. DNA N6-methyladenine (6mA), which is a prevalent epigenetic modification, plays important roles in a series of biological processes. So far, identification of DNA 6mA relies primarily on time-consuming and expensive experimental approaches. However, in silico methods can be implemented to conduct preliminary screening to save experimental resources and time, especially given the rapid accumulation of sequencing data. Results In this study, we constructed a 6mA predictor, p6mA, from a series of sequence-based features, including physicochemical properties, position-specific triple-nucleotide propensity (PSTNP), and electron–ion interaction pseudopotential (EIIP). We performed maximum relevance maximum distance (MRMD) analysis to select key features and used the Extreme Gradient Boosting (XGBoost) algorithm to build our predictor. Results demonstrated that p6mA outperformed other existing predictors using different datasets. Conclusions p6mA can predict the methylation status of DNA adenines, using only sequence files. It may be used as a tool to help the study of 6mA distribution pattern. Users can download it from https://github.com/Konglab404/p6mA.
Collapse
Affiliation(s)
- Hao-Tian Wang
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China.,Kunming College of Life Science, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Fu-Hui Xiao
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China.,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China
| | - Qing-Peng Kong
- State Key Laboratory of Genetic Resources and Evolution/Key Laboratory of Healthy Aging Research of Yunnan Province, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, China. .,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China. .,Kunming Key Laboratory of Healthy Aging Study, Kunming, 650223, China. .,KIZ/CUHK Joint Laboratory of Bioresources and Molecular Research in Common Diseases, Kunming, 650223, China.
| |
Collapse
|
25
|
Karanthamalai J, Chodon A, Chauhan S, Pandi G. DNA N 6-Methyladenine Modification in Plant Genomes-A Glimpse into Emerging Epigenetic Code. PLANTS (BASEL, SWITZERLAND) 2020; 9:E247. [PMID: 32075056 PMCID: PMC7076483 DOI: 10.3390/plants9020247] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 02/09/2020] [Accepted: 02/11/2020] [Indexed: 02/08/2023]
Abstract
N6-methyladenine (6mA) is a DNA base modification at the 6th nitrogen position; recently, it has been resurfaced as a potential reversible epigenetic mark in eukaryotes. Despite its existence, 6mA was considered to be absent due to its undetectable level. However, with the new advancements in methods, considerable 6mA distribution is identified across the plant genome. Unlike 5-methylcytosine (5mC) in the gene promoter, 6mA does not have a definitive role in repression but is exposed to have divergent regulation in gene expression. Though 6mA information is less known, the available evidences suggest its function in plant development, tissue differentiation, and regulations in gene expression. The current review article emphasizes the research advances in DNA 6mA modifications, identification, available databases, analysis tools and its significance in plant development, cellular functions and future perspectives of research.
Collapse
Affiliation(s)
| | | | | | - Gopal Pandi
- Department of Plant Biotechnology, School of Biotechnology, Madurai Kamaraj University, Madurai625021, Tamil Nadu, India; (J.K.); (A.C.); (S.C.)
| |
Collapse
|
26
|
Huang Q, Zhang J, Wei L, Guo F, Zou Q. 6mA-RicePred: A Method for Identifying DNA N 6-Methyladenine Sites in the Rice Genome Based on Feature Fusion. FRONTIERS IN PLANT SCIENCE 2020; 11:4. [PMID: 32076430 PMCID: PMC7006724 DOI: 10.3389/fpls.2020.00004] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 01/06/2020] [Indexed: 06/01/2023]
Abstract
MOTIVATION The biological function of N 6-methyladenine DNA (6mA) in plants is largely unknown. Rice is one of the most important crops worldwide and is a model species for molecular and genetic studies. There are few methods for 6mA site recognition in the rice genome, and an effective computational method is needed. RESULTS In this paper, we propose a new computational method called 6mA-Pred to identify 6mA sites in the rice genome. 6mA-Pred employs a feature fusion method to combine advantageous features from other methods and thus obtain a new feature to identify 6mA sites. This method achieved an accuracy of 87.27% in the identification of 6mA sites with 10-fold cross-validation and achieved an accuracy of 85.6% in independent test sets.
Collapse
Affiliation(s)
- Qianfei Huang
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Leyi Wei
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
27
|
Yu H, Dai Z. SNNRice6mA: A Deep Learning Method for Predicting DNA N6-Methyladenine Sites in Rice Genome. Front Genet 2019; 10:1071. [PMID: 31681441 PMCID: PMC6797597 DOI: 10.3389/fgene.2019.01071] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Accepted: 10/04/2019] [Indexed: 01/08/2023] Open
Abstract
DNA N6-methyladenine (6mA) is an important epigenetic modification, which is involved in many biology regulation processes. An accurate and reliable method for 6mA identification can help us gain a better insight into the regulatory mechanism of the modification. Although many experimental techniques have been proposed to identify 6mA sites genome-wide, these techniques are time consuming and laborious. Recently, several machine learning methods have been developed to identify 6mA sites genome-wide. However, there is room for the improvement on their performance for predicting 6mA sites in rice genome. In this paper, we developed a simple and lightweight deep learning model to identify DNA 6mA sites in rice genome. Our model needs no prior knowledge of 6mA or manually crafted sequence feature. We built our model based on two rice 6mA benchmark datasets. Our method got an average prediction accuracy of ∼93% and ∼92% on the two datasets we used. We compared our method with existing 6mA prediction tools. The comparison results show that our model outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Haitao Yu
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
| | - Zhiming Dai
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China.,Guangdong Province Key Laboratory of Big Data Analysis and Processing, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|