1
|
Nerella S, Bandyopadhyay S, Zhang J, Contreras M, Siegel S, Bumin A, Silva B, Sena J, Shickel B, Bihorac A, Khezeli K, Rashidi P. Transformers and large language models in healthcare: A review. Artif Intell Med 2024; 154:102900. [PMID: 38878555 DOI: 10.1016/j.artmed.2024.102900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 05/28/2024] [Accepted: 05/30/2024] [Indexed: 08/09/2024]
Abstract
With Artificial Intelligence (AI) increasingly permeating various aspects of society, including healthcare, the adoption of the Transformers neural network architecture is rapidly changing many applications. Transformer is a type of deep learning architecture initially developed to solve general-purpose Natural Language Processing (NLP) tasks and has subsequently been adapted in many fields, including healthcare. In this survey paper, we provide an overview of how this architecture has been adopted to analyze various forms of healthcare data, including clinical NLP, medical imaging, structured Electronic Health Records (EHR), social media, bio-physiological signals, biomolecular sequences. Furthermore, which have also include the articles that used the transformer architecture for generating surgical instructions and predicting adverse outcomes after surgeries under the umbrella of critical care. Under diverse settings, these models have been used for clinical diagnosis, report generation, data reconstruction, and drug/protein synthesis. Finally, we also discuss the benefits and limitations of using transformers in healthcare and examine issues such as computational cost, model interpretability, fairness, alignment with human values, ethical implications, and environmental impact.
Collapse
Affiliation(s)
- Subhash Nerella
- Department of Biomedical Engineering, University of Florida, Gainesville, United States
| | | | - Jiaqing Zhang
- Department of Electrical and Computer Engineering, University of Florida, Gainesville, United States
| | - Miguel Contreras
- Department of Biomedical Engineering, University of Florida, Gainesville, United States
| | - Scott Siegel
- Department of Biomedical Engineering, University of Florida, Gainesville, United States
| | - Aysegul Bumin
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, United States
| | - Brandon Silva
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, United States
| | - Jessica Sena
- Department Of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Benjamin Shickel
- Department of Medicine, University of Florida, Gainesville, United States
| | - Azra Bihorac
- Department of Medicine, University of Florida, Gainesville, United States
| | - Kia Khezeli
- Department of Biomedical Engineering, University of Florida, Gainesville, United States
| | - Parisa Rashidi
- Department of Biomedical Engineering, University of Florida, Gainesville, United States.
| |
Collapse
|
2
|
Xin R, Zhang F, Zheng J, Zhang Y, Yu C, Feng X. SDBA: Score Domain-Based Attention for DNA N4-Methylcytosine Site Prediction from Multiperspectives. J Chem Inf Model 2024; 64:2839-2853. [PMID: 37646411 DOI: 10.1021/acs.jcim.3c00688] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
In tasks related to DNA sequence classification, choosing the appropriate encoding methods is challenging. Some of the methods encode sequences based on prior knowledge that limits the ability of the model to obtain multiperspective information from the sequences. We introduced a new trainable ensemble method based on the attention mechanism SDBA, which stands for Score Domain-Based Attention. Unlike other methods, we fed the task-independent encoding results into the models and dynamically ensembled features from different perspectives using the SDBA mechanism. This approach allows the model to acquire and weight sequence features voluntarily. SDBA is conceptually general and empirically powerful. It has achieved new state-of-the-art results on the benchmark data sets associated with DNA N4-methylcytosine site prediction.
Collapse
Affiliation(s)
- Ruihao Xin
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Fan Zhang
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
| | - Jiaxin Zheng
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Yangyi Zhang
- University of Melbourne Centre for Cancer Research, Victorian Comprehensive Cancer Centre, University of Melbourne, Parkville, Victoria 3050, Australia
| | - Cuinan Yu
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Xin Feng
- School of Science, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, College of Chemistry, Jilin University, Changchun 130012, P.R. China
| |
Collapse
|
3
|
Yin Z, Lyu J, Zhang G, Huang X, Ma Q, Jiang J. SoftVoting6mA: An improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:3798-3815. [PMID: 38549308 DOI: 10.3934/mbe.2024169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2024]
Abstract
The DNA N6-methyladenine (6mA) is an epigenetic modification, which plays a pivotal role in biological processes encompassing gene expression, DNA replication, repair, and recombination. Therefore, the precise identification of 6mA sites is fundamental for better understanding its function, but challenging. We proposed an improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes called SoftVoting6mA. The SoftVoting6mA selected four (electron-ion-interaction pseudo potential, One-hot encoding, Kmer, and pseudo dinucleotide composition) codes from 15 types of encoding to represent DNA sequences by comparing their performances. Similarly, the SoftVoting6mA combined four learning algorithms using the soft voting strategy. The 5-fold cross-validation and the independent tests showed that SoftVoting6mA reached the state-of-the-art performance. To enhance accessibility, a user-friendly web server is provided at http://www.biolscience.cn/SoftVoting6mA/.
Collapse
Affiliation(s)
- Zhaoting Yin
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Jianyi Lyu
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Guiyang Zhang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Qinghua Ma
- College of Information Science and Engineering, Hohai University, Nanjing 210000, China
- Faculty of Information Technology, University of Jyvaskyla, Jyvaskyla, Finland
| | - Jinyun Jiang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
4
|
Jia J, Deng Y, Yi M, Zhu Y. 4mCPred-GSIMP: Predicting DNA N4-methylcytosine sites in the mouse genome with multi-Scale adaptive features extraction and fusion. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:253-271. [PMID: 38303422 DOI: 10.3934/mbe.2024012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
The epigenetic modification of DNA N4-methylcytosine (4mC) is vital for controlling DNA replication and expression. It is crucial to pinpoint 4mC's location to comprehend its role in physiological and pathological processes. However, accurate 4mC detection is difficult to achieve due to technical constraints. In this paper, we propose a deep learning-based approach 4mCPred-GSIMP for predicting 4mC sites in the mouse genome. The approach encodes DNA sequences using four feature encoding methods and combines multi-scale convolution and improved selective kernel convolution to adaptively extract and fuse features from different scales, thereby improving feature representation and optimization effect. In addition, we also use convolutional residual connections, global response normalization and pointwise convolution techniques to optimize the model. On the independent test dataset, 4mCPred-GSIMP shows high sensitivity, specificity, accuracy, Matthews correlation coefficient and area under the curve, which are 0.7812, 0.9312, 0.8562, 0.7207 and 0.9233, respectively. Various experiments demonstrate that 4mCPred-GSIMP outperforms existing prediction tools.
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Yu Deng
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Mengyue Yi
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Yuhui Zhu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| |
Collapse
|
5
|
Boulet M, Gilbert G, Renaud Y, Schmidt-Dengler M, Plantié E, Bertrand R, Nan X, Jurkowski T, Helm M, Vandel L, Waltzer L. Adenine methylation is very scarce in the Drosophila genome and not erased by the ten-eleven translocation dioxygenase. eLife 2023; 12:RP91655. [PMID: 38126351 PMCID: PMC10735219 DOI: 10.7554/elife.91655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023] Open
Abstract
N6-methyladenine (6mA) DNA modification has recently been described in metazoans, including in Drosophila, for which the erasure of this epigenetic mark has been ascribed to the ten-eleven translocation (TET) enzyme. Here, we re-evaluated 6mA presence and TET impact on the Drosophila genome. Using axenic or conventional breeding conditions, we found traces of 6mA by LC-MS/MS and no significant increase in 6mA levels in the absence of TET, suggesting that this modification is present at very low levels in the Drosophila genome but not regulated by TET. Consistent with this latter hypothesis, further molecular and genetic analyses showed that TET does not demethylate 6mA but acts essentially in an enzymatic-independent manner. Our results call for further caution concerning the role and regulation of 6mA DNA modification in metazoans and underline the importance of TET non-enzymatic activity for fly development.
Collapse
Affiliation(s)
- Manon Boulet
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| | - Guerric Gilbert
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| | - Yoan Renaud
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| | - Martina Schmidt-Dengler
- Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg-UniversitätMainzGermany
| | - Emilie Plantié
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| | - Romane Bertrand
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| | - Xinsheng Nan
- School of Biosciences, Cardiff UniversityCardiffUnited Kingdom
| | | | - Mark Helm
- Institute of Pharmaceutical and Biomedical Sciences, Johannes Gutenberg-UniversitätMainzGermany
| | - Laurence Vandel
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| | - Lucas Waltzer
- Université Clermont Auvergne, CNRS, INSERM, iGReDClermont-FerrandFrance
| |
Collapse
|
6
|
Zhou J, Horton JR, Kaur G, Chen Q, Li X, Mendoza F, Wu T, Blumenthal RM, Zhang X, Cheng X. Biochemical and structural characterization of the first-discovered metazoan DNA cytosine-N4 methyltransferase from the bdelloid rotifer Adineta vaga. J Biol Chem 2023; 299:105017. [PMID: 37414145 PMCID: PMC10406627 DOI: 10.1016/j.jbc.2023.105017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 06/27/2023] [Accepted: 06/29/2023] [Indexed: 07/08/2023] Open
Abstract
Much is known about the generation, removal, and roles of 5-methylcytosine (5mC) in eukaryote DNA, and there is a growing body of evidence regarding N6-methyladenine, but very little is known about N4-methylcytosine (4mC) in the DNA of eukaryotes. The gene for the first metazoan DNA methyltransferase generating 4mC (N4CMT) was reported and characterized recently by others, in tiny freshwater invertebrates called bdelloid rotifers. Bdelloid rotifers are ancient, apparently asexual animals, and lack canonical 5mC DNA methyltransferases. Here, we characterize the kinetic properties and structural features of the catalytic domain of the N4CMT protein from the bdelloid rotifer Adineta vaga. We find that N4CMT generates high-level methylation at preferred sites, (a/c)CG(t/c/a), and low-level methylation at disfavored sites, exemplified by ACGG. Like the mammalian de novo 5mC DNA methyltransferase 3A/3B (DNMT3A/3B), N4CMT methylates CpG dinucleotides on both DNA strands, generating hemimethylated intermediates and eventually fully methylated CpG sites, particularly in the context of favored symmetric sites. In addition, like DNMT3A/3B, N4CMT methylates non-CpG sites, mainly CpA/TpG, though at a lower rate. Both N4CMT and DNMT3A/3B even prefer similar CpG-flanking sequences. Structurally, the catalytic domain of N4CMT closely resembles the Caulobacter crescentus cell cycle-regulated DNA methyltransferase. The symmetric methylation of CpG, and similarity to a cell cycle-regulated DNA methyltransferase, together suggest that N4CMT might also carry out DNA synthesis-dependent methylation following DNA replication.
Collapse
Affiliation(s)
- Jujun Zhou
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - John R Horton
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Gundeep Kaur
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Qin Chen
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Xuwen Li
- Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Fabian Mendoza
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Tao Wu
- Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Robert M Blumenthal
- Department of Medical Microbiology and Immunology, Program in Bioinformatics, The University of Toledo College of Medicine and Life Sciences, Toledo, Ohio, USA.
| | - Xing Zhang
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA.
| | - Xiaodong Cheng
- Department of Epigenetics and Molecular Carcinogenesis, University of Texas MD Anderson Cancer Center, Houston, Texas, USA.
| |
Collapse
|
7
|
Arkhipova IR, Yushenova IA, Rodriguez F. Shaping eukaryotic epigenetic systems by horizontal gene transfer. Bioessays 2023; 45:e2200232. [PMID: 37339822 PMCID: PMC10287040 DOI: 10.1002/bies.202200232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 05/07/2023] [Accepted: 05/08/2023] [Indexed: 06/22/2023]
Abstract
DNA methylation constitutes one of the pillars of epigenetics, relying on covalent bonds for addition and/or removal of chemically distinct marks within the major groove of the double helix. DNA methyltransferases, enzymes which introduce methyl marks, initially evolved in prokaryotes as components of restriction-modification systems protecting host genomes from bacteriophages and other invading foreign DNA. In early eukaryotic evolution, DNA methyltransferases were horizontally transferred from bacteria into eukaryotes several times and independently co-opted into epigenetic regulatory systems, primarily via establishing connections with the chromatin environment. While C5-methylcytosine is the cornerstone of plant and animal epigenetics and has been investigated in much detail, the epigenetic role of other methylated bases is less clear. The recent addition of N4-methylcytosine of bacterial origin as a metazoan DNA modification highlights the prerequisites for foreign gene co-option into the host regulatory networks, and challenges the existing paradigms concerning the origin and evolution of eukaryotic regulatory systems.
Collapse
Affiliation(s)
- Irina R Arkhipova
- Marine Biological Laboratory, Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Woods Hole, Massachusetts, USA
| | - Irina A Yushenova
- Marine Biological Laboratory, Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Woods Hole, Massachusetts, USA
| | - Fernando Rodriguez
- Marine Biological Laboratory, Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Woods Hole, Massachusetts, USA
| |
Collapse
|
8
|
Mao F, Xie H, Shi Y, Jiang S, Wang S, Wu Y. The Global Changes of N6-methyldeoxyadenosine in Response to Low Temperature in Arabidopsis thaliana and Rice. PLANTS (BASEL, SWITZERLAND) 2023; 12:2373. [PMID: 37375998 DOI: 10.3390/plants12122373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/12/2023] [Accepted: 06/14/2023] [Indexed: 06/29/2023]
Abstract
N6-methyldeoxyadenosine (6mA) is a recently discovered DNA modification involved in regulating plant adaptation to abiotic stresses. However, the mechanisms and changes of 6mA under cold stress in plants are not yet fully understood. Here, we conducted a genome-wide analysis of 6mA and observed that 6mA peaks were predominantly present within the gene body regions under both normal and cold conditions. In addition, the global level of 6mA increased both in Arabidopsis and rice after the cold treatment. The genes that exhibited an up-methylation showed enrichment in various biological processes, whereas there was no significant enrichment observed among the down-methylated genes. The association analysis revealed a positive correlation between the 6mA level and the gene expression level. Joint analysis of the 6mA methylome and transcriptome of Arabidopsis and rice unraveled that fluctuations in 6mA levels caused by cold exposure were not correlated to changes in transcript levels. Furthermore, we discovered that orthologous genes modified by 6mA showed high expression levels; however, only a minor amount of differentially 6mA-methylated orthologous genes were shared between Arabidopsis and rice under low-temperature conditions. In conclusion, our study provides information on the role of 6mA in response to cold stress and reveals its potential for regulating the expression of stress-related genes.
Collapse
Affiliation(s)
- Fei Mao
- National Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, China
| | - Hairong Xie
- National Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, China
| | - Yucheng Shi
- National Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, China
| | - Shasha Jiang
- National Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, China
| | - Shuai Wang
- National Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, China
| | - Yufeng Wu
- National Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, China
| |
Collapse
|
9
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N 4-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023; 21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open
Abstract
N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies.
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
- School of Innovation, Design and Technology, Wellington Institute of Technology, Wellington 5012, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Ho Chi Minh City 700000, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
- Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| |
Collapse
|
10
|
Yu X, Ren J, Cui Y, Zeng R, Long H, Ma C. DRSN4mCPred: accurately predicting sites of DNA N4-methylcytosine using deep residual shrinkage network for diagnosis and treatment of gastrointestinal cancer in the precision medicine era. Front Med (Lausanne) 2023; 10:1187430. [PMID: 37215722 PMCID: PMC10192687 DOI: 10.3389/fmed.2023.1187430] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 04/05/2023] [Indexed: 05/24/2023] Open
Abstract
Introduction The DNA N4-methylcytosine (4mC) site levels of those suffering from digestive system cancers were higher, and the pathogenesis of digestive system cancers may also be related to the changes in DNA 4mC levels. Identifying DNA 4mC sites is a very important step in studying the analysis of biological function and cancer prediction. Extracting accurate features from DNA sequences is the key to establishing a prediction model of effective DNA 4mC sites. This study sought to develop a new predictive model, DRSN4mCPred, which aimed to improve the performance of the predicting DNA 4mC sites. Methods The model adopted multi-scale channel attention to extract features and used attention feature fusion (AFF) to fuse features. In order to capture features information more accurately and effectively, this model utilized Deep Residual Shrinkage Network with Channel-Wise thresholds (DRSN-CW) to eliminate noise-related features and achieve a more precise feature representation, thereby, distinguishing the sites in DNA with 4mC and non-4mC. Additionally, the predictive model incorporated an inverted residual block, a Multi-scale Channel Attention Module (MS-CAM), a Bi-directional Long Short Term Memory Network (Bi-LSTM), AFF, and DRSN-CW. Results and Discussion The results indicated the predictive model DRSN4mCPred had extremely good performance in predicting the DNA 4mC sites across different species. This paper will potentially provide support for the diagnosis and treatment of gastrointestinal cancer based on artificial intelligence in the precise medical era.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- Industrial Design School, Shandong University of ART and Design, Jinan, Shandong, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Cuihua Ma
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
11
|
Yang S, Yang Z, Yang J. 4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies. Int J Biol Macromol 2023; 231:123180. [PMID: 36646347 DOI: 10.1016/j.ijbiomac.2023.123180] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 11/26/2022] [Accepted: 12/30/2022] [Indexed: 01/15/2023]
Abstract
N4-methylcytosine (4mC) is an important DNA chemical modification pattern which is a new methylation modification discovered in recent years and plays critical roles in gene expression regulation, defense against invading genetic elements, genomic imprinting, and so on. Identifying 4mC site from DNA sequence segment contributes to discovering more novel modification patterns. In this paper, we present a model called 4mCBERT that encodes DNA sequence segments by sequence characteristics including one-hot, electron-ion interaction pseudopotential, nucleotide chemical property, word2vec and chemical information containing physicochemical properties (PCP), chemical bidirectional encoder representations from transformers (chemical BERT) and employs ensemble learning framework to develop a prediction model. PCP and chemical BERT features are firstly constructed and applied to predict 4mC sites and show positive contributions to identifying 4mC. For the Matthew's Correlation Coefficient, 4mCBERT significantly outperformed other state-of-the-art models on six independent benchmark datasets including A. thaliana, C. elegans, D. melanogaster, E. coli, G. Pickering, and G. subterraneous by 4.32 % to 24.39 %, 2.52 % to 31.65 %, 2 % to 16.49 %, 6.63 % to 35.15, 8.59 % to 61.85 %, and 8.45 % to 34.45 %. Moreover, 4mCBERT is designed to allow users to predict 4mC sites and retrain 4mC prediction models. In brief, 4mCBERT shows higher performance on six benchmark datasets by incorporating sequence- and chemical-driven information and is available at http://cczubio.top/4mCBERT and https://github.com/abcair/4mCBERT.
Collapse
Affiliation(s)
- Sen Yang
- School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China; The Affiliated Changzhou No 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China.
| | - Zexi Yang
- School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China
| | - Jun Yang
- School of Educational Sciences, Yili Normal University, Yining 835000, China
| |
Collapse
|
12
|
Wang M, Li Q, Liu L. Factors and Methods for the Detection of Gene Expression Regulation. Biomolecules 2023; 13:biom13020304. [PMID: 36830673 PMCID: PMC9953580 DOI: 10.3390/biom13020304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/01/2023] [Accepted: 02/03/2023] [Indexed: 02/10/2023] Open
Abstract
Gene-expression regulation involves multiple processes and a range of regulatory factors. In this review, we describe the key factors that regulate gene expression, including transcription factors (TFs), chromatin accessibility, histone modifications, DNA methylation, and RNA modifications. In addition, we also describe methods that can be used to detect these regulatory factors.
Collapse
|
13
|
A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
|
14
|
Han K, Wang J, Wang Y, Zhang L, Yu M, Xie F, Zheng D, Xu Y, Ding Y, Wan J. A review of methods for predicting DNA N6-methyladenine sites. Brief Bioinform 2023; 24:6887111. [PMID: 36502371 DOI: 10.1093/bib/bbac514] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 10/07/2022] [Accepted: 10/27/2022] [Indexed: 12/14/2022] Open
Abstract
Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China.,College of Pharmacy, Harbin University of Commerce, Harbin, 150076, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yu Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Lei Zhang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Mengyao Yu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Fang Xie
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yaoqun Xu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin, 150001, China
| |
Collapse
|
15
|
MultiScale-CNN-4mCPred: a multi-scale CNN and adaptive embedding-based method for mouse genome DNA N4-methylcytosine prediction. BMC Bioinformatics 2023; 24:21. [PMID: 36653789 PMCID: PMC9847203 DOI: 10.1186/s12859-023-05135-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 01/04/2023] [Indexed: 01/19/2023] Open
Abstract
N4-methylcytosine (4mC) is an important epigenetic mechanism, which regulates many cellular processes such as cell differentiation and gene expression. The knowledge about the 4mC sites is a key foundation to exploring its roles. Due to the limitation of techniques, precise detection of 4mC is still a challenging task. In this paper, we presented a multi-scale convolution neural network (CNN) and adaptive embedding-based computational method for predicting 4mC sites in mouse genome, which was referred to as MultiScale-CNN-4mCPred. The MultiScale-CNN-4mCPred used adaptive embedding to encode nucleotides, and then utilized multi-scale CNNs as well as long short-term memory to extract more in-depth local properties and contextual semantics in the sequences. The MultiScale-CNN-4mCPred is an end-to-end learning method, which requires no sophisticated feature design. The MultiScale-CNN-4mCPred reached an accuracy of 81.66% in the 10-fold cross-validation, and an accuracy of 84.69% in the independent test, outperforming state-of-the-art methods. We implemented the proposed method into a user-friendly web application which is freely available at: http://www.biolscience.cn/MultiScale-CNN-4mCPred/ .
Collapse
|
16
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
17
|
Zeng W, Gautam A, Huson DH. MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction. Gigascience 2022; 12:giad054. [PMID: 37489753 PMCID: PMC10367125 DOI: 10.1093/gigascience/giad054] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 05/09/2023] [Accepted: 07/18/2023] [Indexed: 07/26/2023] Open
Abstract
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
Collapse
Affiliation(s)
- Wenhuan Zeng
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
| | - Anupam Gautam
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| | - Daniel H Huson
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
18
|
PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.05.060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
19
|
Abbas Z, Tayara H, Chong KT. ZayyuNet - A Unified Deep Learning Model for the Identification of Epigenetic Modifications Using Raw Genomic Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2533-2544. [PMID: 34038365 DOI: 10.1109/tcbb.2021.3083789] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Epigenetic modifications have a vital role in gene expression and are linked to cellular processes such as differentiation, development, and tumorigenesis. Thus, the availability of reliable and accurate methods for identifying and defining these changes facilitates greater insights into the regulatory mechanisms that rely on epigenetic modifications. The current experimental methods provide a genome-wide identification of epigenetic modifications; however, they are expensive and time-consuming. To date, several machine learning methods have been proposed for identifying modifications such as DNA N6-Methyladenine (6mA), RNA N6-Methyladenosine (m6A), DNA N4-methylcytosine (4mC), and RNA pseudouridine ( Ψ). However, these methods are task-specific computational tools and require different encoding representations of DNA/RNA sequences. In this study, we propose a unified deep learning model, called ZayyuNet, for the identification of various epigenetic modifications. The proposed model is based on an architecture called, SpinalNet, inspired by the human somatosensory system that can efficiently receive large inputs and achieve better performance. The proposed model has been evaluated on various epigenetic modifications such as 6mA, m6A, 4mC, and Ψ and the results achieved outperform current state-of-the-art models. A user-friendly web server has been built and made freely available at http://nsclbio.jbnu.ac.kr/tools/ZayyuNet/.
Collapse
|
20
|
Liang Y, Wu Y, Zhang Z, Liu N, Peng J, Tang J. Hyb4mC: a hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction. BMC Bioinformatics 2022; 23:258. [PMID: 35768759 PMCID: PMC9241225 DOI: 10.1186/s12859-022-04789-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 06/10/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA N4-methylcytosine is part of the restrictive modification system, which works by regulating some biological processes, for example, the initiation of DNA replication, mismatch repair and inactivation of transposon. However, using experimental methods to detect 4mC sites is time-consuming and expensive. Besides, considering the huge differences in the number of 4mC samples among different species, it is challenging to achieve a robust multi-species 4mC site prediction performance. Hence, it is of great significance to develop effective computational tools to identify 4mC sites. RESULTS This work proposes a flexible deep learning-based framework to predict 4mC sites, called Hyb4mC. Hyb4mC adopts the DNA2vec method for sequence embedding, which captures more efficient and comprehensive information compared with the sequence-based feature method. Then, two different subnets are used for further analysis: Hyb_Caps and Hyb_Conv. Hyb_Caps is composed of a capsule neural network and can generalize from fewer samples. Hyb_Conv combines the attention mechanism with a text convolutional neural network for further feature learning. CONCLUSIONS Extensive benchmark tests have shown that Hyb4mC can significantly enhance the performance of predicting 4mC sites compared with the recently proposed methods.
Collapse
Affiliation(s)
- Ying Liang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China.
| | - Yanan Wu
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Zequn Zhang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Niannian Liu
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Jun Peng
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Jianjun Tang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
21
|
|
22
|
Li X, Guo S, Cui Y, Zhang Z, Luo X, Angelova MT, Landweber LF, Wang Y, Wu TP. NT-seq: a chemical-based sequencing method for genomic methylome profiling. Genome Biol 2022; 23:122. [PMID: 35637459 PMCID: PMC9150344 DOI: 10.1186/s13059-022-02689-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 05/16/2022] [Indexed: 12/15/2022] Open
Abstract
DNA methylation plays vital roles in both prokaryotes and eukaryotes. There are three forms of DNA methylation in prokaryotes: N6-methyladenine (6mA), N4-methylcytosine (4mC), and 5-methylcytosine (5mC). Although many sequencing methods have been developed to sequence specific types of methylation, few technologies can be used for efficiently mapping multiple types of methylation. Here, we present NT-seq for mapping all three types of methylation simultaneously. NT-seq reliably detects all known methylation motifs in two bacterial genomes and can be used for identifying de novo methylation motifs. NT-seq provides a simple and efficient solution for detecting multiple types of DNA methylation.
Collapse
Affiliation(s)
- Xuwen Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Shiyuan Guo
- Genetics, Genomics, and Bioinformatics Graduate Program, University of California Riverside, Riverside, CA, USA
| | - Yan Cui
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Zijian Zhang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Xinlong Luo
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Margarita T Angelova
- Departments of Biochemistry and Molecular Biophysics and Biological Sciences, Columbia University, New York, NY, USA
| | - Laura F Landweber
- Departments of Biochemistry and Molecular Biophysics and Biological Sciences, Columbia University, New York, NY, USA
| | - Yinsheng Wang
- Genetics, Genomics, and Bioinformatics Graduate Program, University of California Riverside, Riverside, CA, USA.,Department of Chemistry, University of California Riverside, Riverside, CA, USA
| | - Tao P Wu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA. .,Huffington Center on Aging, Baylor College of Medicine, Houston, TX, USA. .,Dan L Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
23
|
Zhanga S, Yao Y, Wang J, Liang Y. Identification of DNA N4-methylcytosine sites based on multi-source features and gradient boosting decision tree. Anal Biochem 2022; 652:114746. [DOI: 10.1016/j.ab.2022.114746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 05/13/2022] [Accepted: 05/18/2022] [Indexed: 11/16/2022]
|
24
|
Yu B, Zhang Y, Wang X, Gao H, Sun J, Gao X. Identification of DNA modification sites based on elastic net and bidirectional gated recurrent unit with convolutional neural network. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103566] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
25
|
Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, Jing R. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol 2022; 13:843425. [PMID: 35401453 PMCID: PMC8989013 DOI: 10.3389/fmicb.2022.843425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 02/21/2022] [Indexed: 11/13/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is a pivotal epigenetic modification that plays an essential role in DNA replication, repair, expression and differentiation. To gain insight into the biological functions of 4mC, it is critical to identify their modification sites in the genomics. Recently, deep learning has become increasingly popular in recent years and frequently employed for the 4mC site identification. However, a systematic analysis of how to build predictive models using deep learning techniques is still lacking. In this work, we first summarized all existing deep learning-based predictors and systematically analyzed their models, features and datasets, etc. Then, using a typical standard dataset with three species (A. thaliana, C. elegans, and D. melanogaster), we assessed the contribution of different model architectures, encoding methods and the attention mechanism in establishing a deep learning-based model for the 4mC site prediction. After a series of optimizations, convolutional-recurrent neural network architecture using the one-hot encoding and attention mechanism achieved the best overall prediction performance. Extensive comparison experiments were conducted based on the same dataset. This work will be helpful for researchers who would like to build the 4mC prediction models using deep learning in the future.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Qi Chen
- Department of Endocrinology and Metabolism, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu, China
| |
Collapse
|
26
|
Boulias K, Greer EL. Means, mechanisms and consequences of adenine methylation in DNA. Nat Rev Genet 2022; 23:411-428. [PMID: 35256817 PMCID: PMC9354840 DOI: 10.1038/s41576-022-00456-x] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/31/2022] [Indexed: 12/29/2022]
Abstract
N6-methyl-2'-deoxyadenosine (6mA or m6dA) has been reported in the DNA of prokaryotes and eukaryotes ranging from unicellular protozoa and algae to multicellular plants and mammals. It has been proposed to modulate DNA structure and transcription, transmit information across generations and have a role in disease, among other functions. However, its existence in more recently evolved eukaryotes remains a topic of debate. Recent technological advancements have facilitated the identification and quantification of 6mA even when the modification is exceptionally rare, but each approach has limitations. Critical assessment of existing data, rigorous design of future studies and further development of methods will be required to confirm the presence and biological functions of 6mA in multicellular eukaryotes.
Collapse
|
27
|
Tsukiyama S, Hasan MM, Deng HW, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform 2022; 23:6539171. [PMID: 35225328 PMCID: PMC8921755 DOI: 10.1093/bib/bbac053] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 01/28/2022] [Accepted: 01/31/2022] [Indexed: 01/29/2023] Open
Abstract
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Corresponding author: Hiroyuki Kurata, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan. Tel: 81-948-29-7828; E-mail:
| |
Collapse
|
28
|
Exploration of the Potential Transcriptional Regulatory Mechanisms of DNA Methyltransferases and MBD Genes in Petunia Anther Development and Multi-Stress Responses. Genes (Basel) 2022; 13:genes13020314. [PMID: 35205359 PMCID: PMC8872020 DOI: 10.3390/genes13020314] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 01/28/2022] [Accepted: 02/01/2022] [Indexed: 02/01/2023] Open
Abstract
Cytosine-5 DNA methyltransferases (C5-MTases) and methyl-CpG-binding-domain (MBD) genes can be co-expressed. They directly control target gene expression by enhancing their DNA methylation levels in humans; however, the presence of this kind of cooperative relationship in plants has not been determined. A popular garden plant worldwide, petunia (Petunia hybrida) is also a model plant in molecular biology. In this study, 9 PhC5-MTase and 11 PhMBD proteins were identified in petunia, and they were categorized into four and six subgroups, respectively, on the basis of phylogenetic analyses. An expression correlation analysis was performed to explore the co-expression relationships between PhC5-MTases and PhMBDs using RNA-seq data, and 11 PhC5-MTase/PhMBD pairs preferentially expressed in anthers were identified as having the most significant correlations (Pearson’s correlation coefficients > 0.9). Remarkably, the stability levels of the PhC5-MTase and PhMBD pairs significantly decreased in different tissues and organs compared with that in anthers, and most of the selected PhC5-MTases and PhMBDs responded to the abiotic and hormonal stresses. However, highly correlated expression relationships between most pairs were not observed under different stress conditions, indicating that anther developmental processes are preferentially influenced by the co-expression of PhC5-MTases and PhMBDs. Interestingly, the nuclear localization genes PhDRM2 and PhMBD2 still had higher correlations under GA treatment conditions, implying that they play important roles in the GA-mediated development of petunia. Collectively, our study suggests a regulatory role for DNA methylation by C5-MTase and MBD genes in petunia anther maturation processes and multi-stress responses, and it provides a framework for the functional characterization of C5-MTases and MBDs in the future.
Collapse
|
29
|
Kong Y, Cao L, Deikus G, Fan Y, Mead EA, Lai W, Zhang Y, Yong R, Sebra R, Wang H, Zhang XS, Fang G. Critical assessment of DNA adenine methylation in eukaryotes using quantitative deconvolution. Science 2022; 375:515-522. [PMID: 35113693 PMCID: PMC9382770 DOI: 10.1126/science.abe7489] [Citation(s) in RCA: 53] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The discovery of N6-methyldeoxyadenine (6mA) across eukaryotes led to a search for additional epigenetic mechanisms. However, some studies have highlighted confounding factors that challenge the prevalence of 6mA in eukaryotes. We developed a metagenomic method to quantitatively deconvolve 6mA events from a genomic DNA sample into species of interest, genomic regions, and sources of contamination. Applying this method, we observed high-resolution 6mA deposition in two protozoa. We found that commensal or soil bacteria explained the vast majority of 6mA in insect and plant samples. We found no evidence of high abundance of 6mA in Drosophila, Arabidopsis, or humans. Plasmids used for genetic manipulation, even those from Dam methyltransferase mutant Escherichia coli, could carry abundant 6mA, confounding the evaluation of candidate 6mA methyltransferases and demethylases. On the basis of this work, we advocate for a reassessment of 6mA in eukaryotes.
Collapse
Affiliation(s)
- Yimeng Kong
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
| | - Lei Cao
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
| | - Gintaras Deikus
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
| | - Yu Fan
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
| | - Edward A. Mead
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
| | - Weiyi Lai
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences; Beijing 100085, China
| | - Yizhou Zhang
- Department of Neurosurgery and Oncological Sciences, Icahn School of Medicine at Mount Sinai, New York; NY 10029, USA
| | - Raymund Yong
- Department of Neurosurgery and Oncological Sciences, Icahn School of Medicine at Mount Sinai, New York; NY 10029, USA
| | - Robert Sebra
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
- Black Family Stem Cell Institute, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
- Sema4, a Mount Sinai venture; Stamford, CT, 06902, USA
| | - Hailin Wang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences; Beijing 100085, China
| | - Xue-Song Zhang
- Center for Advanced Biotechnology and Medicine, Rutgers University; New Brunswick, NJ, 08854, USA
| | - Gang Fang
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai; New York, NY 10029, USA
| |
Collapse
|
30
|
Zulfiqar H, Huang QL, Lv H, Sun ZJ, Dao FY, Lin H. Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique. Int J Mol Sci 2022; 23:1251. [PMID: 35163174 PMCID: PMC8836036 DOI: 10.3390/ijms23031251] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 01/19/2022] [Accepted: 01/20/2022] [Indexed: 12/15/2022] Open
Abstract
4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.
Collapse
Affiliation(s)
| | | | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (H.Z.); (Q.-L.H.); (H.L.); (Z.-J.S.); (F.-Y.D.)
| |
Collapse
|
31
|
Chaudhary M. Novel methylation mark and essential hypertension. JOURNAL OF GENETIC ENGINEERING AND BIOTECHNOLOGY 2022; 20:11. [PMID: 35061109 PMCID: PMC8777530 DOI: 10.1186/s43141-022-00301-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 01/14/2022] [Indexed: 12/11/2022]
Abstract
Background Essential hypertension (EH) is an important risk factor for various cardiovascular, cerebral and renal disorders. It is a multi-factorial trait which occurs through complex interplay between genetic, epigenetic, and environmental factors. Even after advancement of technology and deciphering the involvement of multiple signalling pathways in blood pressure regulation, it still remains as a huge global concern. Main body of the abstract Genome-wide association studies (GWAS) have revealed EH-associated genetic variants but these solely cannot explain the variability in blood pressure indicating the involvement of additional factors. The etiopathogenesis of hypertension has now advanced to the level of epigenomics where aberrant DNA methylation is the most defined epigenetic mechanism to be involved in gene regulation. Though role of DNA methylation in cancer and other mechanisms is deeply studied but this mechanism is in infancy in relation to hypertension. Generally, 5-methylcytosine (5mC) levels are being targeted at both individual gene and global level to find association with the disease. But recently, with advanced sequencing techniques another methylation mark, N6-methyladenine (6mA) was found and studied in humans which was earlier considered to be absent in case of eukaryotes. Relation of aberrant 6mA levels with cancer and stem cell fate has drawn attention to target 6mA levels with hypertension too. Conclusion Recent studies targeting hypertension has suggested 6mA levels as novel marker and its demethylase, ALKBH1 as probable therapeutic target to prevent hypertension through epigenetic programming. This review compiles different methylation studies and suggests targeting of both 5mC and 6mA levels to cover role of methylation in hypertension in broader scenario.
Collapse
|
32
|
O’Brown ZK, Greer EL. N6-methyladenine: A Rare and Dynamic DNA Mark. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2022; 1389:177-210. [DOI: 10.1007/978-3-031-11454-0_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
33
|
Mouse4mC-BGRU: deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods 2022; 204:258-262. [DOI: 10.1016/j.ymeth.2022.01.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 01/14/2022] [Accepted: 01/24/2022] [Indexed: 12/12/2022] Open
|
34
|
Clauwaert J, Waegeman W. Novel Transformer Networks for Improved Sequence Labeling in genomics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:97-106. [PMID: 33125335 DOI: 10.1109/tcbb.2020.3035021] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.
Collapse
|
35
|
Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, Wang G. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. FRONTIERS IN PLANT SCIENCE 2022; 13:845835. [PMID: 35237293 PMCID: PMC8882731 DOI: 10.3389/fpls.2022.845835] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/24/2022] [Indexed: 05/17/2023]
Abstract
DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhengnan Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- *Correspondence: Qianzi Lu,
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- Guohua Wang,
| |
Collapse
|
36
|
Rehman MU, Tayara H, Chong KT. DCNN-4mC: Densely connected neural network based N4-methylcytosine site prediction in multiple species. Comput Struct Biotechnol J 2021; 19:6009-6019. [PMID: 34849205 PMCID: PMC8605313 DOI: 10.1016/j.csbj.2021.10.034] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 10/27/2021] [Accepted: 10/28/2021] [Indexed: 01/17/2023] Open
Abstract
DNA N4-methylcytosine (4mC) being a significant genetic modification holds a dominant role in controlling different biological functions, i.e., DNA replication, DNA repair, gene regulations and gene expression levels. The identification of 4mC sites is important to get insight information regarding different organics mechanisms. However, getting modification prediction from experimental methods is a challenging task due to high expenses and time-consuming techniques. Therefore, computational tools can be a great option for modification identification. Various computational tools are proposed in literature but their generalization and prediction performance require improvement. For this motive, we have proposed a neural network based tool named DCNN-4mC for identifying 4mC sites. The proposed model involves a set of neural network layers with a skip connection which allows to share the shallow features with dense layers. Skip connection have allowed to gather crucial information regarding 4mC sites. In literature, different models are employed on different species hence in many cases different datasets are available for a single species. In this research, we have combined all available datasets to create a single benchmark dataset for every species. To the best of our knowledge, no model in literature is employed on more than six different species. To ensure the generalizability of DCNN-4mC we have used 12 different species for performance evaluation. The DCNN-4mC tool has attained 2% to 14% higher accuracy than state-of-the-art tools on all available datasets of different species. Furthermore, independent test datasets are also engaged and DCNN-4mC have overall yielded high performance in them as well.
Collapse
Affiliation(s)
- Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Department of Avionics Engineering, Air University, Islamabad 44000, Pakistan
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
- Corresponding author at: School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea (Hilal Tayara); Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea. (Kil To Chong)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
- Corresponding author at: School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea (Hilal Tayara); Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea. (Kil To Chong)
| |
Collapse
|
37
|
Nguyen TTD, Trinh VN, Le NQK, Ou YY. Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model. PLANT MOLECULAR BIOLOGY 2021; 107:533-542. [PMID: 34843033 DOI: 10.1007/s11103-021-01204-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Accepted: 09/25/2021] [Indexed: 06/13/2023]
Abstract
This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.
Collapse
Affiliation(s)
| | - Van Ngu Trinh
- Soonchunhyang Institute of Medi-Bio Science, Soonchunhyang University, Cheonan, 31151, South Korea
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan.
| |
Collapse
|
38
|
Vigneau J, Borg M. The epigenetic origin of life history transitions in plants and algae. PLANT REPRODUCTION 2021; 34:267-285. [PMID: 34236522 PMCID: PMC8566409 DOI: 10.1007/s00497-021-00422-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 06/14/2021] [Indexed: 05/17/2023]
Abstract
Plants and algae have a complex life history that transitions between distinct life forms called the sporophyte and the gametophyte. This phenomenon-called the alternation of generations-has fascinated botanists and phycologists for over 170 years. Despite the mesmerizing array of life histories described in plants and algae, we are only now beginning to learn about the molecular mechanisms controlling them and how they evolved. Epigenetic silencing plays an essential role in regulating gene expression during multicellular development in eukaryotes, raising questions about its impact on the life history strategy of plants and algae. Here, we trace the origin and function of epigenetic mechanisms across the plant kingdom, from unicellular green algae through to angiosperms, and attempt to reconstruct the evolutionary steps that influenced life history transitions during plant evolution. Central to this evolutionary scenario is the adaption of epigenetic silencing from a mechanism of genome defense to the repression and control of alternating generations. We extend our discussion beyond the green lineage and highlight the peculiar case of the brown algae. Unlike their unicellular diatom relatives, brown algae lack epigenetic silencing pathways common to animals and plants yet display complex life histories, hinting at the emergence of novel life history controls during stramenopile evolution.
Collapse
Affiliation(s)
- Jérômine Vigneau
- Department of Algal Development and Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Michael Borg
- Department of Algal Development and Evolution, Max Planck Institute for Developmental Biology, Tübingen, Germany.
| |
Collapse
|
39
|
Zhang Y, Liu Y, Xu J, Wang X, Peng X, Song J, Yu DJ. Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites. Brief Bioinform 2021; 22:bbab351. [PMID: 34459479 PMCID: PMC8575024 DOI: 10.1093/bib/bbab351] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 08/02/2021] [Accepted: 08/09/2021] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine is an important type of DNA modification that plays important roles in multiple biological processes. Despite the recent progress in developing DNA 6mA site prediction methods, several challenges remain to be addressed. For example, although the hand-crafted features are interpretable, they contain redundant information that may bias the model training and have a negative impact on the trained model. Furthermore, although deep learning (DL)-based models can perform feature extraction and classification automatically, they lack the interpretability of the crucial features learned by those models. As such, considerable research efforts have been focused on achieving the trade-off between the interpretability and straightforwardness of DL neural networks. In this study, we develop two new DL-based models for improving the prediction of N6-methyladenine sites, termed LA6mA and AL6mA, which use bidirectional long short-term memory to respectively capture the long-range information and self-attention mechanism to extract the key position information from DNA sequences. The performance of the two proposed methods is benchmarked and evaluated on the two model organisms Arabidopsis thaliana and Drosophila melanogaster. On the two benchmark datasets, LA6mA achieves an area under the receiver operating characteristic curve (AUROC) value of 0.962 and 0.966, whereas AL6mA achieves an AUROC value of 0.945 and 0.941, respectively. Moreover, an in-depth analysis of the attention matrix is conducted to interpret the important information, which is hidden in the sequence and relevant for 6mA site prediction. The two novel pipelines developed for DNA 6mA site prediction in this work will facilitate a better understanding of the underlying principle of DL-based DNA methylation site prediction and its future applications.
Collapse
Affiliation(s)
- Ying Zhang
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Yan Liu
- School of Computer Science and Engineering at Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Xinxin Peng
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and the Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
40
|
Ao C, Gao L, Yu L. Research progress in predicting DNA methylation modifications and the relation with human diseases. Curr Med Chem 2021; 29:822-836. [PMID: 34533438 DOI: 10.2174/0929867328666210917115733] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/05/2021] [Accepted: 07/11/2021] [Indexed: 11/22/2022]
Abstract
DNA methylation is an important mode of regulation in epigenetic mechanisms, and it is one of the research foci in the field of epigenetics. DNA methylation modification affects a series of biological processes, such as eukaryotic cell growth, differentiation and transformation mechanisms, by regulating gene expression. In this review, we systematically summarized the DNA methylation databases, prediction tools for DNA methylation modification, machine learning algorithms for predicting DNA methylation modification, and the relationship between DNA methylation modification and diseases such as hypertension, Alzheimer's disease, diabetic nephropathy, and cancer. An in-depth understanding of DNA methylation mechanisms can promote accurate prediction of DNA methylation modifications and the treatment and diagnosis of related diseases.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
41
|
Zulfiqar H, Sun ZJ, Huang QL, Yuan SS, Lv H, Dao FY, Lin H, Li YW. Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli. Methods 2021; 203:558-563. [PMID: 34352373 DOI: 10.1016/j.ymeth.2021.07.011] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 07/22/2021] [Accepted: 07/29/2021] [Indexed: 10/20/2022] Open
Abstract
N4-methylcytosine (4mC) is a type of DNA modification which could regulate several biological progressions such as transcription regulation, replication and gene expressions. Precisely recognizing 4mC sites in genomic sequences can provide specific knowledge about their genetic roles. This study aimed to develop a deep learning-based model to predict 4mC sites in the Escherichia coli. In the model, DNA sequences were encoded by word embedding technique 'word2vec'. The obtained features were inputted into 1-D convolutional neural network (CNN) to discriminate 4mC sites from non-4mC sites in Escherichia coli genome. The examination on independent dataset showed that our model could yield the overall accuracy of 0.861, which was about 4.3% higher than the existing model. To provide convenience to scholars, we provided the data and source code of the model which can be freely download from https://github.com/linDing-groups/Deep-4mCW2V.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Shi Yuan
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Yan-Wen Li
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China; Key Laboratory of Intelligent Information Processing of Jilin Province, Northeast Normal University, Changchun 130117, China; Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
42
|
i4mC-Deep: An Intelligent Predictor of N4-Methylcytosine Sites Using a Deep Learning Approach with Chemical Properties. Genes (Basel) 2021; 12:genes12081117. [PMID: 34440291 PMCID: PMC8393747 DOI: 10.3390/genes12081117] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 07/15/2021] [Accepted: 07/16/2021] [Indexed: 01/26/2023] Open
Abstract
DNA is subject to epigenetic modification by the molecule N4-methylcytosine (4mC). N4-methylcytosine plays a crucial role in DNA repair and replication, protects host DNA from degradation, and regulates DNA expression. However, though current experimental techniques can identify 4mC sites, such techniques are expensive and laborious. Therefore, computational tools that can predict 4mC sites would be very useful for understanding the biological mechanism of this vital type of DNA modification. Conventional machine-learning-based methods rely on hand-crafted features, but the new method saves time and computational cost by making use of learned features instead. In this study, we propose i4mC-Deep, an intelligent predictor based on a convolutional neural network (CNN) that predicts 4mC modification sites in DNA samples. The CNN is capable of automatically extracting important features from input samples during training. Nucleotide chemical properties and nucleotide density, which together represent a DNA sequence, act as CNN input data. The outcome of the proposed method outperforms several state-of-the-art predictors. When i4mC-Deep was used to analyze G. subterruneus DNA, the accuracy of the results was improved by 3.9% and MCC increased by 10.5% compared to a conventional predictor.
Collapse
|
43
|
i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning. BIOMED RESEARCH INTERNATIONAL 2021; 2021:5515342. [PMID: 34159192 PMCID: PMC8187051 DOI: 10.1155/2021/5515342] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 05/21/2021] [Indexed: 12/03/2022]
Abstract
As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.
Collapse
|
44
|
Fernandes SB, Grova N, Roth S, Duca RC, Godderis L, Guebels P, Mériaux SB, Lumley AI, Bouillaud-Kremarik P, Ernens I, Devaux Y, Schroeder H, Turner JD. N 6-Methyladenine in Eukaryotic DNA: Tissue Distribution, Early Embryo Development, and Neuronal Toxicity. Front Genet 2021; 12:657171. [PMID: 34108991 PMCID: PMC8181416 DOI: 10.3389/fgene.2021.657171] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 04/12/2021] [Indexed: 01/09/2023] Open
Abstract
DNA methylation is one of the most important epigenetic modifications and is closely related with several biological processes such as regulation of gene transcription and the development of non-malignant diseases. The prevailing dogma states that DNA methylation in eukaryotes occurs essentially through 5-methylcytosine (5mC) but recently adenine methylation was also found to be present in eukaryotes. In mouse embryonic stem cells, 6-methyladenine (6mA) was associated with the repression and silencing of genes, particularly in the X-chromosome, known to play an important role in cell fate determination. Here, we have demonstrated that 6mA is a ubiquitous eukaryotic epigenetic modification that is put in place during epigenetically sensitive periods such as embryogenesis and fetal development. In somatic cells there are clear tissue specificity in 6mA levels, with the highest 6mA levels being observed in the brain. In zebrafish, during the first 120 h of embryo development, from a single pluripotent cell to an almost fully formed individual, 6mA levels steadily increase. An identical pattern was observed over embryonic days 7–21 in the mouse. Furthermore, exposure to a neurotoxic environmental pollutant during the same early life period may led to a decrease in the levels of this modification in female rats. The identification of the periods during which 6mA epigenetic marks are put in place increases our understanding of this mammalian epigenetic modification, and raises the possibility that it may be associated with developmental processes.
Collapse
Affiliation(s)
- Sara B Fernandes
- Immune Endocrine Epigenetics Research Group, Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg.,Faculty of Science, Technology and Medicine, University of Luxembourg, Belval, Luxembourg
| | - Nathalie Grova
- Immune Endocrine Epigenetics Research Group, Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg.,Calbinotox, EA7488, Faculty of Science and Technology, University of Lorraine, Vandoeuvre-lès-Nancy, France
| | - Sarah Roth
- Immune Endocrine Epigenetics Research Group, Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
| | - Radu Corneliu Duca
- Unit Environmental Hygiene and Human Biological Monitoring, Department of Health Protection, National Health Laboratory (LNS), Dudelange, Luxembourg.,Centre for Environment and Health, Department of Public Health and Primary Care, KU Leuven, Leuven, Belgium
| | - Lode Godderis
- Centre for Environment and Health, Department of Public Health and Primary Care, KU Leuven, Leuven, Belgium.,IDEWE, External Service for Prevention and Protection at Work, Heverlee, Belgium
| | - Pauline Guebels
- Immune Endocrine Epigenetics Research Group, Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
| | - Sophie B Mériaux
- Immune Endocrine Epigenetics Research Group, Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
| | - Andrew I Lumley
- Cardiovascular Research Unit, Department of Public Health, Luxembourg Institute of Health, Strassen, Luxembourg
| | | | - Isabelle Ernens
- Cardiovascular Research Unit, Department of Public Health, Luxembourg Institute of Health, Strassen, Luxembourg
| | - Yvan Devaux
- Cardiovascular Research Unit, Department of Public Health, Luxembourg Institute of Health, Strassen, Luxembourg
| | - Henri Schroeder
- Calbinotox, EA7488, Faculty of Science and Technology, University of Lorraine, Vandoeuvre-lès-Nancy, France
| | - Jonathan D Turner
- Immune Endocrine Epigenetics Research Group, Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
45
|
Zulfiqar H, Khan RS, Hassan F, Hippe K, Hunt C, Ding H, Song XM, Cao R. Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:3348-3363. [PMID: 34198389 DOI: 10.3934/mbe.2021167] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/24/2023]
Abstract
N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% -6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rida Sarwar Khan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Ming Song
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Life Sciences, North China University of Science and Technology, Tangshan, Hebei 063210, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| |
Collapse
|
46
|
Yang X, Ye X, Li X, Wei L. iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool. Front Genet 2021; 12:663572. [PMID: 33868390 PMCID: PMC8044371 DOI: 10.3389/fgene.2021.663572] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/02/2021] [Indexed: 02/04/2023] Open
Abstract
Motivation DNA N4-methylcytosine (4mC) and N6-methyladenine (6mA) are two important DNA modifications and play crucial roles in a variety of biological processes. Accurate identification of the modifications is essential to better understand their biological functions and mechanisms. However, existing methods to identify 4mA or 6mC sites are all single tasks, which demonstrates that they can identify only a certain modification in one species. Therefore, it is desirable to develop a novel computational method to identify the modification sites in multiple species simultaneously. Results In this study, we proposed a computational method, called iDNA-MT, to identify 4mC sites and 6mA sites in multiple species, respectively. The proposed iDNA-MT mainly employed multi-task learning coupled with the bidirectional gated recurrent units (BGRU) to capture the sharing information among different species directly from DNA primary sequences. Experimental comparative results on two benchmark datasets, containing different species respectively, show that either for identifying 4mA or for 6mC site in multiple species, the proposed iDNA-MT outperforms other state-of-the-art single-task methods. The promising results have demonstrated that iDNA-MT has great potential to be a powerful and practically useful tool to accurately identify DNA modifications.
Collapse
Affiliation(s)
- Xiao Yang
- School of Software, Shandong University, Jinan, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Xuehong Li
- Department of Rehabilitation, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Lesong Wei
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| |
Collapse
|
47
|
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021; 19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
48
|
Abbas Z, Tayara H, Chong KT. 4mCPred-CNN-Prediction of DNA N4-Methylcytosine in the Mouse Genome Using a Convolutional Neural Network. Genes (Basel) 2021; 12:296. [PMID: 33672576 PMCID: PMC7924022 DOI: 10.3390/genes12020296] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 02/16/2021] [Accepted: 02/17/2021] [Indexed: 02/07/2023] Open
Abstract
Among DNA modifications, N4-methylcytosine (4mC) is one of the most significant ones, and it is linked to the development of cell proliferation and gene expression. To know different its biological functions, the accurate detection of 4mC sites is required. Although we have several techniques for the prediction of 4mC sites in different genomes based on both machine learning (ML) and convolutional neural networks (CNNs), there is no CNN-based tool for the identification of 4mC sites in the mouse genome. In this article, a CNN-based model named 4mCPred-CNN was developed to classify 4mC locations in the mouse genome. Until now, we had only two ML-based models for this purpose; they utilized several feature encoding schemes, and thus still had a lot of space available to improve the prediction accuracy. Utilizing only a single feature encoding scheme-one-hot encoding-we outperformed both of the previous ML-based techniques. In a ten-fold validation test, the proposed model, 4mCPred-CNN, achieved an accuracy of 85.71% and Matthews correlation coefficient (MCC) of 0.717. On an independent dataset, the achieved accuracy was 87.50% with an MCC value of 0.750. The attained results exhibit that the proposed model can be of great use for researchers in the fields of biology and bioinformatics.
Collapse
Affiliation(s)
- Zeeshan Abbas
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea;
- Institute of Avionics and Aeronautics (IAA), Air University, Islamabad 44000, Pakistan
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea;
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
49
|
Li W, Zhang T, Sun M, Shi Y, Zhang XJ, Xu GL, Ding J. Molecular mechanism for vitamin C-derived C 5-glyceryl-methylcytosine DNA modification catalyzed by algal TET homologue CMD1. Nat Commun 2021; 12:744. [PMID: 33531488 PMCID: PMC7854593 DOI: 10.1038/s41467-021-21061-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 01/11/2021] [Indexed: 01/07/2023] Open
Abstract
C5-glyceryl-methylcytosine (5gmC) is a novel DNA modification catalyzed by algal TET homologue CMD1 using vitamin C (VC) as co-substrate. Here, we report the structures of CMD1 in apo form and in complexes with VC or/and dsDNA. CMD1 exhibits comparable binding affinities for DNAs of different lengths, structures, and 5mC levels, and displays a moderate substrate preference for 5mCpG-containing DNA. CMD1 adopts the typical DSBH fold of Fe2+/2-OG-dependent dioxygenases. The lactone form of VC binds to the active site and mono-coordinates the Fe2+ in a manner different from 2-OG. The dsDNA binds to a positively charged cleft of CMD1 and the 5mC/C is inserted into the active site and recognized by CMD1 in a similar manner as the TET proteins. The functions of key residues are validated by mutagenesis and activity assay. Our structural and biochemical data together reveal the molecular mechanism for the VC-derived 5gmC DNA modification by CMD1.
Collapse
Affiliation(s)
- Wenjing Li
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Tianlong Zhang
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Mingliang Sun
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Yu Shi
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China ,grid.440637.20000 0004 4657 8879School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xiao-Jie Zhang
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Guo-Liang Xu
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Jianping Ding
- grid.410726.60000 0004 1797 8419State Key Laboratory of Molecular Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China ,grid.440637.20000 0004 4657 8879School of Life Science and Technology, ShanghaiTech University, Shanghai, China ,grid.410726.60000 0004 1797 8419School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
| |
Collapse
|
50
|
Hasan MM, Shoombuatong W, Kurata H, Manavalan B. Critical evaluation of web-based DNA N6-methyladenine site prediction tools. Brief Funct Genomics 2021; 20:258-272. [PMID: 33491072 DOI: 10.1093/bfgp/elaa028] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 12/11/2020] [Accepted: 12/15/2020] [Indexed: 12/13/2022] Open
Abstract
Methylation of DNA N6-methyladenosine (6mA) is a type of epigenetic modification that plays pivotal roles in various biological processes. The accurate genome-wide identification of 6mA is a challenging task that leads to understanding the biological functions. For the last 5 years, a number of bioinformatics approaches and tools for 6mA site prediction have been established, and some of them are easily accessible as web application. Nevertheless, the accurate genome-wide identification of 6mA is still one of the challenging works that lead to understanding the biological functions. Especially in practical applications, these tools have implemented diverse encoding schemes, machine learning algorithms and feature selection methods, whereas few systematic performance comparisons of 6mA site predictors have been reported. In this review, 11 publicly available 6mA predictors evaluated with seven different species-specific datasets (Arabidopsis thaliana, Tolypocladium, Diospyros lotus, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans and Escherichia coli). Of those, few species are close homologs, and the remaining datasets are distant sequences. Our independent, validation tests demonstrated that Meta-i6mA and MM-6mAPred models for A. thaliana, Tolypocladium, S. cerevisiae and D. melanogaster achieved excellent overall performance when compared with their counterparts. However, none of the existing methods were suitable for E. coli, C. elegans and D. lotus. A feasibility of the existing predictors is also discussed for the seven species. Our evaluation provides useful guidelines for the development of 6mA site predictors and helps biologists selecting suitable prediction tools.
Collapse
Affiliation(s)
| | - Watshara Shoombuatong
- Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics in the Kyushu Institute of Technology, Japan
| | | |
Collapse
|