1
|
Kang Y, Wang H, Qin Y, Liu G, Yu Y, Zhang Y. PSATF-6mA: an integrated learning fusion feature-encoded DNA-6 mA methylcytosine modification site recognition model based on attentional mechanisms. Front Genet 2024; 15:1498884. [PMID: 39600317 PMCID: PMC11588721 DOI: 10.3389/fgene.2024.1498884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Accepted: 10/30/2024] [Indexed: 11/29/2024] Open
Abstract
DNA methylation is of crucial importance for biological genetic expression, such as biological cell differentiation and cellular tumours. The identification of DNA-6mA sites using traditional biological experimental methods requires more cumbersome steps and a large amount of time. The advent of neural network technology has facilitated the identification of 6 mA sites on cross-species DNA with enhanced efficacy. Nevertheless, the majority of contemporary neural network models for identifying 6 mA sites prioritize the design of the identification model, with comparatively limited research conducted on the statistically significant DNA sequence itself. Consequently, this paper will focus on the statistical strategy of DNA double-stranded features, utilising the multi-head self-attention mechanism in neural networks applied to DNA position probabilistic relationships. Furthermore, a new recognition model, PSATF-6 mA, will be constructed by continually adjusting the attentional tendency of feature fusion through an integrated learning framework. The experimental results, obtained through cross-validation with cross-species data, demonstrate that the PSATF-6 mA model outperforms the baseline model. The in-Matthews correlation coefficient (MCC) for the cross-species dataset of rice and m. musus genomes can reach a score of 0.982. The present model is expected to assist biologists in more accurately identifying 6 mA locus and in formulating new testable biological hypotheses.
Collapse
Affiliation(s)
- Yanmei Kang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Hongyuan Wang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yubo Qin
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Guanlin Liu
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yi Yu
- College of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yongjian Zhang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| |
Collapse
|
2
|
Han K, Wang J, Chu Y, Liao Q, Ding Y, Zheng D, Wan J, Guo X, Zou Q. Deep learning based method for predicting DNA N6-methyladenosine sites. Methods 2024; 230:91-98. [PMID: 39097179 DOI: 10.1016/j.ymeth.2024.07.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 07/22/2024] [Accepted: 07/29/2024] [Indexed: 08/05/2024] Open
Abstract
DNA N6 methyladenine (6mA) plays an important role in many biological processes, and accurately identifying its sites helps one to understand its biological effects more comprehensively. Previous traditional experimental methods are very labor-intensive and traditional machine learning methods also seem to be somewhat insufficient as the database of 6mA methylation groups becomes progressively larger, so we propose a deep learning-based method called multi-scale convolutional model based on global response normalization (CG6mA) to solve the prediction problem of 6mA site. This method is tested with other methods on three different kinds of benchmark datasets, and the results show that our model can get more excellent prediction results.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Ying Chu
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Qian Liao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin 150001, China
| | - Xiaoyi Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| |
Collapse
|
3
|
Zheng Y, Li H, Lin S. m7GRegpred: substrate prediction of N7-methylguanosine (m7G) writers and readers based on sequencing features. Front Genet 2024; 15:1469011. [PMID: 39262420 PMCID: PMC11387174 DOI: 10.3389/fgene.2024.1469011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Accepted: 08/19/2024] [Indexed: 09/13/2024] Open
Abstract
N7-Methylguanosine (m7G) is important RNA modification at internal and the cap structure of five terminal end of message RNA. It is essential for RNA stability of RNA, the efficiency of translation, and various intracellular RNA processing pathways. Given the significance of the m7G modification, numerous studies have been conducted to predict m7G sites. To further elucidate the regulatory mechanisms surrounding m7G, we introduce a novel bioinformatics framework, m7GRegpred, designed to forecast the targets of the m7G methyltransferases METTL1 and WDR4, and m7G readers QKI5, QKI6, and QKI7 for the first time. We integrated different features to build predictors, with AUROC scores of 0.856, 0.857, 0.780, 0.776, 0.818 for METTL1, WDR4, QKI5, QKI6, and QKI7, respectively. In addition, the effect of window lengths and algorism were systemically evaluated in this work. The finial model was summarized in a user-friendly webserver: http://modinfor.com/m7GRegpred/. Our research indicates that the substrates of m7G regulators can be identified and may potentially advance the study of m7G regulators under unique conditions.
Collapse
Affiliation(s)
- Yu Zheng
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China
- School of Medical Technology and Engineering, Fujian Medical University, Fuzhou, China
| | - Haipeng Li
- Graduate School of Fujian Medical University, Fuzhou, Fujian, China
- Department of Operating Room, Second Affiliated Hospital of Fujian Medical University, Quanzhou, China
| | - Shaofeng Lin
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China
- School of Medical Technology and Engineering, Fujian Medical University, Fuzhou, China
| |
Collapse
|
4
|
Naseem A, Khan YD. An intelligent model for prediction of abiotic stress-responsive microRNAs in plants using statistical moments based features and ensemble approaches. Methods 2024; 228:65-79. [PMID: 38768931 DOI: 10.1016/j.ymeth.2024.05.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 04/30/2024] [Accepted: 05/10/2024] [Indexed: 05/22/2024] Open
Abstract
This study proposed an intelligent model for predicting abiotic stress-responsive microRNAs in plants. MicroRNAs (miRNAs) are short RNA molecules regulates the stress in genes. Experimental methods are costly and time-consuming, as compare to in-silico prediction. Addressing this gap, the study seeks to develop an efficient computational model for plant stress response prediction. The two benchmark datasets for MiRNA and Pre-MiRNA dataset have been acquired in this study. Four ensemble approaches such as bagging, boosting, stacking, and blending have been employed. Classifiers such as Random Forest (RF), Extra Trees (ET), Ada Boost (ADB), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM). Stacking and Blending employed all stated classifiers as base learners and Logistic Regression (LR) as Meta Classifier. There have been a total of four types of testing used, including independent set, self-consistency, cross-validation with 5 and 10 folds, and jackknife. This study has utilized evaluation metrics such as accuracy score, specificity, sensitivity, Mathew's correlation coefficient (MCC), and AUC. Our proposed methodology has outperformed existing state of the art study in both datasets based on independent set testing. The SVM-based approach has exhibited accuracy score of 0.659 for the MiRNA dataset, which is better than the previous study. The ET classifier has surpassed the accuracy of Pre-MiRNA dataset as compared to the existing benchmark study, achieving an impressive score of 0.67. The proposed method can be used in future research to predict abiotic stresses in plants.
Collapse
Affiliation(s)
- Ansar Naseem
- Department of Artificial Intelligence, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
5
|
Alromema N, Suleman MT, Malebary SJ, Ahmed A, Ali Mohammed Al-Rami Al-Ghamdi B, Khan YD. Identification of 6-methyladenosine sites using novel feature encoding methods and ensemble models. Sci Rep 2024; 14:8180. [PMID: 38589431 PMCID: PMC11001897 DOI: 10.1038/s41598-024-58353-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 03/28/2024] [Indexed: 04/10/2024] Open
Abstract
N6-methyladenosine (6 mA) is the most common internal modification in eukaryotic mRNA. Mass spectrometry and site-directed mutagenesis, two of the most common conventional approaches, have been shown to be laborious and challenging. In recent years, there has been a rising interest in analyzing RNA sequences to systematically investigate mutated locations. Using novel methods for feature development, the current work aimed to identify 6 mA locations in RNA sequences. Following the generation of these novel features, they were used to train an ensemble of models using methods such as stacking, boosting, and bagging. The trained ensemble models were assessed using an independent test set and k-fold cross validation. When compared to baseline predictors, the suggested model performed better and showed improved ratings across the board for key measures of accuracy.
Collapse
Affiliation(s)
- Nashwan Alromema
- Department of Computer Science, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Taseer Suleman
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, 54770, Pakistan.
- Department of Criminology and Forensic Sciences, Lahore Garrison University, Lahore, Pakistan.
| | - Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, P.O. Box 344, 21911, Rabigh, Saudi Arabia
| | - Amir Ahmed
- Department of Information Systems and Security, College of Information Technology, United Arab Emirates University, Alain, United Arab Emirates
| | | | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, 54770, Pakistan
| |
Collapse
|
6
|
Liang P, Li H, Long C, Liu M, Zhou J, Zuo Y. Chromatin region binning of gene expression for improving embryo cell subtype identification. Comput Biol Med 2024; 170:108049. [PMID: 38290319 DOI: 10.1016/j.compbiomed.2024.108049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 01/01/2024] [Accepted: 01/26/2024] [Indexed: 02/01/2024]
Abstract
Mammalian embryonic development is a complex process, characterized by intricate spatiotemporal dynamics and distinct chromatin preferences. However, the quick diversification in early embryogenesis leads to significant cellular diversity and the sparsity of scRNA-seq data, posing challenges in accurately determining cell fate decisions. In this study, we introduce a chromatin region binning method using scChrBin, designed to identify chromatin regions that elucidate the dynamics of embryonic development and lineage differentiation. This method transforms scRNA-seq data into a chromatin-based matrix, leveraging genomic annotations. Our results showed that the scChrBin method achieves high accuracy, with 98.0% and 89.2% on two single-cell embryonic datasets, demonstrating its effectiveness in analyzing complex developmental processes. We also systematically and comprehensively analysis of these key chromatin binning regions and their associated genes, focusing on their roles in lineage and stage development. The perspective of chromatin region binning method enables a comprehensive analysis of transcriptome data at the chromatin level, allowing us to unveil the dynamic expression of chromatin regions across temporal and spatial development. The tool is available as an application at https://github.com/liameihao/scChrBin.
Collapse
Affiliation(s)
- Pengfei Liang
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Hanshuang Li
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Chunshen Long
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Mingzhu Liu
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Jian Zhou
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China.
| |
Collapse
|
7
|
Yin Z, Lyu J, Zhang G, Huang X, Ma Q, Jiang J. SoftVoting6mA: An improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:3798-3815. [PMID: 38549308 DOI: 10.3934/mbe.2024169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2024]
Abstract
The DNA N6-methyladenine (6mA) is an epigenetic modification, which plays a pivotal role in biological processes encompassing gene expression, DNA replication, repair, and recombination. Therefore, the precise identification of 6mA sites is fundamental for better understanding its function, but challenging. We proposed an improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes called SoftVoting6mA. The SoftVoting6mA selected four (electron-ion-interaction pseudo potential, One-hot encoding, Kmer, and pseudo dinucleotide composition) codes from 15 types of encoding to represent DNA sequences by comparing their performances. Similarly, the SoftVoting6mA combined four learning algorithms using the soft voting strategy. The 5-fold cross-validation and the independent tests showed that SoftVoting6mA reached the state-of-the-art performance. To enhance accessibility, a user-friendly web server is provided at http://www.biolscience.cn/SoftVoting6mA/.
Collapse
Affiliation(s)
- Zhaoting Yin
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Jianyi Lyu
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Guiyang Zhang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Qinghua Ma
- College of Information Science and Engineering, Hohai University, Nanjing 210000, China
- Faculty of Information Technology, University of Jyvaskyla, Jyvaskyla, Finland
| | - Jinyun Jiang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
8
|
Bai J, Yang H, Wu C. MLACNN: an attention mechanism-based CNN architecture for predicting genome-wide DNA methylation. Theory Biosci 2023; 142:359-370. [PMID: 37648910 PMCID: PMC10564812 DOI: 10.1007/s12064-023-00402-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 07/31/2023] [Indexed: 09/01/2023]
Abstract
Methylation is an important epigenetic regulation of methylation genes that plays a crucial role in regulating biological processes. While traditional methods for detecting methylation in biological experiments are constantly improving, the development of artificial intelligence has led to the emergence of deep learning and machine learning methods as a new trend. However, traditional machine learning-based methods rely heavily on manual feature extraction, and most deep learning methods for studying methylation extract fewer features due to their simple network structures. To address this, we propose a bottomneck network based on an attention mechanism and use new methods to ensure that the deep network can learn more effective features while minimizing overfitting. This approach enables the model to learn more features from nucleotide sequences and make better predictions of methylation. The model uses three coding methods to encode the original DNA sequence and then applies feature fusion based on attention mechanisms to obtain the best fusion method. Our results demonstrate that MLACNN outperforms previous methods and achieves more satisfactory performance.
Collapse
Affiliation(s)
- JianGuo Bai
- Shandong Jiaotong University, Jinan City, Shandong Province China
| | - Hai Yang
- Shandong Jiaotong University, Jinan City, Shandong Province China
| | - ChangDe Wu
- Shandong Jiaotong University, Jinan City, Shandong Province China
| |
Collapse
|
9
|
Hu W, Guan L, Li M. Prediction of DNA Methylation based on Multi-dimensional feature encoding and double convolutional fully connected convolutional neural network. PLoS Comput Biol 2023; 19:e1011370. [PMID: 37639434 PMCID: PMC10461834 DOI: 10.1371/journal.pcbi.1011370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/18/2023] [Indexed: 08/31/2023] Open
Abstract
DNA methylation takes on critical significance to the regulation of gene expression by affecting the stability of DNA and changing the structure of chromosomes. DNA methylation modification sites should be identified, which lays a solid basis for gaining more insights into their biological functions. Existing machine learning-based methods of predicting DNA methylation have not fully exploited the hidden multidimensional information in DNA gene sequences, such that the prediction accuracy of models is significantly limited. Besides, most models have been built in terms of a single methylation type. To address the above-mentioned issues, a deep learning-based method was proposed in this study for DNA methylation site prediction, termed the MEDCNN model. The MEDCNN model is capable of extracting feature information from gene sequences in three dimensions (i.e., positional information, biological information, and chemical information). Moreover, the proposed method employs a convolutional neural network model with double convolutional layers and double fully connected layers while iteratively updating the gradient descent algorithm using the cross-entropy loss function to increase the prediction accuracy of the model. Besides, the MEDCNN model can predict different types of DNA methylation sites. As indicated by the experimental results,the deep learning method based on coding from multiple dimensions outperformed single coding methods, and the MEDCNN model was highly applicable and outperformed existing models in predicting DNA methylation between different species. As revealed by the above-described findings, the MEDCNN model can be effective in predicting DNA methylation sites.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, China
| |
Collapse
|
10
|
Alotaibi FM, Khan YD. A Framework for Prediction of Oncogenomic Progression Aiding Personalized Treatment of Gastric Cancer. Diagnostics (Basel) 2023; 13:2291. [PMID: 37443684 DOI: 10.3390/diagnostics13132291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 06/05/2023] [Accepted: 06/13/2023] [Indexed: 07/15/2023] Open
Abstract
Mutations in genes can alter their DNA patterns, and by recognizing these mutations, many carcinomas can be diagnosed in the progression stages. The human body contains many hidden and enigmatic features that humankind has not yet fully understood. A total of 7539 neoplasm cases were reported from 1 January 2021 to 31 December 2021. Of these, 3156 were seen in males (41.9%) and 4383 (58.1%) in female patients. Several machine learning and deep learning frameworks are already implemented to detect mutations, but these techniques lack generalized datasets and need to be optimized for better results. Deep learning-based neural networks provide the computational power to calculate the complex structures of gastric carcinoma-driven gene mutations. This study proposes deep learning approaches such as long and short-term memory, gated recurrent units and bi-LSTM to help in identifying the progression of gastric carcinoma in an optimized manner. This study includes 61 carcinogenic driver genes whose mutations can cause gastric cancer. The mutation information was downloaded from intOGen.org and normal gene sequences were downloaded from asia.ensembl.org, as explained in the data collection section. The proposed deep learning models are validated using the self-consistency test (SCT), 10-fold cross-validation test (FCVT), and independent set test (IST); the IST prediction metrics of accuracy, sensitivity, specificity, MCC and AUC of LSTM, Bi-LSTM, and GRU are 97.18%, 98.35%, 96.01%, 0.94, 0.98; 99.46%, 98.93%, 100%, 0.989, 1.00; 99.46%, 98.93%, 100%, 0.989 and 1.00, respectively.
Collapse
Affiliation(s)
- Fahad M Alotaibi
- Department of Information System, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
11
|
Shah AA, Alturise F, Alkhalifah T, Faisal A, Khan YD. EDLM: Ensemble Deep Learning Model to Detect Mutation for the Early Detection of Cholangiocarcinoma. Genes (Basel) 2023; 14:1104. [PMID: 37239464 PMCID: PMC10217880 DOI: 10.3390/genes14051104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 05/11/2023] [Accepted: 05/14/2023] [Indexed: 05/28/2023] Open
Abstract
The most common cause of mortality and disability globally right now is cholangiocarcinoma, one of the worst forms of cancer that may affect people. When cholangiocarcinoma develops, the DNA of the bile duct cells is altered. Cholangiocarcinoma claims the lives of about 7000 individuals annually. Women pass away less often than men. Asians have the greatest fatality rate. Following Whites (20%) and Asians (22%), African Americans (45%) saw the greatest increase in cholangiocarcinoma mortality between 2021 and 2022. For instance, 60-70% of cholangiocarcinoma patients have local infiltration or distant metastases, which makes them unable to receive a curative surgical procedure. Across the board, the median survival time is less than a year. Many researchers work hard to detect cholangiocarcinoma, but this is after the appearance of symptoms, which is late detection. If cholangiocarcinoma progression is detected at an earlier stage, then it will help doctors and patients in treatment. Therefore, an ensemble deep learning model (EDLM), which consists of three deep learning algorithms-long short-term model (LSTM), gated recurrent units (GRUs), and bi-directional LSTM (BLSTM)-is developed for the early identification of cholangiocarcinoma. Several tests are presented, such as a 10-fold cross-validation test (10-FCVT), an independent set test (IST), and a self-consistency test (SCT). Several statistical techniques are used to evaluate the proposed model, such as accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew's correlation coefficient (MCC). There are 672 mutations in 45 distinct cholangiocarcinoma genes among the 516 human samples included in the proposed study. The IST has the highest Acc at 98%, outperforming all other validation approaches.
Collapse
Affiliation(s)
- Asghar Ali Shah
- Department of Computer Science, Bahria University, Islamabad 04408, Pakistan;
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 51921, Qassim, Saudi Arabia
| | - Amna Faisal
- Department of Computer Science, Bahria University, Lahore 54782, Pakistan;
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54782, Pakistan;
| |
Collapse
|
12
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N 4-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023; 21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open
Abstract
N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies.
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
- School of Innovation, Design and Technology, Wellington Institute of Technology, Wellington 5012, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Ho Chi Minh City 700000, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
- Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| |
Collapse
|
13
|
Yu X, Ren J, Cui Y, Zeng R, Long H, Ma C. DRSN4mCPred: accurately predicting sites of DNA N4-methylcytosine using deep residual shrinkage network for diagnosis and treatment of gastrointestinal cancer in the precision medicine era. Front Med (Lausanne) 2023; 10:1187430. [PMID: 37215722 PMCID: PMC10192687 DOI: 10.3389/fmed.2023.1187430] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 04/05/2023] [Indexed: 05/24/2023] Open
Abstract
Introduction The DNA N4-methylcytosine (4mC) site levels of those suffering from digestive system cancers were higher, and the pathogenesis of digestive system cancers may also be related to the changes in DNA 4mC levels. Identifying DNA 4mC sites is a very important step in studying the analysis of biological function and cancer prediction. Extracting accurate features from DNA sequences is the key to establishing a prediction model of effective DNA 4mC sites. This study sought to develop a new predictive model, DRSN4mCPred, which aimed to improve the performance of the predicting DNA 4mC sites. Methods The model adopted multi-scale channel attention to extract features and used attention feature fusion (AFF) to fuse features. In order to capture features information more accurately and effectively, this model utilized Deep Residual Shrinkage Network with Channel-Wise thresholds (DRSN-CW) to eliminate noise-related features and achieve a more precise feature representation, thereby, distinguishing the sites in DNA with 4mC and non-4mC. Additionally, the predictive model incorporated an inverted residual block, a Multi-scale Channel Attention Module (MS-CAM), a Bi-directional Long Short Term Memory Network (Bi-LSTM), AFF, and DRSN-CW. Results and Discussion The results indicated the predictive model DRSN4mCPred had extremely good performance in predicting the DNA 4mC sites across different species. This paper will potentially provide support for the diagnosis and treatment of gastrointestinal cancer based on artificial intelligence in the precise medical era.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- Industrial Design School, Shandong University of ART and Design, Jinan, Shandong, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Cuihua Ma
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
14
|
Fan Y, Sun G, Pan X. ELMo4m6A: A Contextual Language Embedding-Based Predictor for Detecting RNA N6-Methyladenosine Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:944-954. [PMID: 35536814 DOI: 10.1109/tcbb.2022.3173323] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N6-methyladenosine (m6A) is a universal post-transcriptional modification of RNAs, and it is widely involved in various biological processes. Identifying m6A modification sites accurately is indispensable to further investigate m6A-mediated biological functions. How to better represent RNA sequences is crucial for building effective computational methods for detecting m6A modification sites. However, traditional encoding methods require complex biological prior knowledge and are time-consuming. Furthermore, most of the existing m6A sites prediction methods are limited to single species, and few methods are able to predict m6A sites across different species and tissues. Thus, it is necessary to design a more efficient computational method to predict m6A sites across multiple species and tissues. In this paper, we proposed ELMo4m6A, a contextual language embedding-based method for predicting m6A sites from RNA sequences without any prior knowledge. ELMo4m6A first learns embeddings of RNA sequences using a language model ELMo, then uses a hybrid convolutional neural network (CNN) and long short-term memory (LSTM) to identify m6A sites. The results of 5-fold cross-validation and independent testing demonstrate that ELMo4m6A is superior to state-of-the-art methods. Moreover, we applied integrated gradients to find potential sequence patterns contributing to m6A sites.
Collapse
|
15
|
Abadi SAR, Mohammadi A, Koohi S. An automated ultra-fast, memory-efficient, and accurate method for viral genome classification. J Biomed Inform 2023; 139:104316. [PMID: 36781036 DOI: 10.1016/j.jbi.2023.104316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Revised: 01/30/2023] [Accepted: 02/08/2023] [Indexed: 02/13/2023]
Abstract
The classification of different organisms into subtypes is one of the most important tools of organism studies, and among them, the classification of viruses itself has been the focus of many studies due to their use in virology and epidemiology. Many methods have been proposed to classify viruses, some of which are designed for a specific family of organisms and some of which are more general. But still, especially for certain categories such as Influenza and HIV, classification is facing performance challenges as well as processing and memory bottlenecks. In this way, we designed an automated classifier, called PC-mer, that is based on k-mer and physicochemical characteristics of nucleotides, which reduces the number of features about 2 k times compared to the alternative methods based on k-mer, and compared to integer and one-hot encoding methods, it is possible to keep the number of features constant despite the growth of the sequence length. In this way, it also increases the training speed by an average of 17.93 times. This improvement in processing complexity is provided while PC-mer can also improve the classifying performance for a variety of virus families.
Collapse
Affiliation(s)
| | - Amirhossein Mohammadi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
16
|
Cai J, Wang T, Deng X, Tang L, Liu L. GM-lncLoc: LncRNAs subcellular localization prediction based on graph neural network with meta-learning. BMC Genomics 2023; 24:52. [PMID: 36709266 PMCID: PMC9883864 DOI: 10.1186/s12864-022-09034-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 11/21/2022] [Indexed: 01/29/2023] Open
Abstract
In recent years, a large number of studies have shown that the subcellular localization of long non-coding RNAs (lncRNAs) can bring crucial information to the recognition of lncRNAs function. Therefore, it is of great significance to establish a computational method to accurately predict the subcellular localization of lncRNA. Previous prediction models are based on low-level sequences information and are troubled by the few samples problem. In this study, we propose a new prediction model, GM-lncLoc, which is based on the initial information extracted from the lncRNA sequence, and also combines the graph structure information to extract high level features of lncRNA. In addition, the training mode of meta-learning is introduced to obtain meta-parameters by training a series of tasks. With the meta-parameters, the final parameters of other similar tasks can be learned quickly, so as to solve the problem of few samples in lncRNA subcellular localization. Compared with the previous methods, GM-lncLoc achieved the best results with an accuracy of 93.4 and 94.2% in the benchmark datasets of 5 and 4 subcellular compartments, respectively. Furthermore, the prediction performance of GM-lncLoc was also better on the independent dataset. It shows the effectiveness and great potential of our proposed method for lncRNA subcellular localization prediction. The datasets and source code are freely available at https://github.com/JunzheCai/GM-lncLoc .
Collapse
Affiliation(s)
- Junzhe Cai
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Ting Wang
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Xi Deng
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| | - Lin Tang
- grid.410739.80000 0001 0723 6903Key Laboratory of Educational Information for Nationalities Ministry of Education, Yunnan Normal University, Kunming, Yunnan China
| | - Lin Liu
- grid.410739.80000 0001 0723 6903School of Information, Yunnan Normal University, Kunming, Yunnan China
| |
Collapse
|
17
|
Han K, Wang J, Wang Y, Zhang L, Yu M, Xie F, Zheng D, Xu Y, Ding Y, Wan J. A review of methods for predicting DNA N6-methyladenine sites. Brief Bioinform 2023; 24:6887111. [PMID: 36502371 DOI: 10.1093/bib/bbac514] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 10/07/2022] [Accepted: 10/27/2022] [Indexed: 12/14/2022] Open
Abstract
Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China.,College of Pharmacy, Harbin University of Commerce, Harbin, 150076, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yu Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Lei Zhang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Mengyao Yu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Fang Xie
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yaoqun Xu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin, 150001, China
| |
Collapse
|
18
|
Nabeel Asim M, Ali Ibrahim M, Fazeel A, Dengel A, Ahmed S. DNA-MP: a generalized DNA modifications predictor for multiple species based on powerful sequence encoding method. Brief Bioinform 2023; 24:6931721. [PMID: 36528802 DOI: 10.1093/bib/bbac546] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 11/06/2022] [Accepted: 11/12/2022] [Indexed: 12/23/2022] Open
Abstract
Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach "DNA-MP" that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method "position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference" (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on $17$ benchmark DNA modifications prediction datasets of $12$ different species using $10$ different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing $32$ different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Ahtisham Fazeel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| |
Collapse
|
19
|
Shah AA, Alturise F, Alkhalifah T, Khan YD. Deep Learning Approaches for Detection of Breast Adenocarcinoma Causing Carcinogenic Mutations. Int J Mol Sci 2022; 23:ijms231911539. [PMID: 36232840 PMCID: PMC9570286 DOI: 10.3390/ijms231911539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 09/19/2022] [Accepted: 09/23/2022] [Indexed: 11/16/2022] Open
Abstract
Genes are composed of DNA and each gene has a specific sequence. Recombination or replication within the gene base ends in a permanent change in the nucleotide collection in a DNA called mutation and some mutations can lead to cancer. Breast adenocarcinoma starts in secretary cells. Breast adenocarcinoma is the most common of all cancers that occur in women. According to a survey within the United States of America, there are more than 282,000 breast adenocarcinoma patients registered each 12 months, and most of them are women. Recognition of cancer in its early stages saves many lives. A proposed framework is developed for the early detection of breast adenocarcinoma using an ensemble learning technique with multiple deep learning algorithms, specifically: Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Bi-directional LSTM. There are 99 types of driver genes involved in breast adenocarcinoma. This study uses a dataset of 4127 samples including men and women taken from more than 12 cohorts of cancer detection institutes. The dataset encompasses a total of 6170 mutations that occur in 99 genes. On these gene sequences, different algorithms are applied for feature extraction. Three types of testing techniques including independent set testing, self-consistency testing, and a 10-fold cross-validation test is applied to validate and test the learning approaches. Subsequently, multiple deep learning approaches such as LSTM, GRU, and bi-directional LSTM algorithms are applied. Several evaluation metrics are enumerated for the validation of results including accuracy, sensitivity, specificity, Mathew’s correlation coefficient, area under the curve, training loss, precision, recall, F1 score, and Cohen’s kappa while the values obtained are 99.57, 99.50, 99.63, 0.99, 1.0, 0.2027, 99.57, 99.57, 99.57, and 99.14 respectively.
Collapse
Affiliation(s)
- Asghar Ali Shah
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 58892, Qassim, Saudi Arabia
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 58892, Qassim, Saudi Arabia
- Correspondence:
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan
| |
Collapse
|
20
|
Chen M, Zhang X, Ju Y, Liu Q, Ding Y. iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:13829-13850. [PMID: 36654069 DOI: 10.3934/mbe.2022644] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Biological sequence analysis is an important basic research work in the field of bioinformatics. With the explosive growth of data, machine learning methods play an increasingly important role in biological sequence analysis. By constructing a classifier for prediction, the input sequence feature vector is predicted and evaluated, and the knowledge of gene structure, function and evolution is obtained from a large amount of sequence information, which lays a foundation for researchers to carry out in-depth research. At present, many machine learning methods have been applied to biological sequence analysis such as RNA gene recognition and protein secondary structure prediction. As a biological sequence, RNA plays an important biological role in the encoding, decoding, regulation and expression of genes. The analysis of RNA data is currently carried out from the aspects of structure and function, including secondary structure prediction, non-coding RNA identification and functional site prediction. Pseudouridine (У) is the most widespread and rich RNA modification and has been discovered in a variety of RNAs. It is highly essential for the study of related functional mechanisms and disease diagnosis to accurately identify У sites in RNA sequences. At present, several computational approaches have been suggested as an alternative to experimental methods to detect У sites, but there is still potential for improvement in their performance. In this study, we present a model based on twin support vector machine (TWSVM) for У site identification. The model combines a variety of feature representation techniques and uses the max-relevance and min-redundancy methods to obtain the optimum feature subset for training. The independent testing accuracy is improved by 3.4% in comparison to current advanced У site predictors. The outcomes demonstrate that our model has better generalization performance and improves the accuracy of У site identification. iPseU-TWSVM can be a helpful tool to identify У sites.
Collapse
Affiliation(s)
- Mingshuai Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Xin Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Qing Liu
- Department of Anesthesiology, Hospital (T.C.M) Affiliated to Southwest Medical University, Luzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| |
Collapse
|
21
|
Butt AH, Alkhalifah T, Alturise F, Khan YD. A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns. Sci Rep 2022; 12:15183. [PMID: 36071071 PMCID: PMC9452539 DOI: 10.1038/s41598-022-19099-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 08/24/2022] [Indexed: 11/26/2022] Open
Abstract
Enhancers regulate gene expression, by playing a crucial role in the synthesis of RNAs and proteins. They do not directly encode proteins or RNA molecules. In order to control gene expression, it is important to predict enhancers and their potency. Given their distance from the target gene, lack of common motifs, and tissue/cell specificity, enhancer regions are thought to be difficult to predict in DNA sequences. Recently, a number of bioinformatics tools were created to distinguish enhancers from other regulatory components and to pinpoint their advantages. However, because the quality of its prediction method needs to be improved, its practical application value must also be improved. Based on nucleotide composition and statistical moment-based features, the current study suggests a novel method for identifying enhancers and non-enhancers and evaluating their strength. The proposed study outperformed state-of-the-art techniques using fivefold and tenfold cross-validation in terms of accuracy. The accuracy from the current study results in 86.5% and 72.3% in enhancer site and its strength prediction respectively. The results of the suggested methodology point to the potential for more efficient and successful outcomes when statistical moment-based features are used. The current study's source code is available to the research community at https://github.com/csbioinfopk/enpred.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia.
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
22
|
Rehman MU, Tayara H, Zou Q, Chong KT. i6mA-Caps: a CapsuleNet-based framework for identifying DNA N6-methyladenine sites. Bioinformatics 2022; 38:3885-3891. [PMID: 35771648 DOI: 10.1093/bioinformatics/btac434] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Revised: 05/19/2022] [Accepted: 06/28/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6mA) has been demonstrated to have an essential function in epigenetic modification in eukaryotic species in recent research. 6mA has been linked to various biological processes. It's critical to create a new algorithm that can rapidly and reliably detect 6mA sites in genomes to investigate their biological roles. The identification of 6mA marks in the genome is the first and most important step in understanding the underlying molecular processes, as well as their regulatory functions. RESULTS In this article, we proposed a novel computational tool called i6mA-Caps which CapsuleNet based a framework for identifying the DNA N6-methyladenine sites. The proposed framework uses a single encoding scheme for numerical representation of the DNA sequence. The numerical data is then used by the set of convolution layers to extract low-level features. These features are then used by the capsule network to extract intermediate-level and later high-level features to classify the 6mA sites. The proposed network is evaluated on three datasets belonging to three genomes which are Rosaceae, Rice and Arabidopsis thaliana. Proposed method has attained an accuracy of 96.71%, 94% and 86.83% for independent Rosaceae dataset, Rice dataset and A.thaliana dataset respectively. The proposed framework has exhibited improved results when compared with the existing top-of-the-line methods. AVAILABILITY AND IMPLEMENTATION A user-friendly web-server is made available for the biological experts which can be accessed at: http://nsclbio.jbnu.ac.kr/tools/i6mA-Caps/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
23
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Circ-LocNet: A Computational Framework for Circular RNA Sub-Cellular Localization Prediction. Int J Mol Sci 2022; 23:ijms23158221. [PMID: 35897818 PMCID: PMC9329987 DOI: 10.3390/ijms23158221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 02/04/2023] Open
Abstract
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
- Correspondence:
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- School of Computer Science & Electrical Engineering, National University of Sciences and Technology, Islamabad 44000, Pakistan;
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
24
|
An Effective Deep Learning-Based Architecture for Prediction of N7-Methylguanosine Sites in Health Systems. ELECTRONICS 2022. [DOI: 10.3390/electronics11121917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
N7-methylguanosine (m7G) is one of the most important epigenetic modifications found in rRNA, mRNA, and tRNA, and performs a promising role in gene expression regulation. Owing to its significance, well-equipped traditional laboratory-based techniques have been performed for the identification of N7-methylguanosine (m7G). Consequently, these approaches were found to be time-consuming and cost-ineffective. To move on from these traditional approaches to predict N7-methylguanosine sites with high precision, the concept of artificial intelligence has been adopted. In this study, an intelligent computational model called N7-methylguanosine-Long short-term memory (m7G-LSTM) is introduced for the prediction of N7-methylguanosine sites. One-hot encoding and word2vec feature schemes are used to express the biological sequences while the LSTM and CNN algorithms have been employed for classification. The proposed “m7G-LSTM” model obtained an accuracy value of 95.95%, a specificity value of 95.94%, a sensitivity value of 95.97%, and Matthew’s correlation coefficient (MCC) value of 0.919. The proposed predictive m7G-LSTM model has significantly achieved better outcomes than previous models in terms of all evaluation parameters. The proposed m7G-LSTM computational system aims to support the drug industry and help researchers in the fields of bioinformatics to enhance innovation for the prediction of the behavior of N7-methylguanosine sites.
Collapse
|
25
|
Yao Y, Zhang S, Xue T. Integrating LASSO Feature Selection and Soft Voting Classifier to Identify Origins of Replication Sites. Curr Genomics 2022; 23:83-93. [PMID: 36778978 PMCID: PMC9878833 DOI: 10.2174/1389202923666220214122506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 12/11/2021] [Accepted: 01/18/2022] [Indexed: 11/22/2022] Open
Abstract
Background: DNA replication plays an indispensable role in the transmission of genetic information. It is considered to be the basis of biological inheritance and the most fundamental process in all biological life. Considering that DNA replication initiates with a special location, namely the origin of replication, a better and accurate prediction of the origins of replication sites (ORIs) is essential to gain insight into the relationship with gene expression. Objective: In this study, we have developed an efficient predictor called iORI-LAVT for ORIs identification. Methods: This work focuses on extracting feature information from three aspects, including mono-nucleotide encoding, k-mer and ring-function-hydrogen-chemical properties. Subsequently, least absolute shrinkage and selection operator (LASSO) as a feature selection is applied to select the optimal features. Comparing the different combined soft voting classifiers results, the soft voting classifier based on GaussianNB and Logistic Regression is employed as the final classifier. Results: Based on 10-fold cross-validation test, the prediction accuracies of two benchmark datasets are 90.39% and 95.96%, respectively. As for the independent dataset, our method achieves high accuracy of 91.3%. Conclusion: Compared with previous predictors, iORI-LAVT outperforms the existing methods. It is believed that iORI-LAVT predictor is a promising alternative for further research on identifying ORIs.
Collapse
Affiliation(s)
- Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China,Address correspondence to this author at the School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China; Tel/Fax: +86-29- 88202860; E-mail:
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| |
Collapse
|
26
|
Hasan MM, Tsukiyama S, Cho JY, Kurata H, Alam MA, Liu X, Manavalan B, Deng HW. Deepm5C: A deep learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy. Mol Ther 2022; 30:2856-2867. [PMID: 35526094 PMCID: PMC9372321 DOI: 10.1016/j.ymthe.2022.05.001] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 04/25/2022] [Accepted: 05/03/2022] [Indexed: 11/30/2022] Open
Abstract
As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C), plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method to identify RNA m5C sites in the throughout human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature encoding algorithms and a feature derived from word embedding approaches. Afterwards, four variants of deep learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a 1-D convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and formulate the novel testable biological hypothesis.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Jae Youl Cho
- Molecular Immunology Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Ashad Alam
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea.
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| |
Collapse
|
27
|
Naseer S, Hussain W, Khan YD, Rasool N. iPhosS(Deep)-PseAAC: Identification of Phosphoserine Sites in Proteins Using Deep Learning on General Pseudo Amino Acid Compositions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1703-1714. [PMID: 33242308 DOI: 10.1109/tcbb.2020.3040747] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Among all the PTMs, the protein phosphorylation is pivotal for various pathological and physiological processes. About 30 percent of eukaryotic proteins undergo the phosphorylation modification, leading to various changes in conformation, function, stability, localization, and so forth. In eukaryotic proteins, phosphorylation occurs on serine (S), Threonine (T) and Tyrosine (Y) residues. Among these all, serine phosphorylation has its own importance as it is associated with various importance biological processes, including energy metabolism, signal transduction pathways, cell cycling, and apoptosis. Thus, its identification is important, however, the in vitro, ex vivo and in vivo identification can be laborious, time-taking and costly. There is a dire need of an efficient and accurate computational model to help researchers and biologists identifying these sites, in an easy manner. Herein, we propose a novel predictor for identification of Phosphoserine sites (PhosS) in proteins, by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) with deep features. We used well-known DNNs for both the tasks of learning a feature representation of peptide sequences and performing classifications. Among different DNNs, the best score is shown by Covolutional Neural Network based model which renders CNN based prediction model the best for Phosphoserine prediction. Based on these results, it is concluded that the proposed model can help to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
|
28
|
Yu B, Zhang Y, Wang X, Gao H, Sun J, Gao X. Identification of DNA modification sites based on elastic net and bidirectional gated recurrent unit with convolutional neural network. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103566] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
29
|
Tang X, Zheng P, Li X, Wu H, Wei DQ, Liu Y, Huang G. Deep6mAPred: A CNN and Bi-LSTM-based deep learning method for predicting DNA N6-methyladenosine sites across plant species. Methods 2022; 204:142-150. [PMID: 35477057 DOI: 10.1016/j.ymeth.2022.04.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 04/16/2022] [Accepted: 04/20/2022] [Indexed: 12/11/2022] Open
Abstract
DNA N6-methyladenine (6mA) is a key DNA modification, which plays versatile roles in the cellular processes, including regulation of gene expression, DNA repair, and DNA replication. DNA 6mA is closely associated with many diseases in the mammals and with growth as well as development of plants. Precisely detecting DNA 6mA sites is of great importance to exploration of 6mA functions. Although many computational methods have been presented for DNA 6mA prediction, there is still a wide gap in the practical application. We presented a convolution neural network (CNN) and bi-directional long-short term memory (Bi-LSTM)-based deep learning method (Deep6mAPred) for predicting DNA 6mA sites across plant species. The Deep6mAPred stacked the CNNs and the Bi-LSTMs in a paralleling manner instead of a series-connection manner. The Deep6mAPred also employed the attention mechanism for improving the representations of sequences. The Deep6mAPred reached an accuracy of 0.9556 over the independent rice dataset, far outperforming the state-of-the-art methods. The tests across plant species showed that the Deep6mAPred is of a remarkable advantage over the state of the art methods. We developed a user-friendly web application for DNA 6mA prediction, which is freely available at http://106.13.196.152:7001/ for all the scientific researchers. The Deep6mAPred would enrich tools to predict DNA 6mA sites and speed up the exploration of DNA modification.
Collapse
Affiliation(s)
- Xingyu Tang
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Peijie Zheng
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Xueyong Li
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Hongyan Wu
- The Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Dong-Qing Wei
- The Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China; State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China.
| | - Yuewu Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha, Hunan 410081, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China.
| |
Collapse
|
30
|
He Z, Xu J, Shi H, Wu S. m5CRegpred: Epitranscriptome Target Prediction of 5-Methylcytosine (m5C) Regulators Based on Sequencing Features. Genes (Basel) 2022; 13:genes13040677. [PMID: 35456483 PMCID: PMC9025882 DOI: 10.3390/genes13040677] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 04/02/2022] [Accepted: 04/05/2022] [Indexed: 02/04/2023] Open
Abstract
5-methylcytosine (m5C) is a common post-transcriptional modification observed in a variety of RNAs. m5C has been demonstrated to be important in a variety of biological processes, including RNA structural stability and metabolism. Driven by the importance of m5C modification, many projects focused on the m5C sites prediction were reported before. To better understand the upstream and downstream regulation of m5C, we present a bioinformatics framework, m5CRegpred, to predict the substrate of m5C writer NSUN2 and m5C readers YBX1 and ALYREF for the first time. After features comparison, window lengths selection and algorism comparison on the mature mRNA model, our model achieved AUROC scores 0.869, 0.724 and 0.889 for NSUN2, YBX1 and ALYREF, respectively in an independent test. Our work suggests the substrate of m5C regulators can be distinguished and may help the research of m5C regulators in a special condition, such as substrates prediction of hyper- or hypo-expressed m5C regulators in human disease.
Collapse
Affiliation(s)
- Zhizhou He
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China; (Z.H.); (J.X.)
- Department of Molecular, Cell, and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Jing Xu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China; (Z.H.); (J.X.)
| | - Haoran Shi
- Research Center for BioSystems, Land Use, and Nutrition (IFZ), Institute of Applied Microbiology, Justus-Liebig-University Giessen, Heinrich-Buff-Ring 26-32, 35392 Giessen, Germany
- Correspondence: (H.S.); (S.W.)
| | - Shuxiang Wu
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China; (Z.H.); (J.X.)
- Fujian Key Laboratory of Tumor Microbiology, Department of Medical Microbiology, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350004, China
- Correspondence: (H.S.); (S.W.)
| |
Collapse
|
31
|
Zhang H, Zou Q, Ju Y, Song C, Chen D. Distance-based support vector machine to predict DNA N6-methyladenine modification. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220404145517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time-consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics.
Objective:
General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results
Method:
First sequence alignment technology was used to achieve the similarity matrix. Then a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix is made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine.
Results:
The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages compared with traditional machine learning methods. Meanwhile this model achieved 0.943,0.982 and 0.818 accuracy,0.944, 0.982, and 0.838 Matthews correlation coefficient and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively.
Conclusion:
These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which are the lastest research on DNA 6mA.
Collapse
Affiliation(s)
- Haoyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chenggang Song
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
| |
Collapse
|
32
|
Zhang Y, Huang D, Wei Z, Chen K. Primary sequence-assisted prediction of m6A RNA methylation sites from Oxford nanopore direct RNA sequencing data. Methods 2022; 203:62-69. [PMID: 35429629 DOI: 10.1016/j.ymeth.2022.04.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 03/27/2022] [Accepted: 04/11/2022] [Indexed: 11/28/2022] Open
|
33
|
Qiao H, Zhang S, Xue T, Wang J, Wang B. iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 215:106625. [PMID: 35038653 DOI: 10.1016/j.cmpb.2022.106625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/13/2021] [Accepted: 01/06/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Promoter is a component of the gene, which can specifically bind with RNA polymerase and determine where transcription starts, and also determine the transcription efficiency of the gene. Promoters can be divided into strong promoters and weak promoters because their structures and the interaction time interval are quite different. The functional variation of the promoter can lead to a variety of diseases. Therefore, identifying promoters and their strength is necessary and has important biological significance. A novel and promising model based on deep learning is proposed to achieve it. METHODS In this work, we build a power model named iPro-GAN for identification of promoters and their strength. First, we collect benchmark datasets and independent datasets for training and testing. Then, Moran-based spatial auto-cross correlation method is used as feature extraction method. Finally, deep convolution generative adversarial network with 10-fold cross validation is applied for classifying. The first layer of the model is used to identify the promoter and the second layer is used to determine its type. RESULTS On the benchmark data set, the accuracy of the first layer predictor is 93.15%, and the accuracy of the second layer predictor is 92.30%. On the independent data set, the accuracy of the first layer predictor is 86.77%, and the accuracy of the second layer predictor is 91.66%. In particular, breakthrough progress has been made in the identification of promoters' strength. CONCLUSIONS These results are far higher than the existing best predictor, which indicate that our model is serviceable and practicable to identify promoters and their strength. Furthermore, the datasets and source codes are available from this link: https://github.com/Bovbene/iPro-GAN.
Collapse
Affiliation(s)
- Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jinyue Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Bowei Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
34
|
Tsukiyama S, Hasan MM, Deng HW, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform 2022; 23:6539171. [PMID: 35225328 PMCID: PMC8921755 DOI: 10.1093/bib/bbac053] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 01/28/2022] [Accepted: 01/31/2022] [Indexed: 01/29/2023] Open
Abstract
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Corresponding author: Hiroyuki Kurata, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan. Tel: 81-948-29-7828; E-mail:
| |
Collapse
|
35
|
Tahir M, Khan F, Hayat M, Alshehri MD. An effective machine learning-based model for the prediction of protein–protein interaction sites in health systems. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07024-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
36
|
Zhang Z, Wang L. Using Chou's 5-steps rule to identify N 6-methyladenine sites by ensemble learning combined with multiple feature extraction methods. J Biomol Struct Dyn 2022; 40:796-806. [PMID: 32948102 DOI: 10.1080/07391102.2020.1821778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
N6-methyladenine (m6A), a type of modification mostly affecting the downstream biological functions and determining the levels of gene expression, is mediated by the methylation of adenine in nucleic acids. It is also a key factor for influencing biological processes and has attracted attention as a target for treating diseases. Here, an ensemble predictor named as TL-Methy, was developed to identify m6A sites across the genome. TL-Methy is a 2-level machine learning method developed by combining the support vector machine model and multiple features extraction methods, including nucleic acid composition, di-nucleotide composition, tri-nucleotide composition, position-specific trinucleotide propensity, Bi-profile Bayes, binary encoding, and accumulated nucleotide frequency. For Homo sapiens, TL-Methy method reached the accuracy of 91.68% on jackknife test and of 92.23% on 10-fold cross validation test; For Mus musculus, TL-Methy method achieved the accuracy of 93.66% on jackknife test and of 97.07% on 10-fold cross validation test; For Saccharomyces cerevisiae, TL-Methy method obtained the accuracy of 81.57% on jackknife test and of 82.54% on 10-fold cross validation test; For rice genome, TL-Methy method achieved the accuracy of 91.87% on jackknife test and of 93.04% on 10-fold cross validation test. The results via these two test approaches demonstrated the robustness and practicality of our TL-Methy model. The TL-Methy model may be as a potential method for m6A site identification.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Zhongwang Zhang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| | - Lidong Wang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| |
Collapse
|
37
|
Clauwaert J, Waegeman W. Novel Transformer Networks for Improved Sequence Labeling in genomics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:97-106. [PMID: 33125335 DOI: 10.1109/tcbb.2020.3035021] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.
Collapse
|
38
|
Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, Wang G. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. FRONTIERS IN PLANT SCIENCE 2022; 13:845835. [PMID: 35237293 PMCID: PMC8882731 DOI: 10.3389/fpls.2022.845835] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/24/2022] [Indexed: 05/17/2023]
Abstract
DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhengnan Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- *Correspondence: Qianzi Lu,
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- Guohua Wang,
| |
Collapse
|
39
|
Shafiq M, Ali O, Han SB, Kim DH. Mechanobiological Strategies to Enhance Stem Cell Functionality for Regenerative Medicine and Tissue Engineering. Front Cell Dev Biol 2021; 9:747398. [PMID: 34926444 PMCID: PMC8678455 DOI: 10.3389/fcell.2021.747398] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 11/10/2021] [Indexed: 12/18/2022] Open
Abstract
Stem cells have been extensively used in regenerative medicine and tissue engineering; however, they often lose their functionality because of the inflammatory microenvironment. This leads to their poor survival, retention, and engraftment at transplantation sites. Considering the rapid loss of transplanted cells due to poor cell-cell and cell-extracellular matrix (ECM) interactions during transplantation, it has been reasoned that stem cells mainly mediate reparative responses via paracrine mechanisms, including the secretion of extracellular vesicles (EVs). Ameliorating poor cell-cell and cell-ECM interactions may obviate the limitations associated with the poor retention and engraftment of transplanted cells and enable them to mediate tissue repair through the sustained and localized presentation of secreted bioactive cues. Biomaterial-mediated strategies may be leveraged to confer stem cells enhanced immunomodulatory properties, as well as better engraftment and retention at the target site. In these approaches, biomaterials have been exploited to spatiotemporally present bioactive cues to stem cell-laden platforms (e.g., aggregates, microtissues, and tissue-engineered constructs). An array of biomaterials, such as nanoparticles, hydrogels, and scaffolds, has been exploited to facilitate stem cells function at the target site. Additionally, biomaterials can be harnessed to suppress the inflammatory microenvironment to induce enhanced tissue repair. In this review, we summarize biomaterial-based platforms that impact stem cell function for better tissue repair that may have broader implications for the treatment of various diseases as well as tissue regeneration.
Collapse
Affiliation(s)
- Muhammad Shafiq
- Department of Biotechnology, Faculty of Life Sciences, University of Central Punjab, Lahore, Pakistan
| | - Onaza Ali
- School of Chemistry and Chemical Engineering, Tiangong University, Tianjin, China
| | - Seong-Beom Han
- KU-KIST Graduate School of Converging Science and Technology, Korea University, Seoul, South Korea
| | - Dong-Hwee Kim
- KU-KIST Graduate School of Converging Science and Technology, Korea University, Seoul, South Korea.,Department of Integrative Energy Engineering, College of Engineering, Korea University, Seoul, South Korea
| |
Collapse
|
40
|
Le NQK, Ho QT. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes. Methods 2021; 204:199-206. [PMID: 34915158 DOI: 10.1016/j.ymeth.2021.12.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Revised: 11/30/2021] [Accepted: 12/09/2021] [Indexed: 12/19/2022] Open
Abstract
As one of the most common post-transcriptional epigenetic modifications, N6-methyladenine (6 mA), plays an essential role in various cellular processes and disease pathogenesis. Therefore, accurately identifying 6 mA modifications is necessary for a deep understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models were developed with small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we present a novel model based on transformer architecture and deep learning to identify DNA 6 mA sites from the cross-species genome. The model is constructed on a benchmark dataset and explored a feature derived from pre-trained transformer word embedding approaches. Subsequently, a convolutional neural network was employed to learn the generated features and generate the prediction outcomes. As a result, our predictor achieved excellent performance during independent test with the accuracy and Matthews correlation coefficient (MCC) of 79.3% and 0.58, respectively. Overall, its performance achieved better accuracy than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, our model is expected to assist biologists in accurately identifying 6mAs and formulate the novel testable biological hypothesis. We also release source codes and datasets freely at https://github.com/khanhlee/bert-dna for front-end users.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| | - Quang-Thai Ho
- College of Information & Communication Technology, Can Tho University, Viet Nam; Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| |
Collapse
|
41
|
Nguyen TTD, Trinh VN, Le NQK, Ou YY. Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model. PLANT MOLECULAR BIOLOGY 2021; 107:533-542. [PMID: 34843033 DOI: 10.1007/s11103-021-01204-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Accepted: 09/25/2021] [Indexed: 06/13/2023]
Abstract
This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.
Collapse
Affiliation(s)
| | - Van Ngu Trinh
- Soonchunhyang Institute of Medi-Bio Science, Soonchunhyang University, Cheonan, 31151, South Korea
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan.
| |
Collapse
|
42
|
Ao C, Gao L, Yu L. Research progress in predicting DNA methylation modifications and the relation with human diseases. Curr Med Chem 2021; 29:822-836. [PMID: 34533438 DOI: 10.2174/0929867328666210917115733] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/05/2021] [Accepted: 07/11/2021] [Indexed: 11/22/2022]
Abstract
DNA methylation is an important mode of regulation in epigenetic mechanisms, and it is one of the research foci in the field of epigenetics. DNA methylation modification affects a series of biological processes, such as eukaryotic cell growth, differentiation and transformation mechanisms, by regulating gene expression. In this review, we systematically summarized the DNA methylation databases, prediction tools for DNA methylation modification, machine learning algorithms for predicting DNA methylation modification, and the relationship between DNA methylation modification and diseases such as hypertension, Alzheimer's disease, diabetic nephropathy, and cancer. An in-depth understanding of DNA methylation mechanisms can promote accurate prediction of DNA methylation modifications and the treatment and diagnosis of related diseases.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
43
|
Shen LC, Liu Y, Song J, Yu DJ. SAResNet: self-attention residual network for predicting DNA-protein binding. Brief Bioinform 2021; 22:bbab101. [PMID: 33837387 PMCID: PMC8579196 DOI: 10.1093/bib/bbab101] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 03/03/2021] [Accepted: 03/08/2021] [Indexed: 11/12/2022] Open
Abstract
Knowledge of the specificity of DNA-protein binding is crucial for understanding the mechanisms of gene expression, regulation and gene therapy. In recent years, deep-learning-based methods for predicting DNA-protein binding from sequence data have achieved significant success. Nevertheless, the current state-of-the-art computational methods have some drawbacks associated with the use of limited datasets with insufficient experimental data. To address this, we propose a novel transfer learning-based method, termed SAResNet, which combines the self-attention mechanism and residual network structure. More specifically, the attention-driven module captures the position information of the sequence, while the residual network structure guarantees that the high-level features of the binding site can be extracted. Meanwhile, the pre-training strategy used by SAResNet improves the learning ability of the network and accelerates the convergence speed of the network during transfer learning. The performance of SAResNet is extensively tested on 690 datasets from the ChIP-seq experiments with an average AUC of 92.0%, which is 4.4% higher than that of the best state-of-the-art method currently available. When tested on smaller datasets, the predictive performance is more clearly improved. Overall, we demonstrate that the superior performance of DNA-protein binding prediction on DNA sequences can be achieved by combining the attention mechanism and residual structure, and a novel pipeline is accordingly developed. The proposed methodology is generally applicable and can be used to address any other sequence classification problems.
Collapse
Affiliation(s)
- Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| |
Collapse
|
44
|
Akmal MA, Hussain W, Rasool N, Khan YD, Khan SA, Chou KC. Using CHOU'S 5-Steps Rule to Predict O-Linked Serine Glycosylation Sites by Blending Position Relative Features and Statistical Moment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2045-2056. [PMID: 31985438 DOI: 10.1109/tcbb.2020.2968441] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Glycosylation of proteins in eukaryote cells is an important and complicated post-translation modification due to its pivotal role and association with crucial physiological functions within most of the proteins. Identification of glycosylation sites in a polypeptide chain is not an easy task due to multiple impediments. Analytical identification of these sites is expensive and laborious. There is a dire need to develop a reliable computational method for precise determination of such sites which can help researchers to save time and effort. Herein, we propose a novel predictor namely iGlycoS-PseAAC by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) and relative/absolute position-based features. The self-consistency results show that the accuracy revealed by the model using the benchmark dataset for prediction of O-linked glycosylation having serine sites is 98.8 percent. The overall accuracy of predictor achieved through 10-fold cross validation by combining the positive and negative results is 97.2 percent. The overall accuracy achieved through Jackknife test is 96.195 percent by aggregating of all the prediction results. Thus the proposed predictor can help in predicting the O-linked glycosylated serine sites in an efficient and accurate way. The overall results show that the accuracy of the iGlycoS-PseAAC is higher than the existing tools.
Collapse
|
45
|
He S, Kong L, Chen J. iDNA6mA-Rice-DL: A local web server for identifying DNA N6-methyladenine sites in rice genome by deep learning method. J Bioinform Comput Biol 2021; 19:2150019. [PMID: 34291710 DOI: 10.1142/s0219720021500190] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Accurate detection of N6-methyladenine (6mA) sites by biochemical experiments will help to reveal their biological functions, still, these wet experiments are laborious and expensive. Therefore, it is necessary to introduce a powerful computational model to identify the 6mA sites on a genomic scale, especially for plant genomes. In view of this, we proposed a model called iDNA6mA-Rice-DL for the effective identification of 6mA sites in rice genome, which is an intelligent computing model based on deep learning method. Traditional machine learning methods assume the preparation of the features for analysis. However, our proposed model automatically encodes and extracts key DNA features through an embedded layer and several groups of dense layers. We use an independent dataset to evaluate the generalization ability of our model. An area under the receiver operating characteristic curve (auROC) of 0.98 with an accuracy of 95.96% was obtained. The experiment results demonstrate that our model had good performance in predicting 6mA sites in the rice genome. A user-friendly local web server has been established. The Docker image of the local web server can be freely downloaded at https://hub.docker.com/r/his1server/idna6ma-rice-dl.
Collapse
Affiliation(s)
- Shiqian He
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao 066000, P. R. China
| | - Liang Kong
- School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao 066000, P. R. China
| | - Jing Chen
- School of Information Science and Engineering, Yanshan University, Qinhuangdao 066000, P. R. China
| |
Collapse
|
46
|
Feng P, Feng L, Tang C. Comparison and Analysis of Computational Methods for Identifying N6-Methyladenosine Sites in Saccharomyces cerevisiae. Curr Pharm Des 2021; 27:1219-1229. [PMID: 33167827 DOI: 10.2174/1381612826666201109110703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 07/20/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND N6-methyladenosine (m6A) plays critical roles in a broad range of biological processes. Knowledge about the precise location of m6A site in the transcriptome is vital for deciphering its biological functions. Although experimental techniques have made substantial contributions to identify m6A, they are still labor intensive and time consuming. As complement to experimental methods, in the past few years, a series of computational approaches have been proposed to identify m6A sites. METHODS In order to facilitate researchers to select appropriate methods for identifying m6A sites, it is necessary to conduct a comprehensive review and comparison of existing methods. RESULTS Since research works on m6A in Saccharomyces cerevisiae are relatively clear, in this review, we summarized recent progress of computational prediction of m6A sites in S. cerevisiae and assessed the performance of existing computational methods. Finally, future directions of computationally identifying m6A sites are presented. CONCLUSION Taken together, we anticipate that this review will serve as an important guide for computational analysis of m6A modifications.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Lijing Feng
- School of Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Chaohui Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| |
Collapse
|
47
|
UMAP-DBP: An Improved DNA-Binding Proteins Prediction Method Based on Uniform Manifold Approximation and Projection. Protein J 2021; 40:562-575. [PMID: 34176069 DOI: 10.1007/s10930-021-10011-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/10/2021] [Indexed: 10/21/2022]
Abstract
DNA-binding proteins play a vital role in cellular processes. It is an extremely urgent to develop a high-throughput method for efficiently identifying DNA-binding proteins. According to the current research situation, some methods in machine learning and deep learning show excellent computational speed and accuracy, which are worthy of application. In this work, a novel predictor was proposed to predict DNA binding proteins called UMAP-DBP. Firstly, the feature extraction of primary protein sequence was realized based on physicochemical distance transformation, Profile-based auto-cross covariance and General series correlation pseudo amino acid composition. Secondly, uniform manifold approximation and projection (UMAP) and feature importance score methods were used for feature selection; there is a progressive relationship between them. Finally, the Adaboost operation engine with jackknife test were adopted for predicting DNA-binding proteins. For the jackknife test on the BP1075 and BP594, we obtained an overall accuracy of 82.97% and 82.14%, Cohen's kappa (CK) of 0.66 and 0.64, respectively. The results illustrate that a feasible method has been developed for predicting DNA-binding proteins by UMAP and Adaboost. This is the first study in which UMAP has been successfully applied to identify DNA-binding proteins. All the datasets and codes are accessible at https://github.com/Wang-Jinyue/UMAP-DBP .
Collapse
|
48
|
Meher PK, Rai A, Rao AR. mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net. BMC Bioinformatics 2021; 22:342. [PMID: 34167457 PMCID: PMC8223360 DOI: 10.1186/s12859-021-04264-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 06/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Localization of messenger RNAs (mRNAs) plays a crucial role in the growth and development of cells. Particularly, it plays a major role in regulating spatio-temporal gene expression. The in situ hybridization is a promising experimental technique used to determine the localization of mRNAs but it is costly and laborious. It is also a known fact that a single mRNA can be present in more than one location, whereas the existing computational tools are capable of predicting only a single location for such mRNAs. Thus, the development of high-end computational tool is required for reliable and timely prediction of multiple subcellular locations of mRNAs. Hence, we develop the present computational model to predict the multiple localizations of mRNAs. RESULTS The mRNA sequences from 9 different localizations were considered. Each sequence was first transformed to a numeric feature vector of size 5460, based on the k-mer features of sizes 1-6. Out of 5460 k-mer features, 1812 important features were selected by the Elastic Net statistical model. The Random Forest supervised learning algorithm was then employed for predicting the localizations with the selected features. Five-fold cross-validation accuracies of 70.87, 68.32, 68.36, 68.79, 96.46, 73.44, 70.94, 97.42 and 71.77% were obtained for the cytoplasm, cytosol, endoplasmic reticulum, exosome, mitochondrion, nucleus, pseudopodium, posterior and ribosome respectively. With an independent test set, accuracies of 65.33, 73.37, 75.86, 72.99, 94.26, 70.91, 65.53, 93.60 and 73.45% were obtained for the respective localizations. The developed approach also achieved higher accuracies than the existing localization prediction tools. CONCLUSIONS This study presents a novel computational tool for predicting the multiple localization of mRNAs. Based on the proposed approach, an online prediction server "mLoc-mRNA" is accessible at http://cabgrid.res.in:8080/mlocmrna/ . The developed approach is believed to supplement the existing tools and techniques for the localization prediction of mRNAs.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Anil Rai
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | | |
Collapse
|
49
|
Malebary SJ, Khan YD. Evaluating machine learning methodologies for identification of cancer driver genes. Sci Rep 2021; 11:12281. [PMID: 34112883 PMCID: PMC8192921 DOI: 10.1038/s41598-021-91656-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Accepted: 05/19/2021] [Indexed: 02/06/2023] Open
Abstract
Cancer is driven by distinctive sorts of changes and basic variations in genes. Recognizing cancer driver genes is basic for accurate oncological analysis. Numerous methodologies to distinguish and identify drivers presently exist, but efficient tools to combine and optimize them on huge datasets are few. Most strategies for prioritizing transformations depend basically on frequency-based criteria. Strategies are required to dependably prioritize organically dynamic driver changes over inert passengers in high-throughput sequencing cancer information sets. This study proposes a model namely PCDG-Pred which works as a utility capable of distinguishing cancer driver and passenger attributes of genes based on sequencing data. Keeping in view the significance of the cancer driver genes an efficient method is proposed to identify the cancer driver genes. Further, various validation techniques are applied at different levels to establish the effectiveness of the model and to obtain metrics like accuracy, Mathew's correlation coefficient, sensitivity, and specificity. The results of the study strongly indicate that the proposed strategy provides a fundamental functional advantage over other existing strategies for cancer driver genes identification. Subsequently, careful experiments exhibit that the accuracy metrics obtained for self-consistency, independent set, and cross-validation tests are 91.08%., 87.26%, and 92.48% respectively.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 344, Rabigh, 21911, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan.
| |
Collapse
|
50
|
Song B, Li Z, Lin X, Wang J, Wang T, Fu X. Pretraining model for biological sequence data. Brief Funct Genomics 2021; 20:181-195. [PMID: 34050350 PMCID: PMC8194843 DOI: 10.1093/bfgp/elab025] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 04/13/2021] [Accepted: 04/21/2021] [Indexed: 12/26/2022] Open
Abstract
With the development of high-throughput sequencing technology, biological sequence data reflecting life information becomes increasingly accessible. Particularly on the background of the COVID-19 pandemic, biological sequence data play an important role in detecting diseases, analyzing the mechanism and discovering specific drugs. In recent years, pretraining models that have emerged in natural language processing have attracted widespread attention in many research fields not only to decrease training cost but also to improve performance on downstream tasks. Pretraining models are used for embedding biological sequence and extracting feature from large biological sequence corpus to comprehensively understand the biological sequence data. In this survey, we provide a broad review on pretraining models for biological sequence data. Moreover, we first introduce biological sequences and corresponding datasets, including brief description and accessible link. Subsequently, we systematically summarize popular pretraining models for biological sequences based on four categories: CNN, word2vec, LSTM and Transformer. Then, we present some applications with proposed pretraining models on downstream tasks to explain the role of pretraining models. Next, we provide a novel pretraining scheme for protein sequences and a multitask benchmark for protein pretraining models. Finally, we discuss the challenges and future directions in pretraining models for biological sequences.
Collapse
Affiliation(s)
| | | | | | | | | | - Xiangzheng Fu
- Corresponding author: Xiangzheng Fu, College of Information Science and Engineering, Hunan University, Changsha, Hunan, China. Tel: 86-0731-88821907; E-mail:
| |
Collapse
|