1
|
Zhuo L, Wang R, Fu X, Yao X. StableDNAm: towards a stable and efficient model for predicting DNA methylation based on adaptive feature correction learning. BMC Genomics 2023; 24:742. [PMID: 38053026 DOI: 10.1186/s12864-023-09802-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 11/11/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND DNA methylation, instrumental in numerous life processes, underscores the paramount importance of its accurate prediction. Recent studies suggest that deep learning, due to its capacity to extract profound insights, provides a more precise DNA methylation prediction. However, issues related to the stability and generalization performance of these models persist. RESULTS In this study, we introduce an efficient and stable DNA methylation prediction model. This model incorporates a feature fusion approach, adaptive feature correction technology, and a contrastive learning strategy. The proposed model presents several advantages. First, DNA sequences are encoded at four levels to comprehensively capture intricate information across multi-scale and low-span features. Second, we design a sequence-specific feature correction module that adaptively adjusts the weights of sequence features. This improvement enhances the model's stability and scalability, or its generality. Third, our contrastive learning strategy mitigates the instability issues resulting from sparse data. To validate our model, we conducted multiple sets of experiments on commonly used datasets, demonstrating the model's robustness and stability. Simultaneously, we amalgamate various datasets into a single, unified dataset. The experimental outcomes from this combined dataset substantiate the model's robust adaptability. CONCLUSIONS Our research findings affirm that the StableDNAm model is a general, stable, and effective instrument for DNA methylation prediction. It holds substantial promise for providing invaluable assistance in future methylation-related research and analyses.
Collapse
Affiliation(s)
- Linlin Zhuo
- College of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325000, China
| | - Rui Wang
- College of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325000, China
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410000, China.
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China.
| |
Collapse
|
2
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N 4-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023; 21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open
Abstract
N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies.
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
- School of Innovation, Design and Technology, Wellington Institute of Technology, Wellington 5012, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Ho Chi Minh City 700000, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
- Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| |
Collapse
|
3
|
Yu X, Ren J, Cui Y, Zeng R, Long H, Ma C. DRSN4mCPred: accurately predicting sites of DNA N4-methylcytosine using deep residual shrinkage network for diagnosis and treatment of gastrointestinal cancer in the precision medicine era. Front Med (Lausanne) 2023; 10:1187430. [PMID: 37215722 PMCID: PMC10192687 DOI: 10.3389/fmed.2023.1187430] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 04/05/2023] [Indexed: 05/24/2023] Open
Abstract
Introduction The DNA N4-methylcytosine (4mC) site levels of those suffering from digestive system cancers were higher, and the pathogenesis of digestive system cancers may also be related to the changes in DNA 4mC levels. Identifying DNA 4mC sites is a very important step in studying the analysis of biological function and cancer prediction. Extracting accurate features from DNA sequences is the key to establishing a prediction model of effective DNA 4mC sites. This study sought to develop a new predictive model, DRSN4mCPred, which aimed to improve the performance of the predicting DNA 4mC sites. Methods The model adopted multi-scale channel attention to extract features and used attention feature fusion (AFF) to fuse features. In order to capture features information more accurately and effectively, this model utilized Deep Residual Shrinkage Network with Channel-Wise thresholds (DRSN-CW) to eliminate noise-related features and achieve a more precise feature representation, thereby, distinguishing the sites in DNA with 4mC and non-4mC. Additionally, the predictive model incorporated an inverted residual block, a Multi-scale Channel Attention Module (MS-CAM), a Bi-directional Long Short Term Memory Network (Bi-LSTM), AFF, and DRSN-CW. Results and Discussion The results indicated the predictive model DRSN4mCPred had extremely good performance in predicting the DNA 4mC sites across different species. This paper will potentially provide support for the diagnosis and treatment of gastrointestinal cancer based on artificial intelligence in the precise medical era.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- Industrial Design School, Shandong University of ART and Design, Jinan, Shandong, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Cuihua Ma
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
4
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
5
|
Jin J, Yu Y, Wang R, Zeng X, Pang C, Jiang Y, Li Z, Dai Y, Su R, Zou Q, Nakai K, Wei L. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol 2022; 23:219. [PMID: 36253864 PMCID: PMC9575223 DOI: 10.1186/s13059-022-02780-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 10/03/2022] [Indexed: 11/29/2022] Open
Abstract
In this study, we propose iDNA-ABF, a multi-scale deep biological language learning model that enables the interpretable prediction of DNA methylations based on genomic sequences only. Benchmarking comparisons show that our iDNA-ABF outperforms state-of-the-art methods for different methylation predictions. Importantly, we show the power of deep language learning in capturing both sequential and functional semantics information from background genomes. Moreover, by integrating the interpretable analysis mechanism, we well explain what the model learns, helping us build the mapping from the discovery of important sequential determinants to the in-depth analysis of their biological functions.
Collapse
Affiliation(s)
- Junru Jin
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yingying Yu
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Ruheng Wang
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Xin Zeng
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan
- Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan
| | - Chao Pang
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yi Jiang
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Zhongshen Li
- School of Software, Shandong University, Jinan, 250101, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yutong Dai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan
- Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Kenta Nakai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan.
- Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan.
| | - Leyi Wei
- School of Software, Shandong University, Jinan, 250101, China.
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China.
| |
Collapse
|
6
|
FRTpred: A novel approach for accurate prediction of protein folding rate and type. Comput Biol Med 2022; 149:105911. [DOI: 10.1016/j.compbiomed.2022.105911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/08/2022] [Accepted: 07/23/2022] [Indexed: 11/20/2022]
|
7
|
PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.05.060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
8
|
Liang Y, Wu Y, Zhang Z, Liu N, Peng J, Tang J. Hyb4mC: a hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction. BMC Bioinformatics 2022; 23:258. [PMID: 35768759 PMCID: PMC9241225 DOI: 10.1186/s12859-022-04789-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 06/10/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA N4-methylcytosine is part of the restrictive modification system, which works by regulating some biological processes, for example, the initiation of DNA replication, mismatch repair and inactivation of transposon. However, using experimental methods to detect 4mC sites is time-consuming and expensive. Besides, considering the huge differences in the number of 4mC samples among different species, it is challenging to achieve a robust multi-species 4mC site prediction performance. Hence, it is of great significance to develop effective computational tools to identify 4mC sites. RESULTS This work proposes a flexible deep learning-based framework to predict 4mC sites, called Hyb4mC. Hyb4mC adopts the DNA2vec method for sequence embedding, which captures more efficient and comprehensive information compared with the sequence-based feature method. Then, two different subnets are used for further analysis: Hyb_Caps and Hyb_Conv. Hyb_Caps is composed of a capsule neural network and can generalize from fewer samples. Hyb_Conv combines the attention mechanism with a text convolutional neural network for further feature learning. CONCLUSIONS Extensive benchmark tests have shown that Hyb4mC can significantly enhance the performance of predicting 4mC sites compared with the recently proposed methods.
Collapse
Affiliation(s)
- Ying Liang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China.
| | - Yanan Wu
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Zequn Zhang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Niannian Liu
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Jun Peng
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Jianjun Tang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
9
|
|
10
|
Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, Jing R. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol 2022; 13:843425. [PMID: 35401453 PMCID: PMC8989013 DOI: 10.3389/fmicb.2022.843425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 02/21/2022] [Indexed: 11/13/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is a pivotal epigenetic modification that plays an essential role in DNA replication, repair, expression and differentiation. To gain insight into the biological functions of 4mC, it is critical to identify their modification sites in the genomics. Recently, deep learning has become increasingly popular in recent years and frequently employed for the 4mC site identification. However, a systematic analysis of how to build predictive models using deep learning techniques is still lacking. In this work, we first summarized all existing deep learning-based predictors and systematically analyzed their models, features and datasets, etc. Then, using a typical standard dataset with three species (A. thaliana, C. elegans, and D. melanogaster), we assessed the contribution of different model architectures, encoding methods and the attention mechanism in establishing a deep learning-based model for the 4mC site prediction. After a series of optimizations, convolutional-recurrent neural network architecture using the one-hot encoding and attention mechanism achieved the best overall prediction performance. Extensive comparison experiments were conducted based on the same dataset. This work will be helpful for researchers who would like to build the 4mC prediction models using deep learning in the future.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Qi Chen
- Department of Endocrinology and Metabolism, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu, China
| |
Collapse
|
11
|
Zulfiqar H, Huang QL, Lv H, Sun ZJ, Dao FY, Lin H. Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique. Int J Mol Sci 2022; 23:1251. [PMID: 35163174 PMCID: PMC8836036 DOI: 10.3390/ijms23031251] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 01/19/2022] [Accepted: 01/20/2022] [Indexed: 12/15/2022] Open
Abstract
4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.
Collapse
Affiliation(s)
| | | | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (H.Z.); (Q.-L.H.); (H.L.); (Z.-J.S.); (F.-Y.D.)
| |
Collapse
|
12
|
Li F, Guo X, Xiang D, Pitt ME, Bainomugisa A, Coin LJ. Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J 2022; 20:662-674. [PMID: 35140886 PMCID: PMC8804200 DOI: 10.1016/j.csbj.2022.01.019] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 01/09/2022] [Accepted: 01/18/2022] [Indexed: 12/18/2022] Open
Abstract
Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.
Collapse
Affiliation(s)
- Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | - Xudong Guo
- School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China
| | - Dongxu Xiang
- Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia
| | - Miranda E. Pitt
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | | | - Lachlan J.M. Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| |
Collapse
|
13
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
14
|
Yu Y, He W, Jin J, Cui L, Zeng R, Wei L. iDNA-ABT : advanced deep learning model for detecting DNA methylation with adaptive features and transductive information maximization. Bioinformatics 2021; 37:4603-4610. [PMID: 34601568 DOI: 10.1093/bioinformatics/btab677] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 09/07/2021] [Accepted: 09/29/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION DNA methylation plays an important role in epigenetic modification, the occurrence, and the development of diseases. Therefore, the identification of DNA methylation sites is critical for better understanding and revealing their functional mechanisms. To date, several machine learning and deep learning methods have been developed for the prediction of different methylation types. However, they still highly rely on manual features, which can largely limit the high-latent information extraction. Moreover, most of them are designed for one specific methylation type, and therefore cannot predict multiple methylation sites in multiple species simultaneously. In this study, we propose iDNA-ABT, an advanced deep learning model that utilizes adaptive embedding based on bidirectional transformers for language understanding together with a novel transductive information maximization (TIM) loss. RESULTS Benchmark results show that our proposed iDNA-ABT can automatically and adaptively learn the distinguishing features of biological sequences from multiple species, and thus perform significantly better than the state-of-the-art methods in predicting three different DNA methylation. In addition, TIM loss is proven to be effective in dichotomous tasks via the comparison experiment. Furthermore, we verify that our features have strong adaptability and robustness to different species through comparison of adaptive embedding and six handcrafted feature encodings. Importantly, our model shows great generalization ability in different species, demonstrating that our model can adaptively capture the cross-species differences and improve the predictive performance. For the convenient use of our method, we further established an online webserver as the implementation of the proposed iDNA-ABT. AVAILABILITY our proposed iDNA-ABT, which is now freely accessible via http://server.wei-group.net/iDNA_ABT and our source codes are available in the GitHub repository (https://github.com/YUYING07/iDNA_ABT). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yingying Yu
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Wenjia He
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Junru Jin
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Lizhen Cui
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Rao Zeng
- Department of Software Engineering, Xiamen University, Xiamen, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| |
Collapse
|
15
|
Ao C, Gao L, Yu L. Research progress in predicting DNA methylation modifications and the relation with human diseases. Curr Med Chem 2021; 29:822-836. [PMID: 34533438 DOI: 10.2174/0929867328666210917115733] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/05/2021] [Accepted: 07/11/2021] [Indexed: 11/22/2022]
Abstract
DNA methylation is an important mode of regulation in epigenetic mechanisms, and it is one of the research foci in the field of epigenetics. DNA methylation modification affects a series of biological processes, such as eukaryotic cell growth, differentiation and transformation mechanisms, by regulating gene expression. In this review, we systematically summarized the DNA methylation databases, prediction tools for DNA methylation modification, machine learning algorithms for predicting DNA methylation modification, and the relationship between DNA methylation modification and diseases such as hypertension, Alzheimer's disease, diabetic nephropathy, and cancer. An in-depth understanding of DNA methylation mechanisms can promote accurate prediction of DNA methylation modifications and the treatment and diagnosis of related diseases.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
16
|
Zulfiqar H, Sun ZJ, Huang QL, Yuan SS, Lv H, Dao FY, Lin H, Li YW. Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli. Methods 2021; 203:558-563. [PMID: 34352373 DOI: 10.1016/j.ymeth.2021.07.011] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 07/22/2021] [Accepted: 07/29/2021] [Indexed: 10/20/2022] Open
Abstract
N4-methylcytosine (4mC) is a type of DNA modification which could regulate several biological progressions such as transcription regulation, replication and gene expressions. Precisely recognizing 4mC sites in genomic sequences can provide specific knowledge about their genetic roles. This study aimed to develop a deep learning-based model to predict 4mC sites in the Escherichia coli. In the model, DNA sequences were encoded by word embedding technique 'word2vec'. The obtained features were inputted into 1-D convolutional neural network (CNN) to discriminate 4mC sites from non-4mC sites in Escherichia coli genome. The examination on independent dataset showed that our model could yield the overall accuracy of 0.861, which was about 4.3% higher than the existing model. To provide convenience to scholars, we provided the data and source code of the model which can be freely download from https://github.com/linDing-groups/Deep-4mCW2V.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Shi Yuan
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Yan-Wen Li
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China; Key Laboratory of Intelligent Information Processing of Jilin Province, Northeast Normal University, Changchun 130117, China; Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
17
|
Hunt C, Montgomery S, Berkenpas JW, Sigafoos N, Oakley JC, Espinosa J, Justice N, Kishaba K, Hippe K, Si D, Hou J, Ding H, Cao R. Recent Progress of Machine Learning in Gene Therapy. Curr Gene Ther 2021; 22:132-143. [PMID: 34161210 DOI: 10.2174/1566523221666210622164133] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/15/2021] [Accepted: 04/02/2021] [Indexed: 11/22/2022]
Abstract
With new developments in biomedical technology, it is now a viable therapeutic treatment to alter genes with techniques like CRISPR. At the same time, it is increasingly cheaper to do whole genome sequencing, resulting in rapid advancement in gene therapy and editing in precision medicine. Thus, understanding the current industry and academic applications of gene therapy provides an important backdrop to future scientific developments. Additionally, machine learning and artificial intelligence techniques allow for the reduction of time and money spent in the development of new gene therapy products and techniques. In this paper, we survey the current progress of gene therapy treatments for several diseases and explore machine learning applications in gene therapy. We also discuss the ethical implications of gene therapy and the use of machine learning in precision medicine. Machine learning and gene therapy are both topics gaining popularity in various publications, and we conclude that there is still room for continued research and application of machine learning techniques in the gene therapy field.
Collapse
Affiliation(s)
- Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - Sandra Montgomery
- Department of Physics, Pacific Lutheran University, Tacoma, WA, United States
| | | | - Noel Sigafoos
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - John Christian Oakley
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - Jacob Espinosa
- Department of Mathematics, Pacific Lutheran University, Tacoma, WA, United States
| | - Nicola Justice
- Department of Mathematics, Pacific Lutheran University, Tacoma, WA, United States
| | - Kiyomi Kishaba
- Department of Humanities, Pacific Lutheran University, Tacoma, WA, United States
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| | - Dong Si
- Division of Computing Software Systems, University of Washington-Bothell, Bothell, WA, United States
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, United States
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, United States
| |
Collapse
|
18
|
Zeng R, Cheng S, Liao M. 4mCPred-MTL: Accurate Identification of DNA 4mC Sites in Multiple Species Using Multi-Task Deep Learning Based on Multi-Head Attention Mechanism. Front Cell Dev Biol 2021; 9:664669. [PMID: 34041243 PMCID: PMC8141656 DOI: 10.3389/fcell.2021.664669] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 03/17/2021] [Indexed: 01/10/2023] Open
Abstract
DNA methylation is one of the most extensive epigenetic modifications. DNA 4mC modification plays a key role in regulating chromatin structure and gene expression. In this study, we proposed a generic 4mC computational predictor, namely, 4mCPred-MTL using multi-task learning coupled with Transformer to predict 4mC sites in multiple species. In this predictor, we utilize a multi-task learning framework, in which each task is to train species-specific data based on Transformer. Extensive experimental results show that our multi-task predictive model can significantly improve the performance of the model based on single task and outperform existing methods on benchmarking comparison. Moreover, we found that our model can sufficiently capture better characteristics of 4mC sites as compared to existing commonly used feature descriptors, demonstrating the strong feature learning ability of our model. Therefore, based on the above results, it can be expected that our 4mCPred-MTL can be a useful tool for research communities of interest.
Collapse
Affiliation(s)
- Rao Zeng
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| | - Song Cheng
- Department of Thoracic Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Minghong Liao
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
19
|
Zulfiqar H, Khan RS, Hassan F, Hippe K, Hunt C, Ding H, Song XM, Cao R. Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:3348-3363. [PMID: 34198389 DOI: 10.3934/mbe.2021167] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/24/2023]
Abstract
N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% -6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rida Sarwar Khan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Ming Song
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Life Sciences, North China University of Science and Technology, Tangshan, Hebei 063210, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| |
Collapse
|
20
|
Tang Q, Nie F, Kang J, Chen W. mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy. Mol Ther 2021; 29:2617-2623. [PMID: 33823302 DOI: 10.1016/j.ymthe.2021.04.004] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 03/23/2021] [Accepted: 03/31/2021] [Indexed: 02/07/2023] Open
Abstract
The functions of mRNAs are closely correlated with their locations in cells. Knowledge about the subcellular locations of mRNA is helpful to understand their biological functions. In recent years, it has become a hot topic to develop effective computational models to predict eukaryotic mRNA subcellular localizations. However, existing state-of-the-art models still have certain deficiencies in terms of prediction accuracy and generalization ability. Therefore, it is urgent to develop novel methods to accurately predict mRNA subcellular localizations. In this study, a novel method called mRNALocater was proposed to detect the subcellular localization of eukaryotic mRNA by adopting the model fusion strategy. To fully extract information from mRNA sequences, the electron-ion interaction pseudopotential and pseudo k-tuple nucleotide composition were used to encode the sequences. Moreover, the correlation coefficient filtering algorithm and feature forward search technology were used to mine hidden feature information, which guarantees that mRNALocater can be more effectively applied to new sequences. The results based on the independent dataset tests demonstrate that mRNALocater yields promising performances for predicting eukaryotic mRNA subcellular localizations and is a powerful tool in practical applications. A freely available online web server for mRNALocater has been established at http://bio-bigdata.cn/mRNALocater.
Collapse
Affiliation(s)
- Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- School of Life Sciences, North China University of Science and Technology, Tangshan 063210, China; School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Juanjuan Kang
- Affiliated Foshan Maternity & Child Healthcare Hospital, Southern Medical University (Foshan Maternity & Child Healthcare Hospital), Foshan 528000, China
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China; School of Life Sciences, North China University of Science and Technology, Tangshan 063210, China; School of Public Health, North China University of Science and Technology, Tangshan 063210, China.
| |
Collapse
|
21
|
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021; 19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
22
|
Wang H, Liang P, Zheng L, Long C, Li H, Zuo Y. eHSCPr discriminating the cell identity involved in endothelial to hematopoietic transition. Bioinformatics 2021; 37:2157-2164. [PMID: 33532815 DOI: 10.1093/bioinformatics/btab071] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/15/2021] [Accepted: 01/28/2021] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION Hematopoietic stem cells (HSCs) give rise to all blood cells and play a vital role throughout the whole lifespan through their pluripotency and self-renewal properties. Accurately identifying the stages of early HSCs is extremely important, as it may open up new prospects for extracorporeal blood research. Existing experimental techniques for identifying the early stages of HSCs development are time-consuming and expensive. Machine learning has shown its excellence in massive single-cell data processing and it is desirable to develop related computational models as good complements to experimental techniques. RESULTS In this study, we presented a novel predictor called eHSCPr specifically for predicting the early stages of HSCs development. To reveal the distinct genes at each developmental stage of HSCs, we compared F-score with three state-of-art differential gene selection methods (limma, DESeq2, edgeR) and evaluated their performance. F-score captured the more critical surface markers of endothelial cells and hematopoietic cells, and the area under receiver operating characteristic curve (ROC) value was 0.987. Based on SVM, the 10-fold cross-validation accuracy of eHSCpr in the independent dataset and the training dataset reached 94.84% and 94.19%, respectively. Importantly, we performed transcription analysis on the F-score gene set, which indeed further enriched the signal markers of HSCs development stages. eHSCPr can be a powerful tool for predicting early stages of HSCs development, facilitating hypothesis-driven experimental design and providing crucial clues for the in vitro blood regeneration studies. AVAILABILITY http://bioinfor.imu.edu.cn/ehscpr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hao Wang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Pengfei Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Lei Zheng
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - ChunShen Long
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - HanShuang Li
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| |
Collapse
|
23
|
Ao C, Yu L, Zou Q. Prediction of bio-sequence modifications and the associations with diseases. Brief Funct Genomics 2020; 20:1-18. [PMID: 33313647 DOI: 10.1093/bfgp/elaa023] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 11/09/2020] [Accepted: 11/10/2020] [Indexed: 12/22/2022] Open
Abstract
Modifications of protein, RNA and DNA play an important role in many biological processes and are related to some diseases. Therefore, accurate identification and comprehensive understanding of protein, RNA and DNA modification sites can promote research on disease treatment and prevention. With the development of sequencing technology, the number of known sequences has continued to increase. In the past decade, many computational tools that can be used to predict protein, RNA and DNA modification sites have been developed. In this review, we comprehensively summarized the modification site predictors for three different biological sequences and the association with diseases. The relevant web server is accessible at http://lab.malab.cn/∼acy/PTM_data/ some sample data on protein, RNA and DNA modification can be downloaded from that website.
Collapse
|
24
|
Andrews TS, Kiselev VY, McCarthy D, Hemberg M. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data. Nat Protoc 2020; 16:1-9. [PMID: 33288955 DOI: 10.1038/s41596-020-00409-w] [Citation(s) in RCA: 151] [Impact Index Per Article: 37.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 09/08/2020] [Indexed: 01/01/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a popular and powerful technology that allows you to profile the whole transcriptome of a large number of individual cells. However, the analysis of the large volumes of data generated from these experiments requires specialized statistical and computational methods. Here we present an overview of the computational workflow involved in processing scRNA-seq data. We discuss some of the most common tasks and the tools available for addressing central biological questions. In this article and our companion website ( https://scrnaseq-course.cog.sanger.ac.uk/website/index.html ), we provide guidelines regarding best practices for performing computational analyses. This tutorial provides a hands-on guide for experimentalists interested in analyzing their data as well as an overview for bioinformaticians seeking to develop new computational methods.
Collapse
Affiliation(s)
| | | | - Davis McCarthy
- Bioinformatics and Cellular Genomics, St Vincent's Institute of Medical Research, Fitzroy, Victoria, Australia.,Melbourne Integrative Genomics, Faculty of Science, University of Melbourne, Melbourne, Victoria, Australia
| | | |
Collapse
|
25
|
Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8894478. [PMID: 33029195 PMCID: PMC7530508 DOI: 10.1155/2020/8894478] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 09/08/2020] [Accepted: 09/14/2020] [Indexed: 11/29/2022]
Abstract
Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.
Collapse
|
26
|
Manavalan B, Hasan MM, Basith S, Gosu V, Shin TH, Lee G. Empirical Comparison and Analysis of Web-Based DNA N 4-Methylcytosine Site Prediction Tools. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:406-420. [PMID: 33230445 PMCID: PMC7533314 DOI: 10.1016/j.omtn.2020.09.010] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 09/11/2020] [Indexed: 12/12/2022]
Abstract
DNA N4-methylcytosine (4mC) is a crucial epigenetic modification involved in various biological processes. Accurate genome-wide identification of these sites is critical for improving our understanding of their biological functions and mechanisms. As experimental methods for 4mC identification are tedious, expensive, and labor-intensive, several machine learning-based approaches have been developed for genome-wide detection of such sites in multiple species. However, the predictions projected by these tools are difficult to quantify and compare. To date, no systematic performance comparison of 4mC tools has been reported. The aim of this study was to compare and critically evaluate 12 publicly available 4mC site prediction tools according to species specificity, based on a huge independent validation dataset. The tools 4mCCNN (Escherichia coli), DNA4mC-LIP (Arabidopsis thaliana), iDNA-MS (Fragaria vesca), DNA4mC-LIP and 4mCCNN (Drosophila melanogaster), and four tools for Caenorhabditis elegans achieved excellent overall performance compared with their counterparts. However, none of the existing methods was suitable for Geoalkalibacter subterraneus, Geobacter pickeringii, and Mus musculus, thereby limiting their practical applicability. Model transferability to five species and non-transferability to three species are also discussed. The presented evaluation will assist researchers in selecting appropriate prediction tools that best suit their purpose and provide useful guidelines for the development of improved 4mC predictors in the future.
Collapse
Affiliation(s)
- Balachandran Manavalan
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka 820-8502, Japan.,Japan Society for the Promotion of Science, Chiyoda-ku, Tokyo 102-0083, Japan
| | - Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea
| | - Vijayakumar Gosu
- Department of Animal Biotechnology, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Tae-Hwan Shin
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.,Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea
| |
Collapse
|
27
|
Tang Q, Nie F, Kang J, Chen W. ncPro-ML: An integrated computational tool for identifying non-coding RNA promoters in multiple species. Comput Struct Biotechnol J 2020; 18:2445-2452. [PMID: 33005306 PMCID: PMC7509369 DOI: 10.1016/j.csbj.2020.09.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 08/30/2020] [Accepted: 09/01/2020] [Indexed: 02/07/2023] Open
Abstract
A computational method for identifying non-coding promoters was proposed for the first time. A high-quality dataset was built to train and test the models for identifying non-coding promoters. A user-friendly web server was developed to recognize non-coding promoters.
The promoter is located near the transcription start sites and regulates transcription initiation of the gene. Accurate identification of promoters is essential for understanding the mechanism of gene regulation. Since experimental methods are costly and ineffective, developing efficient and accurate computational tools to identify promoters are necessary. Although a series of methods have been proposed for identifying promoters, none of them is able to identify the promoters of non-coding RNA (ncRNA). In the present work, a new method called ncPro-ML was proposed to identify the promoter of ncRNA in Homo sapiens and Mus musculus, in which different kinds of sequence encoding schemes were used to convert DNA sequences into feature vectors. To test the length effect, for each species, datasets including sequences with different lengths were built. The results demonstrated that ncPro-ML achieved the best performance based on the dataset with the sequence length of 221 nucleotides for human and mouse. The performances of ncPro-ML were also satisfying from both independent dataset test and cross-species test. The results indicate that the proposed predictor can server as a powerful tool for the discovery of ncRNA promoters. In addition, a web-server for ncPro-ML was developed, which can be freely accessed at http://www.bio-bigdata.cn/ncPro-ML/.
Collapse
Affiliation(s)
- Qiang Tang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- Center for Genomics and Computational Biology, Scholl of Life Sciences, North China University of Science and Technology, Tangshan 063210, China
- School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Juanjuan Kang
- Affiliated Foshan Maternity & Child Healthcare Hospital, Southern Medical University (Foshan Maternity & Child Healthcare Hospital), Foshan 528000, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
- Center for Genomics and Computational Biology, Scholl of Life Sciences, North China University of Science and Technology, Tangshan 063210, China
- School of Public Health, North China University of Science and Technology, Tangshan 063210, China
- Corresponding author: Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.
| |
Collapse
|