51
|
Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform 2021; 22:6128847. [PMID: 33539511 DOI: 10.1093/bib/bbab005] [Citation(s) in RCA: 82] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 01/01/2021] [Accepted: 01/03/2021] [Indexed: 01/11/2023] Open
Abstract
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Quang-Thai Ho
- College of Information and Communication Technology, Can Tho University, Vietnam
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taiwan
| |
Collapse
|
52
|
Wu F, Yang R, Zhang C, Zhang L. A deep learning framework combined with word embedding to identify DNA replication origins. Sci Rep 2021; 11:844. [PMID: 33436981 PMCID: PMC7804333 DOI: 10.1038/s41598-020-80670-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 12/24/2020] [Indexed: 01/29/2023] Open
Abstract
The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote's ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, 'Word2vec', to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew's correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.
Collapse
Affiliation(s)
- Feng Wu
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Runtao Yang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China.
| | - Chengjin Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Lina Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| |
Collapse
|
53
|
Shujaat M, Wahab A, Tayara H, Chong KT. pcPromoter-CNN: A CNN-Based Prediction and Classification of Promoters. Genes (Basel) 2020; 11:genes11121529. [PMID: 33371507 PMCID: PMC7767505 DOI: 10.3390/genes11121529] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Revised: 12/11/2020] [Accepted: 12/18/2020] [Indexed: 01/13/2023] Open
Abstract
A promoter is a small region within the DNA structure that has an important role in initiating transcription of a specific gene in the genome. Different types of promoters are recognized by their different functions. Due to the importance of promoter functions, computational tools for the prediction and classification of a promoter are highly desired. Promoters resemble each other; therefore, their precise classification is an important challenge. In this study, we propose a convolutional neural network (CNN)-based tool, the pcPromoter-CNN, for application in the prediction of promotors and their classification into subclasses σ70, σ54, σ38, σ32, σ28 and σ24. This CNN-based tool uses a one-hot encoding scheme for promoter classification. The tools architecture was trained and tested on a benchmark dataset. To evaluate its classification performance, we used four evaluation metrics. The model exhibited notable improvement over that of existing state-of-the-art tools.
Collapse
Affiliation(s)
- Muhammad Shujaat
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
- Department of Computer Sciences, Bahria University, Lahore 54000, Pakistan
| | - Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea or (M.S.); (A.W.)
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: (H.T.); (K.T.C.)
| |
Collapse
|
54
|
Le NQK, Do DT, Hung TNK, Lam LHT, Huynh TT, Nguyen NTK. A Computational Framework Based on Ensemble Deep Neural Networks for Essential Genes Identification. Int J Mol Sci 2020; 21:E9070. [PMID: 33260643 PMCID: PMC7730808 DOI: 10.3390/ijms21239070] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 11/25/2020] [Accepted: 11/26/2020] [Indexed: 01/13/2023] Open
Abstract
Essential genes contain key information of genomes that could be the key to a comprehensive understanding of life and evolution. Because of their importance, studies of essential genes have been considered a crucial problem in computational biology. Computational methods for identifying essential genes have become increasingly popular to reduce the cost and time-consumption of traditional experiments. A few models have addressed this problem, but performance is still not satisfactory because of high dimensional features and the use of traditional machine learning algorithms. Thus, there is a need to create a novel model to improve the predictive performance of this problem from DNA sequence features. This study took advantage of a natural language processing (NLP) model in learning biological sequences by treating them as natural language words. To learn the NLP features, a supervised learning model was consequentially employed by an ensemble deep neural network. Our proposed method could identify essential genes with sensitivity, specificity, accuracy, Matthews correlation coefficient (MCC), and area under the receiver operating characteristic curve (AUC) values of 60.2%, 84.6%, 76.3%, 0.449, and 0.814, respectively. The overall performance outperformed the single models without ensemble, as well as the state-of-the-art predictors on the same benchmark dataset. This indicated the effectiveness of the proposed method in determining essential genes, in particular, and other sequencing problems, in general.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| | - Duyen Thi Do
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei 106, Taiwan;
| | - Truong Nguyen Khanh Hung
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Department of Orthopedic and Trauma, Cho Ray Hospital, Ho Chi Minh 70000, Vietnam
| | - Luu Ho Thanh Lam
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; (T.N.K.H.); (L.H.T.L.)
- Intensive Care Unit, Children’s Hospital 2, Ho Chi Minh 70000, Vietnam
| | - Tuan-Tu Huynh
- Department of Electrical Engineering, Yuan Ze University, Taoyuan 320, Taiwan;
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Dong Nai 76120, Vietnam
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, Taipei Medical University, Taipei 110, Taiwan;
| |
Collapse
|
55
|
Kwiecien K, Brzoza P, Bak M, Majewski P, Skulimowska I, Bednarczyk K, Cichy J, Kwitniewski M. The methylation status of the chemerin promoter region located from - 252 to + 258 bp regulates constitutive but not acute-phase cytokine-inducible chemerin expression levels. Sci Rep 2020; 10:13702. [PMID: 32792625 PMCID: PMC7426834 DOI: 10.1038/s41598-020-70625-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 07/29/2020] [Indexed: 12/05/2022] Open
Abstract
Chemerin is a chemoattractant protein with adipokine properties encoded by the retinoic acid receptor responder 2 (RARRES2) gene. It has gained more attention in the past few years due to its multilevel impact on metabolism and immune responses. However, mechanisms controlling the constitutive and regulated expression of RARRES2 in a variety of cell types remain obscure. To our knowledge, this report is the first to show that DNA methylation plays an important role in the cell-specific expression of RARRES2 in adipocytes, hepatocytes, and B lymphocytes. Using luciferase reporter assays, we determined the proximal fragment of the RARRES2 gene promoter, located from - 252 to + 258 bp, to be a key regulator of transcription. Moreover, we showed that chemerin expression is regulated in murine adipocytes by acute-phase cytokines, interleukin 1β and oncostatin M. In contrast with adipocytes, these cytokines exerted a weak, if any, response in mouse hepatocytes, suggesting that the effects of IL-1β and OSM on chemerin expression is specific to fat tissue. Together, our findings highlight previously uncharacterized mediators and mechanisms that control chemerin expression.
Collapse
Affiliation(s)
- Kamila Kwiecien
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Piotr Brzoza
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Maciej Bak
- Swiss Institute of Bioinformatics, Biozentrum, University of Basel, 4056, Basel, Switzerland
| | - Pawel Majewski
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Izabella Skulimowska
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Kamil Bednarczyk
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Joanna Cichy
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland
| | - Mateusz Kwitniewski
- Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, 30-387, Krakow, Poland.
| |
Collapse
|
56
|
Do DT, Le TQT, Le NQK. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites. Brief Bioinform 2020; 22:5866114. [PMID: 32613242 DOI: 10.1093/bib/bbaa128] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/11/2020] [Accepted: 05/26/2020] [Indexed: 12/11/2022] Open
Abstract
Protein S-sulfenylation is one kind of crucial post-translational modifications (PTMs) in which the hydroxyl group covalently binds to the thiol of cysteine. Some recent studies have shown that this modification plays an important role in signaling transduction, transcriptional regulation and apoptosis. To date, the dynamic of sulfenic acids in proteins remains unclear because of its fleeting nature. Identifying S-sulfenylation sites, therefore, could be the key to decipher its mysterious structures and functions, which are important in cell biology and diseases. However, due to the lack of effective methods, scientists in this field tend to be limited in merely a handful of some wet lab techniques that are time-consuming and not cost-effective. Thus, this motivated us to develop an in silico model for detecting S-sulfenylation sites only from protein sequence information. In this study, protein sequences served as natural language sentences comprising biological subwords. The deep neural network was consequentially employed to perform classification. The performance statistics within the independent dataset including sensitivity, specificity, accuracy, Matthews correlation coefficient and area under the curve rates achieved 85.71%, 69.47%, 77.09%, 0.5554 and 0.833, respectively. Our results suggested that the proposed method (fastSulf-DNN) achieved excellent performance in predicting S-sulfenylation sites compared to other well-known tools on a benchmark dataset.
Collapse
Affiliation(s)
- Duyen Thi Do
- Faculty of Applied Sciences, Ton Duc Thang University
| | | | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University
| |
Collapse
|