1
|
Peng B, Sun G, Fan Y. iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model. BMC Bioinformatics 2024; 25:224. [PMID: 38918692 PMCID: PMC11201334 DOI: 10.1186/s12859-024-05849-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/19/2024] [Indexed: 06/27/2024] Open
Abstract
Promoters are essential elements of DNA sequence, usually located in the immediate region of the gene transcription start sites, and play a critical role in the regulation of gene transcription. Its importance in molecular biology and genetics has attracted the research interest of researchers, and it has become a consensus to seek a computational method to efficiently identify promoters. Still, existing methods suffer from imbalanced recognition capabilities for positive and negative samples, and their recognition effect can still be further improved. We conducted research on E. coli promoters and proposed a more advanced prediction model, iProL, based on the Longformer pre-trained model in the field of natural language processing. iProL does not rely on prior biological knowledge but simply uses promoter DNA sequences as plain text to identify promoters. It also combines one-dimensional convolutional neural networks and bidirectional long short-term memory to extract both local and global features. Experimental results show that iProL has a more balanced and superior performance than currently published methods. Additionally, we constructed a novel independent test set following the previous specification and compared iProL with three existing methods on this independent test set.
Collapse
Affiliation(s)
- Binchao Peng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| |
Collapse
|
2
|
Paul S, Olymon K, Martinez GS, Sarkar S, Yella VR, Kumar A. MLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI. J Chem Inf Model 2024; 64:2705-2719. [PMID: 38258978 DOI: 10.1021/acs.jcim.3c02017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
Collapse
Affiliation(s)
- Subhojit Paul
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Kaushika Olymon
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center, Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur 522302, Andhra Pradesh, India
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| |
Collapse
|
3
|
Zhang Z, Huo J, Velo J, Zhou H, Flaherty A, Saier MH. Comprehensive Characterization of fucAO Operon Activation in Escherichia coli. Int J Mol Sci 2024; 25:3946. [PMID: 38612757 PMCID: PMC11011485 DOI: 10.3390/ijms25073946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 03/26/2024] [Accepted: 03/29/2024] [Indexed: 04/14/2024] Open
Abstract
Wildtype Escherichia coli cells cannot grow on L-1,2-propanediol, as the fucAO operon within the fucose (fuc) regulon is thought to be silent in the absence of L-fucose. Little information is available concerning the transcriptional regulation of this operon. Here, we first confirm that fucAO operon expression is highly inducible by fucose and is primarily attributable to the upstream operon promoter, while the fucO promoter within the 3'-end of fucA is weak and uninducible. Using 5'RACE, we identify the actual transcriptional start site (TSS) of the main fucAO operon promoter, refuting the originally proposed TSS. Several lines of evidence are provided showing that the fucAO locus is within a transcriptionally repressed region on the chromosome. Operon activation is dependent on FucR and Crp but not SrsR. Two Crp-cAMP binding sites previously found in the regulatory region are validated, where the upstream site plays a more critical role than the downstream site in operon activation. Furthermore, two FucR binding sites are identified, where the downstream site near the first Crp site is more important than the upstream site. Operon transcription relies on Crp-cAMP to a greater degree than on FucR. Our data strongly suggest that FucR mainly functions to facilitate the binding of Crp to its upstream site, which in turn activates the fucAO promoter by efficiently recruiting RNA polymerase.
Collapse
Affiliation(s)
- Zhongge Zhang
- Department of Molecular Biology, School of Biological Sciences, University of California at San Diego, 9500 Gilman Dr, La Jolla, CA 92093-0116, USA; (J.H.); (J.V.); (A.F.)
| | | | | | | | | | - Milton H. Saier
- Department of Molecular Biology, School of Biological Sciences, University of California at San Diego, 9500 Gilman Dr, La Jolla, CA 92093-0116, USA; (J.H.); (J.V.); (A.F.)
| |
Collapse
|
4
|
Skutel M, Andriianov A, Zavialova M, Kirsanova M, Shodunke O, Zorin E, Golovshchinskii A, Severinov K, Isaev A. T5-like phage BF23 evades host-mediated DNA restriction and methylation. MICROLIFE 2023; 4:uqad044. [PMID: 38025991 PMCID: PMC10644984 DOI: 10.1093/femsml/uqad044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 10/15/2023] [Accepted: 10/25/2023] [Indexed: 12/01/2023]
Abstract
Bacteriophage BF23 is a close relative of phage T5, a prototypical Tequintavirus that infects Escherichia coli. BF23 was isolated in the middle of the XXth century and was extensively studied as a model object. Like T5, BF23 carries long ∼9.7 kb terminal repeats, injects its genome into infected cell in a two-stage process, and carries multiple specific nicks in its double-stranded genomic DNA. The two phages rely on different host secondary receptors-FhuA (T5) and BtuB (BF23). Only short fragments of the BF23 genome, including the region encoding receptor interacting proteins, have been determined. Here, we report the full genomic sequence of BF23 and describe the protein content of its virion. T5-like phages represent a unique group that resist restriction by most nuclease-based host immunity systems. We show that BF23, like other Tequintavirus phages, resist Types I/II/III restriction-modification host immunity systems if their recognition sites are located outside the terminal repeats. We also demonstrate that the BF23 avoids host-mediated methylation. We propose that inhibition of methylation is a common feature of Tequintavirus and Epseptimavirus genera phages, that is not, however, associated with their antirestriction activity.
Collapse
Affiliation(s)
- Mikhail Skutel
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
| | - Aleksandr Andriianov
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
| | - Maria Zavialova
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
- Institute of Biomedical Chemistry (IBMC), Pogodinskaya 10/8, 119435, Moscow, Russia
| | - Maria Kirsanova
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
| | - Oluwasefunmi Shodunke
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
- Moscow Institute of Physics and Technology, Institutskiy Pereulok 9, 141701, Dolgoprudny, Russia
| | - Evgenii Zorin
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
| | | | - Konstantin Severinov
- Waksman Institute of Microbiology, 190 Frelinghuysen Rd, NJ 08854, Piscataway, United States
| | - Artem Isaev
- Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, 143028, Moscow, Russia
| |
Collapse
|
5
|
Lin Y, Sun M, Zhang J, Li M, Yang K, Wu C, Zulfiqar H, Lai H. Computational identification of promoters in Klebsiella aerogenes by using support vector machine. Front Microbiol 2023; 14:1200678. [PMID: 37250059 PMCID: PMC10215528 DOI: 10.3389/fmicb.2023.1200678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 04/18/2023] [Indexed: 05/31/2023] Open
Abstract
Promoters are the basic functional cis-elements to which RNA polymerase binds to initiate the process of gene transcription. Comprehensive understanding gene expression and regulation depends on the precise identification of promoters, as they are the most important component of gene expression. This study aimed to develop a machine learning-based model to predict promoters in Klebsiella aerogenes (K. aerogenes). In the prediction model, the promoter sequences in K. aerogenes genome were encoded by pseudo k-tuple nucleotide composition (PseKNC) and position-correlation scoring function (PCSF). Numerical features were obtained and then optimized using mRMR by combining with support vector machine (SVM) and 5-fold cross-validation (CV). Subsequently, these optimized features were inputted into SVM-based classifier to discriminate promoter sequences from non-promoter sequences in K. aerogenes. Results of 10-fold CV showed that the model could yield the overall accuracy of 96.0% and the area under the ROC curve (AUC) of 0.990. We hope that this model will provide help for the study of promoter and gene regulation in K. aerogenes.
Collapse
Affiliation(s)
- Yan Lin
- Key Laboratory for Animal Disease-Resistance Nutrition of the Ministry of Agriculture, Animal Nutrition Institute, Sichuan Agricultural University, Chengdu, China
| | - Meili Sun
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Junjie Zhang
- Key Laboratory for Animal Disease-Resistance Nutrition of the Ministry of Agriculture, Animal Nutrition Institute, Sichuan Agricultural University, Chengdu, China
| | - Mingyan Li
- Chifeng Product Quality Inspection and Testing Centre, Chifeng, China
| | - Keli Yang
- Nonlinear Research Institute, Baoji University of Arts and Sciences, Baoji, China
| | - Chengyan Wu
- Baotou Teacher’s College, Inner Mongolia University of Science and Technology, Baotou, China
| | - Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China
| |
Collapse
|