1
|
Paul S, Olymon K, Martinez GS, Sarkar S, Yella VR, Kumar A. MLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI. J Chem Inf Model 2024; 64:2705-2719. [PMID: 38258978 DOI: 10.1021/acs.jcim.3c02017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
Collapse
Affiliation(s)
- Subhojit Paul
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Kaushika Olymon
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center, Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur 522302, Andhra Pradesh, India
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| |
Collapse
|
2
|
Bétermier M, Klobutcher LA, Orias E. Programmed chromosome fragmentation in ciliated protozoa: multiple means to chromosome ends. Microbiol Mol Biol Rev 2023; 87:e0018422. [PMID: 38009915 PMCID: PMC10732028 DOI: 10.1128/mmbr.00184-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023] Open
Abstract
SUMMARYCiliated protozoa undergo large-scale developmental rearrangement of their somatic genomes when forming a new transcriptionally active macronucleus during conjugation. This process includes the fragmentation of chromosomes derived from the germline, coupled with the efficient healing of the broken ends by de novo telomere addition. Here, we review what is known of developmental chromosome fragmentation in ciliates that have been well-studied at the molecular level (Tetrahymena, Paramecium, Euplotes, Stylonychia, and Oxytricha). These organisms differ substantially in the fidelity and precision of their fragmentation systems, as well as in the presence or absence of well-defined sequence elements that direct excision, suggesting that chromosome fragmentation systems have evolved multiple times and/or have been significantly altered during ciliate evolution. We propose a two-stage model for the evolution of the current ciliate systems, with both stages involving repetitive or transposable elements in the genome. The ancestral form of chromosome fragmentation is proposed to have been derived from the ciliate small RNA/chromatin modification process that removes transposons and other repetitive elements from the macronuclear genome during development. The evolution of this ancestral system is suggested to have potentiated its replacement in some ciliate lineages by subsequent fragmentation systems derived from mobile genetic elements.
Collapse
Affiliation(s)
- Mireille Bétermier
- Department of Genome Biology, Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell, Gif-sur-Yvette, France
| | - Lawrence A. Klobutcher
- Department of Molecular Biology and Biophysics, UCONN Health (University of Connecticut), Farmington, Connecticut, USA
| | - Eduardo Orias
- Department of Molecular, Cellular, and Developmental Biology, University of California, Santa Barbara, California, USA
| |
Collapse
|
3
|
Kari H, Bandi SMS, Kumar A, Yella VR. DeePromClass: Delineator for Eukaryotic Core Promoters Employing Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:802-807. [PMID: 35353704 DOI: 10.1109/tcbb.2022.3163418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Computational promoter identification in eukaryotes is a classical biological problem that should be refurbished with the availability of an avalanche of experimental data and emerging deep learning technologies. The current knowledge indicates that eukaryotic core promoters display multifarious signals such as TATA-Box, Inr element, TCT, and Pause-button, etc., and structural motifs such as G-quadruplexes. In the present study, we combined the power of deep learning with a plethora of promoter motifs to delineate promoter and non-promoters gleaned from the statistical properties of DNA sequence arrangement. To this end, we implemented convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural network architecture for five model systems with [-100 to +50] segments relative to the transcription start site being the core promoter. Unlike previous state-of-the-art tools, which furnish a binary decision of promoter or non-promoter, we classify a chunk of 151mer sequence into a promoter along with the consensus signal type or a non-promoter. The combined CNN-LSTM model; we call "DeePromClass", achieved testing accuracy of 90.6%, 93.6%, 91.8%, 86.5%, and 84.0% for S. cerevisiae, C. elegans, D. melanogaster, Mus musculus, and Homo sapiens respectively. In total, our tool provides an insightful update on next-generation promoter prediction tools for promoter biologists.
Collapse
|
4
|
Vanaja A, Yella VR. Delineation of the DNA Structural Features of Eukaryotic Core Promoter Classes. ACS OMEGA 2022; 7:5657-5669. [PMID: 35224327 PMCID: PMC8867553 DOI: 10.1021/acsomega.1c04603] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 01/27/2022] [Indexed: 05/02/2023]
Abstract
The eukaryotic transcription is orchestrated from a chunk of the DNA region stated as the core promoter. Multifarious and punctilious core promoter signals, viz., TATA-box, Inr, BREs, and Pause Button, are associated with a subset of genes and regulate their spatiotemporal expression. However, the core promoter architecture linked with these signals has not been investigated exhaustively for several species. In this study, we attempted to envisage the adaptive binding landscape of the transcription initiation machinery as a function of DNA structure. To this end, we deployed a set of k-mer based DNA structural estimates and regular expression models derived from experiments, molecular dynamic simulations, and theoretical frameworks, and high-throughout promoter data sets retrieved from the eukaryotic promoter database. We categorized protein-coding gene core promoters based on characteristic motifs at precise locations and analyzed the B-DNA structural properties and non-B-DNA structural motifs for 15 different eukaryotic genomes. We observed that Inr, BREd, and no-motif classes display common patterns of DNA sequence and structural environment. TATA-containing, BREu, and Pause Button classes show a deviant behavior with the TATA class displaying varied axial and twisting flexibility while BREu and Pause Button leaned toward G-quadruplex motif enrichment. Intriguingly, DNA meltability and shape signals are conserved irrespective of the presence or absence of distinct core promoter motifs in the majority of species. Altogether, here we delineated the conserved DNA structural signals associated with several promoter classes that may contribute to the chromatin configuration, orchestration of transcription machinery, and DNA duplex melting during the transcription process.
Collapse
Affiliation(s)
- Akkinepally Vanaja
- Department
of Biotechnology, Koneru Lakshmaiah Education
Foundation, Vaddeswaram, Guntur 522502, Andhra
Pradesh, India
- KL
College of Pharmacy, Koneru Lakshmaiah Education
Foundation, Vaddeswaram, Guntur 522502, Andhra
Pradesh, India
| | - Venkata Rajesh Yella
- Department
of Biotechnology, Koneru Lakshmaiah Education
Foundation, Vaddeswaram, Guntur 522502, Andhra
Pradesh, India
- . Tel: +91-863-2399999, Extn-1021. Website: https://www.kluniversity.in/bt/faculty-list.aspx
| |
Collapse
|
5
|
Sarkar S, Dey U, Khohliwe TB, Yella VR, Kumar A. Analysis of nucleoid-associated protein-binding regions reveals DNA structural features influencing genome organization in Mycobacterium tuberculosis. FEBS Lett 2021; 595:2504-2521. [PMID: 34387867 DOI: 10.1002/1873-3468.14178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 08/01/2021] [Accepted: 08/11/2021] [Indexed: 11/10/2022]
Abstract
Nucleoid-associated proteins (NAPs) maintain bacterial nucleoid configuration through their architectural properties of DNA bending, wrapping, and bridging. However, the contribution of DNA structural alterations to DNA-NAP recognition at the genomic scale remains unresolved. Present work dissects the DNA sequence, shape and altered structural preferences at a genomic scale for six NAPs in Mycobacterium tuberculosis. Results suggest narrower minor groove width (MGW) and higher DNA rigidity are marked for the binding sites of EspR and Lsr2, while mIHF, MtHU and NapM have heterogeneous DNA structural predilections. In contrast, WhiB4-DNA-binding sites were characterized by wider MGW, highly deformable and less curved DNA. This work provides systematic insight into NAP-mediated genome organization as a function of DNA structural features.
Collapse
Affiliation(s)
- Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, India
| | - Upalabdha Dey
- Department of Molecular Biology and Biotechnology, Tezpur University, India
| | | | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur, India
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, India
| |
Collapse
|
6
|
Symphony of the DNA flexibility and sequence environment orchestrates p53 binding to its responsive elements. Gene 2021; 803:145892. [PMID: 34375633 DOI: 10.1016/j.gene.2021.145892] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 07/26/2021] [Accepted: 08/05/2021] [Indexed: 11/23/2022]
Abstract
The p53 tumor suppressor protein maintains the genome fidelity and integrity by modulating several cellular activities. It regulates these events by interacting with a heterogeneous set of response elements (REs) of regulatory genes in the background of chromatin configuration. At the p53-RE interface, both the base readout and torsional-flexibility of DNA account for high-affinity binding. However, DNA structure is an entanglement of a multitude of physicochemical features, both local and global structure should be considered for dealing with DNA-protein interactions. The goal of current research work is to conceptualize and abstract basic principles of p53-RE binding affinity as a function of structural alterations in DNA such as bending, twisting, and stretching flexibility and shape. For this purpose, we have exploited high throughput in-vitro relative affinity information of responsive elements and genome binding events of p53 from HT-Selex and ChIP-Seq experiments respectively. Our results confirm the role of torsional flexibility in p53 binding, and further, we reveal that DNA axial bending, stretching stiffness, propeller twist, and wedge angles are intimately linked to p53 binding affinity when compared to homeodomain, bZIP, and bHLH proteins. Besides, a similar DNA structural environment is observed in the distal sequences encompassing the actual binding sites of p53 cistrome genes. Additionally, we revealed that p53 cistrome target genes have unique promoter architecture, and the DNA flexibility of genomic sequences around REs in cancer and normal cell types display major differences. Altogether, our work provides a keynote on DNA structural features of REs that shape up the in-vitro and in-vivo high-affinity binding of the p53 transcription factor.
Collapse
|
7
|
Dey U, Sarkar S, Teronpi V, Yella VR, Kumar A. G-quadruplex motifs are functionally conserved in cis-regulatory regions of pathogenic bacteria: An in-silico evaluation. Biochimie 2021; 184:40-51. [PMID: 33548392 DOI: 10.1016/j.biochi.2021.01.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/28/2021] [Accepted: 01/29/2021] [Indexed: 02/06/2023]
Abstract
The role of G-quadruplexes in the cellular physiology of human pathogenesis is an intriguing area of research. Nonetheless, their functional roles and evolutionary conservation have not been compared comprehensively in pathogenic forms of various bacterial genera and species. In the current in silico study, we addressed the role of G-quadruplex-forming sequences (G4 motifs) in the context of cis-regulation, expression variation, regulatory networks, gene orthology and ontology. Genome-wide screening across seven pathogenic genomes using the G4Hunter tool revealed the significant prevalence of G4 motifs in cis-regulatory regions compared to the intragenic regions. Significant conservation of G4 motifs was observed in the regulatory region of 300 orthologous genes. Further analysis of published ChIP-Seq data (Minch et al., 2015) of 91 DNA-binding proteins of the M. tuberculosis genome revealed significant links between G4 motifs and target sites of transcriptional regulators. Interestingly, the transcription factors entangled with virulence, in specific, CsoR, Rv0081, DevR/DosR, and TetR family are found to have G4 motifs in their target regulatory regions. Overall the current study applies positional-functional relationship computation to delve into the cis-regulation of G-quadruplex structures in the context of gene orthology in pathogenic bacteria.
Collapse
Affiliation(s)
- Upalabdha Dey
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, 784028, Assam, India
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, 784028, Assam, India
| | - Valentina Teronpi
- Department of Zoology, Pandit Deendayal Upadhyaya Adarsha Mahavidyalaya, Behali, Biswanath, 784184, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur, 522502, Andhra Pradesh, India.
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, 784028, Assam, India.
| |
Collapse
|
8
|
Wu F, Yang R, Zhang C, Zhang L. A deep learning framework combined with word embedding to identify DNA replication origins. Sci Rep 2021; 11:844. [PMID: 33436981 PMCID: PMC7804333 DOI: 10.1038/s41598-020-80670-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 12/24/2020] [Indexed: 01/29/2023] Open
Abstract
The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable period of time. To simplify the identification process of eukaryote's ORIs, most of existing methods are developed by traditional machine learning algorithms, and target to the gene sequences with a fixed length. Consequently, the identification results are not satisfying, i.e. there is still great room for improvement. To break through the limitations in previous studies, this paper develops sequence segmentation methods, and employs the word embedding technique, 'Word2vec', to convert gene sequences into word vectors, thereby grasping the inner correlations of gene sequences with different lengths. Then, a deep learning framework to perform the ORI identification task is constructed by a convolutional neural network with an embedding layer. On the basis of the analysis of similarity reduction dimensionality diagram, Word2vec can effectively transform the inner relationship among words into numerical feature. For four species in this study, the best models are obtained with the overall accuracy of 0.975, 0.765, 0.885, 0.967, the Matthew's correlation coefficient of 0.940, 0.530, 0.771, 0.934, and the AUC of 0.975, 0.800, 0.888, 0.981, which indicate that the proposed predictor has a stable ability and provide a high confidence coefficient to classify both of ORIs and non-ORIs. Compared with state-of-the-art methods, the proposed predictor can achieve ORI identification with significant improvement. It is therefore reasonable to anticipate that the proposed method will make a useful high throughput tool for genome analysis.
Collapse
Affiliation(s)
- Feng Wu
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Runtao Yang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China.
| | - Chengjin Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| | - Lina Zhang
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, Weihai, 264200, China
| |
Collapse
|