1
|
Wassenaar TM, Harville T, Chastain J, Wanchai V, Ussery DW. DNA structural features and variability of complete MHC locus sequences. FRONTIERS IN BIOINFORMATICS 2024; 4:1392613. [PMID: 39022183 PMCID: PMC11251971 DOI: 10.3389/fbinf.2024.1392613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 06/07/2024] [Indexed: 07/20/2024] Open
Abstract
The major histocompatibility (MHC) locus, also known as the Human Leukocyte Antigen (HLA) genes, is located on the short arm of chromosome 6, and contains three regions (Class I, Class II and Class III). This 5 Mbp locus is one of the most variable regions of the human genome, yet it also encodes a set of highly conserved and important proteins related to immunological response. Genetic variations in this region are responsible for more diseases than in the entire rest of the human genome. However, information on local structural features of the DNA is largely ignored. With recent advances in long-read sequencing technology, it is now becoming possible to sequence the entire 5 Mbp MHC locus, producing complete diploid haplotypes of the whole region. Here, we describe structural maps based on the complete sequences from six different homozygous HLA cell lines. We find long-range structural variability in the different sequences for DNA stacking energy, position preference and curvature, variation in repeats, as well as more local changes in regions forming open chromatin structures, likely to influence gene expression levels. These structural maps can be useful in visualizing large scale structural variation across HLA types, in particular when this can be complemented with epigenetic signals.
Collapse
Affiliation(s)
| | - Terry Harville
- Department of Pathology and Laboratory Services, and Department of Internal Medicine, Division of Hematology, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Jonathan Chastain
- Department of Pediatrics, The University of Arkansas for Medical Sciences University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Visanu Wanchai
- Myeloma Center, Winthrop P. Rockefeller Institute, Department of Internal Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - David W. Ussery
- Department of BioMedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
2
|
Peng B, Sun G, Fan Y. iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model. BMC Bioinformatics 2024; 25:224. [PMID: 38918692 PMCID: PMC11201334 DOI: 10.1186/s12859-024-05849-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/19/2024] [Indexed: 06/27/2024] Open
Abstract
Promoters are essential elements of DNA sequence, usually located in the immediate region of the gene transcription start sites, and play a critical role in the regulation of gene transcription. Its importance in molecular biology and genetics has attracted the research interest of researchers, and it has become a consensus to seek a computational method to efficiently identify promoters. Still, existing methods suffer from imbalanced recognition capabilities for positive and negative samples, and their recognition effect can still be further improved. We conducted research on E. coli promoters and proposed a more advanced prediction model, iProL, based on the Longformer pre-trained model in the field of natural language processing. iProL does not rely on prior biological knowledge but simply uses promoter DNA sequences as plain text to identify promoters. It also combines one-dimensional convolutional neural networks and bidirectional long short-term memory to extract both local and global features. Experimental results show that iProL has a more balanced and superior performance than currently published methods. Additionally, we constructed a novel independent test set following the previous specification and compared iProL with three existing methods on this independent test set.
Collapse
Affiliation(s)
- Binchao Peng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| |
Collapse
|
3
|
Paul S, Olymon K, Martinez GS, Sarkar S, Yella VR, Kumar A. MLDSPP: Bacterial Promoter Prediction Tool Using DNA Structural Properties with Machine Learning and Explainable AI. J Chem Inf Model 2024; 64:2705-2719. [PMID: 38258978 DOI: 10.1021/acs.jcim.3c02017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Bacterial promoters play a crucial role in gene expression by serving as docking sites for the transcription initiation machinery. However, accurately identifying promoter regions in bacterial genomes remains a challenge due to their diverse architecture and variations. In this study, we propose MLDSPP (Machine Learning and Duplex Stability based Promoter prediction in Prokaryotes), a machine learning-based promoter prediction tool, to comprehensively screen bacterial promoter regions in 12 diverse genomes. We leveraged biologically relevant and informative DNA structural properties, such as DNA duplex stability and base stacking, and state-of-the-art machine learning (ML) strategies to gain insights into promoter characteristics. We evaluated several machine learning models, including Support Vector Machines, Random Forests, and XGBoost, and assessed their performance using accuracy, precision, recall, specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost outperformed other models and current state-of-the-art promoter prediction tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95% in most systems. Significantly, the use of one-hot encoding for representing nucleotide sequences complements these structural features, enhancing our XGBoost model's predictive capabilities. To address the challenge of model interpretability, we incorporated explainable AI techniques using Shapley values. This enhancement allows for a better understanding and interpretation of the predictions of our model. In conclusion, our study presents MLDSPP as a novel, generic tool for predicting promoter regions in bacteria, utilizing original downstream sequences as nonpromoter controls. This tool has the potential to significantly advance the field of bacterial genomics and contribute to our understanding of gene regulation in diverse bacterial systems.
Collapse
Affiliation(s)
- Subhojit Paul
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Kaushika Olymon
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center, Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur 522302, Andhra Pradesh, India
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| |
Collapse
|
4
|
Martinez GS, Perez-Rueda E, Kumar A, Dutt M, Maya CR, Ledesma-Dominguez L, Casa PL, Kumar A, de Avila e Silva S, Kelvin DJ. CDBProm: the Comprehensive Directory of Bacterial Promoters. NAR Genom Bioinform 2024; 6:lqae018. [PMID: 38385146 PMCID: PMC10880602 DOI: 10.1093/nargab/lqae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 01/12/2024] [Accepted: 01/29/2024] [Indexed: 02/23/2024] Open
Abstract
The decreasing cost of whole genome sequencing has produced high volumes of genomic information that require annotation. The experimental identification of promoter sequences, pivotal for regulating gene expression, is a laborious and cost-prohibitive task. To expedite this, we introduce the Comprehensive Directory of Bacterial Promoters (CDBProm), a directory of in-silico predicted bacterial promoter sequences. We first identified that an Extreme Gradient Boosting (XGBoost) algorithm would distinguish promoters from random downstream regions with an accuracy of 87%. To capture distinctive promoter signals, we generated a second XGBoost classifier trained on the instances misclassified in our first classifier. The predictor of CDBProm is then fed with over 55 million upstream regions from more than 6000 bacterial genomes. Upon finding potential promoter sequences in upstream regions, each promoter is mapped to the genomic data of the organism, linking the predicted promoter with its coding DNA sequence, and identifying the function of the gene regulated by the promoter. The collection of bacterial promoters available in CDBProm enables the quantitative analysis of a plethora of bacterial promoters. Our collection with over 24 million promoters is publicly available at https://aw.iimas.unam.mx/cdbprom/.
Collapse
Affiliation(s)
- Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| | - Ernesto Perez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autonóma de México, Unidad Académica del Estado de Yucatán, Mérida 97302, Yucatán, Mexico
| | - Anuj Kumar
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| | - Mansi Dutt
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| | - Cinthia Rodríguez Maya
- Facultad de Ciencias e Ingeniería, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
| | - Leonardo Ledesma-Dominguez
- Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
| | - Pedro Lenz Casa
- Biotechnology Institute, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul 95070-560, Brazil
| | - Aditya Kumar
- Molecular Biology and Biotechnology, Tezpur University, Tezpur, Assam 784028, India
| | - Scheila de Avila e Silva
- Biotechnology Institute, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul 95070-560, Brazil
| | - David J Kelvin
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| |
Collapse
|
5
|
Uemura K, Ohyama T. Physical Peculiarity of Two Sites in Human Promoters: Universality and Diverse Usage in Gene Function. Int J Mol Sci 2024; 25:1487. [PMID: 38338773 PMCID: PMC10855393 DOI: 10.3390/ijms25031487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/15/2024] [Accepted: 01/18/2024] [Indexed: 02/12/2024] Open
Abstract
Since the discovery of physical peculiarities around transcription start sites (TSSs) and a site corresponding to the TATA box, research has revealed only the average features of these sites. Unsettled enigmas include the individual genes with these features and whether they relate to gene function. Herein, using 10 physical properties of DNA, including duplex DNA free energy, base stacking energy, protein-induced deformability, and stabilizing energy of Z-DNA, we clarified for the first time that approximately 97% of the promoters of 21,056 human protein-coding genes have distinctive physical properties around the TSS and/or position -27; of these, nearly 65% exhibited such properties at both sites. Furthermore, about 55% of the 21,056 genes had a minimum value of regional duplex DNA free energy within TSS-centered ±300 bp regions. Notably, distinctive physical properties within the promoters and free energies of the surrounding regions separated human protein-coding genes into five groups; each contained specific gene ontology (GO) terms. The group represented by immune response genes differed distinctly from the other four regarding the parameter of the free energies of the surrounding regions. A vital suggestion from this study is that physical-feature-based analyses of genomes may reveal new aspects of the organization and regulation of genes.
Collapse
Affiliation(s)
- Kohei Uemura
- Major in Integrative Bioscience and Biomedical Engineering, Graduate School of Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan;
| | - Takashi Ohyama
- Major in Integrative Bioscience and Biomedical Engineering, Graduate School of Science and Engineering, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan;
- Department of Biology, Faculty of Education and Integrated Arts and Sciences, Waseda University, 2-2 Wakamatsu-cho, Shinjuku-ku, Tokyo 162-8480, Japan
| |
Collapse
|
6
|
Liu J, Yang M, Yu Y, Xu H, Li K, Zhou X. Large language models in bioinformatics: applications and perspectives. ARXIV 2024:arXiv:2401.04155v1. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems.
Collapse
Affiliation(s)
- Jiajia Liu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Mengyuan Yang
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Yankai Yu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
| | - Haixia Xu
- The Center of Gerontology and Geriatrics, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
7
|
Yella VR, Vanaja A. Computational analysis on the dissemination of non-B DNA structural motifs in promoter regions of 1180 cellular genomes. Biochimie 2023; 214:101-111. [PMID: 37311475 DOI: 10.1016/j.biochi.2023.06.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 05/05/2023] [Accepted: 06/05/2023] [Indexed: 06/15/2023]
Abstract
The promoter regions of gene regulation are under evolutionary constraints and earlier studies uncovered that they are characterized by enrichment of functional non-B DNA structural signatures like curved DNA, cruciform DNA, G-quadruplex, triple-helical DNA, slipped DNA structures, and Z-DNA. However, these studies are restricted to a few model organisms, single non-B DNA motif types, or whole genomic sequences, and their comparative accumulation in promoter regions of different domains of life has not been reported comprehensively. In this study, for the first time, we investigated the preponderance of non-B DNA-prone motifs in promoter regions in 1180 genomes belonging to 28 taxonomic groups using the non-B DNA Motif Search Tool (nBMST). The trends suggest that they are predominant in promoters compared to the upstream and downstream regions of all three domains of life and variably linked to taxonomic groups. Cruciform DNA motif is the most abundant form of non-B DNA, spanning from archaea to lower eukaryotes. Curved DNA motifs are prominent in host-associated bacteria, and suppressed in mammals. Triplex-DNA and slipped DNA structure repeats are discretely dispersed in all lineages. G-quadruplex motifs are significantly enriched in mammals. We also observed that the unique enrichment of non-B DNA in promoters is strongly linked to genome GC, size, evolutionary time divergence, and ecological adaptations. Overall, our work systematically reports the unique non-B DNA structural landscape of cellular organisms from the perspective of the cis-regulatory code of genomes.
Collapse
Affiliation(s)
- Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur, 522302, Andhra Pradesh, India.
| | - Akkinepally Vanaja
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur, 522302, Andhra Pradesh, India; KL College of Pharmacy, Koneru Lakshmaiah Education Foundation, Guntur, 522302, Andhra Pradesh, India
| |
Collapse
|
8
|
Aditama R, Tanjung ZA, Aprilyanto V, Sudania WM, Utomo C, Liwang T. Identification of oil palm cis-regulatory elements based on DNA free energy and single nucleotide polymorphism density. Comput Biol Chem 2023; 106:107931. [PMID: 37481844 DOI: 10.1016/j.compbiolchem.2023.107931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 06/29/2023] [Accepted: 07/17/2023] [Indexed: 07/25/2023]
Abstract
Transcription control through cis-regulatory elements (CREs) is one of important regulators of gene expression. This study aimed to identify the location of CREs in oil palm (Elaeis guineensis Jacq.) using the combination of DNA free energy and single nucleotide polymorphism (SNP) density approaches. Promoter region sequences were extracted oil palm genome spanning from 1500 nucleotides (nt) upstream to 1000 nt downstream of every annotated transcription start sites (TSS). Free energy profiles of each promoter region were calculated using PromPredict software. Raw reads from the deep sequencing of 59 oil palm origins were used to calculate SNP density of each promoter region. The result showed that the average free energy (AFE) on the upstream region of TSS is about 1.5 kcal/mol higher compared to the downstream region. Using DNA free energy method, 16,281 regions of CREs were predicted. Most of predicted CREs was located between 1 and 500 nt upstream of TSS. Anti-correlation pattern between free energy and SNP density was observed on the predicted regions of CREs. This anti-correlated pattern was also observed on an experimentally determined promoter of the oil palm metallothionein gene, EgMSP1. Considering the increasing use of promoter information on plant biotechnology, an easy and accurate promoter prediction using the combination of free energy and SNP density method could be recommended.
Collapse
Affiliation(s)
- Redi Aditama
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Zulfikar Achmad Tanjung
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Victor Aprilyanto
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Widyartini Made Sudania
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| | - Condro Utomo
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia.
| | - Tony Liwang
- Biotechnology Department, Plant Production and Biotechnology Division, PT SMART Tbk., Bogor 16810, Indonesia
| |
Collapse
|
9
|
Guan T, Long L, Liu Y, Tian L, Peng Z, He Z. Complete Genome Sequencing and Bacteriocin Functional Characterization of Pediococcus ethanolidurans CP201 from Daqu. Appl Biochem Biotechnol 2023; 195:4728-4743. [PMID: 37285000 DOI: 10.1007/s12010-023-04575-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/24/2023] [Indexed: 06/08/2023]
Abstract
This study aims to sequence the whole genome of Pediococcus ethanolidurans CP201 isolated from Daqu and determine the anti-corrosion ability of bacteriocins on chicken breast. The whole genome sequence information of P. ethanolidurans CP201 was analyzed, and its gene structure and function were explored. It was found that gene1164 had annotations in the NR, Pfam, and Swiss-Prot databases, and was related to bacteriocins. The exogenous expression of the bacteriocin gene Pediocin PE-201 was analyzed based on the pET-21b vector and the host BL21, and the corresponding bacteriocin was successfully expressed under the induction of IPTG. After purification by NI-NTA column, enterokinase treatment, membrane dialysis concentration treatment, and SDS-PAGE electrophoresis, the molecular weight was about 6.5 kDa and the purity was above 90%. By applying different concentrations of bacteriocin to chicken breast with different levels of contamination, the control of pathogenic bacteria, the ordinary contamination level (OC) group, and the high contamination level (MC) group could be completely achieved with 25 mg/L bacteriocin. In conclusion, the bacteriocin produced by the newly isolated CP201 can be applied to the preservation of meat products to prevent the risk of food-borne diseases.
Collapse
Affiliation(s)
- Tongwei Guan
- College of Food & Bioengineering, Xihua University, Chengdu, 610039, China.
- Sichuan Provincial Key Laboratory of Food Microbiology, Chengdu, 610039, China.
| | - Liuzhu Long
- College of Food & Bioengineering, Xihua University, Chengdu, 610039, China
| | - Ying Liu
- College of Food & Bioengineering, Xihua University, Chengdu, 610039, China
| | - Lei Tian
- College of Food & Bioengineering, Xihua University, Chengdu, 610039, China
| | - Zhong Peng
- College of Food & Bioengineering, Xihua University, Chengdu, 610039, China
| | - Zongjun He
- Sichuan Tujiu Liquor Co., Ltd., Nanchong, 637000, China
| |
Collapse
|
10
|
Wang Y, Tai S, Zhang S, Sheng N, Xie X. PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence. Genes (Basel) 2023; 14:1441. [PMID: 37510345 PMCID: PMC10379012 DOI: 10.3390/genes14071441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/04/2023] [Accepted: 07/10/2023] [Indexed: 07/30/2023] Open
Abstract
Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew's correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Shiwen Tai
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Shuangquan Zhang
- School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Nan Sheng
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Xuping Xie
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| |
Collapse
|
11
|
Shao C, Sun S, Liu K, Wang J, Li S, Liu Q, Deagle BE, Seim I, Biscontin A, Wang Q, Liu X, Kawaguchi S, Liu Y, Jarman S, Wang Y, Wang HY, Huang G, Hu J, Feng B, De Pittà C, Liu S, Wang R, Ma K, Ying Y, Sales G, Sun T, Wang X, Zhang Y, Zhao Y, Pan S, Hao X, Wang Y, Xu J, Yue B, Sun Y, Zhang H, Xu M, Liu Y, Jia X, Zhu J, Liu S, Ruan J, Zhang G, Yang H, Xu X, Wang J, Zhao X, Meyer B, Fan G. The enormous repetitive Antarctic krill genome reveals environmental adaptations and population insights. Cell 2023; 186:1279-1294.e19. [PMID: 36868220 DOI: 10.1016/j.cell.2023.02.005] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 12/11/2022] [Accepted: 02/02/2023] [Indexed: 03/05/2023]
Abstract
Antarctic krill (Euphausia superba) is Earth's most abundant wild animal, and its enormous biomass is vital to the Southern Ocean ecosystem. Here, we report a 48.01-Gb chromosome-level Antarctic krill genome, whose large genome size appears to have resulted from inter-genic transposable element expansions. Our assembly reveals the molecular architecture of the Antarctic krill circadian clock and uncovers expanded gene families associated with molting and energy metabolism, providing insights into adaptations to the cold and highly seasonal Antarctic environment. Population-level genome re-sequencing from four geographical sites around the Antarctic continent reveals no clear population structure but highlights natural selection associated with environmental variables. An apparent drastic reduction in krill population size 10 mya and a subsequent rebound 100 thousand years ago coincides with climate change events. Our findings uncover the genomic basis of Antarctic krill adaptations to the Southern Ocean and provide valuable resources for future Antarctic research.
Collapse
Affiliation(s)
- Changwei Shao
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China.
| | - Shuai Sun
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China; BGI-Shenzhen, Shenzhen, Guangdong 518083, China; College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Kaiqiang Liu
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Jiahao Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Shuo Li
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Qun Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China; Department of Biology, University of Copenhagen, 2100 Copenhagen, Denmark
| | - Bruce E Deagle
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australian National Fish Collection, National Research Collections Australia, Hobart, TAS 7000, Australia; Australian Antarctic Division, Channel Highway, Kingston, TAS 7050, Australia
| | - Inge Seim
- Integrative Biology Laboratory, College of Life Sciences, Nanjing Normal University, Nanjing, Jiangsu 210023, China
| | | | - Qian Wang
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Xin Liu
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China; BGI-Beijing, Beijing 102601, China; State Key Laboratory of Agricultural Genomics, BGI-Shenzhen, Shenzhen 518083, China; State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Murdoch University, Murdoch, WA 6150, Australia
| | - So Kawaguchi
- Australian Antarctic Division, Channel Highway, Kingston, TAS 7050, Australia
| | - Yalin Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Simon Jarman
- School of Molecular and Life Sciences, Curtin University, Perth, WA 6009, Australia
| | - Yue Wang
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China; State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, University of Macau, Macao 999078, China
| | - Hong-Yan Wang
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | | | - Jiang Hu
- Nextomics Biosciences Institute, Wuhan, Hubei 430073, China
| | - Bo Feng
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | | | - Shanshan Liu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Rui Wang
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Kailong Ma
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China; China National GeneBank, BGI-Shenzhen, Shenzhen 518120, China
| | - Yiping Ying
- Key Lab of Sustainable Development of Polar Fisheries, Ministry of Agriculture and Rural Affairs, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China
| | - Gabrielle Sales
- Department of Biology, University of Padova, Padova 35121, Italy
| | - Tao Sun
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Xinliang Wang
- Key Lab of Sustainable Development of Polar Fisheries, Ministry of Agriculture and Rural Affairs, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China
| | - Yaolei Zhang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China; BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - Yunxia Zhao
- Key Lab of Sustainable Development of Polar Fisheries, Ministry of Agriculture and Rural Affairs, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China
| | - Shanshan Pan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Xiancai Hao
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Yang Wang
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - Jiakun Xu
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Key Lab of Sustainable Development of Polar Fisheries, Ministry of Agriculture and Rural Affairs, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China
| | - Bowen Yue
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Yanxu Sun
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - He Zhang
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - Mengyang Xu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China; BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - Yuyan Liu
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Xiaodong Jia
- Joint Laboratory for Translational Medicine Research, Liaocheng People's Hospital, Liaocheng, Shandong 252000, China
| | - Jiancheng Zhu
- Key Lab of Sustainable Development of Polar Fisheries, Ministry of Agriculture and Rural Affairs, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China
| | - Shufang Liu
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, Shandong 266237, China
| | - Jue Ruan
- Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China
| | - Guojie Zhang
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China; Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
| | - Huanming Yang
- BGI-Shenzhen, Shenzhen, Guangdong 518083, China; James D. Watson Institute of Genome Science, Hangzhou 310058, China
| | - Xun Xu
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China; BGI-Shenzhen, Shenzhen, Guangdong 518083, China
| | - Jun Wang
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China
| | - Xianyong Zhao
- National Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China; Key Lab of Sustainable Development of Polar Fisheries, Ministry of Agriculture and Rural Affairs, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, Shandong 266071, China
| | - Bettina Meyer
- Section Polar Biological Oceanography, Alfred Wegener Institute Helmholtz Centre for Polar and Marine Research, Bremerhaven, Germany; Institute for Chemistry and Biology of the Marine Environment, Carlvon Ossietzky University of Oldenburg, 26111 Oldenburg, Germany; Helmholtz Institute for Functional Marine Biodiversity (HIFMB), University of Oldenburg, 26129 Oldenburg, Germany.
| | - Guangyi Fan
- BGI-Qingdao, BGI-Shenzhen, Qingdao, Shandong 266555, China; BGI-Shenzhen, Shenzhen, Guangdong 518083, China; Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China; Lars Bolund Institute of Regenerative Medicine, Qingdao-Europe Advanced Institute for Life Sciences, BGI-Qingdao, BGI-Shenzhen 518120, China.
| |
Collapse
|
12
|
Transcriptome-based Mining of the Constitutive Promoters for Tuning Gene Expression in Aspergillus oryzae. J Microbiol 2023; 61:199-210. [PMID: 36745334 DOI: 10.1007/s12275-023-00020-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/09/2022] [Accepted: 12/12/2022] [Indexed: 02/07/2023]
Abstract
Transcriptional regulation has been adopted for developing metabolic engineering tools. The regulatory promoter is a crucial genetic element for strain optimization. In this study, a gene set of Aspergillus oryzae with highly constitutive expression across different growth stages was identified through transcriptome data analysis. The candidate promoters were functionally characterized in A. oryzae by transcriptional control of β-glucuronidase (GUS) as a reporter. The results showed that the glyceraldehyde triphosphate dehydrogenase promoter (PgpdA1) of A. oryzae with a unique structure displayed the most robust strength in constitutively controlling the expression compared to the PgpdA2 and other putative promoters tested. In addition, the ubiquitin promoter (Pubi) of A. oryzae exhibited a moderate expression strength. The deletion analysis revealed that the 5' untranslated regions of gpdA1 and ubi with the length of 1028 and 811 nucleotides, counted from the putative translation start site (ATG), respectively, could efficiently drive the GUS expression. Interestingly, both promoters could function on various carbon sources for cell growth. Glucose was the best fermentable carbon source for allocating high constitutive expressions during cell growth, and the high concentrations (6-8% glucose, w/v) did not repress their functions. It was also demonstrated that the secondary metabolite gene coding for indigoidine could express under the control of PgpdA1 or Pubi promoter. These strong and moderate promoters of A. oryzae provided beneficial options in tuning the transcriptional expression for leveraging the metabolic control towards the targeted products.
Collapse
|
13
|
Explainable artificial intelligence as a reliable annotator of archaeal promoter regions. Sci Rep 2023; 13:1763. [PMID: 36720898 PMCID: PMC9889792 DOI: 10.1038/s41598-023-28571-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 01/20/2023] [Indexed: 02/02/2023] Open
Abstract
Archaea are a vast and unexplored cellular domain that thrive in a high diversity of environments, having central roles in processes mediating global carbon and nutrient fluxes. For these organisms to balance their metabolism, the appropriate regulation of their gene expression is essential. A key momentum in regulating genes responsible for the life maintenance of archaea is when transcription factor proteins bind to the promoter element. This DNA segment is conserved, which enables its exploration by machine learning techniques. Here, we trained and tested a support vector machine with 3935 known archaeal promoter sequences. All promoter sequences were coded into DNA Duplex Stability. After, we performed a model interpretation task to map the decision pattern of the classification procedure. We also used a dataset of known-promoter sequences for validation. Our results showed that an AT rich region around position - 27 upstream (relative to the start TSS) is the most conserved in the analyzed organisms. In addition, we were able to identify the BRE element (- 33), the PPE (at - 10) and a position at + 3, that provides a more understandable picture of how promoters are organized in all the archaeal organisms. Finally, we used the interpreted model to identify potential promoter sequences of 135 unannotated organisms, delivering regulatory regions annotation of archaea in a scale never accomplished before ( https://pcyt.unam.mx/gene-regulation/ ). We consider that this approach will be useful to understand how gene regulation is achieved in other organisms apart from the already established transcription factor binding sites.
Collapse
|
14
|
Kari H, Bandi SMS, Kumar A, Yella VR. DeePromClass: Delineator for Eukaryotic Core Promoters Employing Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:802-807. [PMID: 35353704 DOI: 10.1109/tcbb.2022.3163418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Computational promoter identification in eukaryotes is a classical biological problem that should be refurbished with the availability of an avalanche of experimental data and emerging deep learning technologies. The current knowledge indicates that eukaryotic core promoters display multifarious signals such as TATA-Box, Inr element, TCT, and Pause-button, etc., and structural motifs such as G-quadruplexes. In the present study, we combined the power of deep learning with a plethora of promoter motifs to delineate promoter and non-promoters gleaned from the statistical properties of DNA sequence arrangement. To this end, we implemented convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural network architecture for five model systems with [-100 to +50] segments relative to the transcription start site being the core promoter. Unlike previous state-of-the-art tools, which furnish a binary decision of promoter or non-promoter, we classify a chunk of 151mer sequence into a promoter along with the consensus signal type or a non-promoter. The combined CNN-LSTM model; we call "DeePromClass", achieved testing accuracy of 90.6%, 93.6%, 91.8%, 86.5%, and 84.0% for S. cerevisiae, C. elegans, D. melanogaster, Mus musculus, and Homo sapiens respectively. In total, our tool provides an insightful update on next-generation promoter prediction tools for promoter biologists.
Collapse
|
15
|
Ferreira TMM, Ferreira Filho JA, Leão AP, de Sousa CAF, Souza MTJ. Structural and functional analysis of stress-inducible genes and their promoters selected from young oil palm ( Elaeis guineensis) under salt stress. BMC Genomics 2022; 23:735. [PMCID: PMC9620643 DOI: 10.1186/s12864-022-08926-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 10/04/2022] [Indexed: 11/10/2022] Open
Abstract
Background Soil salinity is a problem in more than 100 countries across all continents. It is one of the abiotic stress that threatens agriculture the most, negatively affecting crops and reducing productivity. Transcriptomics is a technology applied to characterize the transcriptome in a cell, tissue, or organism at a given time via RNA-Seq, also known as full-transcriptome shotgun sequencing. This technology allows the identification of most genes expressed at a particular stage, and different isoforms are separated and transcript expression levels measured. Once determined by this technology, the expression profile of a gene must undergo validation by another, such as quantitative real-time PCR (qRT-PCR). This study aimed to select, annotate, and validate stress-inducible genes—and their promoters—differentially expressed in the leaves of oil palm (Elaeis guineensis) plants under saline stress. Results The transcriptome analysis led to the selection of 14 genes that underwent structural and functional annotation, besides having their expression validated using the qRT-PCR technique. When compared, the RNA-Seq and qRT-PCR profiles of those genes resulted in some inconsistencies. The structural and functional annotation analysis of proteins coded by the selected genes showed that some of them are orthologs of genes reported as conferring resistance to salinity in other species. There were those coding for proteins related to the transport of salt into and out of cells, transcriptional regulatory activity, and opening and closing of stomata. The annotation analysis performed on the promoter sequence revealed 22 distinct types of cis-acting elements, and 14 of them are known to be involved in abiotic stress. Conclusion This study has helped validate the process of an accurate selection of genes responsive to salt stress with a specific and predefined expression profile and their promoter sequence. Its results also can be used in molecular-genetics-assisted breeding programs. In addition, using the identified genes is a window of opportunity for strategies trying to relieve the damages arising from the salt stress in many glycophyte crops with economic importance.
Collapse
Affiliation(s)
- Thalita Massaro Malheiros Ferreira
- grid.411269.90000 0000 8816 9513Graduate Program of Plant Biotechnology, Federal University of Lavras, 37200-000 Lavras, MG CP 3037, Brazil
| | - Jaire Alves Ferreira Filho
- grid.460200.00000 0004 0541 873XBrazilian Agricultural Research Corporation, Embrapa Agroenergy, 70770-901 Brasília, DF Brazil
| | - André Pereira Leão
- grid.460200.00000 0004 0541 873XBrazilian Agricultural Research Corporation, Embrapa Agroenergy, 70770-901 Brasília, DF Brazil
| | | | - Manoel Teixeira Jr. Souza
- grid.411269.90000 0000 8816 9513Graduate Program of Plant Biotechnology, Federal University of Lavras, 37200-000 Lavras, MG CP 3037, Brazil ,grid.460200.00000 0004 0541 873XBrazilian Agricultural Research Corporation, Embrapa Agroenergy, 70770-901 Brasília, DF Brazil
| |
Collapse
|
16
|
DeeProPre: A promoter predictor based on deep learning. Comput Biol Chem 2022; 101:107770. [PMID: 36116322 DOI: 10.1016/j.compbiolchem.2022.107770] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 08/06/2022] [Accepted: 09/11/2022] [Indexed: 11/21/2022]
Abstract
The promoter is a DNA sequence recognized, bound and transcribed by RNA polymerase. It is usually located at the upstream or 5'end of the transcription start site (TSS). Studies have shown that the structure of the promoter affects its affinity for RNA polymerase, thus affecting the level of gene expression. Therefore, the correct identification of core promoter and common structural gene is of great significance in the field of biomedicine. At present, many methods have been proposed to improve the accuracy of promoter recognition, but the performances still need to be further improved. In this study, a deep learning algorithm (DeeProPre) based on bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) was proposed. Firstly, the supervised embedding layer was applied to map the sequence to a high-dimensional space. Secondly, two 1D convolutional layers, BiLSTM and attentional mechanism layer were used for extracting features. Finally, the full connection layer activated by Sigmoid function was used to obtain the probability of classification into target categories. This model can identify the promoter region of eukaryotes with high accuracy, providing an analytical basis for further understanding of promoter physiological functions and studies of gene transcription mechanisms. The source code of DeeProPre is freely available at https://github.com/zzwwmmm/DeeProPre/tree/master.
Collapse
|
17
|
Chang CH. Correlated Expression of the Opsin Retrogene LWS-R and its Host Gene in Two Poeciliid Fishes. Zool Stud 2022; 61:e16. [PMID: 36330033 PMCID: PMC9579955 DOI: 10.6620/zs.2022.61-16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 02/22/2022] [Indexed: 06/16/2023]
Abstract
The important role of retrogenes in genome evolution and species differentiation is becoming increasingly accepted. One synapomorphy among cyprinodontoid fish is a retrotransposed version of a long-wavelength sensitive (LWS) opsin gene, LWS-R, within an intron of the gephyrin (GPHN) gene. These two genes display opposing orientations. It had been speculated that LWS-R hijacks the cis-regulatory elements of GPHN for transcription, but whether their expression is correlated had remained unclear. Here, in silico predictions identified putative promoters upstream of the translation start site of LWS-R, indicating that its transcription is driven by its own promoter rather than by the GPHN promoter. However, consistent expression ratios of LWS-R:GPHN in the eyeball and brain of fishes indicate that the respective gene transcriptions are correlated. Co-expression is potentially modulated by histone exchange during GPHN transcription. Two isoforms were detected in this study, i.e., intron-free and intron-retaining. Intron-free LWS-R was only expressed in the eyeball of fishes, whereas intron-retaining LWS-R occurred in both eyeball and brain. Expression of vision-associated LWS-R beyond the eyeball supports that it is co-expressed with more ubiquitous GPHN.
Collapse
Affiliation(s)
- Chia-Hao Chang
- Department of Science Education, National Taipei University of Education, No.134, Sec.2, Heping E. Rd., Da'an District, Taipei City 10671, Taiwan. E-mail: (Chang)
| |
Collapse
|
18
|
Machine learning and statistics shape a novel path in archaeal promoter annotation. BMC Bioinformatics 2022; 23:171. [PMID: 35538405 PMCID: PMC9087966 DOI: 10.1186/s12859-022-04714-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/05/2022] [Indexed: 11/29/2022] Open
Abstract
Background Archaea are a vast and unexplored domain. Bioinformatic techniques might enlighten the path to a higher quality genome annotation in varied organisms. Promoter sequences of archaea have the action of a plethora of proteins upon it. The conservation found in a structural level of the binding site of proteins such as TBP, TFB, and TFE aids RNAP-DNA stabilization and makes the archaeal promoter prone to be explored by statistical and machine learning techniques. Results and discussions In this study, experimentally verified promoter sequences of the organisms Haloferax volcanii, Sulfolobus solfataricus, and Thermococcus kodakarensis were converted into DNA duplex stability attributes (i.e. numerical variables) and were classified through Artificial Neural Networks and an in-house statistical method of classification, being tested with three forms of controls. The recognition of these promoters enabled its use to validate unannotated promoter sequences in other organisms. As a result, the binding site of basal transcription factors was located through a DNA duplex stability codification. Additionally, the classification presented satisfactory results (above 90%) among varied levels of control. Concluding remarks The classification models were employed to perform genomic annotation into the archaea Aciduliprofundum boonei and Thermofilum pendens, from which potential promoters have been identified and uploaded into public repositories. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04714-x.
Collapse
|
19
|
IMGT® Biocuration and Analysis of the Rhesus Monkey IG Loci. Vaccines (Basel) 2022; 10:vaccines10030394. [PMID: 35335026 PMCID: PMC8950363 DOI: 10.3390/vaccines10030394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 02/25/2022] [Accepted: 02/25/2022] [Indexed: 11/29/2022] Open
Abstract
The adaptive immune system, along with the innate immune system, are the two main biological processes that protect an organism from pathogens. The adaptive immune system is characterized by the specificity and extreme diversity of its antigen receptors. These antigen receptors are the immunoglobulins (IG) or antibodies of the B cells and the T cell receptors (TR) of the T cells. The IG are proteins that have a dual role in immunity: they recognize antigens and trigger elimination mechanisms, to rid the body of foreign cells. The synthesis of the immunoglobulin heavy and light chains requires gene rearrangements at the DNA level in the IGH, IGK, and IGL loci. The rhesus monkey (Macaca mulatta) is one of the most widely used nonhuman primate species in biomedical research. In this manuscript, we provide a thorough analysis of the three IG loci of the Mmul_10 assembly of rhesus monkey, integrating IMGT previously existing data. Detailed characterization of IG genes includes their localization and position in the loci, the determination of the allele functionality, and the description of the regulatory elements of their promoters as well as the sequences of the conventional recombination signals (RS). This complete annotation of the genomic IG loci of Mmul_10 assembly and the highly detailed IG gene characterization could be used as a model, in additional rhesus monkey assemblies, for the analysis of the IG allelic polymorphism and structural variation, which have been described in rhesus monkeys.
Collapse
|
20
|
Vanaja A, Yella VR. Delineation of the DNA Structural Features of Eukaryotic Core Promoter Classes. ACS OMEGA 2022; 7:5657-5669. [PMID: 35224327 PMCID: PMC8867553 DOI: 10.1021/acsomega.1c04603] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 01/27/2022] [Indexed: 05/02/2023]
Abstract
The eukaryotic transcription is orchestrated from a chunk of the DNA region stated as the core promoter. Multifarious and punctilious core promoter signals, viz., TATA-box, Inr, BREs, and Pause Button, are associated with a subset of genes and regulate their spatiotemporal expression. However, the core promoter architecture linked with these signals has not been investigated exhaustively for several species. In this study, we attempted to envisage the adaptive binding landscape of the transcription initiation machinery as a function of DNA structure. To this end, we deployed a set of k-mer based DNA structural estimates and regular expression models derived from experiments, molecular dynamic simulations, and theoretical frameworks, and high-throughout promoter data sets retrieved from the eukaryotic promoter database. We categorized protein-coding gene core promoters based on characteristic motifs at precise locations and analyzed the B-DNA structural properties and non-B-DNA structural motifs for 15 different eukaryotic genomes. We observed that Inr, BREd, and no-motif classes display common patterns of DNA sequence and structural environment. TATA-containing, BREu, and Pause Button classes show a deviant behavior with the TATA class displaying varied axial and twisting flexibility while BREu and Pause Button leaned toward G-quadruplex motif enrichment. Intriguingly, DNA meltability and shape signals are conserved irrespective of the presence or absence of distinct core promoter motifs in the majority of species. Altogether, here we delineated the conserved DNA structural signals associated with several promoter classes that may contribute to the chromatin configuration, orchestration of transcription machinery, and DNA duplex melting during the transcription process.
Collapse
Affiliation(s)
- Akkinepally Vanaja
- Department
of Biotechnology, Koneru Lakshmaiah Education
Foundation, Vaddeswaram, Guntur 522502, Andhra
Pradesh, India
- KL
College of Pharmacy, Koneru Lakshmaiah Education
Foundation, Vaddeswaram, Guntur 522502, Andhra
Pradesh, India
| | - Venkata Rajesh Yella
- Department
of Biotechnology, Koneru Lakshmaiah Education
Foundation, Vaddeswaram, Guntur 522502, Andhra
Pradesh, India
| |
Collapse
|
21
|
Genome-Wide Prediction of Transcription Start Sites in Conifers. Int J Mol Sci 2022; 23:ijms23031735. [PMID: 35163661 PMCID: PMC8836283 DOI: 10.3390/ijms23031735] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/30/2022] [Accepted: 02/01/2022] [Indexed: 02/04/2023] Open
Abstract
The identification of promoters is an essential step in the genome annotation process, providing a framework for gene regulatory networks and their role in transcription regulation. Despite considerable advances in the high-throughput determination of transcription start sites (TSSs) and transcription factor binding sites (TFBSs), experimental methods are still time-consuming and expensive. Instead, several computational approaches have been developed to provide fast and reliable means for predicting the location of TSSs and regulatory motifs on a genome-wide scale. Numerous studies have been carried out on the regulatory elements of mammalian genomes, but plant promoters, especially in gymnosperms, have been left out of the limelight and, therefore, have been poorly investigated. The aim of this study was to enhance and expand the existing genome annotations using computational approaches for genome-wide prediction of TSSs in the four conifer species: loblolly pine, white spruce, Norway spruce, and Siberian larch. Our pipeline will be useful for TSS predictions in other genomes, especially for draft assemblies, where reliable TSS predictions are not usually available. We also explored some of the features of the nucleotide composition of the predicted promoters and compared the GC properties of conifer genes with model monocot and dicot plants. Here, we demonstrate that even incomplete genome assemblies and partial annotations can be a reliable starting point for TSS annotation. The results of the TSS prediction in four conifer species have been deposited in the Persephone genome browser, which allows smooth visualization and is optimized for large data sets. This work provides the initial basis for future experimental validation and the study of the regulatory regions to understand gene regulation in gymnosperms.
Collapse
|
22
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
23
|
de Medeiros Oliveira M, Bonadio I, Lie de Melo A, Mendes Souza G, Durham AM. TSSFinder-fast and accurate ab initio prediction of the core promoter in eukaryotic genomes. Brief Bioinform 2021; 22:bbab198. [PMID: 34050351 PMCID: PMC8574697 DOI: 10.1093/bib/bbab198] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 02/14/2021] [Accepted: 02/23/2021] [Indexed: 12/02/2022] Open
Abstract
Promoter annotation is an important task in the analysis of a genome. One of the main challenges for this task is locating the border between the promoter region and the transcribing region of the gene, the transcription start site (TSS). The TSS is the reference point to delimit the DNA sequence responsible for the assembly of the transcribing complex. As the same gene can have more than one TSS, so to delimit the promoter region, it is important to locate the closest TSS to the site of the beginning of the translation. This paper presents TSSFinder, a new software for the prediction of the TSS signal of eukaryotic genes that is significantly more accurate than other available software. We currently are the only application to offer pre-trained models for six different eukaryotic organisms: Arabidopsis thaliana, Drosophila melanogaster, Gallus gallus, Homo sapiens, Oryza sativa and Saccharomyces cerevisiae. Additionally, our software can be easily customized for specific organisms using only 125 DNA sequences with a validated TSS signal and corresponding genomic locations as a training set. TSSFinder is a valuable new tool for the annotation of genomes. TSSFinder source code and docker container can be downloaded from http://tssfinder.github.io. Alternatively, TSSFinder is also available as a web service at http://sucest-fun.org/wsapp/tssfinder/.
Collapse
Affiliation(s)
| | - Igor Bonadio
- Data Science, Elo7 Research Lab, São Paulo, Brazil
| | | | | | | |
Collapse
|
24
|
Martinez GS, Sarkar S, Kumar A, Pérez‐Rueda E, de Avila e Silva S. Characterization of promoters in archaeal genomes based on DNA structural parameters. Microbiologyopen 2021; 10:e1230. [PMID: 34713600 PMCID: PMC8553660 DOI: 10.1002/mbo3.1230] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 07/27/2021] [Accepted: 07/29/2021] [Indexed: 11/10/2022] Open
Abstract
The transcription machinery of archaea can be roughly classified as a simplified version of eukaryotic organisms. The basal transcription factor machinery binds to the TATA box found around 28 nucleotides upstream of the transcription start site; however, some transcription units lack a clear TATA box and still have TBP/TFB binding over them. This apparent absence of conserved sequences could be a consequence of sequence divergence associated with the upstream region, operon, and gene organization. Furthermore, earlier studies have found that a structural analysis gains more information compared with a simple sequence inspection. In this work, we evaluated and coded 3630 archaeal promoter sequences of three organisms, Haloferax volcanii, Thermococcus kodakarensis, and Sulfolobus solfataricus into DNA duplex stability, enthalpy, curvature, and bendability parameters. We also split our dataset into conserved TATA and degenerated TATA promoters to identify differences among these two classes of promoters. The structural analysis reveals variations in archaeal promoter architecture, that is, a distinctive signal is observed in the TFB, TBP, and TFE binding sites independently of these being TATA-conserved or TATA-degenerated. In addition, the promoter encountering method was validated with upstream regions of 13 other archaea, suggesting that there might be promoter sequences among them. Therefore, we suggest a novel method for locating promoters within the genome of archaea based on DNA energetic/structural features.
Collapse
Affiliation(s)
| | - Sharmilee Sarkar
- Department of Molecular Biology and BiotechnologyTezpur UniversityTezpurAssamIndia
| | - Aditya Kumar
- Department of Molecular Biology and BiotechnologyTezpur UniversityTezpurAssamIndia
| | - Ernesto Pérez‐Rueda
- Unidad Académica de YucatánInstituto de Investigaciones en Matemáticas Aplicadas y en SistemasUniversidad Nacional Autónoma de MéxicoMéridaYucatánMéxico
| | | |
Collapse
|
25
|
Umarov R, Li Y, Arakawa T, Takizawa S, Gao X, Arner E. ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation. PLoS Comput Biol 2021; 17:e1009376. [PMID: 34491989 PMCID: PMC8448322 DOI: 10.1371/journal.pcbi.1009376] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 09/17/2021] [Accepted: 08/23/2021] [Indexed: 11/19/2022] Open
Abstract
Regulatory elements control gene expression through transcription initiation (promoters) and by enhancing transcription at distant regions (enhancers). Accurate identification of regulatory elements is fundamental for annotating genomes and understanding gene expression patterns. While there are many attempts to develop computational promoter and enhancer identification methods, reliable tools to analyze long genomic sequences are still lacking. Prediction methods often perform poorly on the genome-wide scale because the number of negatives is much higher than that in the training sets. To address this issue, we propose a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions. The developed method achieves good genome-level performance and maintains robust performance when applied to other vertebrate species, without re-training. Moreover, the unannotated predicted regulatory regions made on the human genome are enriched for disease-associated variants, suggesting them to be potentially true regulatory elements rather than false positives. We validated high scoring "false positive" predictions using reporter assay and all tested candidates were successfully validated, demonstrating the ability of our method to discover novel human regulatory regions.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (XG); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Takahiro Arakawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Satoshi Takizawa
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Xin Gao
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, Thuwal, Saudi Arabia
- * E-mail: (RU); (XG); (EA)
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (XG); (EA)
| |
Collapse
|
26
|
Symphony of the DNA flexibility and sequence environment orchestrates p53 binding to its responsive elements. Gene 2021; 803:145892. [PMID: 34375633 DOI: 10.1016/j.gene.2021.145892] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 07/26/2021] [Accepted: 08/05/2021] [Indexed: 11/23/2022]
Abstract
The p53 tumor suppressor protein maintains the genome fidelity and integrity by modulating several cellular activities. It regulates these events by interacting with a heterogeneous set of response elements (REs) of regulatory genes in the background of chromatin configuration. At the p53-RE interface, both the base readout and torsional-flexibility of DNA account for high-affinity binding. However, DNA structure is an entanglement of a multitude of physicochemical features, both local and global structure should be considered for dealing with DNA-protein interactions. The goal of current research work is to conceptualize and abstract basic principles of p53-RE binding affinity as a function of structural alterations in DNA such as bending, twisting, and stretching flexibility and shape. For this purpose, we have exploited high throughput in-vitro relative affinity information of responsive elements and genome binding events of p53 from HT-Selex and ChIP-Seq experiments respectively. Our results confirm the role of torsional flexibility in p53 binding, and further, we reveal that DNA axial bending, stretching stiffness, propeller twist, and wedge angles are intimately linked to p53 binding affinity when compared to homeodomain, bZIP, and bHLH proteins. Besides, a similar DNA structural environment is observed in the distal sequences encompassing the actual binding sites of p53 cistrome genes. Additionally, we revealed that p53 cistrome target genes have unique promoter architecture, and the DNA flexibility of genomic sequences around REs in cancer and normal cell types display major differences. Altogether, our work provides a keynote on DNA structural features of REs that shape up the in-vitro and in-vivo high-affinity binding of the p53 transcription factor.
Collapse
|
27
|
Martinez GS, de Ávila e Silva S, Kumar A, Pérez-Rueda E. DNA structural and physical properties reveal peculiarities in promoter sequences of the bacterium Escherichia coli K-12. SN APPLIED SCIENCES 2021. [DOI: 10.1007/s42452-021-04713-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Abstract
AbstractThe gene transcription of bacteria starts with a promoter sequence being recognized by a transcription factor found in the RNAP enzyme, this process is assisted through the conservation of nucleotides as well as other factors governing these intergenic regions. Faced with this, the coding of genetic information into physical aspects of the DNA such as enthalpy, stability, and base-pair stacking could suggest promoter activity as well as protrude differentiation of promoter and non-promoter data. In this work, a total of 3131 promoter sequences associated to six different sigma factors in the bacterium E. coli were converted into numeric attributes, a strong set of control sequences referring to a shuffled version of the original sequences as well as coding regions is provided. Then, the parameterized genetic information was normalized, exhaustively analyzed through statistical tests. The results suggest that strong signals in the promoter sequences match the binding site of transcription factor proteins, indicating that promoter activity is well represented by its conversion into physical attributes. Moreover, the features tested in this report conveyed significant variances between promoter and control data, enabling these features to be employed in bacterial promoter classification. The results produced here may aid in bacterial promoter recognition by providing a robust set of biological inferences.
Collapse
|
28
|
Brázda V, Bartas M, Bowater RP. Evolution of Diverse Strategies for Promoter Regulation. Trends Genet 2021; 37:730-744. [PMID: 33931265 DOI: 10.1016/j.tig.2021.04.003] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 03/31/2021] [Accepted: 04/01/2021] [Indexed: 12/15/2022]
Abstract
DNA is fundamentally important for all cellular organisms due to its role as a store of hereditary genetic information. The precise and accurate regulation of gene transcription depends primarily on promoters, which vary significantly within and between genomes. Some promoters are rich in specific types of bases, while others have more varied, complex sequence characteristics. However, it is not only base sequence but also epigenetic modifications and altered DNA structure that regulate promoter activity. Significantly, many promoters across all organisms contain sequences that can form intrastrand hairpins (cruciforms) or four-stranded structures (G-quadruplex or i-motif). In this review we integrate recent studies on promoter regulation that highlight the importance of DNA structure in the evolutionary adaptation of promoter sequences.
Collapse
Affiliation(s)
- Václav Brázda
- Institute of Biophysics of the Czech Academy of Sciences, Královopolská 135, 612 65 Brno, Czech Republic
| | - Martin Bartas
- Department of Biology and Ecology/Institute of Environmental Technologies, Faculty of Science, University of Ostrava, 710 00 Ostrava, Czech Republic
| | - Richard P Bowater
- School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ, UK.
| |
Collapse
|
29
|
Dey U, Sarkar S, Teronpi V, Yella VR, Kumar A. G-quadruplex motifs are functionally conserved in cis-regulatory regions of pathogenic bacteria: An in-silico evaluation. Biochimie 2021; 184:40-51. [PMID: 33548392 DOI: 10.1016/j.biochi.2021.01.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/28/2021] [Accepted: 01/29/2021] [Indexed: 02/06/2023]
Abstract
The role of G-quadruplexes in the cellular physiology of human pathogenesis is an intriguing area of research. Nonetheless, their functional roles and evolutionary conservation have not been compared comprehensively in pathogenic forms of various bacterial genera and species. In the current in silico study, we addressed the role of G-quadruplex-forming sequences (G4 motifs) in the context of cis-regulation, expression variation, regulatory networks, gene orthology and ontology. Genome-wide screening across seven pathogenic genomes using the G4Hunter tool revealed the significant prevalence of G4 motifs in cis-regulatory regions compared to the intragenic regions. Significant conservation of G4 motifs was observed in the regulatory region of 300 orthologous genes. Further analysis of published ChIP-Seq data (Minch et al., 2015) of 91 DNA-binding proteins of the M. tuberculosis genome revealed significant links between G4 motifs and target sites of transcriptional regulators. Interestingly, the transcription factors entangled with virulence, in specific, CsoR, Rv0081, DevR/DosR, and TetR family are found to have G4 motifs in their target regulatory regions. Overall the current study applies positional-functional relationship computation to delve into the cis-regulation of G-quadruplex structures in the context of gene orthology in pathogenic bacteria.
Collapse
Affiliation(s)
- Upalabdha Dey
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, 784028, Assam, India
| | - Sharmilee Sarkar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, 784028, Assam, India
| | - Valentina Teronpi
- Department of Zoology, Pandit Deendayal Upadhyaya Adarsha Mahavidyalaya, Behali, Biswanath, 784184, Assam, India
| | - Venkata Rajesh Yella
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Guntur, 522502, Andhra Pradesh, India.
| | - Aditya Kumar
- Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, 784028, Assam, India.
| |
Collapse
|
30
|
Zhu Y, Li F, Xiang D, Akutsu T, Song J, Jia C. Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks. Brief Bioinform 2020; 22:5998831. [PMID: 33227813 DOI: 10.1093/bib/bbaa299] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 10/01/2020] [Accepted: 10/07/2020] [Indexed: 12/26/2022] Open
Abstract
A promoter is a region in the DNA sequence that defines where the transcription of a gene by RNA polymerase initiates, which is typically located proximal to the transcription start site (TSS). How to correctly identify the gene TSS and the core promoter is essential for our understanding of the transcriptional regulation of genes. As a complement to conventional experimental methods, computational techniques with easy-to-use platforms as essential bioinformatics tools can be effectively applied to annotate the functions and physiological roles of promoters. In this work, we propose a deep learning-based method termed Depicter (Deep learning for predicting promoter), for identifying three specific types of promoters, i.e. promoter sequences with the TATA-box (TATA model), promoter sequences without the TATA-box (non-TATA model), and indistinguishable promoters (TATA and non-TATA model). Depicter is developed based on an up-to-date, species-specific dataset which includes Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana promoters. A convolutional neural network coupled with capsule layers is proposed to train and optimize the prediction model of Depicter. Extensive benchmarking and independent tests demonstrate that Depicter achieves an improved predictive performance compared with several state-of-the-art methods. The webserver of Depicter is implemented and freely accessible at https://depicter.erc.monash.edu/.
Collapse
Affiliation(s)
- Yan Zhu
- School of Science, Dalian Maritime University, China
| | - Fuyi Li
- Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia
| | | | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University
| |
Collapse
|
31
|
Yella VR, Vanaja A, Kulandaivelu U, Kumar A. Delving into Eukaryotic Origins of Replication Using DNA Structural Features. ACS OMEGA 2020; 5:13601-13611. [PMID: 32566825 PMCID: PMC7301376 DOI: 10.1021/acsomega.0c00441] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 05/15/2020] [Indexed: 05/18/2023]
Abstract
DNA replication in eukaryotes is an intricate process, which is precisely synchronized by a set of regulatory proteins, and the replication fork emanates from discrete sites on chromatin called origins of replication (Oris). These spots are considered as the gateway to chromosomal replication and are stereotyped by sequence motifs. The cognate sequences are noticeable in a small group of entire origin regions or totally absent across different metazoans. Alternatively, the use of DNA secondary structural features can provide additional information compared to the primary sequence. In this article, we report the trends in DNA sequence-based structural properties of origin sequences in nine eukaryotic systems representing different families of life. Biologically relevant DNA secondary structural properties, namely, stability, propeller twist, flexibility, and minor groove shape were studied in the sequences flanking replication start sites. Results indicate that Oris in yeasts show lower stability, more rigidity, and narrow minor groove preferences compared to genomic sequences surrounding them. Yeast Oris also show preference for A-tracts and the promoter element TATA box in the vicinity of replication start sites. On the contrary, Drosophila melanogaster, humans, and Arabidopsis thaliana do not have such features in their Oris, and instead, they show high preponderance of G-rich sequence motifs such as putative G-quadruplexes or i-motifs and CpG islands. Our extensive study applies the DNA structural feature computation to delve into origins of replication across organisms ranging from yeasts to mammals and including a plant. Insights from this study would be significant in understanding origin architecture and help in designing new algorithms for predicting DNA trans-acting factor recognition events.
Collapse
Affiliation(s)
- Venkata Rajesh Yella
- Department
of Biotechnology, Koneru Lakshmaiah Education
Foundation, Guntur 522502, Andhra Pradesh, India
| | - Akkinepally Vanaja
- Department
of Biotechnology, Koneru Lakshmaiah Education
Foundation, Guntur 522502, Andhra Pradesh, India
- KL
College of Pharmacy, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur 522502, Andhra Pradesh, India
| | - Umasankar Kulandaivelu
- KL
College of Pharmacy, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur 522502, Andhra Pradesh, India
| | - Aditya Kumar
- Department
of Molecular Biology and Biotechnology, Tezpur University, Tezpur 784028, Assam, India
| |
Collapse
|
32
|
Romsdahl J, Blachowicz A, Chiang YM, Venkateswaran K, Wang CCC. Metabolomic Analysis of Aspergillus niger Isolated From the International Space Station Reveals Enhanced Production Levels of the Antioxidant Pyranonigrin A. Front Microbiol 2020; 11:931. [PMID: 32670208 PMCID: PMC7326050 DOI: 10.3389/fmicb.2020.00931] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Accepted: 04/20/2020] [Indexed: 11/13/2022] Open
Abstract
Secondary metabolite (SM) production in Aspergillus niger JSC-093350089, isolated from the International Space Station (ISS), is reported, along with a comparison to the experimentally established strain ATCC 1015. The analysis revealed enhanced production levels of naphtho-γ-pyrones and therapeutically relevant SMs, including bicoumanigrin A, aurasperones A and B, and the antioxidant pyranonigrin A. Genetic variants that may be responsible for increased SM production levels in JSC-093350089 were identified. These findings include INDELs within the predicted promoter region of flbA, which encodes a developmental regulator that modulates pyranonigrin A production via regulation of Fum21. The pyranonigrin A biosynthetic gene cluster was confirmed in A. niger, which revealed the involvement of a previously undescribed gene, pyrE, in its biosynthesis. UVC sensitivity assays enabled characterization of pyranonigrin A as a UV resistance agent in the ISS isolate.
Collapse
Affiliation(s)
- Jillian Romsdahl
- Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Adriana Blachowicz
- Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, Los Angeles, CA, United States.,Biotechnology and Planetary Protection Group, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States
| | - Yi-Ming Chiang
- Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, Los Angeles, CA, United States
| | - Kasthuri Venkateswaran
- Biotechnology and Planetary Protection Group, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States
| | - Clay C C Wang
- Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, Los Angeles, CA, United States.,Department of Chemistry, Dornsife College of Letters, Arts, and Sciences, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|