1
|
Dresch JM, Conrad RD, Klonaros D, Drewell RA. Investigating the sequence landscape in the Drosophila initiator core promoter element using an enhanced MARZ algorithm. PeerJ 2023; 11:e15597. [PMID: 37366427 PMCID: PMC10290830 DOI: 10.7717/peerj.15597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 05/29/2023] [Indexed: 06/28/2023] Open
Abstract
The core promoter elements are important DNA sequences for the regulation of RNA polymerase II transcription in eukaryotic cells. Despite the broad evolutionary conservation of these elements, there is extensive variation in the nucleotide composition of the actual sequences. In this study, we aim to improve our understanding of the complexity of this sequence variation in the TATA box and initiator core promoter elements in Drosophila melanogaster. Using computational approaches, including an enhanced version of our previously developed MARZ algorithm that utilizes gapped nucleotide matrices, several sequence landscape features are uncovered, including an interdependency between the nucleotides in position 2 and 5 in the initiator. Incorporating this information in an expanded MARZ algorithm improves predictive performance for the identification of the initiator element. Overall our results demonstrate the need to carefully consider detailed sequence composition features in core promoter elements in order to make more robust and accurate bioinformatic predictions.
Collapse
|
2
|
Wang Z, He W, Tang J, Guo F. Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. J Chem Inf Model 2020; 60:1876-1883. [DOI: 10.1021/acs.jcim.9b01012] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Zongyu Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Wenying He
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, P. R. China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina 29208, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
3
|
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Chèneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res 2019; 46:D260-D266. [PMID: 29140473 PMCID: PMC5753243 DOI: 10.1093/nar/gkx1126] [Citation(s) in RCA: 859] [Impact Index Per Article: 171.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/27/2017] [Indexed: 12/31/2022] Open
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups. In the 2018 release of JASPAR, the CORE collection has been expanded with 322 new PFMs (60 for vertebrates and 262 for plants) and 33 PFMs were updated (24 for vertebrates, 8 for plants and 1 for insects). These new profiles represent a 30% expansion compared to the 2016 release. In addition, we have introduced 316 TFFMs (95 for vertebrates, 218 for plants and 3 for insects). This release incorporates clusters of similar PFMs in each taxon and each TF class per taxon. The JASPAR 2018 CORE vertebrate collection of PFMs was used to predict TF-binding sites in the human genome. The predictions are made available to the scientific community through a UCSC Genome Browser track data hub. Finally, this update comes with a new web framework with an interactive and responsive user-interface, along with new features. All the underlying data can be retrieved programmatically using a RESTful API and through the JASPAR 2018 R/Bioconductor package.
Collapse
Affiliation(s)
- Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Arnaud Stigliani
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Robin van der Lee
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Adrien Bessy
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Jeanne Chèneby
- INSERM, UMR1090 TAGC, Marseille, F-13288, France.,Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Shubhada R Kulkarni
- Ghent University, Department of Plant Biotechnology and Bioinformatics, Technologiepark 927, 9052 Ghent, Belgium.,VIB Center for Plant Systems Biology, Technologiepark 927, 9052 Ghent, Belgium.,Bioinformatics Institute Ghent, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
| | - Ge Tan
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK
| | - Damir Baranasic
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK
| | - David J Arenillas
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology and Biotech Research & Innovation Centre, University of Copenhagen, DK2200 Copenhagen N, Denmark
| | - Klaas Vandepoele
- Ghent University, Department of Plant Biotechnology and Bioinformatics, Technologiepark 927, 9052 Ghent, Belgium.,VIB Center for Plant Systems Biology, Technologiepark 927, 9052 Ghent, Belgium.,Bioinformatics Institute Ghent, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
| | - Boris Lenhard
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK.,Sars International Centre for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway
| | - Benoît Ballester
- INSERM, UMR1090 TAGC, Marseille, F-13288, France.,Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - François Parcy
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0310 Oslo, Norway
| |
Collapse
|
4
|
Elmas A, Wang X, Dresch JM. The folded k-spectrum kernel: A machine learning approach to detecting transcription factor binding sites with gapped nucleotide dependencies. PLoS One 2017; 12:e0185570. [PMID: 28982128 PMCID: PMC5628859 DOI: 10.1371/journal.pone.0185570] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 09/14/2017] [Indexed: 12/22/2022] Open
Abstract
Understanding the molecular machinery involved in transcriptional regulation is central to improving our knowledge of an organism's development, disease, and evolution. The building blocks of this complex molecular machinery are an organism's genomic DNA sequence and transcription factor proteins. Despite the vast amount of sequence data now available for many model organisms, predicting where transcription factors bind, often referred to as 'motif detection' is still incredibly challenging. In this study, we develop a novel bioinformatic approach to binding site prediction. We do this by extending pre-existing SVM approaches in an unbiased way to include all possible gapped k-mers, representing different combinations of complex nucleotide dependencies within binding sites. We show the advantages of this new approach when compared to existing SVM approaches, through a rigorous set of cross-validation experiments. We also demonstrate the effectiveness of our new approach by reporting on its improved performance on a set of 127 genomic regions known to regulate gene expression along the anterio-posterior axis in early Drosophila embryos.
Collapse
Affiliation(s)
- Abdulkadir Elmas
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| | - Jacqueline M. Dresch
- Department of Mathematics and Computer Science, Clark University, Worcester, MA, United States of America
| |
Collapse
|
5
|
Dresch JM, Zellers RG, Bork DK, Drewell RA. Nucleotide Interdependency in Transcription Factor Binding Sites in the Drosophila Genome. GENE REGULATION AND SYSTEMS BIOLOGY 2016; 10:21-33. [PMID: 27330274 PMCID: PMC4907338 DOI: 10.4137/grsb.s38462] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Revised: 04/17/2016] [Accepted: 04/28/2016] [Indexed: 01/14/2023]
Abstract
A long-standing objective in modern biology is to characterize the molecular components that drive the development of an organism. At the heart of eukaryotic development lies gene regulation. On the molecular level, much of the research in this field has focused on the binding of transcription factors (TFs) to regulatory regions in the genome known as cis-regulatory modules (CRMs). However, relatively little is known about the sequence-specific binding preferences of many TFs, especially with respect to the possible interdependencies between the nucleotides that make up binding sites. A particular limitation of many existing algorithms that aim to predict binding site sequences is that they do not allow for dependencies between nonadjacent nucleotides. In this study, we use a recently developed computational algorithm, MARZ, to compare binding site sequences using 32 distinct models in a systematic and unbiased approach to explore nucleotide dependencies within binding sites for 15 distinct TFs known to be critical to Drosophila development. Our results indicate that many of these proteins have varying levels of nucleotide interdependencies within their DNA recognition sequences, and that, in some cases, models that account for these dependencies greatly outperform traditional models that are used to predict binding sites. We also directly compare the ability of different models to identify the known KRUPPEL TF binding sites in CRMs and demonstrate that a more complex model that accounts for nucleotide interdependencies performs better when compared with simple models. This ability to identify TFs with critical nucleotide interdependencies in their binding sites will lead to a deeper understanding of how these molecular characteristics contribute to the architecture of CRMs and the precise regulation of transcription during organismal development.
Collapse
Affiliation(s)
- Jacqueline M. Dresch
- Department of Mathematics and Computer Science, Clark University, Worcester, MA, USA
| | - Rowan G. Zellers
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | - Daniel K. Bork
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | | |
Collapse
|
6
|
Pettie KP, Dresch JM, Drewell RA. Spatial distribution of predicted transcription factor binding sites in Drosophila ChIP peaks. Mech Dev 2016; 141:51-61. [PMID: 27264535 DOI: 10.1016/j.mod.2016.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Revised: 04/24/2016] [Accepted: 06/01/2016] [Indexed: 11/19/2022]
Abstract
In the development of the Drosophila embryo, gene expression is directed by the sequence-specific interactions of a large network of protein transcription factors (TFs) and DNA cis-regulatory binding sites. Once the identity of the typically 8-10bp binding sites for any given TF has been determined by one of several experimental procedures, the sequences can be represented in a position weight matrix (PWM) and used to predict the location of additional TF binding sites elsewhere in the genome. Often, alignments of large (>200bp) genomic fragments that have been experimentally determined to bind the TF of interest in Chromatin Immunoprecipitation (ChIP) studies are trimmed under the assumption that the majority of the binding sites are located near the center of all the aligned fragments. In this study, ChIP/chip datasets are analyzed using the corresponding PWMs for the well-studied TFs; CAUDAL, HUNCHBACK, KNIRPS and KRUPPEL, to determine the distribution of predicted binding sites. All four TFs are critical regulators of gene expression along the anterio-posterior axis in early Drosophila development. For all four TFs, the ChIP peaks contain multiple binding sites that are broadly distributed across the genomic region represented by the peak, regardless of the prediction stringency criteria used. This result suggests that ChIP peak trimming may exclude functional binding sites from subsequent analyses.
Collapse
Affiliation(s)
- Kade P Pettie
- Department of Biology, Amherst College, Amherst, MA 01002, United States
| | - Jacqueline M Dresch
- Department of Mathematics and Computer Science, Clark University, 950 Main Street, Worcester, MA 01610, United States
| | - Robert A Drewell
- Biology Department, Clark University, 950 Main Street, Worcester, MA 01610, United States
| |
Collapse
|