1
|
Garza AB, Garcia R, Solis LM, Halfon MS, Girgis HZ. EnhancerTracker: Comparing cell-type-specific enhancer activity of DNA sequence triplets via an ensemble of deep convolutional neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.23.573198. [PMID: 38187673 PMCID: PMC10769370 DOI: 10.1101/2023.12.23.573198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Motivation Transcriptional enhancers - unlike promoters - are unrestrained by distance or strand orientation with respect to their target genes, making their computational identification a challenge. Further, there are insufficient numbers of confirmed enhancers for many cell types, preventing robust training of machine-learning-based models for enhancer prediction for such cell types. Results We present EnhancerTracker , a novel tool that leverages an ensemble of deep separable convolutional neural networks to identify cell-type-specific enhancers with the need of only two confirmed enhancers. EnhancerTracker is trained, validated, and tested on 52,789 putative enhancers obtained from the FANTOM5 Project and control sequences derived from the human genome. Unlike available tools, which accept one sequence at a time, the input to our tool is three sequences; the first two are enhancers active in the same cell type. EnhancerTracker outputs 1 if the third sequence is an enhancer active in the same cell type(s) where the first two enhancers are active. It outputs 0 otherwise. On a held-out set (15%), EnhancerTracker achieved an accuracy of 64%, a specificity of 93%, a recall of 35%, a precision of 84%, and an F1 score of 49%. Availability and implementation https://github.com/BioinformaticsToolsmith/EnhancerTracker. Contact hani.girgis@tamuk.edu.
Collapse
|
2
|
Tomoyasu Y, Halfon MS. How to study enhancers in non-traditional insect models. ACTA ACUST UNITED AC 2020; 223:223/Suppl_1/jeb212241. [PMID: 32034049 DOI: 10.1242/jeb.212241] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Transcriptional enhancers are central to the function and evolution of genes and gene regulation. At the organismal level, enhancers play a crucial role in coordinating tissue- and context-dependent gene expression. At the population level, changes in enhancers are thought to be a major driving force that facilitates evolution of diverse traits. An amazing array of diverse traits seen in insect morphology, physiology and behavior has been the subject of research for centuries. Although enhancer studies in insects outside of Drosophila have been limited, recent advances in functional genomic approaches have begun to make such studies possible in an increasing selection of insect species. Here, instead of comprehensively reviewing currently available technologies for enhancer studies in established model organisms such as Drosophila, we focus on a subset of computational and experimental approaches that are likely applicable to non-Drosophila insects, and discuss the pros and cons of each approach. We discuss the importance of validating enhancer function and evaluate several possible validation methods, such as reporter assays and genome editing. Key points and potential pitfalls when establishing a reporter assay system in non-traditional insect models are also discussed. We close with a discussion of how to advance enhancer studies in insects, both by improving computational approaches and by expanding the genetic toolbox in various insects. Through these discussions, this Review provides a conceptual framework for studying the function and evolution of enhancers in non-traditional insect models.
Collapse
Affiliation(s)
| | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
| |
Collapse
|
3
|
Castellanos M, Mothi N, Muñoz V. Eukaryotic transcription factors can track and control their target genes using DNA antennas. Nat Commun 2020; 11:540. [PMID: 31992709 PMCID: PMC6987225 DOI: 10.1038/s41467-019-14217-8] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 12/12/2019] [Indexed: 12/27/2022] Open
Abstract
Eukaryotic transcription factors (TF) function by binding to short 6-10 bp DNA recognition sites located near their target genes, which are scattered through vast genomes. Such process surmounts enormous specificity, efficiency and celerity challenges using a molecular mechanism that remains poorly understood. Combining biophysical experiments, theory and bioinformatics, we dissect the interplay between the DNA-binding domain of Engrailed, a Drosophila TF, and the regulatory regions of its target genes. We find that Engrailed binding affinity is strongly amplified by the DNA regions flanking the recognition site, which contain long tracts of degenerate recognition-site repeats. Such DNA organization operates as an antenna that attracts TF molecules in a promiscuous exchange among myriads of intermediate affinity binding sites. The antenna ensures a local TF supply, enables gene tracking and fine control of the target site's basal occupancy. This mechanism illuminates puzzling gene expression data and suggests novel engineering strategies to control gene expression.
Collapse
Affiliation(s)
- Milagros Castellanos
- Instituto Madrileño de Estudios Avanzados en Nanociencia (IMDEA Nanociencia), Faraday 9, Campus de Cantoblanco, Madrid, 28049, Spain.,Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas (CSIC), Darwin 3, Campus de Cantoblanco, Madrid, 28049, Spain
| | - Nivin Mothi
- Department of Bioengineering, School of Engineering, University of California, 95343, Merced, CA, USA
| | - Victor Muñoz
- Instituto Madrileño de Estudios Avanzados en Nanociencia (IMDEA Nanociencia), Faraday 9, Campus de Cantoblanco, Madrid, 28049, Spain. .,Centro Nacional de Biotecnología, Consejo Superior de Investigaciones Científicas (CSIC), Darwin 3, Campus de Cantoblanco, Madrid, 28049, Spain. .,Department of Bioengineering, School of Engineering, University of California, 95343, Merced, CA, USA.
| |
Collapse
|
4
|
Lai YT, Deem KD, Borràs-Castells F, Sambrani N, Rudolf H, Suryamohan K, El-Sherif E, Halfon MS, McKay DJ, Tomoyasu Y. Enhancer identification and activity evaluation in the red flour beetle, Tribolium castaneum. Development 2018; 145:dev160663. [PMID: 29540499 PMCID: PMC11736658 DOI: 10.1242/dev.160663] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Accepted: 03/09/2018] [Indexed: 12/13/2022]
Abstract
Evolution of cis-regulatory elements (such as enhancers) plays an important role in the production of diverse morphology. However, a mechanistic understanding is often limited by the absence of methods for studying enhancers in species other than established model systems. Here, we sought to establish methods to identify and test enhancer activity in the red flour beetle, Tribolium castaneum To identify possible enhancer regions, we first obtained genome-wide chromatin profiles from various tissues and stages of Tribolium using FAIRE (formaldehyde-assisted isolation of regulatory elements)-sequencing. Comparison of these profiles revealed a distinct set of open chromatin regions in each tissue and at each stage. In addition, comparison of the FAIRE data with sets of computationally predicted (i.e. supervised cis-regulatory module-predicted) enhancers revealed a very high overlap between the two datasets. Second, using nubbin in the wing and hunchback in the embryo as case studies, we established the first universal reporter assay system that works in various contexts in Tribolium, and in a cross-species context. Together, these advances will facilitate investigation of cis-evolution and morphological diversity in Tribolium and other insects.
Collapse
Affiliation(s)
- Yi-Ting Lai
- Department of Biology, Miami University, Oxford, OH 45056, USA
| | - Kevin D Deem
- Department of Biology, Miami University, Oxford, OH 45056, USA
| | | | - Nagraj Sambrani
- Department of Biology, Miami University, Oxford, OH 45056, USA
| | - Heike Rudolf
- Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen 91058, Germany
| | - Kushal Suryamohan
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Ezzat El-Sherif
- Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen 91058, Germany
| | - Marc S Halfon
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Daniel J McKay
- Department of Biology, Department of Genetics, Integrative Program for Biological and Genome Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | | |
Collapse
|
5
|
Handling Permutation in Sequence Comparison: Genome-Wide Enhancer Prediction in Vertebrates by a Novel Non-Linear Alignment Scoring Principle. PLoS One 2015; 10:e0141487. [PMID: 26505748 PMCID: PMC4624239 DOI: 10.1371/journal.pone.0141487] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 10/08/2015] [Indexed: 01/01/2023] Open
Abstract
Enhancers have been described to evolve by permutation without changing function. This has posed the problem of how to predict enhancer elements that are hidden from alignment-based approaches due to the loss of co-linearity. Alignment-free algorithms have been proposed as one possible solution. However, this approach is hampered by several problems inherent to its underlying working principle. Here we present a new approach, which combines the power of alignment and alignment-free techniques into one algorithm. It allows the prediction of enhancers based on the query and target sequence only, no matter whether the regulatory logic is co-linear or reshuffled. To test our novel approach, we employ it for the prediction of enhancers across the evolutionary distance of ~450Myr between human and medaka. We demonstrate its efficacy by subsequent in vivo validation resulting in 82% (9/11) of the predicted medaka regions showing reporter activity. These include five candidates with partially co-linear and four with reshuffled motif patterns. Orthology in flanking genes and conservation of the detected co-linear motifs indicates that those candidates are likely functionally equivalent enhancers. In sum, our results demonstrate that the proposed principle successfully predicts mutated as well as permuted enhancer regions at an encouragingly high rate.
Collapse
|
6
|
Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. WILEY INTERDISCIPLINARY REVIEWS. DEVELOPMENTAL BIOLOGY 2015; 4:59-84. [PMID: 25704908 PMCID: PMC4339228 DOI: 10.1002/wdev.168] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Revised: 11/04/2014] [Accepted: 11/16/2014] [Indexed: 11/08/2022]
Abstract
UNLABELLED Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website. CONFLICT OF INTEREST The authors have declared no conflicts of interest for this article.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
7
|
Torlopp A, Khan MAF, Oliveira NMM, Lekk I, Soto-Jiménez LM, Sosinsky A, Stern CD. The transcription factor Pitx2 positions the embryonic axis and regulates twinning. eLife 2014; 3:e03743. [PMID: 25496870 PMCID: PMC4371885 DOI: 10.7554/elife.03743] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2014] [Accepted: 11/14/2014] [Indexed: 12/29/2022] Open
Abstract
Embryonic polarity of invertebrates, amphibians and fish is specified largely by maternal determinants, which fixes cell fates early in development. In contrast, amniote embryos remain plastic and can form multiple individuals until gastrulation. How is their polarity determined? In the chick embryo, the earliest known factor is cVg1 (homologous to mammalian growth differentiation factor 1, GDF1), a transforming growth factor beta (TGFβ) signal expressed posteriorly before gastrulation. A molecular screen to find upstream regulators of cVg1 in normal embryos and in embryos manipulated to form twins now uncovers the transcription factor Pitx2 as a candidate. We show that Pitx2 is essential for axis formation, and that it acts as a direct regulator of cVg1 expression by binding to enhancers within neighbouring genes. Pitx2, Vg1/GDF1 and Nodal are also key actors in left-right asymmetry, suggesting that the same ancient polarity determination mechanism has been co-opted to different functions during evolution.
Collapse
Affiliation(s)
- Angela Torlopp
- Department of Cell and
Developmental Biology, University College
London, London, United Kingdom
| | - Mohsin A F Khan
- Department of Cell and
Developmental Biology, University College
London, London, United Kingdom
| | - Nidia M M Oliveira
- Department of Cell and
Developmental Biology, University College
London, London, United Kingdom
| | - Ingrid Lekk
- Department of Cell and
Developmental Biology, University College
London, London, United Kingdom
| | - Luz Mayela Soto-Jiménez
- Department of Cell and
Developmental Biology, University College
London, London, United Kingdom
- Programa de Ciencias
Genómicas, Universidad Nacional Autónoma de
México, Morelos, Mexico
| | - Alona Sosinsky
- Institute of Structural
and Molecular Biology, Birkbeck College, University of
London, London, United Kingdom
| | - Claudio D Stern
- Department of Cell and
Developmental Biology, University College
London, London, United Kingdom
| |
Collapse
|
8
|
Qiu X, Sun W, McDonnell CM, Li-Byarlay H, Steele LD, Wu J, Xie J, Muir WM, Pittendrigh BR. Genome-wide analysis of genes associated with moderate and high DDT resistance in Drosophila melanogaster. PEST MANAGEMENT SCIENCE 2013; 69:930-937. [PMID: 23371854 DOI: 10.1002/ps.3454] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/11/2012] [Revised: 10/03/2012] [Accepted: 11/07/2012] [Indexed: 06/01/2023]
Abstract
BACKGROUND Moderate to high DDT resistance in generally associated with overexpression of multiple genes and therefore has been considered to be polygenic. However, very little information is available about the molecular mechanisms that insect populations employ when evolving increased levels of resistance. The presence of common regulatory motifs among resistance-associated genes may help to explain how and why certain suites of genes are preferentially represented in genomic-scale analyses. RESULTS A set of commonly differentially expressed genes associated with DDT resistance in the fruit fly was identified on the basis of genome-wide microarray analysis followed by qRT-PCR verification. More genes were observed to be overtranscribed in the highly resistant strain (91-R) than in the moderately resistant strain (Wisconsin) and susceptible strain (Canton-S). Furthermore, possible transcription factor binding sites that occurred in coexpressed resistance-associated genes were discovered by computational motif discovery methods. CONCLUSION A glucocorticoid receptor (GR)-like putative transcription factor binding motif (TFBM) was observed to be associated with genes commonly differentially transcribed in both the 91-R and Wisconsin lines of DDT-resistant Drosophila.
Collapse
Affiliation(s)
- Xinghui Qiu
- State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Ma Q, Liu B, Zhou C, Yin Y, Li G, Xu Y. An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale. ACTA ACUST UNITED AC 2013; 29:2261-8. [PMID: 23846744 DOI: 10.1093/bioinformatics/btt397] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
MOTIVATION We present an integrated toolkit, BoBro2.0, for prediction and analysis of cis-regulatory motifs. This toolkit can (i) reliably identify statistically significant cis-regulatory motifs at a genome scale; (ii) accurately scan for all motif instances of a query motif in specified genomic regions using a novel method for P-value estimation; (iii) provide highly reliable comparisons and clustering of identified motifs, which takes into consideration the weak signals from the flanking regions of the motifs; and (iv) analyze co-occurring motifs in the regulatory regions. RESULTS We have carried out systematic comparisons between motif predictions using BoBro2.0 and the MEME package. The comparison results on Escherichia coli K12 genome and the human genome show that BoBro2.0 can identify the statistically significant motifs at a genome scale more efficiently, identify motif instances more accurately and get more reliable motif clusters than MEME. In addition, BoBro2.0 provides correlational analyses among the identified motifs to facilitate the inference of joint regulation relationships of transcription factors. AVAILABILITY The source code of the program is freely available for noncommercial uses at http://code.google.com/p/bobro/. CONTACT xyn@bmb.uga.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qin Ma
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | | | | | | | | | | |
Collapse
|
10
|
Khan MAF, Soto-Jimenez LM, Howe T, Streit A, Sosinsky A, Stern CD. Computational tools and resources for prediction and analysis of gene regulatory regions in the chick genome. Genesis 2013; 51:311-24. [PMID: 23355428 PMCID: PMC3664090 DOI: 10.1002/dvg.22375] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2012] [Revised: 01/16/2013] [Accepted: 01/17/2013] [Indexed: 11/07/2022]
Abstract
The discovery of cis-regulatory elements is a challenging problem in bioinformatics, owing to distal locations and context-specific roles of these elements in controlling gene regulation. Here we review the current bioinformatics methodologies and resources available for systematic discovery of cis-acting regulatory elements and conserved transcription factor binding sites in the chick genome. In addition, we propose and make available, a novel workflow using computational tools that integrate CTCF analysis to predict putative insulator elements, enhancer prediction, and TFBS analysis. To demonstrate the usefulness of this computational workflow, we then use it to analyze the locus of the gene Sox2 whose developmental expression is known to be controlled by a complex array of cis-acting regulatory elements. The workflow accurately predicts most of the experimentally verified elements along with some that have not yet been discovered. A web version of the CTCF tool, together with instructions for using the workflow can be accessed from http://toolshed.g2.bx.psu.edu/view/mkhan1980/ctcf_analysis. For local installation of the tool, relevant Perl scripts and instructions are provided in the directory named "code" in the supplementary materials.
Collapse
Affiliation(s)
- Mohsin A F Khan
- Department of Cell & Developmental Biology, University College London, London, United Kingdom
| | | | | | | | | | | |
Collapse
|
11
|
Katara P, Grover A, Sharma V. Phylogenetic footprinting: a boost for microbial regulatory genomics. PROTOPLASMA 2012; 249:901-907. [PMID: 22113593 DOI: 10.1007/s00709-011-0351-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 11/09/2011] [Indexed: 05/31/2023]
Abstract
Phylogenetic footprinting is a method for the discovery of regulatory elements in a set of homologous regulatory regions, usually collected from multiple species. It does so by identifying the best conserved motifs in those homologous regions. There are two popular sets of methods-alignment-based and motif-based, which are generally employed for phylogenetic methods. However, serious efforts have lacked to develop a tool exclusively for phylogenetic footprinting, based on either of these methods. Nevertheless, a number of software and tools exist that can be applied for prediction of phylogenetic footprinting with variable degree of success. The output from these tools may get affected by a number of factors associated with current state of knowledge, techniques and other resources available. We here present a critical apprehension of various phylogenetic approaches with reference to prokaryotes outlining the available resources and also discussing various factors affecting footprinting in order to make a clear idea about the proper use of this approach on prokaryotes.
Collapse
Affiliation(s)
- Pramod Katara
- Department of Bioscience and Biotechnology, Banasthali University, Banasthali, 304022, India.
| | | | | |
Collapse
|
12
|
Cunial F, Apostolico A. Phylogeny Construction with Rigid Gapped Motifs. J Comput Biol 2012; 19:911-27. [DOI: 10.1089/cmb.2012.0060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Fabio Cunial
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Alberto Apostolico
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
13
|
High resolution mapping of Twist to DNA in Drosophila embryos: Efficient functional analysis and evolutionary conservation. Genome Res 2011; 21:566-77. [PMID: 21383317 DOI: 10.1101/gr.104018.109] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Cis-regulatory modules (CRMs) function by binding sequence specific transcription factors, but the relationship between in vivo physical binding and the regulatory capacity of factor-bound DNA elements remains uncertain. We investigate this relationship for the well-studied Twist factor in Drosophila melanogaster embryos by analyzing genome-wide factor occupancy and testing the functional significance of Twist occupied regions and motifs within regions. Twist ChIP-seq data efficiently identified previously studied Twist-dependent CRMs and robustly predicted new CRM activity in transgenesis, with newly identified Twist-occupied regions supporting diverse spatiotemporal patterns (>74% positive, n = 31). Some, but not all, candidate CRMs require Twist for proper expression in the embryo. The Twist motifs most favored in genome ChIP data (in vivo) differed from those most favored by Systematic Evolution of Ligands by EXponential enrichment (SELEX) (in vitro). Furthermore, the majority of ChIP-seq signals could be parsimoniously explained by a CABVTG motif located within 50 bp of the ChIP summit and, of these, CACATG was most prevalent. Mutagenesis experiments demonstrated that different Twist E-box motif types are not fully interchangeable, suggesting that the ChIP-derived consensus (CABVTG) includes sites having distinct regulatory outputs. Further analysis of position, frequency of occurrence, and sequence conservation revealed significant enrichment and conservation of CABVTG E-box motifs near Twist ChIP-seq signal summits, preferential conservation of ±150 bp surrounding Twist occupied summits, and enrichment of GA- and CA-repeat sequences near Twist occupied summits. Our results show that high resolution in vivo occupancy data can be used to drive efficient discovery and dissection of global and local cis-regulatory logic.
Collapse
|
14
|
Holloway DM, Lopes FJP, da Fontoura Costa L, Travençolo BAN, Golyandina N, Usevich K, Spirov AV. Gene expression noise in spatial patterning: hunchback promoter structure affects noise amplitude and distribution in Drosophila segmentation. PLoS Comput Biol 2011; 7:e1001069. [PMID: 21304932 PMCID: PMC3033364 DOI: 10.1371/journal.pcbi.1001069] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2010] [Accepted: 12/28/2010] [Indexed: 01/08/2023] Open
Abstract
Positional information in developing embryos is specified by spatial gradients of transcriptional regulators. One of the classic systems for studying this is the activation of the hunchback (hb) gene in early fruit fly (Drosophila) segmentation by the maternally-derived gradient of the Bicoid (Bcd) protein. Gene regulation is subject to intrinsic noise which can produce variable expression. This variability must be constrained in the highly reproducible and coordinated events of development. We identify means by which noise is controlled during gene expression by characterizing the dependence of hb mRNA and protein output noise on hb promoter structure and transcriptional dynamics. We use a stochastic model of the hb promoter in which the number and strength of Bcd and Hb (self-regulatory) binding sites can be varied. Model parameters are fit to data from WT embryos, the self-regulation mutant hb(14F), and lacZ reporter constructs using different portions of the hb promoter. We have corroborated model noise predictions experimentally. The results indicate that WT (self-regulatory) Hb output noise is predominantly dependent on the transcription and translation dynamics of its own expression, rather than on Bcd fluctuations. The constructs and mutant, which lack self-regulation, indicate that the multiple Bcd binding sites in the hb promoter (and their strengths) also play a role in buffering noise. The model is robust to the variation in Bcd binding site number across a number of fly species. This study identifies particular ways in which promoter structure and regulatory dynamics reduce hb output noise. Insofar as many of these are common features of genes (e.g. multiple regulatory sites, cooperativity, self-feedback), the current results contribute to the general understanding of the reproducibility and determinacy of spatial patterning in early development.
Collapse
Affiliation(s)
- David M Holloway
- Mathematics Department, British Columbia Institute of Technology, Burnaby, British Columbia, Canada.
| | | | | | | | | | | | | |
Collapse
|
15
|
Genome-wide identification of cis-regulatory motifs and modules underlying gene coregulation using statistics and phylogeny. Proc Natl Acad Sci U S A 2010; 107:14615-20. [PMID: 20671200 DOI: 10.1073/pnas.1002876107] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Cell fate determination depends in part on the establishment of specific transcriptional programs of gene expression. These programs result from the interpretation of the genomic cis-regulatory information by sequence-specific factors. Decoding this information in sequenced genomes is an important issue. Here, we developed statistical analysis tools to computationally identify the cis-regulatory elements that control gene expression in a set of coregulated genes. Starting with a small number of validated and/or predicted cis-regulatory modules (CRMs) in a reference species as a training set, but with no a priori knowledge of the factors acting in trans, we computationally predicted transcription factor binding sites (TFBSs) and genomic CRMs underlying coregulation. This method was applied to the gene expression program active in Drosophila melanogaster sensory organ precursor cells (SOPs), a specific type of neural progenitor cells. Mutational analysis showed that four, including one newly characterized, out of the five top-ranked families of predicted TFBSs were required for SOP-specific gene expression. Additionaly, 19 out of the 29 top-ranked predicted CRMs directed gene expression in neural progenitor cells, i.e., SOPs or larval brain neuroblasts, with a notable fraction active in SOPs (11/29). We further identified the lola gene as the target of two SOP-specific CRMs and found that the lola gene contributed to SOP specification. The statistics and phylogeny-based tools described here can be more generally applied to identify the cis-regulatory elements of specific gene regulatory networks in any family of related species with sequenced genomes.
Collapse
|
16
|
Arunachalam M, Jayasurya K, Tomancak P, Ohler U. An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes. ACTA ACUST UNITED AC 2010; 26:2109-15. [PMID: 20624780 DOI: 10.1093/bioinformatics/btq358] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
MOTIVATION Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory region such as transcriptional enhancers. However, detecting orthologous enhancers using alignment-based methods in higher eukaryotic genomes is particularly challenging, as regulatory regions can undergo considerable sequence changes while maintaining their functionality. RESULTS We have developed an alignment-free method which identifies conserved enhancers in multiple diverged species. Our method is based on similarity metrics between two sequences based on the co-occurrence of sequence patterns regardless of their order and orientation, thus tolerating sequence changes observed in non-coding evolution. We show that our method is highly successful in detecting orthologous enhancers in distantly related species without requiring additional information such as knowledge about transcription factors involved, or predicted binding sites. By estimating the significance of similarity scores, we are able to discriminate experimentally validated functional enhancers from seemingly equally conserved candidates without function. We demonstrate the effectiveness of this approach on a wide range of enhancers in Drosophila, and also present encouraging results to detect conserved functional regions across large evolutionary distances. Our work provides encouraging steps on the way to ab initio unbiased enhancer prediction to complement ongoing experimental efforts. AVAILABILITY The software, data and the results used in this article are available at http://www.genome.duke.edu/labs/ohler/research/transcription/fly_enhancer/.
Collapse
|
17
|
Evans KJ. Most transcription factor binding sites are in a few mosaic classes of the human genome. BMC Genomics 2010; 11:286. [PMID: 20459624 PMCID: PMC2881025 DOI: 10.1186/1471-2164-11-286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2010] [Accepted: 05/06/2010] [Indexed: 12/02/2022] Open
Abstract
Background Many algorithms for finding transcription factor binding sites have concentrated on the characterisation of the binding site itself: and these algorithms lead to a large number of false positive sites. The DNA sequence which does not bind has been modeled only to the extent necessary to complement this formulation. Results We find that the human genome may be described by 19 pairs of mosaic classes, each defined by its base frequencies, (or more precisely by the frequencies of doublets), so that typically a run of 10 to 100 bases belongs to the same class. Most experimentally verified binding sites are in the same four pairs of classes. In our sample of seventeen transcription factors — taken from different families of transcription factors — the average proportion of sites in this subset of classes was 75%, with values for individual factors ranging from 48% to 98%. By contrast these same classes contain only 26% of the bases of the genome and only 31% of occurrences of the motifs of these factors — that is places where one might expect the factors to bind. These results are not a consequence of the class composition in promoter regions. Conclusions This method of analysis will help to find transcription factor binding sites and assist with the problem of false positives. These results also imply a profound difference between the mosaic classes.
Collapse
Affiliation(s)
- Kenneth J Evans
- School of Crystallography, Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK.
| |
Collapse
|
18
|
Zhou X, Sumazin P, Rajbhandari P, Califano A. A systems biology approach to transcription factor binding site prediction. PLoS One 2010; 5:e9878. [PMID: 20360861 PMCID: PMC2845628 DOI: 10.1371/journal.pone.0009878] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2010] [Accepted: 03/02/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The elucidation of mammalian transcriptional regulatory networks holds great promise for both basic and translational research and remains one the greatest challenges to systems biology. Recent reverse engineering methods deduce regulatory interactions from large-scale mRNA expression profiles and cross-species conserved regulatory regions in DNA. Technical challenges faced by these methods include distinguishing between direct and indirect interactions, associating transcription regulators with predicted transcription factor binding sites (TFBSs), identifying non-linearly conserved binding sites across species, and providing realistic accuracy estimates. METHODOLOGY/PRINCIPAL FINDINGS We address these challenges by closely integrating proven methods for regulatory network reverse engineering from mRNA expression data, linearly and non-linearly conserved regulatory region discovery, and TFBS evaluation and discovery. Using an extensive test set of high-likelihood interactions, which we collected in order to provide realistic prediction-accuracy estimates, we show that a careful integration of these methods leads to significant improvements in prediction accuracy. To verify our methods, we biochemically validated TFBS predictions made for both transcription factors (TFs) and co-factors; we validated binding site predictions made using a known E2F1 DNA-binding motif on E2F1 predicted promoter targets, known E2F1 and JUND motifs on JUND predicted promoter targets, and a de novo discovered motif for BCL6 on BCL6 predicted promoter targets. Finally, to demonstrate accuracy of prediction using an external dataset, we showed that sites matching predicted motifs for ZNF263 are significantly enriched in recent ZNF263 ChIP-seq data. CONCLUSIONS/SIGNIFICANCE Using an integrative framework, we were able to address technical challenges faced by state of the art network reverse engineering methods, leading to significant improvement in direct-interaction detection and TFBS-discovery accuracy. We estimated the accuracy of our framework on a human B-cell specific test set, which may help guide future methodological development.
Collapse
Affiliation(s)
- Xiang Zhou
- Department of Biomedical Informatics (DBMI), Columbia University, New York, New York, United States of America
| | - Pavel Sumazin
- Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, New York, United States of America
| | - Presha Rajbhandari
- Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, New York, United States of America
| | - Andrea Califano
- Department of Biomedical Informatics (DBMI), Columbia University, New York, New York, United States of America
- Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, New York, United States of America
- Herbert Irving Comprehensive Cancer Center, Columbia University, New York, New York, United States of America
| |
Collapse
|
19
|
Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S. Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell 2009; 17:568-79. [PMID: 19853570 DOI: 10.1016/j.devcel.2009.09.002] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2009] [Revised: 07/02/2009] [Accepted: 09/09/2009] [Indexed: 12/24/2022]
Abstract
We present new approaches to cis-regulatory module (CRM) discovery in the common scenario where relevant transcription factors and/or motifs are unknown. Beginning with a small list of CRMs mediating a common gene expression pattern, we search genome-wide for CRMs with similar functionality, using new statistical scores and without requiring known motifs or accurate motif discovery. We cross-validate our predictions on 31 regulatory networks in Drosophila and through correlations with gene expression data. Five predicted modules tested using an in vivo reporter gene assay all show tissue-specific regulatory activity. We also demonstrate our methods' ability to predict mammalian tissue-specific enhancers. Finally, we predict human CRMs that regulate early blood and cardiovascular development. In vivo transgenic mouse analysis of two predicted CRMs demonstrates that both have appropriate enhancer activity. Overall, 7/7 predictions were validated successfully in vivo, demonstrating the effectiveness of our approach for insect and mammalian genomes.
Collapse
Affiliation(s)
- Miriam R Kantorovitz
- Department of Mathematics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Identifying cis-regulatory sequences by word profile similarity. PLoS One 2009; 4:e6901. [PMID: 19730735 PMCID: PMC2731932 DOI: 10.1371/journal.pone.0006901] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2008] [Accepted: 08/07/2009] [Indexed: 12/13/2022] Open
Abstract
Background Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples. Methodology/Principal Findings We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila. Conclusions/Significance Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz.
Collapse
|
21
|
Wilczynski B, Dojer N, Patelak M, Tiuryn J. Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs. BMC Bioinformatics 2009; 10:82. [PMID: 19284541 PMCID: PMC2669485 DOI: 10.1186/1471-2105-10-82] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Accepted: 03/10/2009] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Finding functional regulatory elements in DNA sequences is a very important problem in computational biology and providing a reliable algorithm for this task would be a major step towards understanding regulatory mechanisms on genome-wide scale. Major obstacles in this respect are that the fact that the amount of non-coding DNA is vast, and that the methods for predicting functional transcription factor binding sites tend to produce results with a high percentage of false positives. This makes the problem of finding regions significantly enriched in binding sites difficult. RESULTS We develop a novel method for predicting regulatory regions in DNA sequences, which is designed to exploit the evolutionary conservation of regulatory elements between species without assuming that the order of motifs is preserved across species. We have implemented our method and tested its predictive abilities on various datasets from different organisms. CONCLUSION We show that our approach enables us to find a majority of the known CRMs using only sequence information from different species together with currently publicly available motif data. Also, our method is robust enough to perform well in predicting CRMs, despite differences in tissue specificity and even across species, provided that the evolutionary distances between compared species do not change substantially. The complexity of the proposed algorithm is polynomial, and the observed running times show that it may be readily applied.
Collapse
|
22
|
Kuzin A, Kundu M, Ekatomatis A, Brody T, Odenwald WF. Conserved sequence block clustering and flanking inter-cluster flexibility delineate enhancers that regulate nerfin-1 expression during Drosophila CNS development. Gene Expr Patterns 2008; 9:65-72. [PMID: 19056518 DOI: 10.1016/j.gep.2008.10.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2008] [Revised: 10/21/2008] [Accepted: 10/23/2008] [Indexed: 10/21/2022]
Abstract
We have identified clusters of conserved sequences constituting discrete modular enhancers within the Drosophilanerfin-1 locus. nerfin-1 encodes a Zn-finger transcription factor that directs pioneer interneuron axon guidance. nerfin-1 mRNA is detected in many early delaminating neuroblasts, ganglion mother cells and transiently in nascent neurons. The comparative genomics analysis program EvoPrinter revealed conserved sequence blocks both upstream and downstream of the transcribed region. By using the aligning regions of different drosophilids as the reference DNA, EvoPrinter detects sequence length flexibility between clusters of conserved sequences and thus facilitates differentiation between closely associated modular enhancers. Expression analysis of enhancer-reporter transgenes identified enhancers that drive expression in different regions of the developing embryonic and adult nervous system, including subsets of embryonic CNS neuroblasts, GMCs, neurons and PNS neurons. In summary, EvoPrinter facilitates the discovery and analysis of enhancers that control crucial aspects of nerfin-1 expression.
Collapse
Affiliation(s)
- Alexander Kuzin
- Neural Cell-Fate Determinants Section, NINDS, NIH, Bethesda, MD, USA.
| | | | | | | | | |
Collapse
|
23
|
Xie D, Cai J, Chia NY, Ng HH, Zhong S. Cross-species de novo identification of cis-regulatory modules with GibbsModule: application to gene regulation in embryonic stem cells. Genome Res 2008; 18:1325-35. [PMID: 18490265 DOI: 10.1101/gr.072769.107] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
We introduce the GibbsModule algorithm for de novo detection of cis-regulatory motifs and modules in eukaryote genomes. GibbsModule models the coexpressed genes within one species as sharing a core cis-regulatory motif and each homologous gene group as sharing a homologous cis-regulatory module (CRM), characterized by a similar composition of motifs. Without using a predetermined alignment result, GibbsModule iteratively updates the core motif shared by coexpressed genes and traces the homologous CRMs that contain the core motif. GibbsModule achieved substantial improvements in both precision and recall as compared with peer algorithms on a number of synthetic and real data sets. Applying GibbsModule to analyze the binding regions of the Krüppel-like factor (KLF) transcription factor in embryonic stem cells (ESCs), we discovered a motif that differs from a previously published KLF motif identified by a SELEX experiment, but the new motif is consistent with mutagenesis analysis. The SOX2 motif was found to be a collaborating motif to the KLF motif in ESCs. We used quantitative chromatin immunoprecipitation (ChIP) analysis to test whether GibbsModule could distinguish functional and nonfunctional binding sites. All seven tested binding sites in GibbsModule-predicted CRMs had higher ChIP signals as compared with the other seven tested binding sites located outside of predicted CRMs. GibbsModule is available at (http://biocomp.bioen.uiuc.edu/GibbsModule).
Collapse
Affiliation(s)
- Dan Xie
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | |
Collapse
|
24
|
Ivan A, Halfon MS, Sinha S. Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biol 2008; 9:R22. [PMID: 18226245 PMCID: PMC2395258 DOI: 10.1186/gb-2008-9-1-r22] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Revised: 12/18/2007] [Accepted: 01/28/2008] [Indexed: 01/01/2023] Open
Abstract
Prediction of cis-regulatory modules ab initio, without any input of relevant motifs, is achieved with two novel methods. We consider the problem of predicting cis-regulatory modules without knowledge of motifs. We formulate this problem in a pragmatic setting, and create over 30 new data sets, using Drosophila modules, to use as a 'benchmark'. We propose two new methods for the problem, and evaluate these, as well as two existing methods, on our benchmark. We find that the challenge of predicting cis-regulatory modules ab initio, without any input of relevant motifs, is a realizable goal.
Collapse
Affiliation(s)
- Andra Ivan
- Department of Computer Science and Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | |
Collapse
|