1
|
Karollus A, Hingerl J, Gankin D, Grosshauser M, Klemon K, Gagneur J. Species-aware DNA language models capture regulatory elements and their evolution. Genome Biol 2024; 25:83. [PMID: 38566111 PMCID: PMC10985990 DOI: 10.1186/s13059-024-03221-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 03/20/2024] [Indexed: 04/04/2024] Open
Abstract
BACKGROUND The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. RESULTS Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. CONCLUSIONS Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.
Collapse
Affiliation(s)
- Alexander Karollus
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Johannes Hingerl
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Dennis Gankin
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Martin Grosshauser
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Kristian Klemon
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Munich Center for Machine Learning, Munich, Germany.
- Institute of Human Genetics, School of Medicine and Health, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
| |
Collapse
|
2
|
Reb1, Cbf1, and Pho4 bias histone sliding and deposition away from their binding sites. Mol Cell Biol 2021; 42:e0047221. [PMID: 34898278 DOI: 10.1128/mcb.00472-21] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
In transcriptionally active genes, nucleosome positions in promoters are regulated by nucleosome displacing factors (NDFs) and chromatin remodeling enzymes. Depletion of NDFs or the RSC chromatin remodeler shrinks or abolishes the nucleosome depleted regions (NDRs) in promoters, which can suppress gene activation and result in cryptic transcription. Despite their vital cellular functions, how the action of chromatin remodelers may be directly affected by site-specific binding factors like NDFs is poorly understood. Here we demonstrate that two NDFs, Reb1 and Cbf1, can direct both Chd1 and RSC chromatin remodeling enzymes in vitro, stimulating repositioning of the histone core away from their binding sites. Interestingly, although the Pho4 transcription factor had a much weaker effect on nucleosome positioning, both NDFs and Pho4 were able to similarly redirect positioning of hexasomes. In chaperone-mediated nucleosome assembly assays, Reb1 but not Pho4 showed an ability to block deposition of the histone H3/H4 tetramer, but Reb1 did not block addition of the H2A/H2B dimer to hexasomes. Our in vitro results show that NDFs bias the action of remodelers to increase the length of the free DNA in the vicinity of their binding sites. These results suggest that NDFs could directly affect NDR architecture through chromatin remodelers.
Collapse
|
3
|
Lieberman-Lazarovich M, Yahav C, Israeli A, Efroni I. Deep Conservation of cis-Element Variants Regulating Plant Hormonal Responses. THE PLANT CELL 2019; 31:2559-2572. [PMID: 31467248 PMCID: PMC6881130 DOI: 10.1105/tpc.19.00129] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Accepted: 08/27/2019] [Indexed: 05/14/2023]
Abstract
Phytohormones regulate many aspects of plant life by activating transcription factors (TFs) that bind sequence-specific response elements (REs) in regulatory regions of target genes. Despite their short length, REs are degenerate, with a core of just 3 to 4 bp. This degeneracy is paradoxical, as it reduces specificity and REs are extremely common in the genome. To study whether RE degeneracy might serve a biological function, we developed an algorithm for the detection of regulatory sequence conservation and applied it to phytohormone REs in 45 angiosperms. Surprisingly, we found that specific RE variants are highly conserved in core hormone response genes. Experimental evidence showed that specific variants act to regulate the magnitude and spatial profile of hormonal response in Arabidopsis (Arabidopsis thaliana) and tomato (Solanum lycopersicum). Our results suggest that hormone-regulated TFs bind a spectrum of REs, each coding for a distinct transcriptional response profile. Our approach has implications for precise genome editing and for rational promoter design.
Collapse
Affiliation(s)
- Michal Lieberman-Lazarovich
- Institute of Plant Sciences and Genetics in Agriculture, The Robert H. Smith Faculty of Agriculture, The Hebrew University, Rehovot 7610001, Israel
| | - Chen Yahav
- Institute of Plant Sciences and Genetics in Agriculture, The Robert H. Smith Faculty of Agriculture, The Hebrew University, Rehovot 7610001, Israel
| | - Alon Israeli
- Institute of Plant Sciences and Genetics in Agriculture, The Robert H. Smith Faculty of Agriculture, The Hebrew University, Rehovot 7610001, Israel
| | - Idan Efroni
- Institute of Plant Sciences and Genetics in Agriculture, The Robert H. Smith Faculty of Agriculture, The Hebrew University, Rehovot 7610001, Israel
| |
Collapse
|
4
|
Song J, Bjarnason J, Surette MG. The identification of functional motifs in temporal gene expression analysis. Evol Bioinform Online 2017. [DOI: 10.1177/117693430500100008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The identification of transcription factor binding sites is essential to the understanding of the regulation of gene expression and the reconstruction of genetic regulatory networks. The in silico identification of cis-regulatory motifs is challenging due to sequence variability and lack of sufficient data to generate consensus motifs that are of quantitative or even qualitative predictive value. To determine functional motifs in gene expression, we propose a strategy to adopt false discovery rate (FDR) and estimate motif effects to evaluate combinatorial analysis of motif candidates and temporal gene expression data. The method decreases the number of predicted motifs, which can then be confirmed by genetic analysis. To assess the method we used simulated motif/expression data to evaluate parameters. We applied this approach to experimental data for a group of iron responsive genes in Salmonella typhimurium 14028S. The method identified known and potentially new ferric-uptake regulator (Fur) binding sites. In addition, we identified uncharacterized functional motif candidates that correlated with specific patterns of expression. A SAS code for the simulation and analysis gene expression data is available from the first author upon request.
Collapse
Affiliation(s)
- Jiuzhou Song
- Department of Animal and Avian Sciences, and University of Maryland, Maryland 20742, USA
| | - Jaime Bjarnason
- Department of Microbiology and Infectious Diseases, and Department of Biochemistry and Molecular Biology, Health Sciences Centre, University of Calgary, Calgary, AB, Canada, T2N 4N1
| | - Michael G. Surette
- Department of Microbiology and Infectious Diseases, and Department of Biochemistry and Molecular Biology, Health Sciences Centre, University of Calgary, Calgary, AB, Canada, T2N 4N1
| |
Collapse
|
5
|
Ho MCW, Quintero-Cadena P, Sternberg PW. Genome-wide discovery of active regulatory elements and transcription factor footprints in Caenorhabditis elegans using DNase-seq. Genome Res 2017; 27:2108-2119. [PMID: 29074739 PMCID: PMC5741056 DOI: 10.1101/gr.223735.117] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Accepted: 10/18/2017] [Indexed: 12/23/2022]
Abstract
Deep sequencing of size-selected DNase I–treated chromatin (DNase-seq) allows high-resolution measurement of chromatin accessibility to DNase I cleavage, permitting identification of de novo active cis-regulatory modules (CRMs) and individual transcription factor (TF) binding sites. We adapted DNase-seq to nuclei isolated from C. elegans embryos and L1 arrest larvae to generate high-resolution maps of TF binding. Over half of embryonic DNase I hypersensitive sites (DHSs) were annotated as noncoding, with 24% in intergenic, 12% in promoters, and 28% in introns, with similar statistics observed in L1 arrest larvae. Noncoding DHSs are highly conserved and enriched in marks of enhancer activity and transcription. We validated noncoding DHSs against known enhancers from myo-2, myo-3, hlh-1, elt-2, and lin-26/lir-1 and recapitulated 15 of 17 known enhancers. We then mined DNase-seq data to identify putative active CRMs and TF footprints. Using DNase-seq data improved predictions of tissue-specific expression compared with motifs alone. In a pilot functional test, 10 of 15 DHSs from pha-4, icl-1, and ceh-13 drove reporter gene expression in transgenic C. elegans. Overall, we provide experimental annotation of 26,644 putative CRMs in the embryo containing 55,890 TF footprints, as well as 15,841 putative CRMs in the L1 arrest larvae containing 32,685 TF footprints.
Collapse
Affiliation(s)
- Margaret C W Ho
- Division of Biology and Bioengineering, Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California 91125, USA
| | - Porfirio Quintero-Cadena
- Division of Biology and Bioengineering, Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California 91125, USA
| | - Paul W Sternberg
- Division of Biology and Bioengineering, Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California 91125, USA
| |
Collapse
|
6
|
Trescher S, Münchmeyer J, Leser U. Estimating genome-wide regulatory activity from multi-omics data sets using mathematical optimization. BMC SYSTEMS BIOLOGY 2017; 11:41. [PMID: 28347313 PMCID: PMC5369021 DOI: 10.1186/s12918-017-0419-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2016] [Accepted: 03/08/2017] [Indexed: 12/28/2022]
Abstract
Background Gene regulation is one of the most important cellular processes, indispensable for the adaptability of organisms and closely interlinked with several classes of pathogenesis and their progression. Elucidation of regulatory mechanisms can be approached by a multitude of experimental methods, yet integration of the resulting heterogeneous, large, and noisy data sets into comprehensive and tissue or disease-specific cellular models requires rigorous computational methods. Recently, several algorithms have been proposed which model genome-wide gene regulation as sets of (linear) equations over the activity and relationships of transcription factors, genes and other factors. Subsequent optimization finds those parameters that minimize the divergence of predicted and measured expression intensities. In various settings, these methods produced promising results in terms of estimating transcription factor activity and identifying key biomarkers for specific phenotypes. However, despite their common root in mathematical optimization, they vastly differ in the types of experimental data being integrated, the background knowledge necessary for their application, the granularity of their regulatory model, the concrete paradigm used for solving the optimization problem and the data sets used for evaluation. Results Here, we review five recent methods of this class in detail and compare them with respect to several key properties. Furthermore, we quantitatively compare the results of four of the presented methods based on publicly available data sets. Conclusions The results show that all methods seem to find biologically relevant information. However, we also observe that the mutual result overlaps are very low, which contradicts biological intuition. Our aim is to raise further awareness of the power of these methods, yet also to identify common shortcomings and necessary extensions enabling focused research on the critical points. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0419-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saskia Trescher
- Knowledge Management in Bioinformatics, Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany.
| | - Jannes Münchmeyer
- Knowledge Management in Bioinformatics, Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
| |
Collapse
|
7
|
Gerovska D, Araúzo-Bravo MJ. Does mouse embryo primordial germ cell activation start before implantation as suggested by single-cell transcriptomics dynamics? Mol Hum Reprod 2016; 22:208-25. [PMID: 26740066 DOI: 10.1093/molehr/gav072] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 12/07/2015] [Indexed: 12/19/2022] Open
Abstract
STUDY HYPOTHESIS Does primordial germ cell (PGC) activation start before mouse embryo implantation, and does the possible regulation of the DNA (cytosine-5-)-methyltransferase 3-like (Dnmt3l) by transcription factor AP-2, gamma (TCFAP2C) have a role in this activation and in the primitive endoderm (PE)-epiblast (EPI) lineage specification? STUDY FINDING A burst of expression of PGC markers, such as Dppa3/Stella, Ifitm2/Fragilis, Fkbp6 and Prdm4, is observed from embryonic day (E) 3.25, and some of them, together with the late germ cell markers Zp3, Mcf2 and Morc1, become restricted to the EPI subpopulation at E4.5, while the dynamics analysis of the PE-EPI transitions in the single-cell data suggests that TCFAP2C transitorily represses Dnmt3l in EPI cells at E3.5 and such repression is withdrawn with reactivation of Dnmt3l expression in PE and EPI cells at E4.5. WHAT IS KNOWN ALREADY In the mouse preimplantation embryo, cells with the same phenotype take different fates based on the orchestration between topological clues (cell polarity, positional history and division orientation) and gene regulatory rules (at transcriptomics and epigenomics level), prompting the proposal of positional, stochastic and combined models explaining the specification mechanism. PGC specification starts at E6.0-6.5 post-implantation. In view of the important role of DNA methylation in developmental events, the cross-talk between some transcription factors and DNA methyltransferases is of particular relevance. TCFAP2C has a CpG DNA methylation motif that is not methylated in pluripotent cells and that could potentially bind on DNMT3L, the stimulatory DNA methyltransferase co-factor that assists in the process of de novo DNA methylation. Chromatin-immunoprecipitation analysis has demonstrated that Dnmt3l is indeed a target of TCFAP2C. STUDY DESIGN, SAMPLES/MATERIALS, METHODS We aimed to assess the timing of early preimplantation events and to understand better the segregation of the inner cell mass (ICM) into PE and EPI. We designed a single-cell transcriptomics dynamics computational study to identify markers of the PE-EPI bifurcation in ICM cells through searching for statistically significant (using the Student's t-test method) differently expressed genes (DEGs) between PE and EPI cells from E3.5 to E4.5. The DEGs common for E3.5 and E4.5 were used as the markers defining the steady states. We collected microarray and next-generation sequencing transcriptomics data from public databases from bulk populations and single cells from mice at E3.25, E3.5 and E4.5. The results are based on three independent single-cell transcriptomics data sets, with a fold change of 3 and P-value <0.01 for the DEG selection. MAIN RESULTS AND THE ROLE OF CHANCE The dynamics analysis revealed new transitory E3.5 and steady PE and EPI markers. Among the transitory E3.5 PE markers (Dnmt3l, Dusp4, Cpne8, Akap13, Dcaf12l1, Aaed1, B4galt6, BC100530, Rnpc3, Tfpi, Lgalsl, Ckap4 and Fbxl20), several (Dusp4, Akap13, Cpn8, Dcaf12l1 and Tfpi) are related to the extracellular regulated kinase pathway. We also identified new transitory E3.5 EPI markers (Sgk1, Mal, Ubxn2a, Atg16l2, Gm13102, Tcfap2c, Hexb, Slc1a1, Svip, Liph and Mier3), six new stable PE markers (Sdc4, Cpn1, Dkk1, Havcr1, F2r/Par1 and Slc7a6os) as well as three new stable EPI markers (Zp3, Mcf2 and Hexb), which are known to be late stage germ cell markers. We found that mouse PGC marker activation starts at least at E3.25 preimplantation. The transcriptomics dynamics analyses support the regulation of Dnmt3l expression by TCFAP2C. LIMITATIONS, REASONS FOR CAUTION Since the regulation of Dnmt3l by TCFAP2C is based on computational prediction of DNA methylation motifs, Chip-Seq and transcriptomics data, functional studies are required to validate this result. WIDER IMPLICATIONS OF THE FINDINGS We identified a collection of previously undescribed E3.5-specific PE and EPI markers, and new steady PE and EPI markers. Identification of these genes, many of which encode cell membrane proteins, will facilitate the isolation and characterization of early PE and EPI populations. Since it is so well established in the literature that mouse PGC specification is a post-implantation event, it was surprising for us to see activation of PGC markers as early as E3.25 preimplantation, and identify the newly found steady EPI markers as late germ cell markers. The discovery of such early activation of PGC markers has important implications in the derivation of germ cells from pluripotent cells (embryonic stem cells or induced pluripotent stem cells), since the initial stages of such derivation resemble early development. The early activation of PGC markers points out the difficulty of separating PGC cells from pluripotent populations. Collectively, our results suggest that the combining of the precision of single-cell omics data with dynamic analysis of time-series data can establish the timing of some developmental stages as earlier than previously thought. LARGE-SCALE DATA Not applicable. STUDY FUNDING AND COMPETING INTERESTS This work was supported by grants DFG15/14 and DFG15/020 from Diputación Foral de Gipuzkoa (Spain), and grant II14/00016 from I + D + I National Plan 2013-2016 (Spain) and FEDER funds. The authors declare no conflict of interest.
Collapse
Affiliation(s)
- Daniela Gerovska
- Group of Computational Biology and Systems Biomedicine, Biodonostia Health Research Institute, Calle Doctor Beguiristain s/n, 20014 San Sebastián - Donostia, Spain
| | - Marcos J Araúzo-Bravo
- Group of Computational Biology and Systems Biomedicine, Biodonostia Health Research Institute, Calle Doctor Beguiristain s/n, 20014 San Sebastián - Donostia, Spain IKERBASQUE, Basque Foundation for Science, Bilbao, Spain
| |
Collapse
|
8
|
Hogan GJ, Brown PO, Herschlag D. Evolutionary Conservation and Diversification of Puf RNA Binding Proteins and Their mRNA Targets. PLoS Biol 2015; 13:e1002307. [PMID: 26587879 PMCID: PMC4654594 DOI: 10.1371/journal.pbio.1002307] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2015] [Accepted: 10/23/2015] [Indexed: 12/31/2022] Open
Abstract
Reprogramming of a gene’s expression pattern by acquisition and loss of sequences recognized by specific regulatory RNA binding proteins may be a major mechanism in the evolution of biological regulatory programs. We identified that RNA targets of Puf3 orthologs have been conserved over 100–500 million years of evolution in five eukaryotic lineages. Focusing on Puf proteins and their targets across 80 fungi, we constructed a parsimonious model for their evolutionary history. This model entails extensive and coordinated changes in the Puf targets as well as changes in the number of Puf genes and alterations of RNA binding specificity including that: 1) Binding of Puf3 to more than 200 RNAs whose protein products are predominantly involved in the production and organization of mitochondrial complexes predates the origin of budding yeasts and filamentous fungi and was maintained for 500 million years, throughout the evolution of budding yeast. 2) In filamentous fungi, remarkably, more than 150 of the ancestral Puf3 targets were gained by Puf4, with one lineage maintaining both Puf3 and Puf4 as regulators and a sister lineage losing Puf3 as a regulator of these RNAs. The decrease in gene expression of these mRNAs upon deletion of Puf4 in filamentous fungi (N. crassa) in contrast to the increase upon Puf3 deletion in budding yeast (S. cerevisiae) suggests that the output of the RNA regulatory network is different with Puf4 in filamentous fungi than with Puf3 in budding yeast. 3) The coregulated Puf4 target set in filamentous fungi expanded to include mitochondrial genes involved in the tricarboxylic acid (TCA) cycle and other nuclear-encoded RNAs with mitochondrial function not bound by Puf3 in budding yeast, observations that provide additional evidence for substantial rewiring of post-transcriptional regulation. 4) Puf3 also expanded and diversified its targets in filamentous fungi, gaining interactions with the mRNAs encoding the mitochondrial electron transport chain (ETC) complex I as well as hundreds of other mRNAs with nonmitochondrial functions. The many concerted and conserved changes in the RNA targets of Puf proteins strongly support an extensive role of RNA binding proteins in coordinating gene expression, as originally proposed by Keene. Rewiring of Puf-coordinated mRNA targets and transcriptional control of the same genes occurred at different points in evolution, suggesting that there have been distinct adaptations via RNA binding proteins and transcription factors. The changes in Puf targets and in the Puf proteins indicate an integral involvement of RNA binding proteins and their RNA targets in the adaptation, reprogramming, and function of gene expression. A map of the evolutionary history of Puf proteins and their RNA targets shows that reprogramming of global gene expression programs via adaptive mutations that affect protein-RNA interactions is an important source of biological diversity. We set out to trace the evolutionary history of an RNA binding protein and how its interactions with targets change over evolution. Identifying this natural history is a step toward understanding the critical differences between organisms and how gene expression programs are rewired during evolution. Using bioinformatics and experimental approaches, we broadly surveyed the evolution of binding targets of a particular family of RNA binding proteins—the Puf proteins, whose protein sequences and target RNA sequences are relatively well-characterized—across 99 eukaryotic species. We found five groups of species in which targets have been conserved for at least 100 million years and then took advantage of genome sequences from a large number of fungal species to deeply investigate the conservation and changes in Puf proteins and their RNA targets. Our analyses identified multiple and extensive reconfigurations during the natural history of fungi and suggest that RNA binding proteins and their RNA targets are profoundly involved in evolutionary reprogramming of gene expression and help define distinct programs unique to each organism. Continuing to uncover the natural history of RNA binding proteins and their interactions will provide a unique window into the gene expression programs of present day species and point to new ways to engineer gene expression programs.
Collapse
Affiliation(s)
- Gregory J. Hogan
- Department of Biochemistry, Stanford University School of Medicine, Stanford, California, United States of America
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, California, United States of America
| | - Patrick O. Brown
- Department of Biochemistry, Stanford University School of Medicine, Stanford, California, United States of America
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, California, United States of America
- * E-mail: (POB); (DH)
| | - Daniel Herschlag
- Department of Biochemistry, Stanford University School of Medicine, Stanford, California, United States of America
- Department of Chemistry, Stanford University, Stanford, California, United States of America
- Department of Chemical Engineering, Stanford University, Stanford, California, United States of America
- ChEM-H Institute, Stanford University, Stanford, California, United States of America
- * E-mail: (POB); (DH)
| |
Collapse
|
9
|
De Witte D, Van de Velde J, Decap D, Van Bel M, Audenaert P, Demeester P, Dhoedt B, Vandepoele K, Fostier J. BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements. Bioinformatics 2015; 31:3758-66. [PMID: 26254488 PMCID: PMC4653392 DOI: 10.1093/bioinformatics/btv466] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2014] [Accepted: 08/03/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. RESULTS We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. AVAILABILITY AND IMPLEMENTATION BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller CONTACT Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dieter De Witte
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Jan Van de Velde
- Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Dries Decap
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Michiel Van Bel
- Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Pieter Audenaert
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Piet Demeester
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Bart Dhoedt
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| | - Klaas Vandepoele
- Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
| | - Jan Fostier
- Department of Information Technology (INTEC), Ghent University-iMinds, Ghent, Belgium
| |
Collapse
|
10
|
Maier EJ, Haynes BC, Gish SR, Wang ZA, Skowyra ML, Marulli AL, Doering TL, Brent MR. Model-driven mapping of transcriptional networks reveals the circuitry and dynamics of virulence regulation. Genome Res 2015; 25:690-700. [PMID: 25644834 PMCID: PMC4417117 DOI: 10.1101/gr.184101.114] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2014] [Accepted: 01/15/2015] [Indexed: 01/09/2023]
Abstract
Key steps in understanding a biological process include identifying genes that are involved and determining how they are regulated. We developed a novel method for identifying transcription factors (TFs) involved in a specific process and used it to map regulation of the key virulence factor of a deadly fungus—its capsule. The map, built from expression profiles of 41 TF mutants, includes 20 TFs not previously known to regulate virulence attributes. It also reveals a hierarchy comprising executive, midlevel, and “foreman” TFs. When grouped by temporal expression pattern, these TFs explain much of the transcriptional dynamics of capsule induction. Phenotypic analysis of TF deletion mutants revealed complex relationships among virulence factors and virulence in mice. These resources and analyses provide the first integrated, systems-level view of capsule regulation and biosynthesis. Our methods dramatically improve the efficiency with which transcriptional networks can be analyzed, making genomic approaches accessible to laboratories focused on specific physiological processes.
Collapse
Affiliation(s)
- Ezekiel J Maier
- Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, Missouri 63108, USA; Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, USA
| | - Brian C Haynes
- Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, Missouri 63108, USA; Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, USA
| | - Stacey R Gish
- Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63110, USA
| | - Zhuo A Wang
- Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63110, USA
| | - Michael L Skowyra
- Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63110, USA
| | - Alyssa L Marulli
- Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63110, USA
| | - Tamara L Doering
- Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63110, USA
| | - Michael R Brent
- Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, Missouri 63108, USA; Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, USA; Department of Genetics, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63110, USA
| |
Collapse
|
11
|
Lorch Y, Maier-Davis B, Kornberg RD. Role of DNA sequence in chromatin remodeling and the formation of nucleosome-free regions. Genes Dev 2015; 28:2492-7. [PMID: 25403179 PMCID: PMC4233242 DOI: 10.1101/gad.250704.114] [Citation(s) in RCA: 83] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
AT-rich DNA is concentrated in the nucleosome-free regions (NFRs) associated with transcription start sites of most genes. Lorch et al. find that the AT-rich sequences present in many NFRs have little effect on the stability of nucleosomes. These sequences instead facilitate the removal of nucleosomes by the RSC chromatin remodeling complex. RSC activity is stimulated by AT-rich sequences in nucleosomes and inhibited by competition with AT-rich DNA. AT-rich DNA is concentrated in the nucleosome-free regions (NFRs) associated with transcription start sites of most genes. We tested the hypothesis that AT-rich DNA engenders NFR formation by virtue of its rigidity and consequent exclusion of nucleosomes. We found that the AT-rich sequences present in many NFRs have little effect on the stability of nucleosomes. Rather, these sequences facilitate the removal of nucleosomes by the RSC chromatin remodeling complex. RSC activity is stimulated by AT-rich sequences in nucleosomes and inhibited by competition with AT-rich DNA. RSC may remove NFR nucleosomes without effect on adjacent ORF nucleosomes. Our findings suggest that many NFRs are formed and maintained by an active mechanism involving the ATP-dependent removal of nucleosomes rather than a passive mechanism due to the intrinsic instability of nucleosomes on AT-rich DNA sequences.
Collapse
Affiliation(s)
- Yahli Lorch
- Department of Structural Biology, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Barbara Maier-Davis
- Department of Structural Biology, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Roger D Kornberg
- Department of Structural Biology, Stanford University School of Medicine, Stanford, California 94305, USA
| |
Collapse
|
12
|
Park CY, Krishnan A, Zhu Q, Wong AK, Lee YS, Troyanskaya OG. Tissue-aware data integration approach for the inference of pathway interactions in metazoan organisms. ACTA ACUST UNITED AC 2014; 31:1093-101. [PMID: 25431329 DOI: 10.1093/bioinformatics/btu786] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2014] [Accepted: 11/20/2014] [Indexed: 11/12/2022]
Abstract
MOTIVATION Leveraging the large compendium of genomic data to predict biomedical pathways and specific mechanisms of protein interactions genome-wide in metazoan organisms has been challenging. In contrast to unicellular organisms, biological and technical variation originating from diverse tissues and cell-lineages is often the largest source of variation in metazoan data compendia. Therefore, a new computational strategy accounting for the tissue heterogeneity in the functional genomic data is needed to accurately translate the vast amount of human genomic data into specific interaction-level hypotheses. RESULTS We developed an integrated, scalable strategy for inferring multiple human gene interaction types that takes advantage of data from diverse tissue and cell-lineage origins. Our approach specifically predicts both the presence of a functional association and also the most likely interaction type among human genes or its protein products on a whole-genome scale. We demonstrate that directly incorporating tissue contextual information improves the accuracy of our predictions, and further, that such genome-wide results can be used to significantly refine regulatory interactions from primary experimental datasets (e.g. ChIP-Seq, mass spectrometry). AVAILABILITY AND IMPLEMENTATION An interactive website hosting all of our interaction predictions is publically available at http://pathwaynet.princeton.edu. Software was implemented using the open-source Sleipnir library, which is available for download at https://bitbucket.org/libsleipnir/libsleipnir.bitbucket.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher Y Park
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Arjun Krishnan
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Qian Zhu
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Aaron K Wong
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Young-Suk Lee
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Olga G Troyanskaya
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| |
Collapse
|
13
|
iRegulon: from a gene list to a gene regulatory network using large motif and track collections. PLoS Comput Biol 2014; 10:e1003731. [PMID: 25058159 PMCID: PMC4109854 DOI: 10.1371/journal.pcbi.1003731] [Citation(s) in RCA: 613] [Impact Index Per Article: 61.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Accepted: 05/27/2014] [Indexed: 01/17/2023] Open
Abstract
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from http://iregulon.aertslab.org. Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.
Collapse
|
14
|
Barrière A, Ruvinsky I. Pervasive divergence of transcriptional gene regulation in Caenorhabditis nematodes. PLoS Genet 2014; 10:e1004435. [PMID: 24968346 PMCID: PMC4072541 DOI: 10.1371/journal.pgen.1004435] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 04/28/2014] [Indexed: 12/18/2022] Open
Abstract
Because there is considerable variation in gene expression even between closely related species, it is clear that gene regulatory mechanisms evolve relatively rapidly. Because primary sequence conservation is an unreliable proxy for functional conservation of cis-regulatory elements, their assessment must be carried out in vivo. We conducted a survey of cis-regulatory conservation between C. elegans and closely related species C. briggsae, C. remanei, C. brenneri, and C. japonica. We tested enhancers of eight genes from these species by introducing them into C. elegans and analyzing the expression patterns they drove. Our results support several notable conclusions. Most exogenous cis elements direct expression in the same cells as their C. elegans orthologs, confirming gross conservation of regulatory mechanisms. However, the majority of exogenous elements, when placed in C. elegans, also directed expression in cells outside endogenous patterns, suggesting functional divergence. Recurrent ectopic expression of different promoters in the same C. elegans cells may reflect biases in the directions in which expression patterns can evolve due to shared regulatory logic of coexpressed genes. The fact that, despite differences between individual genes, several patterns repeatedly emerged from our survey, encourages us to think that general rules governing regulatory evolution may exist and be discoverable.
Collapse
Affiliation(s)
- Antoine Barrière
- Department of Ecology and Evolution and Institute for Genomics and Systems Biology, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (AB); (IR)
| | - Ilya Ruvinsky
- Department of Ecology and Evolution and Institute for Genomics and Systems Biology, The University of Chicago, Chicago, Illinois, United States of America
- Department of Organismal Biology and Anatomy, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail: (AB); (IR)
| |
Collapse
|
15
|
Glenwinkel L, Wu D, Minevich G, Hobert O. TargetOrtho: a phylogenetic footprinting tool to identify transcription factor targets. Genetics 2014; 197:61-76. [PMID: 24558259 PMCID: PMC4012501 DOI: 10.1534/genetics.113.160721] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2014] [Accepted: 02/09/2014] [Indexed: 11/18/2022] Open
Abstract
The identification of the regulatory targets of transcription factors is central to our understanding of how transcription factors fulfill their many key roles in development and homeostasis. DNA-binding sites have been uncovered for many transcription factors through a number of experimental approaches, but it has proven difficult to use this binding site information to reliably predict transcription factor target genes in genomic sequence space. Using the nematode Caenorhabditis elegans and other related nematode species as a starting point, we describe here a bioinformatic pipeline that identifies potential transcription factor target genes from genomic sequences. Among the key features of this pipeline is the use of sequence conservation of transcription-factor-binding sites in related species. Rather than using aligned genomic DNA sequences from the genomes of multiple species as a starting point, TargetOrtho scans related genome sequences independently for matches to user-provided transcription-factor-binding motifs, assigns motif matches to adjacent genes, and then determines whether orthologous genes in different species also contain motif matches. We validate TargetOrtho by identifying previously characterized targets of three different types of transcription factors in C. elegans, and we use TargetOrtho to identify novel target genes of the Collier/Olf/EBF transcription factor UNC-3 in C. elegans ventral nerve cord motor neurons. We have also implemented the use of TargetOrtho in Drosophila melanogaster using conservation among five species in the D. melanogaster species subgroup for target gene discovery.
Collapse
Affiliation(s)
- Lori Glenwinkel
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| | | | - Gregory Minevich
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| | - Oliver Hobert
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| |
Collapse
|
16
|
Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res 2014; 24:1147-56. [PMID: 24714811 PMCID: PMC4079970 DOI: 10.1101/gr.169243.113] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Gene expression is determined by genomic elements called enhancers, which contain short motifs bound by different transcription factors (TFs). However, how enhancer sequences and TF motifs relate to enhancer activity is unknown, and general sequence requirements for enhancers or comprehensive sets of important enhancer sequence elements have remained elusive. Here, we computationally dissect thousands of functional enhancer sequences from three different Drosophila cell lines. We find that the enhancers display distinct cis-regulatory sequence signatures, which are predictive of the enhancers’ cell type-specific or broad activities. These signatures contain transcription factor motifs and a novel class of enhancer sequence elements, dinucleotide repeat motifs (DRMs). DRMs are highly enriched in enhancers, particularly in enhancers that are broadly active across different cell types. We experimentally validate the importance of the identified TF motifs and DRMs for enhancer function and show that they can be sufficient to create an active enhancer de novo from a nonfunctional sequence. The function of DRMs as a novel class of general enhancer features that are also enriched in human regulatory regions might explain their implication in several diseases and provides important insights into gene regulation.
Collapse
|
17
|
Systematic identification of regulatory elements in conserved 3' UTRs of human transcripts. Cell Rep 2014; 7:281-92. [PMID: 24656821 DOI: 10.1016/j.celrep.2014.03.001] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Revised: 02/03/2014] [Accepted: 03/03/2014] [Indexed: 11/21/2022] Open
Abstract
Posttranscriptional regulatory programs governing diverse aspects of RNA biology remain largely uncharacterized. Understanding the functional roles of RNA cis-regulatory elements is essential for decoding complex programs that underlie the dynamic regulation of transcript stability, splicing, localization, and translation. Here, we describe a combined experimental/computational technology to reveal a catalog of functional regulatory elements embedded in 3' UTRs of human transcripts. We used a bidirectional reporter system coupled with flow cytometry and high-throughput sequencing to measure the effect of short, noncoding, vertebrate-conserved RNA sequences on transcript stability and translation. Information-theoretic motif analysis of the resulting sequence-to-gene-expression mapping revealed linear and structural RNA cis-regulatory elements that positively and negatively modulate the posttranscriptional fates of human transcripts. This combined experimental/computational strategy can be used to systematically characterize the vast landscape of posttranscriptional regulatory elements controlling physiological and pathological cellular state transitions.
Collapse
|
18
|
Menoret D, Santolini M, Fernandes I, Spokony R, Zanet J, Gonzalez I, Latapie Y, Ferrer P, Rouault H, White KP, Besse P, Hakim V, Aerts S, Payre F, Plaza S. Genome-wide analyses of Shavenbaby target genes reveals distinct features of enhancer organization. Genome Biol 2013; 14:R86. [PMID: 23972280 PMCID: PMC4053989 DOI: 10.1186/gb-2013-14-8-r86] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2013] [Accepted: 08/23/2013] [Indexed: 12/17/2022] Open
Abstract
Background Developmental programs are implemented by regulatory interactions between Transcription Factors (TFs) and their target genes, which remain poorly understood. While recent studies have focused on regulatory cascades of TFs that govern early development, little is known about how the ultimate effectors of cell differentiation are selected and controlled. We addressed this question during late Drosophila embryogenesis, when the finely tuned expression of the TF Ovo/Shavenbaby (Svb) triggers the morphological differentiation of epidermal trichomes. Results We defined a sizeable set of genes downstream of Svb and used in vivo assays to delineate 14 enhancers driving their specific expression in trichome cells. Coupling computational modeling to functional dissection, we investigated the regulatory logic of these enhancers. Extending the repertoire of epidermal effectors using genome-wide approaches showed that the regulatory models learned from this first sample are representative of the whole set of trichome enhancers. These enhancers harbor remarkable features with respect to their functional architectures, including a weak or non-existent clustering of Svb binding sites. The in vivo function of each site relies on its intimate context, notably the flanking nucleotides. Two additional cis-regulatory motifs, present in a broad diversity of composition and positioning among trichome enhancers, critically contribute to enhancer activity. Conclusions Our results show that Svb directly regulates a large set of terminal effectors of the remodeling of epidermal cells. Further, these data reveal that trichome formation is underpinned by unexpectedly diverse modes of regulation, providing fresh insights into the functional architecture of enhancers governing a terminal differentiation program.
Collapse
|
19
|
Ghandi M, Mohammad-Noori M, Beer MA. Robust k-mer frequency estimation using gapped k-mers. J Math Biol 2013; 69:469-500. [PMID: 23861010 DOI: 10.1007/s00285-013-0705-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Revised: 06/09/2013] [Indexed: 10/26/2022]
Abstract
Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.
Collapse
Affiliation(s)
- Mahmoud Ghandi
- Department of Biomedical Engineering and McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA,
| | | | | |
Collapse
|
20
|
Lusk RW, Eisen MB. Spatial promoter recognition signatures may enhance transcription factor specificity in yeast. PLoS One 2013; 8:e53778. [PMID: 23320104 PMCID: PMC3540036 DOI: 10.1371/journal.pone.0053778] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2012] [Accepted: 12/04/2012] [Indexed: 11/26/2022] Open
Abstract
The short length and high degeneracy of sites recognized by DNA-binding transcription factors limit the amount of information they can carry, and individual sites are rarely sufficient to mediate the regulation of specific targets. Computational analysis of microbial genomes has suggested that many factors function optimally when in a particular orientation and position with respect to their target promoters. To investigate this further, we developed and trained spatial models of binding site positioning and applied them to the genome of the yeast Saccharomyces cerevisiae. We found evidence of non-random organization of sites within promoters, differences in binding site density, or both for thirty-eight transcription factors. We show that these signatures allow transcription factors with substantial differences in binding site specificity to share similar promoter specificities. We illustrate how spatial information dictating the positioning and density of binding sites can in principle increase the information available to the organism for differentiating a transcription factor’s true targets, and we indicate how this information could potentially be leveraged for the same purpose in bioinformatic analyses.
Collapse
Affiliation(s)
- Richard W. Lusk
- Department of Ecology & Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Michael B. Eisen
- Department of Molecular & Cell Biology, University of California, Berkeley, California, United States of America
- Howard Hughes Medical Institute, University of California, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
21
|
Seidl MF, Wang RP, Van den Ackerveken G, Govers F, Snel B. Bioinformatic inference of specific and general transcription factor binding sites in the plant pathogen Phytophthora infestans. PLoS One 2012; 7:e51295. [PMID: 23251489 PMCID: PMC3520976 DOI: 10.1371/journal.pone.0051295] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2012] [Accepted: 11/01/2012] [Indexed: 11/19/2022] Open
Abstract
Plant infection by oomycete pathogens is a complex process. It requires precise expression of a plethora of genes in the pathogen that contribute to a successful interaction with the host. Whereas much effort has been made to uncover the molecular systems underlying this infection process, mechanisms of transcriptional regulation of the genes involved remain largely unknown. We performed the first systematic de-novo DNA motif discovery analysis in Phytophthora. To this end, we utilized the genome sequence of the late blight pathogen Phytophthora infestans and two related Phytophthora species (P. ramorum and P. sojae), as well as genome-wide in planta gene expression data to systematically predict 19 conserved DNA motifs. This catalog describes common eukaryotic promoter elements whose functionality is supported by the presence of orthologs of known general transcription factors. Together with strong functional enrichment of the common promoter elements towards effector genes involved in pathogenicity, we obtained a new and expanded picture of the promoter structure in P. infestans. More intriguingly, we identified specific DNA motifs that are either highly abundant or whose presence is significantly correlated with gene expression levels during infection. Several of these motifs are observed upstream of genes encoding transporters, RXLR effectors, but also transcriptional regulators. Motifs that are observed upstream of known pathogenicity-related genes are potentially important binding sites for transcription factors. Our analyses add substantial knowledge to the as of yet virtually unexplored question regarding general and specific gene regulation in this important class of pathogens. We propose hypotheses on the effects of cis-regulatory motifs on the gene regulation of pathogenicity-related genes and pinpoint motifs that are prime targets for further experimental validation.
Collapse
Affiliation(s)
- Michael F Seidl
- Theoretical Biology and Bioinformatics, Department of Biology, Utrecht University, Utrecht, The Netherlands.
| | | | | | | | | |
Collapse
|
22
|
Müller-Molina AJ, Schöler HR, Araúzo-Bravo MJ. Comprehensive human transcription factor binding site map for combinatory binding motifs discovery. PLoS One 2012; 7:e49086. [PMID: 23209563 PMCID: PMC3509107 DOI: 10.1371/journal.pone.0049086] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 10/08/2012] [Indexed: 11/18/2022] Open
Abstract
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%-20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory "DNA words." From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%-far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of "DNA words," newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
Collapse
Affiliation(s)
- Arnoldo J. Müller-Molina
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Hans R. Schöler
- Department of Cell and Developmental Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Medical Faculty, University of Münster, Münster, Germany
| | - Marcos J. Araúzo-Bravo
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| |
Collapse
|
23
|
Ding J, Li X, Hu H. Systematic prediction of cis-regulatory elements in the Chlamydomonas reinhardtii genome using comparative genomics. PLANT PHYSIOLOGY 2012; 160:613-23. [PMID: 22915576 PMCID: PMC3461543 DOI: 10.1104/pp.112.200840] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Chlamydomonas reinhardtii is one of the most important microalgae model organisms and has been widely studied toward the understanding of chloroplast functions and various cellular processes. Further exploitation of C. reinhardtii as a model system to elucidate various molecular mechanisms and pathways requires systematic study of gene regulation. However, there is a general lack of genome-scale gene regulation study, such as global cis-regulatory element (CRE) identification, in C. reinhardtii. Recently, large-scale genomic data in microalgae species have become available, which enable the development of efficient computational methods to systematically identify CREs and characterize their roles in microalgae gene regulation. Here, we performed in silico CRE identification at the whole genome level in C. reinhardtii using a comparative genomics-based method. We predicted a large number of CREs in C. reinhardtii that are consistent with experimentally verified CREs. We also discovered that a large percentage of these CREs form combinations and have the potential to work together for coordinated gene regulation in C. reinhardtii. Multiple lines of evidence from literature, gene transcriptional profiles, and gene annotation resources support our prediction. The predicted CREs will serve, to our knowledge, as the first large-scale collection of CREs in C. reinhardtii to facilitate further experimental study of microalgae gene regulation. The accompanying software tool and the predictions in C. reinhardtii are also made available through a Web-accessible database (http://hulab.ucf.edu/research/projects/Microalgae/sdcre/motifcomb.html).
Collapse
|
24
|
Hansen L, Mariño-Ramírez L, Landsman D. Differences in local genomic context of bound and unbound motifs. Gene 2012; 506:125-34. [PMID: 22692006 PMCID: PMC3412921 DOI: 10.1016/j.gene.2012.06.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2012] [Accepted: 06/04/2012] [Indexed: 11/25/2022]
Abstract
Understanding gene regulation is a major objective in molecular biology research. Frequently, transcription is driven by transcription factors (TFs) that bind to specific DNA sequences. These motifs are usually short and degenerate, rendering the likelihood of multiple copies occurring throughout the genome due to random chance as high. Despite this, TFs only bind to a small subset of sites, thus prompting our investigation into the differences between motifs that are bound by TFs and those that remain unbound. Here we constructed vectors representing various chromatin- and sequence-based features for a published set of bound and unbound motifs representing nine TFs in the budding yeast Saccharomyces cerevisiae. Using a machine learning approach, we identified a set of features that can be used to discriminate between bound and unbound motifs. We also discovered that some TFs bind most or all of their strong motifs in intergenic regions. Our data demonstrate that local sequence context can be strikingly different around motifs that are bound compared to motifs that are unbound. We concluded that there are multiple combinations of genomic features that characterize bound or unbound motifs.
Collapse
Affiliation(s)
- Loren Hansen
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8900 Rockville Pike, Bethesda, MD 20894
- Bioinformatics Program, Boston University, Boston, MA 02215, USA
| | - Leonardo Mariño-Ramírez
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8900 Rockville Pike, Bethesda, MD 20894
- PanAmerican Bioinformatics Institute, Santa Marta, Magdalena, Colombia
| | - David Landsman
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8900 Rockville Pike, Bethesda, MD 20894
| |
Collapse
|
25
|
Wang S, Yin Y, Ma Q, Tang X, Hao D, Xu Y. Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis. BMC PLANT BIOLOGY 2012; 12:138. [PMID: 22877077 PMCID: PMC3463447 DOI: 10.1186/1471-2229-12-138] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Accepted: 07/30/2012] [Indexed: 05/21/2023]
Abstract
BACKGROUND Identification of the novel genes relevant to plant cell-wall (PCW) synthesis represents a highly important and challenging problem. Although substantial efforts have been invested into studying this problem, the vast majority of the PCW related genes remain unknown. RESULTS Here we present a computational study focused on identification of the novel PCW genes in Arabidopsis based on the co-expression analyses of transcriptomic data collected under 351 conditions, using a bi-clustering technique. Our analysis identified 217 highly co-expressed gene clusters (modules) under some experimental conditions, each containing at least one gene annotated as PCW related according to the Purdue Cell Wall Gene Families database. These co-expression modules cover 349 known/annotated PCW genes and 2,438 new candidates. For each candidate gene, we annotated the specific PCW synthesis stages in which it is involved and predicted the detailed function. In addition, for the co-expressed genes in each module, we predicted and analyzed their cis regulatory motifs in the promoters using our motif discovery pipeline, providing strong evidence that the genes in each co-expression module are transcriptionally co-regulated. From the all co-expression modules, we infer that 108 modules are related to four major PCW synthesis components, using three complementary methods. CONCLUSIONS We believe our approach and data presented here will be useful for further identification and characterization of PCW genes. All the predicted PCW genes, co-expression modules, motifs and their annotations are available at a web-based database: http://csbl.bmb.uga.edu/publications/materials/shanwang/CWRPdb/index.html.
Collapse
Affiliation(s)
- Shan Wang
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- Key Lab for Molecular Enzymology and Engineering of the Ministry of Education, Jilin University, Changchun, China
- Biotechnology Research Centre, Jilin Academy of Agricultural Sciences (JAAS), Changchun, China
| | - Yanbin Yin
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
| | - Qin Ma
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
| | - Xiaojia Tang
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
| | - Dongyun Hao
- Key Lab for Molecular Enzymology and Engineering of the Ministry of Education, Jilin University, Changchun, China
- Biotechnology Research Centre, Jilin Academy of Agricultural Sciences (JAAS), Changchun, China
| | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
- College of Computer Science and Technology, Jilin University, Changchun, China
| |
Collapse
|
26
|
Herrmann C, Van de Sande B, Potier D, Aerts S. i-cisTarget: an integrative genomics method for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res 2012; 40:e114. [PMID: 22718975 PMCID: PMC3424583 DOI: 10.1093/nar/gks543] [Citation(s) in RCA: 129] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The field of regulatory genomics today is characterized by the generation of high-throughput data sets that capture genome-wide transcription factor (TF) binding, histone modifications, or DNAseI hypersensitive regions across many cell types and conditions. In this context, a critical question is how to make optimal use of these publicly available datasets when studying transcriptional regulation. Here, we address this question in Drosophila melanogaster for which a large number of high-throughput regulatory datasets are available. We developed i-cisTarget (where the 'i' stands for integrative), for the first time enabling the discovery of different types of enriched 'regulatory features' in a set of co-regulated sequences in one analysis, being either TF motifs or 'in vivo' chromatin features, or combinations thereof. We have validated our approach on 15 co-expressed gene sets, 21 ChIP data sets, 628 curated gene sets and multiple individual case studies, and show that meaningful regulatory features can be confidently discovered; that bona fide enhancers can be identified, both by in vivo events and by TF motifs; and that combinations of in vivo events and TF motifs further increase the performance of enhancer prediction.
Collapse
Affiliation(s)
- Carl Herrmann
- TAGC - Inserm U1090 and Aix-Marseille Université, Campus de Luminy, 13288 Marseille, France.
| | | | | | | |
Collapse
|
27
|
Petrov V, Vermeirssen V, De Clercq I, Van Breusegem F, Minkov I, Vandepoele K, Gechev TS. Identification of cis-regulatory elements specific for different types of reactive oxygen species in Arabidopsis thaliana. Gene 2012; 499:52-60. [DOI: 10.1016/j.gene.2012.02.035] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Revised: 02/09/2012] [Accepted: 02/19/2012] [Indexed: 10/28/2022]
|
28
|
Poultney CS, Greenfield A, Bonneau R. Integrated inference and analysis of regulatory networks from multi-level measurements. Methods Cell Biol 2012; 110:19-56. [PMID: 22482944 PMCID: PMC5615108 DOI: 10.1016/b978-0-12-388403-9.00002-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Regulatory and signaling networks coordinate the enormously complex interactions and processes that control cellular processes (such as metabolism and cell division), coordinate response to the environment, and carry out multiple cell decisions (such as development and quorum sensing). Regulatory network inference is the process of inferring these networks, traditionally from microarray data but increasingly incorporating other measurement types such as proteomics, ChIP-seq, metabolomics, and mass cytometry. We discuss existing techniques for network inference. We review in detail our pipeline, which consists of an initial biclustering step, designed to estimate co-regulated groups; a network inference step, designed to select and parameterize likely regulatory models for the control of the co-regulated groups from the biclustering step; and a visualization and analysis step, designed to find and communicate key features of the network. Learning biological networks from even the most complete data sets is challenging; we argue that integrating new data types into the inference pipeline produces networks of increased accuracy, validity, and biological relevance.
Collapse
Affiliation(s)
- Christopher S Poultney
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | | | | |
Collapse
|
29
|
Potier D, Atak ZK, Sanchez MN, Herrmann C, Aerts S. Using cisTargetX to predict transcriptional targets and networks in Drosophila. Methods Mol Biol 2012; 786:291-314. [PMID: 21938634 DOI: 10.1007/978-1-61779-292-2_18] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Gene expression regulation is a fundamental biological process leading to complete organism development by controlling processes like cell type specification and differentiation. The accuracy of this process is -governed by transcription factors (TFs) acting within a complex gene regulatory network. CisTargetX has been developed to enable a user to predict TFs, enhancers, and target genes involved in the regulation of co-expressed genes. It uses a strategy that incorporates the genome-wide prediction of clusters of transcription factor binding sites (TFBSs), starting from a large, unbiased collection of position weight matrices (PWMs) and uses comparative genomics criteria to filter potential TFBS. We describe in this chapter, step-by-step, how to use cisTargetX starting from a set of genes or TF(s) to predict transcriptional targets with their putative binding sites and networks in Drosophila. Next, we illustrate this approach on a particular developmental system, namely, sensory organ development, and identify relevant TFs, DNA regions regulating gene expression, and TF/target gene interactions. CisTargetX is available at http://med.kuleuven.be/lcb/cisTargetX .
Collapse
Affiliation(s)
- Delphine Potier
- TAGC Inserm U928 and Université de la Mediterranée, Marseille, France
| | | | | | | | | |
Collapse
|
30
|
Aerts S. Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr Top Dev Biol 2012; 98:121-45. [PMID: 22305161 DOI: 10.1016/b978-0-12-386499-4.00005-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Transcription factors (TFs) are key proteins that decode the information in our genome to express a precise and unique set of proteins and RNA molecules in each cell type in our body. These factors play a pivotal role in all biological processes, including the determination of a cell's fate during development and the maintenance of a cell's physiological function. To achieve this, a TF binds to specific DNA sequences in the noncoding part of the genome, recruits chromatin modifiers and cofactors, and directs the transcription initiation rate of its "target genes." Therefore, a key challenge in deciphering a transcriptional switch is to identify the direct target genes of the master regulators that control the switch, the cis-regulatory elements implementing (auto-)regulatory loops, and the target genes of all the TFs in the downstream regulatory network. A better knowledge of a TF's targetome during specification and differentiation of a particular cell type will generate mechanistic insight into its developmental program. Here, I review computational strategies and methods to predict transcriptional targets by genome-wide searches for TF binding sites using position weight matrices, motif clusters, phylogenetic footprinting, chromatin binding and accessibility data, enhancer classification, motif enrichment, and gene expression signatures.
Collapse
Affiliation(s)
- Stein Aerts
- Laboratory of Computational Biology, Center for Human Genetics, Katholieke Universiteit Leuven, Leuven, Belgium
| |
Collapse
|
31
|
Daily K, Patel VR, Rigor P, Xie X, Baldi P. MotifMap: integrative genome-wide maps of regulatory motif sites for model species. BMC Bioinformatics 2011; 12:495. [PMID: 22208852 PMCID: PMC3293935 DOI: 10.1186/1471-2105-12-495] [Citation(s) in RCA: 135] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 12/30/2011] [Indexed: 12/20/2022] Open
Abstract
Background A central challenge of biology is to map and understand gene regulation on a genome-wide scale. For any given genome, only a small fraction of the regulatory elements embedded in the DNA sequence have been characterized, and there is great interest in developing computational methods to systematically map all these elements and understand their relationships. Such computational efforts, however, are significantly hindered by the overwhelming size of non-coding regions and the statistical variability and complex spatial organizations of regulatory elements and interactions. Genome-wide catalogs of regulatory elements for all model species simply do not yet exist. Results The MotifMap system uses databases of transcription factor binding motifs, refined genome alignments, and a comparative genomic statistical approach to provide comprehensive maps of candidate regulatory elements encoded in the genomes of model species. The system is used to derive new genome-wide maps for yeast, fly, worm, mouse, and human. The human map contains 519,108 sites for 570 matrices with a False Discovery Rate of 0.1 or less. The new maps are assessed in several ways, for instance using high-throughput experimental ChIP-seq data and AUC statistics, providing strong evidence for their accuracy and coverage. The maps can be usefully integrated with many other kinds of omic data and are available at http://motifmap.igb.uci.edu/.
Conclusions MotifMap and its integration with other data provide a foundation for analyzing gene regulation on a genome-wide scale, and for automatically generating regulatory pathways and hypotheses. The power of this approach is demonstrated and discussed using the P53 apoptotic pathway and the Gli hedgehog pathways as examples.
Collapse
Affiliation(s)
- Kenneth Daily
- Department of Computer Science, University of California Irvine, Irvine, CA 92697, USA
| | | | | | | | | |
Collapse
|
32
|
Harris EY, Ponts N, Le Roch KG, Lonardi S. Chromatin-driven de novo discovery of DNA binding motifs in the human malaria parasite. BMC Genomics 2011; 12:601. [PMID: 22165844 PMCID: PMC3282892 DOI: 10.1186/1471-2164-12-601] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 12/13/2011] [Indexed: 11/10/2022] Open
Abstract
Background Despite extensive efforts to discover transcription factors and their binding sites in the human malaria parasite Plasmodium falciparum, only a few transcription factor binding motifs have been experimentally validated to date. As a consequence, gene regulation in P. falciparum is still poorly understood. There is now evidence that the chromatin architecture plays an important role in transcriptional control in malaria. Results We propose a methodology for discovering cis-regulatory elements that uses for the first time exclusively dynamic chromatin remodeling data. Our method employs nucleosome positioning data collected at seven time points during the erythrocytic cycle of P. falciparum to discover putative DNA binding motifs and their transcription factor binding sites along with their associated clusters of target genes. Our approach results in 129 putative binding motifs within the promoter region of known genes. About 75% of those are novel, the remaining being highly similar to experimentally validated binding motifs. About half of the binding motifs reported show statistically significant enrichment in functional gene sets and strong positional bias in the promoter region. Conclusion Experimental results establish the principle that dynamic chromatin remodeling data can be used in lieu of gene expression data to discover binding motifs and their transcription factor binding sites. Our approach can be applied using only dynamic nucleosome positioning data, independent from any knowledge of gene function or expression.
Collapse
Affiliation(s)
- Elena Y Harris
- Department of Cell Biology and Neuroscience, University of California, Riverside, CA 92521, USA
| | | | | | | |
Collapse
|
33
|
Eirín-López J, Ausió J. H2A.Z-Mediated Genome-Wide Chromatin Specialization. Curr Genomics 2011; 8:59-66. [PMID: 18645626 DOI: 10.2174/138920207780076965] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Revised: 12/16/2006] [Accepted: 01/01/2007] [Indexed: 11/22/2022] Open
Abstract
The characterization of the involvement of different histone post-translational modifications (PTMs) and histone variants in chromatin structure has represented one of the most recurrent topics in molecular biology during the last decade (since 1996). The interest in this topic underscores the critical roles played by chromatin in such important processes as DNA packaging, DNA repair and recombination, and regulation of gene expression. The genomic information currently available has pushed the boundaries of this research a step further, from the study of local domains to the genome-wide characterization of the mechanisms governing chromatin dynamics. How the heterchromatin and euchromatin compartmentalization is established has been the subject of recent extensive research. Many PTMs, as well as histone variants have been identified to play a role, including the replacement of histone H2A by the histone variant H2A.Z. Several studies have provided support to a role for H2A.Z (known as Htz1 in yeast) in transcriptional regulation, chromosome structure, DNA repair and heterochromatin formation. Although the mechanisms by which H2A.Z defines different structural regions in the chromatin have long remained elusive, various reports published last year have shed new insight into this process. The present mini review focuses its attention on the genome-wide distribution of H2A.Z, with special attention to the mechanisms involved in its distribution and exchange as well as on the role of its N-terminal acetylation.
Collapse
Affiliation(s)
- Jm Eirín-López
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, Canada V8W 3P6
| | | |
Collapse
|
34
|
Xie Z, Hu S, Qian J, Blackshaw S, Zhu H. Systematic characterization of protein-DNA interactions. Cell Mol Life Sci 2011; 68:1657-68. [PMID: 21207099 PMCID: PMC11115113 DOI: 10.1007/s00018-010-0617-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2010] [Revised: 11/29/2010] [Accepted: 12/16/2010] [Indexed: 12/13/2022]
Abstract
Sequence-specific protein-DNA interactions (PDIs) are critical for regulating many cellular processes, including transcription, DNA replication, repair, and rearrangement. We review recent experimental advances in high-throughput technologies designed to characterize PDIs and discuss recent studies that use these tools, including ChIP-chip/seq, SELEX-based approaches, yeast one-hybrid, bacterial one-hybrid, protein binding microarray, and protein microarray. The results of these studies have challenged some long-standing concepts of PDI and provide valuable insights into the complex transcriptional regulatory networks.
Collapse
Affiliation(s)
- Zhi Xie
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD USA
- Present Address: The Center for Human Immunology, National Institutes of Health, Bethesda, MD USA
| | - Shaohui Hu
- The Center for High-Throughput Biology, Johns Hopkins University School of Medicine, Baltimore, MD USA
- Department of Pharmacology and Molecular Sciences, Johns Hopkins University School of Medicine, Baltimore, MD USA
| | - Jiang Qian
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD USA
- The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD USA
| | - Seth Blackshaw
- The Center for High-Throughput Biology, Johns Hopkins University School of Medicine, Baltimore, MD USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD USA
| | - Heng Zhu
- The Center for High-Throughput Biology, Johns Hopkins University School of Medicine, Baltimore, MD USA
- Department of Pharmacology and Molecular Sciences, Johns Hopkins University School of Medicine, Baltimore, MD USA
- The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD USA
- Room 333, BRB, 733 N. Broadway, 21205 Baltimore, MD USA
| |
Collapse
|
35
|
Molineris I, Grassi E, Ala U, Di Cunto F, Provero P. Evolution of promoter affinity for transcription factors in the human lineage. Mol Biol Evol 2011; 28:2173-83. [PMID: 21335606 DOI: 10.1093/molbev/msr027] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Changes in gene regulation are believed to play an important role in the evolution of animals. It has been suggested that changes in cis-regulatory regions are responsible for many or most of the anatomical and behavioral differences between humans and apes. However, the study of the evolution of cis-regulatory regions is made problematic by the degeneracy of transcription factor (TF) binding sites and the shuffling of their positions. In this work, we use the predicted total affinity of a promoter for a large collection of TFs as the basis for studying the evolution of cis-regulatory regions in mammals. We introduce the human specificity of a promoter, measuring the divergence between the affinity profile of a human promoter and its orthologous promoters in other mammals. The promoters of genes involved in functional categories such as neural processes and signal transduction, among others, have higher human specificity compared with the rest of the genome. Clustering of the human-specific affinities (HSAs) of neural genes reveals patterns of promoter evolution associated with functional categories such as synaptic transmission and brain development and to diseases such as bipolar disorder and autism.
Collapse
Affiliation(s)
- Ivan Molineris
- Department of Genetics, Biology and Biochemistry, Molecular Biotechnology Center, University of Turin, Turin, Italy
| | | | | | | | | |
Collapse
|
36
|
Abstract
The ability to manipulate the genome of organisms at will is perhaps the single most useful ability for the study of biological systems. Techniques for the generation of transgenics in the nematode Caenorhabditis elegans became available in the late 1980s. Since then, improvements to the original approach have been made to address specific limitations with transgene expression, expand on the repertoire of the types of biological information that transgenes can provide, and begin to develop methods to target transgenes to defined chromosomal locations. Many recent, detailed protocols have been published, and hence in this chapter, we will review various approaches to making C. elegans transgenics, discuss their applications, and consider their relative advantages and disadvantages. Comments will also be made on anticipated future developments and on the application of these methods to other nematodes.
Collapse
Affiliation(s)
- Vida Praitis
- Biology Department, Grinnell College, Grinnell, Iowa, USA
| | | |
Collapse
|
37
|
Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res 2010; 21:447-55. [PMID: 21106904 DOI: 10.1101/gr.112623.110] [Citation(s) in RCA: 390] [Impact Index Per Article: 27.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Accurate functional annotation of regulatory elements is essential for understanding global gene regulation. Here, we report a genome-wide map of 827,000 transcription factor binding sites in human lymphoblastoid cell lines, which is comprised of sites corresponding to 239 position weight matrices of known transcription factor binding motifs, and 49 novel sequence motifs. To generate this map, we developed a probabilistic framework that integrates cell- or tissue-specific experimental data such as histone modifications and DNase I cleavage patterns with genomic information such as gene annotation and evolutionary conservation. Comparison to empirical ChIP-seq data suggests that our method is highly accurate yet has the advantage of targeting many factors in a single assay. We anticipate that this approach will be a valuable tool for genome-wide studies of gene regulation in a wide variety of cell types or tissues under diverse conditions.
Collapse
Affiliation(s)
- Roger Pique-Regi
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA.
| | | | | | | | | | | |
Collapse
|
38
|
Campbell TL, De Silva EK, Olszewski KL, Elemento O, Llinás M. Identification and genome-wide prediction of DNA binding specificities for the ApiAP2 family of regulators from the malaria parasite. PLoS Pathog 2010; 6:e1001165. [PMID: 21060817 PMCID: PMC2965767 DOI: 10.1371/journal.ppat.1001165] [Citation(s) in RCA: 182] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2010] [Accepted: 09/27/2010] [Indexed: 11/18/2022] Open
Abstract
The molecular mechanisms underlying transcriptional regulation in apicomplexan parasites remain poorly understood. Recently, the Apicomplexan AP2 (ApiAP2) family of DNA binding proteins was identified as a major class of transcriptional regulators that are found across all Apicomplexa. To gain insight into the regulatory role of these proteins in the malaria parasite, we have comprehensively surveyed the DNA-binding specificities of all 27 members of the ApiAP2 protein family from Plasmodium falciparum revealing unique binding preferences for the majority of these DNA binding proteins. In addition to high affinity primary motif interactions, we also observe interactions with secondary motifs. The ability of a number of ApiAP2 proteins to bind multiple, distinct motifs significantly increases the potential complexity of the transcriptional regulatory networks governed by the ApiAP2 family. Using these newly identified sequence motifs, we infer the trans-factors associated with previously reported plasmodial cis-elements and provide evidence that ApiAP2 proteins modulate key regulatory decisions at all stages of parasite development. Our results offer a detailed view of ApiAP2 DNA binding specificity and take the first step toward inferring comprehensive gene regulatory networks for P. falciparum.
Collapse
Affiliation(s)
- Tracey L. Campbell
- Department of Molecular Biology & Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Erandi K. De Silva
- Department of Molecular Biology & Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Kellen L. Olszewski
- Department of Molecular Biology & Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Olivier Elemento
- Institute for Computational Medicine, Weill Cornell Medical College, New York, New York, United States of America
| | - Manuel Llinás
- Department of Molecular Biology & Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
39
|
Waltman P, Kacmarczyk T, Bate AR, Kearns DB, Reiss DJ, Eichenberger P, Bonneau R. Multi-species integrative biclustering. Genome Biol 2010; 11:R96. [PMID: 20920250 PMCID: PMC2965388 DOI: 10.1186/gb-2010-11-9-r96] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Revised: 09/19/2010] [Accepted: 09/29/2010] [Indexed: 12/22/2022] Open
Abstract
We describe an algorithm, multi-species cMonkey, for the simultaneous biclustering of heterogeneous multiple-species data collections and apply the algorithm to a group of bacteria containing Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes. The algorithm reveals evolutionary insights into the surprisingly high degree of conservation of regulatory modules across these three species and allows data and insights from well-studied organisms to complement the analysis of related but less well studied organisms.
Collapse
Affiliation(s)
- Peter Waltman
- Computer Science Department, Warren Weaver Hall (Room 305), 251 Mercer Street, New York, NY 10012, USA.
| | | | | | | | | | | | | |
Collapse
|
40
|
Gordân R, Narlikar L, Hartemink AJ. Finding regulatory DNA motifs using alignment-free evolutionary conservation information. Nucleic Acids Res 2010; 38:e90. [PMID: 20047961 PMCID: PMC2847231 DOI: 10.1093/nar/gkp1166] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2009] [Revised: 10/30/2009] [Accepted: 11/23/2009] [Indexed: 01/01/2023] Open
Abstract
As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
Collapse
Affiliation(s)
- Raluca Gordân
- Department of Computer Science, Duke University, Box 90129, Durham, NC 27708, USA
| | | | | |
Collapse
|
41
|
Kumar L, Breakspear A, Kistler C, Ma LJ, Xie X. Systematic discovery of regulatory motifs in Fusarium graminearum by comparing four Fusarium genomes. BMC Genomics 2010; 11:208. [PMID: 20346147 PMCID: PMC2853525 DOI: 10.1186/1471-2164-11-208] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2009] [Accepted: 03/26/2010] [Indexed: 11/24/2022] Open
Abstract
Background Fusarium graminearum (Fg), a major fungal pathogen of cultivated cereals, is responsible for billions of dollars in agriculture losses. There is a growing interest in understanding the transcriptional regulation of this organism, especially the regulation of genes underlying its pathogenicity. The generation of whole genome sequence assemblies for Fg and three closely related Fusarium species provides a unique opportunity for such a study. Results Applying comparative genomics approaches, we developed a computational pipeline to systematically discover evolutionarily conserved regulatory motifs in the promoter, downstream and the intronic regions of Fg genes, based on the multiple alignments of sequenced Fusarium genomes. Using this method, we discovered 73 candidate regulatory motifs in the promoter regions. Nearly 30% of these motifs are highly enriched in promoter regions of Fg genes that are associated with a specific functional category. Through comparison to Saccharomyces cerevisiae (Sc) and Schizosaccharomyces pombe (Sp), we observed conservation of transcription factors (TFs), their binding sites and the target genes regulated by these TFs related to pathways known to respond to stress conditions or phosphate metabolism. In addition, this study revealed 69 and 39 conserved motifs in the downstream regions and the intronic regions, respectively, of Fg genes. The top intronic motif is the splice donor site. For the downstream regions, we noticed an intriguing absence of the mammalian and Sc poly-adenylation signals among the list of conserved motifs. Conclusion This study provides the first comprehensive list of candidate regulatory motifs in Fg, and underscores the power of comparative genomics in revealing functional elements among related genomes. The conservation of regulatory pathways among the Fusarium genomes and the two yeast species reveals their functional significance, and provides new insights in their evolutionary importance among Ascomycete fungi.
Collapse
|
42
|
Georgiev S, Boyle AP, Jayasurya K, Ding X, Mukherjee S, Ohler U. Evidence-ranked motif identification. Genome Biol 2010; 11:R19. [PMID: 20156354 PMCID: PMC2872879 DOI: 10.1186/gb-2010-11-2-r19] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2009] [Revised: 09/30/2009] [Accepted: 02/15/2010] [Indexed: 11/13/2022] Open
Abstract
cERMIT is a computationally efficient motif discovery tool based on analyzing genome-wide quantitative regulatory evidence. Instead of pre-selecting promising candidate sequences, it utilizes information across all sequence regions to search for high-scoring motifs. We apply cERMIT on a range of direct binding and overexpression datasets; it substantially outperforms state-of-the-art approaches on curated ChIP-chip datasets, and easily scales to current mammalian ChIP-seq experiments with data on thousands of non-coding regions.
Collapse
Affiliation(s)
- Stoyan Georgiev
- Program for Computational Biology and Bioinformatics, Duke University, 102 North Building, Durham, NC 27708, USA
- Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - Alan P Boyle
- Program for Computational Biology and Bioinformatics, Duke University, 102 North Building, Durham, NC 27708, USA
- Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - Karthik Jayasurya
- Program for Computational Biology and Bioinformatics, Duke University, 102 North Building, Durham, NC 27708, USA
- Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - Xuan Ding
- Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - Sayan Mukherjee
- Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
- Department of Computer Science, Duke University, 450 Research Drive, Durham, NC 27708, USA
- Department of Statistical Science, Duke University, 214 Old Chemistry Building, Durham, NC 27708, USA
- Mathematics Department, Duke University, 102 Science Drive, Durham, NC 27708, USA
| | - Uwe Ohler
- Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
- Department of Computer Science, Duke University, 450 Research Drive, Durham, NC 27708, USA
- Department of Biostatistics and Bioinformatics, Duke University, Duke University School of Medicine, 2424 Erwin Road, Durham NC 27710, USA
| |
Collapse
|
43
|
Goodarzi H, Elemento O, Tavazoie S. Revealing global regulatory perturbations across human cancers. Mol Cell 2010; 36:900-11. [PMID: 20005852 DOI: 10.1016/j.molcel.2009.11.016] [Citation(s) in RCA: 162] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2008] [Revised: 07/09/2009] [Accepted: 11/17/2009] [Indexed: 01/04/2023]
Abstract
The discovery of pathways and regulatory networks whose perturbation contributes to neoplastic transformation remains a fundamental challenge for cancer biology. We show that such pathway perturbations, and the cis-regulatory elements through which they operate, can be efficiently extracted from global gene expression profiles. Our approach utilizes information-theoretic analysis of expression levels, pathways, and genomic sequences. Analysis across a diverse set of human cancers reveals the majority of previously known cancer pathways. Through de novo motif discovery we associate these pathways with transcription-factor binding sites and miRNA targets, including those of E2F, NF-Y, p53, and let-7. Follow-up experiments confirmed that these predictions correspond to functional in vivo regulatory interactions. Strikingly, the majority of the perturbations, associated with putative cis-regulatory elements, fall outside of known cancer pathways. Our study provides a systems-level dissection of regulatory perturbations in cancer-an essential component of a rational strategy for therapeutic intervention and drug-target discovery.
Collapse
Affiliation(s)
- Hani Goodarzi
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA
| | | | | |
Collapse
|
44
|
Hu S, Xie Z, Onishi A, Yu X, Jiang L, Lin J, Rho HS, Woodard C, Wang H, Jeong JS, Long S, He X, Wade H, Blackshaw S, Qian J, Zhu H. Profiling the human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling. Cell 2009; 139:610-22. [PMID: 19879846 DOI: 10.1016/j.cell.2009.08.037] [Citation(s) in RCA: 300] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2008] [Revised: 07/13/2009] [Accepted: 08/20/2009] [Indexed: 11/28/2022]
Abstract
Protein-DNA interactions (PDIs) mediate a broad range of functions essential for cellular differentiation, function, and survival. However, it is still a daunting task to comprehensively identify and profile sequence-specific PDIs in complex genomes. Here, we have used a combined bioinformatics and protein microarray-based strategy to systematically characterize the human protein-DNA interactome. We identified 17,718 PDIs between 460 DNA motifs predicted to regulate transcription and 4,191 human proteins of various functional classes. Among them, we recovered many known PDIs for transcription factors (TFs). We identified a large number of unanticipated PDIs for known TFs, as well as for previously uncharacterized TFs. We also found that over three hundred unconventional DNA-binding proteins (uDBPs)--which include RNA-binding proteins, mitochondrial proteins, and protein kinases--showed sequence-specific PDIs. One such uDBP, ERK2, acts as a transcriptional repressor for interferon gamma-induced genes, suggesting important biological roles for such proteins.
Collapse
Affiliation(s)
- Shaohui Hu
- Department of Pharmacology and Molecular Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
46
|
Unravelling cis-regulatory elements in the genome of the smallest photosynthetic eukaryote: phylogenetic footprinting in Ostreococcus. J Mol Evol 2009; 69:249-59. [PMID: 19693423 DOI: 10.1007/s00239-009-9271-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Revised: 07/17/2009] [Accepted: 07/27/2009] [Indexed: 10/20/2022]
Abstract
We used a phylogenetic footprinting approach, adapted to high levels of divergence, to estimate the level of constraint in intergenic regions of the extremely gene dense Ostreococcus algae genomes (Chlorophyta, Prasinophyceae). We first benchmarked our method against the Saccharomyces sensu stricto genome data and found that the proportion of conserved non-coding sites was consistent with those obtained with methods using calibration by the neutral substitution rate. We then applied our method to the complete genomes of Ostreococcus tauri and O. lucimarinus, which are the most divergent species from the same genus sequenced so far. We found that 77% of intergenic regions in Ostreococcus still contain some phylogenetic footprints, as compared to 88% for Saccharomyces, corresponding to an average rate of constraint on intergenic region of 17% and 30%, respectively. A comparison with some known functional cis-regulatory elements enabled us to investigate whether some transcriptional regulatory pathways were conserved throughout the green lineage. Strikingly, the size of the phylogenetic footprints depends on gene orientation of neighboring genes, and appears to be genus-specific. In Ostreococcus, 5' intergenic regions contain four times more conserved sites than 3' intergenic regions, whereas in yeast a higher frequency of constrained sites in intergenic regions between genes on the same DNA strand suggests a higher frequency of bidirectional regulatory elements. The phylogenetic footprinting approach can be used despite high levels of divergence in the ultrasmall Ostreococcus algae, to decipher structure of constrained regulatory motifs, and identify putative regulatory pathways conserved within the green lineage.
Collapse
|
47
|
Wang X, Haberer G, Mayer KFX. Discovery of cis-elements between sorghum and rice using co-expression and evolutionary conservation. BMC Genomics 2009; 10:284. [PMID: 19558665 PMCID: PMC2714861 DOI: 10.1186/1471-2164-10-284] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2008] [Accepted: 06/26/2009] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND The spatiotemporal regulation of gene expression largely depends on the presence and absence of cis-regulatory sites in the promoter. In the economically highly important grass family, our knowledge of transcription factor binding sites and transcriptional networks is still very limited. With the completion of the sorghum genome and the available rice genome sequence, comparative promoter analyses now allow genome-scale detection of conserved cis-elements. RESULTS In this study, we identified thousands of phylogenetic footprints conserved between orthologous rice and sorghum upstream regions that are supported by co-expression information derived from three different rice expression data sets. In a complementary approach, cis-motifs were discovered by their highly conserved co-occurrence in syntenic promoter pairs. Sequence conservation and matches to known plant motifs support our findings. Expression similarities of gene pairs positively correlate with the number of motifs that are shared by gene pairs and corroborate the importance of similar promoter architectures for concerted regulation. This strongly suggests that these motifs function in the regulation of transcript levels in rice and, presumably also in sorghum. CONCLUSION Our work provides the first large-scale collection of cis-elements for rice and sorghum and can serve as a paradigm for cis-element analysis through comparative genomics in grasses in general.
Collapse
Affiliation(s)
- Xi Wang
- MIPS/IBIS Institute of Bioinformatics and System Biology, Helmholtz Center Munich, Neuherberg, Germany.
| | | | | |
Collapse
|
48
|
Vandepoele K, Quimbaya M, Casneuf T, De Veylder L, Van de Peer Y. Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. PLANT PHYSIOLOGY 2009; 150:535-46. [PMID: 19357200 PMCID: PMC2689962 DOI: 10.1104/pp.109.136028] [Citation(s) in RCA: 160] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2009] [Accepted: 04/02/2009] [Indexed: 05/17/2023]
Abstract
Analysis of gene expression data generated by high-throughput microarray transcript profiling experiments has demonstrated that genes with an overall similar expression pattern are often enriched for similar functions. This guilt-by-association principle can be applied to define modular gene programs, identify cis-regulatory elements, or predict gene functions for unknown genes based on their coexpression neighborhood. We evaluated the potential to use Gene Ontology (GO) enrichment of a gene's coexpression neighborhood as a tool to predict its function but found overall low sensitivity scores (13%-34%). This indicates that for many functional categories, coexpression alone performs poorly to infer known biological gene functions. However, integration of cis-regulatory elements shows that 46% of the gene coexpression neighborhoods are enriched for one or more motifs, providing a valuable complementary source to functionally annotate genes. Through the integration of coexpression data, GO annotations, and a set of known cis-regulatory elements combined with a novel set of evolutionarily conserved plant motifs, we could link many genes and motifs to specific biological functions. Application of our coexpression framework extended with cis-regulatory element analysis on transcriptome data from the cell cycle-related transcription factor OBP1 yielded several coexpressed modules associated with specific cis-regulatory elements. Moreover, our analysis strongly suggests a feed-forward regulatory interaction between OBP1 and the E2F pathway. The ATCOECIS resource (http://bioinformatics.psb.ugent.be/ATCOECIS/) makes it possible to query coexpression data and GO and cis-regulatory element annotations and to submit user-defined gene sets for motif analysis, providing an access point to unravel the regulatory code underlying transcriptional control in Arabidopsis (Arabidopsis thaliana).
Collapse
Affiliation(s)
- Klaas Vandepoele
- Department of Plant Systems Biology, Flanders Institute for Biotechnology, B-9052 Ghent, Belgium
| | | | | | | | | |
Collapse
|
49
|
Lu L, Li J. A combinatorial approach to determine the context-dependent role in transcriptional and posttranscriptional regulation in Arabidopsis thaliana. BMC SYSTEMS BIOLOGY 2009; 3:43. [PMID: 19400940 PMCID: PMC2694151 DOI: 10.1186/1752-0509-3-43] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2008] [Accepted: 04/28/2009] [Indexed: 12/23/2022]
Abstract
Background While progresses have been made in mapping transcriptional regulatory networks, posttranscriptional regulatory roles just begin to be uncovered, which has arrested much attention due to the discovery of miRNAs. Here we demonstrated a combinatorial approach to incorporate transcriptional and posttranscriptional regulatory sequences with gene expression profiles to determine their probabilistic dependencies. Results We applied the proposed method to microarray time course gene expression profiles and could correctly predict expression patterns for more than 50% of 1,132 genes, based on the sequence motifs adopted in the network models, which was statistically significant. Our study suggested that the contribution of miRNA regulation towards gene expression in plants may be more restricted than that of transcription factors; however, miRNAs might confer additional layers of robustness on gene regulation networks. The programs written in C++ and PERL implementing methods in this work are available for download from our supplemental data web page. Conclusion In this study we demonstrated a combinatorial approach to incorporate miRNA target motifs (miRNA-mediated posttranscriptional regulatory sites) and TFBSs (transcription factor binding sites) with gene expression profiles to reconstruct the regulatory networks. The proposed approach may facilitate the incorporation of diverse sources with limited prior knowledge.
Collapse
Affiliation(s)
- Le Lu
- Division of Structural and Computational Biology, School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore.
| | | |
Collapse
|
50
|
Freckleton G, Lippman SI, Broach JR, Tavazoie S. Microarray profiling of phage-display selections for rapid mapping of transcription factor-DNA interactions. PLoS Genet 2009; 5:e1000449. [PMID: 19360118 PMCID: PMC2659770 DOI: 10.1371/journal.pgen.1000449] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2008] [Accepted: 03/10/2009] [Indexed: 11/19/2022] Open
Abstract
Modern computational methods are revealing putative transcription-factor (TF) binding sites at an extraordinary rate. However, the major challenge in studying transcriptional networks is to map these regulatory element predictions to the protein transcription factors that bind them. We have developed a microarray-based profiling of phage-display selection (MaPS) strategy that allows rapid and global survey of an organism's proteome for sequence-specific interactions with such putative DNA regulatory elements. Application to a variety of known yeast TF binding sites successfully identified the cognate TF from the background of a complex whole-proteome library. These factors contain DNA-binding domains from diverse families, including Myb, TEA, MADS box, and C2H2 zinc-finger. Using MaPS, we identified Dot6 as a trans-active partner of the long-predicted orphan yeast element Polymerase A & C (PAC). MaPS technology should enable rapid and proteome-scale study of bi-molecular interactions within transcriptional networks. Specific interactions between protein transcription factors (TFs) and their DNA recognition sites are central to the regulation of gene expression. Inter-species conservation of these TF binding sites (TFBS), and their statistical enrichment in sets of co-expressed genes, facilitates their large-scale prediction through computational sequence analysis. A major challenge in characterizing these putative TFBS is the identification of the proteins that bind them. We have developed a new approach to this problem by expressing random genomically encoded protein fragments as fusions to the capsid of bacteriophage T7. We select this diverse phage-display “library” for binding surface-immobilized instances of the TFBS in the form of short double-stranded DNA. This in vitro selection strategy leads to the enrichment of phage whose capsid-fusion peptides interact with the specific DNA sequence. Because each phage carries the DNA encoding the peptide fusion, the identity of the enriched phage can be determined through population-level PCR amplification of DNA inserts and their hybridization to DNA microarrays. Here, we show that this technology efficiently reveals the identity of proteins that bind known and novel predicted regulatory elements. Its application to a predicted yeast element (PAC) reveals Dot6 as one of its interaction partners, both in vitro and within the yeast nucleus.
Collapse
Affiliation(s)
- Gordon Freckleton
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Soyeon I. Lippman
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - James R. Broach
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Saeed Tavazoie
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
- The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|