1
|
de Boer CG, Taipale J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 2024; 625:41-50. [PMID: 38093018 DOI: 10.1038/s41586-023-06661-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 09/20/2023] [Indexed: 01/05/2024]
Abstract
Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The 'cis-regulatory code' - how cells interpret DNA sequences to determine when, where and how much genes should be expressed - has proven to be exceedingly complex. Recently, advances in the scale and resolution of functional genomics assays and machine learning have enabled substantial progress towards deciphering this code. However, the cis-regulatory code will probably never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. As the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models. Here we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of machine learning and massively parallel assays using synthetic DNA.
Collapse
Affiliation(s)
- Carl G de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Jussi Taipale
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden.
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
2
|
Martinez-Corral R, Park M, Biette KM, Friedrich D, Scholes C, Khalil AS, Gunawardena J, DePace AH. Transcriptional kinetic synergy: A complex landscape revealed by integrating modeling and synthetic biology. Cell Syst 2023; 14:324-339.e7. [PMID: 37080164 PMCID: PMC10472254 DOI: 10.1016/j.cels.2023.02.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 08/22/2022] [Accepted: 02/10/2023] [Indexed: 04/22/2023]
Abstract
Transcription factors (TFs) control gene expression, often acting synergistically. Classical thermodynamic models offer a biophysical explanation for synergy based on binding cooperativity and regulated recruitment of RNA polymerase. Because transcription requires polymerase to transition through multiple states, recent work suggests that "kinetic synergy" can arise through TFs acting on distinct steps of the transcription cycle. These types of synergy are not mutually exclusive and are difficult to disentangle conceptually and experimentally. Here, we model and build a synthetic circuit in which TFs bind to a single shared site on DNA, such that TFs cannot synergize by simultaneous binding. We model mRNA production as a function of both TF binding and regulation of the transcription cycle, revealing a complex landscape dependent on TF concentration, DNA binding affinity, and regulatory activity. We use synthetic TFs to confirm that the transcription cycle must be integrated with recruitment for a quantitative understanding of gene regulation.
Collapse
Affiliation(s)
| | - Minhee Park
- Biological Design Center, Boston University, Boston, MA 02215, USA; Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - Kelly M Biette
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Dhana Friedrich
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Clarissa Scholes
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Ahmad S Khalil
- Biological Design Center, Boston University, Boston, MA 02215, USA; Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Jeremy Gunawardena
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Angela H DePace
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA.
| |
Collapse
|
3
|
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 2022; 54:613-624. [PMID: 35551305 DOI: 10.1038/s41588-022-01048-5] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/08/2022] [Indexed: 02/06/2023]
Abstract
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Collapse
|
4
|
Bhogale S, Sinha S. Thermodynamics-based modeling reveals regulatory effects of indirect transcription factor-DNA binding. iScience 2022; 25:104152. [PMID: 35465052 PMCID: PMC9018382 DOI: 10.1016/j.isci.2022.104152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 12/28/2021] [Accepted: 03/21/2022] [Indexed: 11/30/2022] Open
Abstract
Transcription factors (TFs) influence gene expression by binding to DNA, yet experimental data suggests that they also frequently bind regulatory DNA indirectly by interacting with other DNA-bound proteins. Here, we used a data modeling approach to test if such indirect binding by TFs plays a significant role in gene regulation. We first incorporated regulatory function of indirectly bound TFs into a thermodynamics-based model for predicting enhancer-driven expression from its sequence. We then fit the new model to a rich data set comprising hundreds of enhancers and their regulatory activities during mesoderm specification in Drosophila embryogenesis and showed that the newly incorporated mechanism results in significantly better agreement with data. In the process, we derived the first sequence-level model of this extensively characterized regulatory program. We further showed that allowing indirect binding of a TF explains its localization at enhancers more accurately than with direct binding only. Our model also provided a simple explanation of how a TF may switch between activating and repressive roles depending on context. Inclusion of indirect DNA binding of transcription factor improves enhancer function prediction Context specific activating or repressive roles of TFs Indirect binding improves fits to experimental TF-DNA binding data Role of Tinman depends on its DNA-binding mode (direct or indirect)
Collapse
|
5
|
Garbuzov FE, Gursky VV. Nonequilibrium model of short-range repression in gene transcription regulation. Phys Rev E 2021; 104:014407. [PMID: 34412298 DOI: 10.1103/physreve.104.014407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 06/24/2021] [Indexed: 11/07/2022]
Abstract
Transcription factors are proteins that regulate gene activity by activating or repressing gene transcription. A special class of transcriptional repressors operates via a short-range mechanism, making local DNA regions inaccessible to binding by activators, and thus providing an indirect repressive action on the target gene. This mechanism is commonly modeled assuming that repressors interact with DNA under thermodynamic equilibrium and neglecting some configurations of the gene regulatory region. We elaborate on a more general nonequilibrium model of short-range repression using the graph formalism for transitions between gene states, and we apply analytical calculations to compare it with the equilibrium model in terms of the repression strength and expression noise. In contrast to the equilibrium approach, the new model allows us to separate two basic mechanisms of short-range repression. The first mechanism is associated with the recruiting of factors that mediate chromatin condensation, and the second one concerns the blocking of factors that mediate chromatin loosening. The nonequilibrium model demonstrates better performance on previously published gene expression data obtained for transcription factors controlling Drosophila development, and furthermore it predicts that the first repression mechanism is the most favorable in this system. The presented approach can be scaled to larger gene networks and can be used to infer specific modes and parameters of transcriptional regulation from gene expression data.
Collapse
Affiliation(s)
- F E Garbuzov
- Ioffe Institute, 26 Polytekhnicheskaya, St. Petersburg 194021, Russia
| | - V V Gursky
- Ioffe Institute, 26 Polytekhnicheskaya, St. Petersburg 194021, Russia
| |
Collapse
|
6
|
Ullah F, Ben-Hur A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res 2021; 49:e77. [PMID: 33950192 PMCID: PMC8287919 DOI: 10.1093/nar/gkab349] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 04/15/2021] [Accepted: 04/20/2021] [Indexed: 11/14/2022] Open
Abstract
Deep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problem. We present SATORI, a Self-ATtentiOn based model to detect Regulatory element Interactions. Our approach combines convolutional layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. A comprehensive evaluation demonstrates the ability of SATORI to identify numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of experimentally verified TF-TF interactions than existing methods, and has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactions.
Collapse
Affiliation(s)
- Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
7
|
Jindal GA, Farley EK. Enhancer grammar in development, evolution, and disease: dependencies and interplay. Dev Cell 2021; 56:575-587. [PMID: 33689769 PMCID: PMC8462829 DOI: 10.1016/j.devcel.2021.02.016] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 02/15/2021] [Accepted: 02/16/2021] [Indexed: 12/19/2022]
Abstract
Each language has standard books describing that language's grammatical rules. Biologists have searched for similar, albeit more complex, principles relating enhancer sequence to gene expression. Here, we review the literature on enhancer grammar. We introduce dependency grammar, a model where enhancers encode information based on dependencies between enhancer features shaped by mechanistic, evolutionary, and biological constraints. Classifying enhancers based on the types of dependencies may identify unifying principles relating enhancer sequence to gene expression. Such rules would allow us to read the instructions for development within genomes and pinpoint causal enhancer variants underlying disease and evolutionary changes.
Collapse
Affiliation(s)
- Granton A Jindal
- Division of Cardiology, Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA; Division of Biological Sciences, Section of Molecular Biology, University of California San Diego, La Jolla, CA 92093, USA
| | - Emma K Farley
- Division of Cardiology, Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA; Division of Biological Sciences, Section of Molecular Biology, University of California San Diego, La Jolla, CA 92093, USA.
| |
Collapse
|
8
|
Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, Zeitlinger J. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021; 53:354-366. [PMID: 33603233 PMCID: PMC8812996 DOI: 10.1038/s41588-021-00782-6] [Citation(s) in RCA: 225] [Impact Index Per Article: 75.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 01/07/2021] [Indexed: 01/30/2023]
Abstract
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Collapse
Affiliation(s)
- Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany,Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany,Currently at DeepMind, London, UK
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Sabrina Krueger
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Amr Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Khyati Dalal
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA
| | - Robin Fropf
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Charles McAnany
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA,Department of Genetics, Stanford University, Stanford, CA, USA,correspondence: ,
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA,correspondence: ,
| |
Collapse
|
9
|
Chen L, Capra JA. Learning and interpreting the gene regulatory grammar in a deep learning framework. PLoS Comput Biol 2020; 16:e1008334. [PMID: 33137083 PMCID: PMC7660921 DOI: 10.1371/journal.pcbi.1008334] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 11/12/2020] [Accepted: 09/12/2020] [Indexed: 12/12/2022] Open
Abstract
Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.
Collapse
Affiliation(s)
- Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
| | - John A. Capra
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
- Vanderbilt Genetics Institute and Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States of America
| |
Collapse
|
10
|
Abstract
Key discoveries in Drosophila have shaped our understanding of cellular "enhancers." With a special focus on the fly, this chapter surveys properties of these adaptable cis-regulatory elements, whose actions are critical for the complex spatial/temporal transcriptional regulation of gene expression in metazoa. The powerful combination of genetics, molecular biology, and genomics available in Drosophila has provided an arena in which the developmental role of enhancers can be explored. Enhancers are characterized by diverse low- or high-throughput assays, which are challenging to interpret, as not all of these methods of identifying enhancers produce concordant results. As a model metazoan, the fly offers important advantages to comprehensive analysis of the central functions that enhancers play in gene expression, and their critical role in mediating the production of phenotypes from genotype and environmental inputs. A major challenge moving forward will be obtaining a quantitative understanding of how these cis-regulatory elements operate in development and disease.
Collapse
Affiliation(s)
- Stephen Small
- Department of Biology, Developmental Systems Training Program, New York University, 10003 and
| | - David N Arnosti
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824
| |
Collapse
|
11
|
Peng PC, Khoueiry P, Girardot C, Reddington JP, Garfield DA, Furlong EEM, Sinha S. The Role of Chromatin Accessibility in cis-Regulatory Evolution. Genome Biol Evol 2020; 11:1813-1828. [PMID: 31114856 PMCID: PMC6601868 DOI: 10.1093/gbe/evz103] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/13/2019] [Indexed: 02/07/2023] Open
Abstract
Transcription factor (TF) binding is determined by sequence as well as chromatin accessibility. Although the role of accessibility in shaping TF-binding landscapes is well recorded, its role in evolutionary divergence of TF binding, which in turn can alter cis-regulatory activities, is not well understood. In this work, we studied the evolution of genome-wide binding landscapes of five major TFs in the core network of mesoderm specification, between Drosophila melanogaster and Drosophila virilis, and examined its relationship to accessibility and sequence-level changes. We generated chromatin accessibility data from three important stages of embryogenesis in both Drosophila melanogaster and Drosophila virilis and recorded conservation and divergence patterns. We then used multivariable models to correlate accessibility and sequence changes to TF-binding divergence. We found that accessibility changes can in some cases, for example, for the master regulator Twist and for earlier developmental stages, more accurately predict binding change than is possible using TF-binding motif changes between orthologous enhancers. Accessibility changes also explain a significant portion of the codivergence of TF pairs. We noted that accessibility and motif changes offer complementary views of the evolution of TF binding and developed a combined model that captures the evolutionary data much more accurately than either view alone. Finally, we trained machine learning models to predict enhancer activity from TF binding and used these functional models to argue that motif and accessibility-based predictors of TF-binding change can substitute for experimentally measured binding change, for the purpose of predicting evolutionary changes in enhancer activity.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign.,Center for Bioinformatics and Functional Genomics, Department of Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA
| | - Pierre Khoueiry
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.,American University of Beirut (AUB), Department of Biochemistry and Molecular Genetics, Beirut, Lebanon
| | - Charles Girardot
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - James P Reddington
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - David A Garfield
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.,IRI-Life Sciences, Humboldt Universität zu Berlin, Berlin, Germany
| | - Eileen E M Furlong
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign
| |
Collapse
|
12
|
Xie X, Hanson C, Sinha S. Mechanistic interpretation of non-coding variants for discovering transcriptional regulators of drug response. BMC Biol 2019; 17:62. [PMID: 31362726 PMCID: PMC6664756 DOI: 10.1186/s12915-019-0679-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Accepted: 07/09/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Identification of functional non-coding variants and their mechanistic interpretation is a major challenge of modern genomics, especially for precision medicine. Transcription factor (TF) binding profiles and epigenomic landscapes in reference samples allow functional annotation of the genome, but do not provide ready answers regarding the effects of non-coding variants on phenotypes. A promising computational approach is to build models that predict TF-DNA binding from sequence, and use such models to score a variant's impact on TF binding strength. Here, we asked if this mechanistic approach to variant interpretation can be combined with information on genotype-phenotype associations to discover transcription factors regulating phenotypic variation among individuals. RESULTS We developed a statistical approach that integrates phenotype, genotype, gene expression, TF ChIP-seq, and Hi-C chromatin interaction data to answer this question. Using drug sensitivity of lymphoblastoid cell lines as the phenotype of interest, we tested if non-coding variants statistically linked to the phenotype are enriched for strong predicted impact on DNA binding strength of a TF and thus identified TFs regulating individual differences in the phenotype. Our approach relies on a new method for predicting variant impact on TF-DNA binding that uses a combination of biophysical modeling and machine learning. We report statistical and literature-based support for many of the TFs discovered here as regulators of drug response variation. We show that the use of mechanistically driven variant impact predictors can identify TF-drug associations that would otherwise be missed. We examined in depth one reported association-that of the transcription factor ELF1 with the drug doxorubicin-and identified several genes that may mediate this regulatory relationship. CONCLUSION Our work represents initial steps in utilizing predictions of variant impact on TF binding sites for discovery of regulatory mechanisms underlying phenotypic variation. Future advances on this topic will be greatly beneficial to the reconstruction of phenotype-associated gene regulatory networks.
Collapse
Affiliation(s)
- Xiaoman Xie
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Casey Hanson
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA. .,Institute of Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.
| |
Collapse
|
13
|
Datta V, Hannenhalli S, Siddharthan R. ChIPulate: A comprehensive ChIP-seq simulation pipeline. PLoS Comput Biol 2019; 15:e1006921. [PMID: 30897079 PMCID: PMC6445533 DOI: 10.1371/journal.pcbi.1006921] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 04/02/2019] [Accepted: 03/04/2019] [Indexed: 12/17/2022] Open
Abstract
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) is a high-throughput technique to identify genomic regions that are bound in vivo by a particular protein, e.g., a transcription factor (TF). Biological factors, such as chromatin state, indirect and cooperative binding, as well as experimental factors, such as antibody quality, cross-linking, and PCR biases, are known to affect the outcome of ChIP-seq experiments. However, the relative impact of these factors on inferences made from ChIP-seq data is not entirely clear. Here, via a detailed ChIP-seq simulation pipeline, ChIPulate, we assess the impact of various biological and experimental sources of variation on several outcomes of a ChIP-seq experiment, viz., the recoverability of the TF binding motif, accuracy of TF-DNA binding detection, the sensitivity of inferred TF-DNA binding strength, and number of replicates needed to confidently infer binding strength. We find that the TF motif can be recovered despite poor and non-uniform extraction and PCR amplification efficiencies. The recovery of the motif is, however, affected to a larger extent by the fraction of sites that are either cooperatively or indirectly bound. Importantly, our simulations reveal that the number of ChIP-seq replicates needed to accurately measure in vivo occupancy at high-affinity sites is larger than the recommended community standards. Our results establish statistical limits on the accuracy of inferences of protein-DNA binding from ChIP-seq and suggest that increasing the mean extraction efficiency, rather than amplification efficiency, would better improve sensitivity. The source code and instructions for running ChIPulate can be found at https://github.com/vishakad/chipulate. DNA-binding proteins perform many key roles in biology, such as transcriptional regulation of gene expression and chromatin modification. ChIP-seq (Chromatin immunoprecipitation followed by high-throughput sequencing) is a widely used experimental technique to identify DNA-binding sites of specific proteins of interest, within cells, genome-wide. DNA fragments from genomic regions that are bound by a protein of interest, often a transcription factor (TF), are selectively extracted using specific antibodies, amplified using PCR, and sequenced. The sequences are mapped to the reference genome. Regions where many sequences map, called “peaks”, are used to infer the location of TF-bound loci (peaks), in vivo occupancy at those loci, and the sequence pattern (motif) to which the TF shows a binding affinity. But measurements of TF occupancy and motif inference are vulnerable to several biological and experimental sources of variation that are poorly understood and difficult to assess directly. Here, we simulate key steps of the ChIP-seq protocol with the aim of estimating the relative effects of various sources of variations on motif inference and binding affinity estimations. Besides providing specific insights and recommendations, we provide a general framework to simulate sequence reads in a ChIP-seq experiment, which should considerably aid in the development of software aimed at analyzing ChIP-seq data.
Collapse
Affiliation(s)
- Vishaka Datta
- Simons Centre for the Study of Living Machines, National Centre for Biological Sciences, TIFR, Bengaluru, Karnataka, India
- * E-mail:
| | - Sridhar Hannenhalli
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| | - Rahul Siddharthan
- The Institute of Mathematical Sciences/HBNI, Taramani, Chennai, India
| |
Collapse
|
14
|
Asada R, Umeda M, Adachi A, Senmatsu S, Abe T, Iwasaki H, Ohta K, Hoffman CS, Hirota K. Recruitment and delivery of the fission yeast Rst2 transcription factor via a local genome structure counteracts repression by Tup1-family corepressors. Nucleic Acids Res 2017; 45:9361-9371. [PMID: 28934464 PMCID: PMC5766161 DOI: 10.1093/nar/gkx555] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 06/14/2017] [Indexed: 12/12/2022] Open
Abstract
Transcription factors (TFs) determine the transcription activity of target genes and play a central role in controlling the transcription in response to various environmental stresses. Three dimensional genome structures such as local loops play a fundamental role in the regulation of transcription, although the link between such structures and the regulation of TF binding to cis-regulatory elements remains to be elucidated. Here, we show that during transcriptional activation of the fission yeast fbp1 gene, binding of Rst2 (a critical C2H2 zinc-finger TF) is mediated by a local loop structure. During fbp1 activation, Rst2 is first recruited to upstream-activating sequence 1 (UAS1), then it subsequently binds to UAS2 (a critical cis-regulatory site located approximately 600 base pairs downstream of UAS1) through a loop structure that brings UAS1 and UAS2 into spatially close proximity. Tup11/12 (the Tup-family corepressors) suppress direct binding of Rst2 to UAS2, but this suppression is counteracted by the recruitment of Rst2 at UAS1 and following delivery to UAS2 through a loop structure. These data demonstrate a previously unappreciated mechanism for the recruitment and expansion of TF-DNA interactions within a promoter mediated by local three-dimensional genome structures and for timely TF-binding via counteractive regulation by the Tup-family corepressors.
Collapse
Affiliation(s)
- Ryuta Asada
- Department of Chemistry, Graduate School of Science and Engineering, Tokyo Metropolitan University, Minamiosawa 1-1, Hachioji-shi, Tokyo 192-0397, Japan
| | - Miki Umeda
- Department of Chemistry, Graduate School of Science and Engineering, Tokyo Metropolitan University, Minamiosawa 1-1, Hachioji-shi, Tokyo 192-0397, Japan
| | - Akira Adachi
- Department of Chemistry, Graduate School of Science and Engineering, Tokyo Metropolitan University, Minamiosawa 1-1, Hachioji-shi, Tokyo 192-0397, Japan
| | - Satoshi Senmatsu
- Department of Chemistry, Graduate School of Science and Engineering, Tokyo Metropolitan University, Minamiosawa 1-1, Hachioji-shi, Tokyo 192-0397, Japan
| | - Takuya Abe
- Department of Chemistry, Graduate School of Science and Engineering, Tokyo Metropolitan University, Minamiosawa 1-1, Hachioji-shi, Tokyo 192-0397, Japan
| | - Hiroshi Iwasaki
- Cell Biology Unit, Institute of Innovative Research, Tokyo Institute of Technology M6-11, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, Japan
| | - Kunihiro Ohta
- Department of Life Sciences, The University of Tokyo, Meguro-ku, Tokyo 153-8902, Japan.,Universal Biology Institute, The University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
| | | | - Kouji Hirota
- Department of Chemistry, Graduate School of Science and Engineering, Tokyo Metropolitan University, Minamiosawa 1-1, Hachioji-shi, Tokyo 192-0397, Japan
| |
Collapse
|
15
|
Khoueiry P, Girardot C, Ciglar L, Peng PC, Gustafson EH, Sinha S, Furlong EE. Uncoupling evolutionary changes in DNA sequence, transcription factor occupancy and enhancer activity. eLife 2017; 6. [PMID: 28792889 PMCID: PMC5550276 DOI: 10.7554/elife.28440] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Accepted: 07/21/2017] [Indexed: 12/15/2022] Open
Abstract
Sequence variation within enhancers plays a major role in both evolution and disease, yet its functional impact on transcription factor (TF) occupancy and enhancer activity remains poorly understood. Here, we assayed the binding of five essential TFs over multiple stages of embryogenesis in two distant Drosophila species (with 1.4 substitutions per neutral site), identifying thousands of orthologous enhancers with conserved or diverged combinatorial occupancy. We used these binding signatures to dissect two properties of developmental enhancers: (1) potential TF cooperativity, using signatures of co-associations and co-divergence in TF occupancy. This revealed conserved combinatorial binding despite sequence divergence, suggesting protein-protein interactions sustain conserved collective occupancy. (2) Enhancer in-vivo activity, revealing orthologous enhancers with conserved activity despite divergence in TF occupancy. Taken together, we identify enhancers with diverged motifs yet conserved occupancy and others with diverged occupancy yet conserved activity, emphasising the need to functionally measure the effect of divergence on enhancer activity.
Collapse
Affiliation(s)
- Pierre Khoueiry
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Charles Girardot
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Lucia Ciglar
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Pei-Chen Peng
- Carl R. Woese Institute of Genomic Biology, University of Illinois, Champaign, United States
| | - E Hilary Gustafson
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Saurabh Sinha
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.,Carl R. Woese Institute of Genomic Biology, University of Illinois, Champaign, United States
| | - Eileen Em Furlong
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| |
Collapse
|
16
|
Sex combs reduced (Scr) regulatory region of Drosophila revisited. Mol Genet Genomics 2017; 292:773-787. [DOI: 10.1007/s00438-017-1309-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2016] [Accepted: 03/08/2017] [Indexed: 10/19/2022]
|
17
|
Contextual Refinement of Regulatory Targets Reveals Effects on Breast Cancer Prognosis of the Regulome. PLoS Comput Biol 2017; 13:e1005340. [PMID: 28103241 PMCID: PMC5289608 DOI: 10.1371/journal.pcbi.1005340] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Revised: 02/02/2017] [Accepted: 01/03/2017] [Indexed: 01/12/2023] Open
Abstract
Gene expression regulators, such as transcription factors (TFs) and microRNAs (miRNAs), have varying regulatory targets based on the tissue and physiological state (context) within which they are expressed. While the emergence of regulator-characterizing experiments has inferred the target genes of many regulators across many contexts, methods for transferring regulator target genes across contexts are lacking. Further, regulator target gene lists frequently are not curated or have permissive inclusion criteria, impairing their use. Here, we present a method called iterative Contextual Transcriptional Activity Inference of Regulators (icTAIR) to resolve these issues. icTAIR takes a regulator’s previously-identified target gene list and combines it with gene expression data from a context, quantifying that regulator’s activity for that context. It then calculates the correlation between each listed target gene’s expression and the quantitative score of regulatory activity, removes the uncorrelated genes from the list, and iterates the process until it derives a stable list of refined target genes. To validate and demonstrate icTAIR’s power, we use it to refine the MSigDB c3 database of TF, miRNA and unclassified motif target gene lists for breast cancer. We then use its output for survival analysis with clinicopathological multivariable adjustment in 7 independent breast cancer datasets covering 3,430 patients. We uncover many novel prognostic regulators that were obscured prior to refinement, in particular NFY, and offer a detailed look at the composition and relationships among the breast cancer prognostic regulome. We anticipate icTAIR will be of general use in contextually refining regulator target genes for discoveries across many contexts. The icTAIR algorithm can be downloaded from https://github.com/icTAIR. Gene expression regulators, such as transcription factors and microRNAs, are critical actors in cellular physiology and pathophysiology and act by modulating the expression levels of sets of target genes. Given their significance, numerous experiments have sought to characterize the specific target genes of specific regulators, which in turn has led to regulator target gene list databases. Unfortunately, these lists are plagued by poor curation and validation. Further, all lists suffer from the fundamental issue that regulator targets vary across tissue type and physiological state, or “context”, making them poor for conducting downstream, context-specific analyses. To address this issue, here we present a method called icTAIR that contextually-refines regulator target gene lists. To demonstrate its value, we use icTAIR to take the largest-available database of regulator target gene lists, refine it for the breast cancer context, and use both the pre-refined and refined lists for downstream survival analyses in over 3,400 tumors. We find that icTAIR improves the statistical power of the analyses by multiple orders of magnitude. This in turn lets us map the relational network of breast cancer regulators and identify regulators with prognostic effects even after clinicopathological adjustment. We anticipate icTAIR will be broadly useful in regulator studies.
Collapse
|
18
|
Yang CC, Chen MH, Lin SY, Andrews EH, Cheng C, Liu CC, Chen JJW. Inferring condition-specific targets of human TF-TF complexes using ChIP-seq data. BMC Genomics 2017; 18:61. [PMID: 28068916 PMCID: PMC5223348 DOI: 10.1186/s12864-016-3450-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2016] [Accepted: 12/21/2016] [Indexed: 01/18/2023] Open
Abstract
Background Transcription factors (TFs) often interact with one another to form TF complexes that bind DNA and regulate gene expression. Many databases are created to describe known TF complexes identified by either mammalian two-hybrid experiments or data mining. Lately, a wealth of ChIP-seq data on human TFs under different experiment conditions are available, making it possible to investigate condition-specific (cell type and/or physiologic state) TF complexes and their target genes. Results Here, we developed a systematic pipeline to infer Condition-Specific Targets of human TF-TF complexes (called the CST pipeline) by integrating ChIP-seq data and TF motifs. In total, we predicted 2,392 TF complexes and 13,504 high-confidence or 127,994 low-confidence regulatory interactions amongst TF complexes and their target genes. We validated our predictions by (i) comparing predicted TF complexes to external TF complex databases, (ii) validating selected target genes of TF complexes using ChIP-qPCR and RT-PCR experiments, and (iii) analysing target genes of select TF complexes using gene ontology enrichment to demonstrate the accuracy of our work. Finally, the predicted results above were integrated and employed to construct a CST database. Conclusions We built up a methodology to construct the CST database, which contributes to the analysis of transcriptional regulation and the identification of novel TF-TF complex formation in a certain condition. This database also allows users to visualize condition-specific TF regulatory networks through a user-friendly web interface. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3450-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chia-Chun Yang
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan.,Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan.,Institute of Biomedical Sciences, National Chung Hsing University, No. 250, Kuo-Kuang Rd., 40227, Taichung, Taiwan
| | - Min-Hsuan Chen
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Sheng-Yi Lin
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan.,Institute of Biomedical Sciences, National Chung Hsing University, No. 250, Kuo-Kuang Rd., 40227, Taichung, Taiwan
| | - Erik H Andrews
- Department of Genetics, Geisel School of Medicine at Dartmouth, 03755, Hanover, NH, USA
| | - Chao Cheng
- Department of Genetics, Geisel School of Medicine at Dartmouth, 03755, Hanover, NH, USA. .,Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, 03766, Lebanon, NH, USA.
| | - Chun-Chi Liu
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan. .,Institute of Biomedical Sciences, National Chung Hsing University, No. 250, Kuo-Kuang Rd., 40227, Taichung, Taiwan. .,Agricultural Biotechnology Centre, National Chung Hsing University, Taichung, Taiwan.
| | - Jeremy J W Chen
- Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan. .,Institute of Biomedical Sciences, National Chung Hsing University, No. 250, Kuo-Kuang Rd., 40227, Taichung, Taiwan. .,Agricultural Biotechnology Centre, National Chung Hsing University, Taichung, Taiwan.
| |
Collapse
|
19
|
Crocker J, Tsai A, Stern DL. A Fully Synthetic Transcriptional Platform for a Multicellular Eukaryote. Cell Rep 2017; 18:287-296. [DOI: 10.1016/j.celrep.2016.12.025] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Revised: 12/14/2015] [Accepted: 12/07/2016] [Indexed: 01/12/2023] Open
|
20
|
Peng PC, Sinha S. Quantitative modeling of gene expression using DNA shape features of binding sites. Nucleic Acids Res 2016; 44:e120. [PMID: 27257066 PMCID: PMC5291265 DOI: 10.1093/nar/gkw446] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Revised: 05/06/2016] [Accepted: 05/09/2016] [Indexed: 12/11/2022] Open
Abstract
Prediction of gene expression levels driven by regulatory sequences is pivotal in genomic biology. A major focus in transcriptional regulation is sequence-to-expression modeling, which interprets the enhancer sequence based on transcription factor concentrations and DNA binding specificities and predicts precise gene expression levels in varying cellular contexts. Such models largely rely on the position weight matrix (PWM) model for DNA binding, and the effect of alternative models based on DNA shape remains unexplored. Here, we propose a statistical thermodynamics model of gene expression using DNA shape features of binding sites. We used rigorous methods to evaluate the fits of expression readouts of 37 enhancers regulating spatial gene expression patterns in Drosophila embryo, and show that DNA shape-based models perform arguably better than PWM-based models. We also observed DNA shape captures information complimentary to the PWM, in a way that is useful for expression modeling. Furthermore, we tested if combining shape and PWM-based features provides better predictions than using either binding model alone. Our work demonstrates that the increasingly popular DNA-binding models based on local DNA shape can be useful in sequence-to-expression modeling. It also provides a framework for future studies to predict gene expression better than with PWM models alone.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
21
|
Gomes ALC, Wang HH. The Role of Genome Accessibility in Transcription Factor Binding in Bacteria. PLoS Comput Biol 2016; 12:e1004891. [PMID: 27104615 PMCID: PMC4841574 DOI: 10.1371/journal.pcbi.1004891] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Accepted: 03/31/2016] [Indexed: 02/01/2023] Open
Abstract
ChIP-seq enables genome-scale identification of regulatory regions that govern gene expression. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. One possibility is that binding sites are not equally accessible across the genome. A more comprehensive biophysical representation of TF-binding is required to improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 63% of the ChIP-seq profile variance, while a model based in motif score alone explains only 35% of the variance. Moreover, our framework enables de novo ChIP-seq peak prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysically motivated model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.
Collapse
Affiliation(s)
- Antonio L. C. Gomes
- Department of Systems Biology, Columbia University, New York, New York, United States of America
| | - Harris H. Wang
- Department of Systems Biology, Columbia University, New York, New York, United States of America
- Department of Pathology and Cell Biology, Columbia University, New York, New York, United States of America
- * E-mail:
| |
Collapse
|
22
|
Bottani S, Veitia RA. Hill function-based models of transcriptional switches: impact of specific, nonspecific, functional and nonfunctional binding. Biol Rev Camb Philos Soc 2016; 92:953-963. [PMID: 27061969 DOI: 10.1111/brv.12262] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2015] [Revised: 02/12/2016] [Accepted: 02/16/2016] [Indexed: 12/25/2022]
Abstract
We explore minimalist models of transcription in which we take into account that a cis-regulatory sequence is embedded in, and interacts with, a complex genome. The classical Hill equation is the simplest way to represent a transcriptional response. However, it may overlook the fact that a transcription factor (TF) establishes specific and nonspecific nonfunctional interactions with chromatin. Classical papers have shown that nonfunctional binding (not leading to transcription) may influence gene expression. We examine how the presence of additional binding sites for a TF, besides those on the gene(s) of interest, affect the shape and parameters of the transcriptional response. We consider two conditions: at equilibrium and at steady-state. In many cases the TF level is determined by the position of the cell within a spatial or temporal gradient. We show that such gradients can be adjusted by evolutionary selection to compensate for the alteration of the gene transcription response by the presence of nonfunctional binding sites. Finally, we analyse how the transcriptional response is affected by a decrease in TF concentration, as in cases of haploinsufficiency. We show that the nonlinearity of the transcriptional response as a function of [TF] exacerbates the effect of a decrease in the latter, at least for weakly expressed TFs. Although decades of work on TFs have led to the impression that almost everything is known about the control of gene expression, we show that even the simplest models of transcription control have not delivered all their secrets yet.
Collapse
Affiliation(s)
- Samuel Bottani
- Matière et Systèmes Complexes CNRS UMR 7057, 75013 Paris, France.,Université Paris Diderot, Sorbonne Paris Cité, 75013 Paris, France
| | - Reiner A Veitia
- Université Paris Diderot, Sorbonne Paris Cité, 75013 Paris, France.,Institut Jacques Monod, CNRS UMR 7592, 75013 Paris, France
| |
Collapse
|
23
|
Peng PC, Hassan Samee MA, Sinha S. Incorporating chromatin accessibility data into sequence-to-expression modeling. Biophys J 2016; 108:1257-67. [PMID: 25762337 DOI: 10.1016/j.bpj.2014.12.037] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Revised: 12/01/2014] [Accepted: 12/11/2014] [Indexed: 01/30/2023] Open
Abstract
Prediction of gene expression levels from regulatory sequences is one of the major challenges of genomic biology today. A particularly promising approach to this problem is that taken by thermodynamics-based models that interpret an enhancer sequence in a given cellular context specified by transcription factor concentration levels and predict precise expression levels driven by that enhancer. Such models have so far not accounted for the effect of chromatin accessibility on interactions between transcription factor and DNA and consequently on gene-expression levels. Here, we extend a thermodynamics-based model of gene expression, called GEMSTAT (Gene Expression Modeling Based on Statistical Thermodynamics), to incorporate chromatin accessibility data and quantify its effect on accuracy of expression prediction. In the new model, called GEMSTAT-A, accessibility at a binding site is assumed to affect the transcription factor's binding strength at the site, whereas all other aspects are identical to the GEMSTAT model. We show that this modification results in significantly better fits in a data set of over 30 enhancers regulating spatial expression patterns in the blastoderm-stage Drosophila embryo. It is important to note that the improved fits result not from an overall elevated accessibility in active enhancers but from the variation of accessibility levels within an enhancer. With whole-genome DNA accessibility measurements becoming increasingly popular, our work demonstrates how such data may be useful for sequence-to-expression models. It also calls for future advances in modeling accessibility levels from sequence and the transregulatory context, so as to predict accurately the effect of cis and trans perturbations on gene expression.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois
| | - Md Abul Hassan Samee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois; Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois.
| |
Collapse
|
24
|
Samee MAH, Lim B, Samper N, Lu H, Rushlow CA, Jiménez G, Shvartsman SY, Sinha S. A Systematic Ensemble Approach to Thermodynamic Modeling of Gene Expression from Sequence Data. Cell Syst 2015; 1:396-407. [PMID: 27136354 DOI: 10.1016/j.cels.2015.12.002] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Revised: 10/19/2015] [Accepted: 12/02/2015] [Indexed: 11/17/2022]
Abstract
To understand the relationship between an enhancer DNA sequence and quantitative gene expression, thermodynamics-driven mathematical models of transcription are often employed. These "sequence-to-expression" models can describe an incomplete or even incorrect set of regulatory relationships if the parameter space is not searched systematically. Here, we focus on an enhancer of the Drosophila gene ind and demonstrate how a systematic search of parameter space can reveal a more comprehensive picture of a gene's regulatory mechanisms, resolve outstanding ambiguities, and suggest testable hypotheses. We describe an approach that generates an ensemble of ind models; all of these models are technically acceptable solutions to the sequence-to-expression problem in light of wild-type data, and some represent mechanistically distinct hypotheses about the regulation of ind. This ensemble can be restricted to biologically plausible models using requirements gleaned from in vivo perturbation experiments. Biologically plausible models make unique predictions about how specific ind enhancer sequences affect ind expression; we validate these predictions in vivo through site mutagenesis in transgenic Drosophila embryos.
Collapse
Affiliation(s)
- Md Abul Hassan Samee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Bomyi Lim
- Department of Chemical and Biological Engineering and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Núria Samper
- Department of Developmental Biology, Instituto de Biología Molecular de Barcelona, Consejo Superior de Investigaciones Científicas (CSIC), Barcelona 08208, Spain
| | - Hang Lu
- School of Chemical and Biomolecular Engineering and Parker H. Petit Institute for Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | - Gerardo Jiménez
- Department of Developmental Biology, Instituto de Biología Molecular de Barcelona, Consejo Superior de Investigaciones Científicas (CSIC), Barcelona 08208, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona 08010, Spain
| | - Stanislav Y Shvartsman
- Department of Chemical and Biological Engineering and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| |
Collapse
|
25
|
Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol 2015; 11:e1004418. [PMID: 26291518 PMCID: PMC4546298 DOI: 10.1371/journal.pcbi.1004418] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 06/29/2015] [Indexed: 11/19/2022] Open
Abstract
Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA “intrinsic properties” (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome. Identification of transcription factor binding sites based on sequence motifs is typically accompanied by a high false positive rate. Increasing evidence suggests that there are many other factors besides DNA sequence that may affect the binding and interaction of TFs with DNA. Through the integration of sequence motif, chromatin state, and DNA structure properties, we show that TF binding can be better predicted. Moreover, considering chromatin state and DNA structure properties simultaneously yields a significant improvement. While the binding of some TFs can be readily predicted using either chromatin state information or DNA structure, other TFs need both. Thus, our findings provide insights on how different histone modifications and DNA structure properties may influence the binding of a particular TF and thus how TFs regulate gene expression. These features are referred to as sequence “intrinsic properties” because they can be predicted from sequences alone. These intrinsic properties can be used to build a TF binding prediction model that has a similar performance to considering all features. Moreover, the intrinsic property model allows TFBS predictions not only across TFs, but also across DNA-binding domain families that are present in most eukaryotes, suggesting that the model likely can be used across species.
Collapse
|
26
|
Sebeson A, Xi L, Zhang Q, Sigmund A, Wang JP, Widom J, Wang X. Differential Nucleosome Occupancies across Oct4-Sox2 Binding Sites in Murine Embryonic Stem Cells. PLoS One 2015; 10:e0127214. [PMID: 25992972 PMCID: PMC4436218 DOI: 10.1371/journal.pone.0127214] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Accepted: 04/13/2015] [Indexed: 12/03/2022] Open
Abstract
The binding sequence for any transcription factor can be found millions of times within a genome, yet only a small fraction of these sequences encode functional transcription factor binding sites. One of the reasons for this dichotomy is that many other factors, such as nucleosomes, compete for binding. To study how the competition between nucleosomes and transcription factors helps determine a functional transcription factor site from a predicted transcription factor site, we compared experimentally-generated in vitro nucleosome occupancy with in vivo nucleosome occupancy and transcription factor binding in murine embryonic stem cells. Using a solution hybridization enrichment technique, we generated a high-resolution nucleosome map from targeted regions of the genome containing predicted sites and functional sites of Oct4/Sox2 regulation. We found that at Pax6 and Nes, which are bivalently poised in stem cells, functional Oct4 and Sox2 sites show high amounts of in vivo nucleosome displacement compared to in vitro. Oct4 and Sox2, which are active, show no significant displacement of in vivo nucleosomes at functional sites, similar to nonfunctional Oct4/Sox2 binding. This study highlights a complex interplay between Oct4 and Sox2 transcription factors and nucleosomes among different target genes, which may result in distinct patterns of stem cell gene regulation.
Collapse
Affiliation(s)
- Amy Sebeson
- Department of Molecular Biosciences, Northwestern University, Evanston, Illinois, United States of America
| | - Liqun Xi
- Department of Statistics, Northwestern University, Evanston, Illinois, United States of America
| | - Quanwei Zhang
- Department of Statistics, Northwestern University, Evanston, Illinois, United States of America
| | - Audrey Sigmund
- Department of Molecular Biosciences, Northwestern University, Evanston, Illinois, United States of America
| | - Ji-Ping Wang
- Department of Statistics, Northwestern University, Evanston, Illinois, United States of America
- * E-mail: (XW); (J-PW)
| | - Jonathan Widom
- Department of Molecular Biosciences, Northwestern University, Evanston, Illinois, United States of America
| | - Xiaozhong Wang
- Department of Molecular Biosciences, Northwestern University, Evanston, Illinois, United States of America
- * E-mail: (XW); (J-PW)
| |
Collapse
|
27
|
Colombo N, Vlassis N. FastMotif: spectral sequence motif discovery. Bioinformatics 2015; 31:2623-31. [PMID: 25886979 DOI: 10.1093/bioinformatics/btv208] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Accepted: 04/09/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, most of the existing motif finding algorithms are computationally demanding, and they may not be able to support the increasingly large datasets produced by modern high-throughput sequencing technologies. RESULTS We present FastMotif, a new motif discovery algorithm that is built on a recent machine learning technique referred to as Method of Moments. Based on spectral decompositions, our method is robust to model misspecifications and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. On HT-Selex data, FastMotif extracts motif profiles that match those computed by various state-of-the-art algorithms, but one order of magnitude faster. We provide a theoretical and numerical analysis of the algorithm's robustness and discuss its sensitivity with respect to the free parameters. AVAILABILITY AND IMPLEMENTATION The Matlab code of FastMotif is available from http://lcsb-portal.uni.lu/bioinformatics. CONTACT vlassis@adobe.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nicoló Colombo
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg and
| | | |
Collapse
|
28
|
Blatti C, Kazemian M, Wolfe S, Brodsky M, Sinha S. Integrating motif, DNA accessibility and gene expression data to build regulatory maps in an organism. Nucleic Acids Res 2015; 43:3998-4012. [PMID: 25791631 PMCID: PMC4417154 DOI: 10.1093/nar/gkv195] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 02/24/2015] [Indexed: 11/17/2022] Open
Abstract
Characterization of cell type specific regulatory networks and elements is a major challenge in genomics, and emerging strategies frequently employ high-throughput genome-wide assays of transcription factor (TF) to DNA binding, histone modifications or chromatin state. However, these experiments remain too difficult/expensive for many laboratories to apply comprehensively to their system of interest. Here, we explore the potential of elucidating regulatory systems in varied cell types using computational techniques that rely on only data of gene expression, low-resolution chromatin accessibility, and TF–DNA binding specificities (‘motifs’). We show that static computational motif scans overlaid with chromatin accessibility data reasonably approximate experimentally measured TF–DNA binding. We demonstrate that predicted binding profiles and expression patterns of hundreds of TFs are sufficient to identify major regulators of ∼200 spatiotemporal expression domains in the Drosophila embryo. We are then able to learn reliable statistical models of enhancer activity for over 70 expression domains and apply those models to annotate domain specific enhancers genome-wide. Throughout this work, we apply our motif and accessibility based approach to comprehensively characterize the regulatory network of fruitfly embryonic development and show that the accuracy of our computational method compares favorably to approaches that rely on data from many experimental assays.
Collapse
Affiliation(s)
- Charles Blatti
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA
| | - Majid Kazemian
- National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Scot Wolfe
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01655, USA Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01655, USA
| | - Michael Brodsky
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01655, USA Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01655, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois, Urbana, IL 61801, USA Institute of Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| |
Collapse
|
29
|
Zabet NR, Adryan B. Estimating binding properties of transcription factors from genome-wide binding profiles. Nucleic Acids Res 2015; 43:84-94. [PMID: 25432957 PMCID: PMC4288167 DOI: 10.1093/nar/gku1269] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Revised: 10/22/2014] [Accepted: 11/19/2014] [Indexed: 12/20/2022] Open
Abstract
The binding of transcription factors (TFs) is essential for gene expression. One important characteristic is the actual occupancy of a putative binding site in the genome. In this study, we propose an analytical model to predict genomic occupancy that incorporates the preferred target sequence of a TF in the form of a position weight matrix (PWM), DNA accessibility data (in the case of eukaryotes), the number of TF molecules expected to be bound specifically to the DNA and a parameter that modulates the specificity of the TF. Given actual occupancy data in the form of ChIP-seq profiles, we backwards inferred copy number and specificity for five Drosophila TFs during early embryonic development: Bicoid, Caudal, Giant, Hunchback and Kruppel. Our results suggest that these TFs display thousands of molecules that are specifically bound to the DNA and that whilst Bicoid and Caudal display a higher specificity, the other three TFs (Giant, Hunchback and Kruppel) display lower specificity in their binding (despite having PWMs with higher information content). This study gives further weight to earlier investigations into TF copy numbers that suggest a significant proportion of molecules are not bound specifically to the DNA.
Collapse
Affiliation(s)
- Nicolae Radu Zabet
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK
| | - Boris Adryan
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK
| |
Collapse
|
30
|
Abstract
Understanding how sequence-specific protein-DNA interactions direct cellular function is of great interest to the research community. High-throughput methods have been developed to determine DNA-binding specificities; one such technique, the bacterial one-hybrid (B1H) system, confers advantages including ease of use, sensitivity and throughput. In this review, we describe the evolution of the B1H system as a tool capable of screening large DNA libraries to investigate protein-DNA interactions of interest. We discuss how DNA-binding specificities produced by the B1H system have been used to predict regulatory targets. Additionally, we examine how this approach has been applied to characterize two common DNA-binding domain families-homeodomains and Cys2His2 zinc fingers-both in organism-wide studies and with synthetic approaches. In the case of the former, the B1H system has produced large catalogs of protein specificity and nuanced information about previously recovered DNA targets, thereby improving our understanding of these proteins' functions in vivo and increasing our capacity to predict similar interactions in other species. In the latter, synthetic screens of the same DNA-binding domains have further refined our models of specificity, through analyzing comprehensive libraries to uncover all proteins able to bind a complete set of targets, and, for instance, exploring how context-in the form of domain position within the parent protein-may affect specificity. Finally, we recognize the limitations of the B1H system and discuss its potential for use in the production of designer proteins and in studies of protein-protein interactions.
Collapse
|
31
|
Griffon A, Barbier Q, Dalino J, van Helden J, Spicuglia S, Ballester B. Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape. Nucleic Acids Res 2014; 43:e27. [PMID: 25477382 PMCID: PMC4344487 DOI: 10.1093/nar/gku1280] [Citation(s) in RCA: 106] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The large collections of ChIP-seq data rapidly accumulating in public data warehouses provide genome-wide binding site maps for hundreds of transcription factors (TFs). However, the extent of the regulatory occupancy space in the human genome has not yet been fully apprehended by integrating public ChIP-seq data sets and combining it with ENCODE TFs map. To enable genome-wide identification of regulatory elements we have collected, analysed and retained 395 available ChIP-seq data sets merged with ENCODE peaks covering a total of 237 TFs. This enhanced repertoire complements and refines current genome-wide occupancy maps by increasing the human genome regulatory search space by 14% compared to ENCODE alone, and also increases the complexity of the regulatory dictionary. As a direct application we used this unified binding repertoire to annotate variant enhancer loci (VELs) from H3K4me1 mark in two cancer cell lines (MCF-7, CRC) and observed enrichments of specific TFs involved in biological key functions to cancer development and proliferation. Those enrichments of TFs within VELs provide a direct annotation of non-coding regions detected in cancer genomes. Finally, full access to this catalogue is available online together with the TFs enrichment analysis tool (http://tagc.univ-mrs.fr/remap/).
Collapse
Affiliation(s)
- Aurélien Griffon
- INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Quentin Barbier
- INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Jordi Dalino
- INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Jacques van Helden
- INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Salvatore Spicuglia
- INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Benoit Ballester
- INSERM, UMR1090 TAGC, Marseille, F-13288, France Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| |
Collapse
|
32
|
Krebs W, Schmidt SV, Goren A, De Nardo D, Labzin L, Bovier A, Ulas T, Theis H, Kraut M, Latz E, Beyer M, Schultze JL. Optimization of transcription factor binding map accuracy utilizing knockout-mouse models. Nucleic Acids Res 2014; 42:13051-60. [PMID: 25378309 PMCID: PMC4245947 DOI: 10.1093/nar/gku1078] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Revised: 09/26/2014] [Accepted: 10/16/2014] [Indexed: 12/20/2022] Open
Abstract
Genome-wide assessment of protein-DNA interaction by chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-seq) is a key technology for studying transcription factor (TF) localization and regulation of gene expression. Signal-to-noise-ratio and signal specificity in ChIP-seq studies depend on many variables, including antibody affinity and specificity. Thus far, efforts to improve antibody reagents for ChIP-seq experiments have focused mainly on generating higher quality antibodies. Here we introduce KOIN (knockout implemented normalization) as a novel strategy to increase signal specificity and reduce noise by using TF knockout mice as a critical control for ChIP-seq data experiments. Additionally, KOIN can identify 'hyper ChIPable regions' as another source of false-positive signals. As the use of the KOIN algorithm reduces false-positive results and thereby prevents misinterpretation of ChIP-seq data, it should be considered as the gold standard for future ChIP-seq analyses, particularly when developing ChIP-assays with novel antibody reagents.
Collapse
Affiliation(s)
- Wolfgang Krebs
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| | - Susanne V Schmidt
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| | - Alon Goren
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Dominic De Nardo
- Institute of Innate Immunity, University Hospitals, University of Bonn, 53127 Bonn, Germany
| | - Larisa Labzin
- Institute of Innate Immunity, University Hospitals, University of Bonn, 53127 Bonn, Germany
| | - Anton Bovier
- Institute for Applied Mathematics, University of Bonn, 53115 Bonn, Germany
| | - Thomas Ulas
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| | - Heidi Theis
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| | - Michael Kraut
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| | - Eicke Latz
- Institute of Innate Immunity, University Hospitals, University of Bonn, 53127 Bonn, Germany Division of Infectious Diseases and Immunology, UMass Medical School, Worcester, MA 01605, USA German Center of Neurodegenerative Diseases (DZNE), 53175 Bonn, Germany
| | - Marc Beyer
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| | - Joachim L Schultze
- Genomics and Immunoregulation, LIMES-Institute, University of Bonn, 53115 Bonn, Germany
| |
Collapse
|
33
|
Siepel A, Arbiza L. Cis-regulatory elements and human evolution. Curr Opin Genet Dev 2014; 29:81-9. [PMID: 25218861 PMCID: PMC4258466 DOI: 10.1016/j.gde.2014.08.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Revised: 08/17/2014] [Accepted: 08/23/2014] [Indexed: 11/20/2022]
Abstract
Modification of gene regulation has long been considered an important force in human evolution, particularly through changes to cis-regulatory elements (CREs) that function in transcriptional regulation. For decades, however, the study of cis-regulatory evolution was severely limited by the available data. New data sets describing the locations of CREs and genetic variation within and between species have now made it possible to study CRE evolution much more directly on a genome-wide scale. Here, we review recent research on the evolution of CREs in humans based on large-scale genomic data sets. We consider inferences based on primate divergence, human polymorphism, and combinations of divergence and polymorphism. We then consider 'new frontiers' in this field stemming from recent research on transcriptional regulation.
Collapse
Affiliation(s)
- Adam Siepel
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA.
| | - Leonardo Arbiza
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
34
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 337] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
35
|
Ezer D, Zabet NR, Adryan B. Homotypic clusters of transcription factor binding sites: A model system for understanding the physical mechanics of gene expression. Comput Struct Biotechnol J 2014; 10:63-9. [PMID: 25349675 PMCID: PMC4204428 DOI: 10.1016/j.csbj.2014.07.005] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The organization of binding sites in cis-regulatory elements (CREs) can influence gene expression through a combination of physical mechanisms, ranging from direct interactions between TF molecules to DNA looping and transient chromatin interactions. The study of simple and common building blocks in promoters and other CREs allows us to dissect how all of these mechanisms work together. Many adjacent TF binding sites for the same TF species form homotypic clusters, and these CRE architecture building blocks serve as a prime candidate for understanding interacting transcriptional mechanisms. Homotypic clusters are prevalent in both bacterial and eukaryotic genomes, and are present in both promoters as well as more distal enhancer/silencer elements. Here, we review previous theoretical and experimental studies that show how the complexity (number of binding sites) and spatial organization (distance between sites and overall distance from transcription start sites) of homotypic clusters influence gene expression. In particular, we describe how homotypic clusters modulate the temporal dynamics of TF binding, a mechanism that can affect gene expression, but which has not yet been sufficiently characterized. We propose further experiments on homotypic clusters that would be useful in developing mechanistic models of gene expression.
Collapse
Affiliation(s)
- Daphne Ezer
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK
| | - Nicolae Radu Zabet
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK
| | - Boris Adryan
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK
| |
Collapse
|
36
|
Gupta A, Christensen RG, Bell HA, Goodwin M, Patel RY, Pandey M, Enuameh MS, Rayla AL, Zhu C, Thibodeau-Beganny S, Brodsky MH, Joung JK, Wolfe SA, Stormo GD. An improved predictive recognition model for Cys(2)-His(2) zinc finger proteins. Nucleic Acids Res 2014; 42:4800-12. [PMID: 24523353 PMCID: PMC4005693 DOI: 10.1093/nar/gku132] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Revised: 01/21/2014] [Accepted: 01/22/2014] [Indexed: 11/17/2022] Open
Abstract
Cys(2)-His(2) zinc finger proteins (ZFPs) are the largest family of transcription factors in higher metazoans. They also represent the most diverse family with regards to the composition of their recognition sequences. Although there are a number of ZFPs with characterized DNA-binding preferences, the specificity of the vast majority of ZFPs is unknown and cannot be directly inferred by homology due to the diversity of recognition residues present within individual fingers. Given the large number of unique zinc fingers and assemblies present across eukaryotes, a comprehensive predictive recognition model that could accurately estimate the DNA-binding specificity of any ZFP based on its amino acid sequence would have great utility. Toward this goal, we have used the DNA-binding specificities of 678 two-finger modules from both natural and artificial sources to construct a random forest-based predictive model for ZFP recognition. We find that our recognition model outperforms previously described determinant-based recognition models for ZFPs, and can successfully estimate the specificity of naturally occurring ZFPs with previously defined specificities.
Collapse
Affiliation(s)
- Ankit Gupta
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Ryan G. Christensen
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Heather A. Bell
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Mathew Goodwin
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Ronak Y. Patel
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Manishi Pandey
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Metewo Selase Enuameh
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Amy L. Rayla
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Cong Zhu
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Stacey Thibodeau-Beganny
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Michael H. Brodsky
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - J. Keith Joung
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Scot A. Wolfe
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| | - Gary D. Stormo
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA, Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA, Department of Biochemistry and Biology and Biotechnology, Worcester Polytechnic Institute, Worcester, MA 01609, USA, Molecular Pathology Unit, Center for Computational and Integrative Biology, and Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA 02129, USA, Department of Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01605, USA and Department of Pathology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
37
|
Ezer D, Zabet NR, Adryan B. Physical constraints determine the logic of bacterial promoter architectures. Nucleic Acids Res 2014; 42:4196-207. [PMID: 24476912 PMCID: PMC3985651 DOI: 10.1093/nar/gku078] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Site-specific transcription factors (TFs) bind to their target sites on the DNA, where they regulate the rate at which genes are transcribed. Bacterial TFs undergo facilitated diffusion (a combination of 3D diffusion around and 1D random walk on the DNA) when searching for their target sites. Using computer simulations of this search process, we show that the organization of the binding sites, in conjunction with TF copy number and binding site affinity, plays an important role in determining not only the steady state of promoter occupancy, but also the order at which TFs bind. These effects can be captured by facilitated diffusion-based models, but not by standard thermodynamics. We show that the spacing of binding sites encodes complex logic, which can be derived from combinations of three basic building blocks: switches, barriers and clusters, whose response alone and in higher orders of organization we characterize in detail. Effective promoter organizations are commonly found in the E. coli genome and are highly conserved between strains. This will allow studies of gene regulation at a previously unprecedented level of detail, where our framework can create testable hypothesis of promoter logic.
Collapse
Affiliation(s)
- Daphne Ezer
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK and Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK
| | | | | |
Collapse
|
38
|
Duque T, Samee MAH, Kazemian M, Pham HN, Brodsky MH, Sinha S. Simulations of enhancer evolution provide mechanistic insights into gene regulation. Mol Biol Evol 2013; 31:184-200. [PMID: 24097306 PMCID: PMC3879441 DOI: 10.1093/molbev/mst170] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
There is growing interest in models of regulatory sequence evolution. However, existing models specifically designed for regulatory sequences consider the independent evolution of individual transcription factor (TF)-binding sites, ignoring that the function and evolution of a binding site depends on its context, typically the cis-regulatory module (CRM) in which the site is located. Moreover, existing models do not account for the gene-specific roles of TF-binding sites, primarily because their roles often are not well understood. We introduce two models of regulatory sequence evolution that address some of the shortcomings of existing models and implement simulation frameworks based on them. One model simulates the evolution of an individual binding site in the context of a CRM, while the other evolves an entire CRM. Both models use a state-of-the art sequence-to-expression model to predict the effects of mutations on the regulatory output of the CRM and determine the strength of selection. We use the new framework to simulate the evolution of TF-binding sites in 37 well-studied CRMs belonging to the anterior-posterior patterning system in Drosophila embryos. We show that these simulations provide accurate fits to evolutionary data from 12 Drosophila genomes, which includes statistics of binding site conservation on relatively short evolutionary scales and site loss across larger divergence times. The new framework allows us, for the first time, to test hypotheses regarding the underlying cis-regulatory code by directly comparing the evolutionary implications of the hypothesis with the observed evolutionary dynamics of binding sites. Using this capability, we find that explicitly modeling self-cooperative DNA binding by the TF Caudal (CAD) provides significantly better fits than an otherwise identical evolutionary simulation that lacks this mechanistic aspect. This hypothesis is further supported by a statistical analysis of the distribution of intersite spacing between adjacent CAD sites. Experimental tests confirm direct homodimeric interaction between CAD molecules as well as self-cooperative DNA binding by CAD. We note that computational modeling of the D. melanogaster CRMs alone did not yield significant evidence to support CAD self-cooperativity. We thus demonstrate how specific mechanistic details encoded in CRMs can be revealed by modeling their evolution and fitting such models to multispecies data.
Collapse
Affiliation(s)
- Thyago Duque
- Department of Computer Science, University of Illinois at Urbana-Champaign
| | | | | | | | | | | |
Collapse
|