1
|
Andreani V, South EJ, Dunlop MJ. Generating information-dense promoter sequences with optimal string packing. PLoS Comput Biol 2024; 20:e1012276. [PMID: 39047028 PMCID: PMC11268586 DOI: 10.1371/journal.pcbi.1012276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Accepted: 06/25/2024] [Indexed: 07/27/2024] Open
Abstract
Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs sets of 20-100 binding sites into dense nucleotide arrays of 50-300 base pairs in 0.05-10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts.
Collapse
Affiliation(s)
- Virgile Andreani
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
- Biological Design Center, Boston University, Boston, Massachusetts, United States of America
| | - Eric J. South
- Biological Design Center, Boston University, Boston, Massachusetts, United States of America
- Molecular Biology, Cell Biology & Biochemistry Program, Boston University, Boston, Massachusetts, United States of America
| | - Mary J. Dunlop
- Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America
- Biological Design Center, Boston University, Boston, Massachusetts, United States of America
- Molecular Biology, Cell Biology & Biochemistry Program, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
2
|
Tabe-Bordbar S, Song YJ, Lunt BJ, Alavi Z, Prasanth KV, Sinha S. Mechanistic analysis of enhancer sequences in the estrogen receptor transcriptional program. Commun Biol 2024; 7:719. [PMID: 38862711 PMCID: PMC11167054 DOI: 10.1038/s42003-024-06400-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Accepted: 05/30/2024] [Indexed: 06/13/2024] Open
Abstract
Estrogen Receptor α (ERα) is a major lineage determining transcription factor (TF) in mammary gland development. Dysregulation of ERα-mediated transcriptional program results in cancer. Transcriptomic and epigenomic profiling of breast cancer cell lines has revealed large numbers of enhancers involved in this regulatory program, but how these enhancers encode function in their sequence remains poorly understood. A subset of ERα-bound enhancers are transcribed into short bidirectional RNA (enhancer RNA or eRNA), and this property is believed to be a reliable marker of active enhancers. We therefore analyze thousands of ERα-bound enhancers and build quantitative, mechanism-aware models to discriminate eRNAs from non-transcribing enhancers based on their sequence. Our thermodynamics-based models provide insights into the roles of specific TFs in ERα-mediated transcriptional program, many of which are supported by the literature. We use in silico perturbations to predict TF-enhancer regulatory relationships and integrate these findings with experimentally determined enhancer-promoter interactions to construct a gene regulatory network. We also demonstrate that the model can prioritize breast cancer-related sequence variants while providing mechanistic explanations for their function. Finally, we experimentally validate the model-proposed mechanisms underlying three such variants.
Collapse
Affiliation(s)
- Shayan Tabe-Bordbar
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - You Jin Song
- Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Bryan J Lunt
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Zahra Alavi
- Department of Physics, Loyola Marymount University, Los Angeles, CA, USA
| | - Kannanganattu V Prasanth
- Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Saurabh Sinha
- Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
3
|
Lipps G. Definition of the binding specificity of the T7 bacteriophage primase by analysis of a protein binding microarray using a thermodynamic model. Nucleic Acids Res 2024; 52:4818-4829. [PMID: 38597656 PMCID: PMC11109968 DOI: 10.1093/nar/gkae215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 01/26/2024] [Accepted: 03/13/2024] [Indexed: 04/11/2024] Open
Abstract
Protein binding microarrays (PBM), SELEX, RNAcompete and chromatin-immunoprecipitation have been intensively used to determine the specificity of nucleic acid binding proteins. While the specificity of proteins with pronounced sequence specificity is straightforward, the determination of the sequence specificity of proteins of modest sequence specificity is more difficult. In this work, an explorative data analysis workflow for nucleic acid binding data was developed that can be used by scientists that want to analyse their binding data. The workflow is based on a regressor realized in scikit-learn, the major machine learning module for the scripting language Python. The regressor is built on a thermodynamic model of nucleic acid binding and describes the sequence specificity with base- and position-specific energies. The regressor was used to determine the binding specificity of the T7 primase. For this, we reanalysed the binding data of the T7 primase obtained with a custom PBM. The binding specificity of the T7 primase agrees with the priming specificity (5'-GTC) and the template (5'-GGGTC) for the preferentially synthesized tetraribonucleotide primer (5'-pppACCC) but is more relaxed. The dominant contribution of two positions in the motif can be explained by the involvement of the initiating and elongating nucleotides for template binding.
Collapse
Affiliation(s)
- Georg Lipps
- Institute of Chemistry and Bioanalytics, University of Applied Sciences Northwestern Switzerland, 4132 Muttenz, Switzerland
| |
Collapse
|
4
|
Ishigami Y, Wong MS, Martí-Gómez C, Ayaz A, Kooshkbaghi M, Hanson SM, McCandlish DM, Krainer AR, Kinney JB. Specificity, synergy, and mechanisms of splice-modifying drugs. Nat Commun 2024; 15:1880. [PMID: 38424098 PMCID: PMC10904865 DOI: 10.1038/s41467-024-46090-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 02/10/2024] [Indexed: 03/02/2024] Open
Abstract
Drugs that target pre-mRNA splicing hold great therapeutic potential, but the quantitative understanding of how these drugs work is limited. Here we introduce mechanistically interpretable quantitative models for the sequence-specific and concentration-dependent behavior of splice-modifying drugs. Using massively parallel splicing assays, RNA-seq experiments, and precision dose-response curves, we obtain quantitative models for two small-molecule drugs, risdiplam and branaplam, developed for treating spinal muscular atrophy. The results quantitatively characterize the specificities of risdiplam and branaplam for 5' splice site sequences, suggest that branaplam recognizes 5' splice sites via two distinct interaction modes, and contradict the prevailing two-site hypothesis for risdiplam activity at SMN2 exon 7. The results also show that anomalous single-drug cooperativity, as well as multi-drug synergy, are widespread among small-molecule drugs and antisense-oligonucleotide drugs that promote exon inclusion. Our quantitative models thus clarify the mechanisms of existing treatments and provide a basis for the rational development of new therapies.
Collapse
Affiliation(s)
- Yuma Ishigami
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Mandy S Wong
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- Beam Therapeutics, Cambridge, MA, 02142, USA
| | | | - Andalus Ayaz
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Mahdi Kooshkbaghi
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- The Estée Lauder Companies, New York, NY, 10153, USA
| | | | | | - Adrian R Krainer
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA.
| | - Justin B Kinney
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA.
| |
Collapse
|
5
|
Liu S, Gomez-Alcala P, Leemans C, Glassford WJ, Mann RS, Bussemaker HJ. Predicting the DNA binding specificity of mutated transcription factors using family-level biophysically interpretable machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.24.577115. [PMID: 38352411 PMCID: PMC10862739 DOI: 10.1101/2024.01.24.577115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Sequence-specific interactions of transcription factors (TFs) with genomic DNA underlie many cellular processes. High-throughput in vitro binding assays coupled with computational analysis have made it possible to accurately define such sequence recognition in a biophysically interpretable yet mechanism-agonistic way for individual TFs. The fact that such sequence-to-affinity models are now available for hundreds of TFs provides new avenues for predicting how the DNA binding specificity of a TF changes when its protein sequence is mutated. To this end, we developed an analytical framework based on a tetrahedron embedding that can be applied at the level of a given structural TF family. Using bHLH as a test case, we demonstrate that we can systematically map dependencies between the protein sequence of a TF and base preference within the DNA binding site. We also develop a regression approach to predict the quantitative energetic impact of mutations in the DNA binding domain of a TF on its DNA binding specificity, and perform SELEX-seq assays on mutated TFs to experimentally validate our results. Our results point to the feasibility of predicting the functional impact of disease mutations and allelic variation in the cell-wide TF repertoire by leveraging high-quality functional information across sets of homologous wild-type proteins.
Collapse
Affiliation(s)
- Shaoxun Liu
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Pilar Gomez-Alcala
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Christ Leemans
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - William J Glassford
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | - Richard S Mann
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| |
Collapse
|
6
|
Recio PS, Mitra NJ, Shively CA, Song D, Jaramillo G, Lewis KS, Chen X, Mitra R. Zinc cluster transcription factors frequently activate target genes using a non-canonical half-site binding mode. Nucleic Acids Res 2023; 51:5006-5021. [PMID: 37125648 PMCID: PMC10250231 DOI: 10.1093/nar/gkad320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 04/11/2023] [Accepted: 04/14/2023] [Indexed: 05/02/2023] Open
Abstract
Gene expression changes are orchestrated by transcription factors (TFs), which bind to DNA to regulate gene expression. It remains surprisingly difficult to predict basic features of the transcriptional process, including in vivo TF occupancy. Existing thermodynamic models of TF function are often not concordant with experimental measurements, suggesting undiscovered biology. Here, we analyzed one of the most well-studied TFs, the yeast zinc cluster Gal4, constructed a Shea-Ackers thermodynamic model to describe its binding, and compared the results of this model to experimentally measured Gal4p binding in vivo. We found that at many promoters, the model predicted no Gal4p binding, yet substantial binding was observed. These outlier promoters lacked canonical binding motifs, and subsequent investigation revealed Gal4p binds unexpectedly to DNA sequences with high densities of its half site (CGG). We confirmed this novel mode of binding through multiple experimental and computational paradigms; we also found most other zinc cluster TFs we tested frequently utilize this binding mode, at 27% of their targets on average. Together, these results demonstrate a novel mode of binding where zinc clusters, the largest class of TFs in yeast, bind DNA sequences with high densities of half sites.
Collapse
Affiliation(s)
- Pamela S Recio
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Nikhil J Mitra
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Christian A Shively
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - David Song
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Grace Jaramillo
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Kristine Shady Lewis
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Xuhua Chen
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| | - Robi D Mitra
- Department of Genetics, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
- McDonnell Genome Institute, Washington University School of Medicine in St. Louis, St. Louis, MO 63108, USA
| |
Collapse
|
7
|
Alexandari AM, Horton CA, Shrikumar A, Shah N, Li E, Weilert M, Pufall MA, Zeitlinger J, Fordyce PM, Kundaje A. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.11.540401. [PMID: 37214836 PMCID: PMC10197627 DOI: 10.1101/2023.05.11.540401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.
Collapse
Affiliation(s)
- Amr M. Alexandari
- Department of Computer Science, Stanford University, Stanford, CA 94305
| | | | - Avanti Shrikumar
- Department of Earth System Science, Stanford University, Stanford, CA 94305
| | - Nilay Shah
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Eileen Li
- Department of Genetics, Stanford University, Stanford, CA 94305
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Miles A. Pufall
- Department of Biochemistry, Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA
- The University of Kansas Medical Center, Kansas City, KS, USA
| | - Polly M. Fordyce
- Department of Genetics, Stanford University, Stanford, CA 94305
- Department of Bioengineering, Stanford University, Stanford, CA 94305
- ChEM-H Institute, Stanford University, Stanford, CA 94305
- Chan Zuckerberg Biohub, San Francisco, CA 94110
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA 94305
- Department of Genetics, Stanford University, Stanford, CA 94305
| |
Collapse
|
8
|
Ni P, Wilson D, Su Z. A map of cis-regulatory modules and constituent transcription factor binding sites in 80% of the mouse genome. BMC Genomics 2022; 23:714. [PMID: 36261804 PMCID: PMC9583556 DOI: 10.1186/s12864-022-08933-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, the characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. RESULTS Here, by integrating binding peaks in ~ 9,000 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into a cis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC set. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strong evolutionary constraints, indicating that they are likely cis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikely cis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. CONCLUSIONS Validation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - David Wilson
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
9
|
Huang YA, Pan GQ, Wang J, Li JQ, Chen J, Wu YH. Heterogeneous graph embedding model for predicting interactions between TF and target gene. Bioinformatics 2022; 38:2554-2560. [PMID: 35266510 DOI: 10.1093/bioinformatics/btac148] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 02/13/2022] [Accepted: 03/09/2022] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Identifying the target genes of transcription factors (TFs) is of great significance for biomedical researches. However, using biological experiments to identify TF-target gene interactions is still time consuming, expensive and limited to small scale. Existing computational methods for predicting underlying genes for TF to target is mainly proposed for their binding sites rather than the direct interaction. To bridge this gap, we in this work proposed a deep learning prediction model, named HGETGI, to identify the new TF-target gene interaction. Specifically, the proposed HGETGI model learns the patterns of the known interaction between TF and target gene complemented with their involvement in different human disease mechanisms. It performs prediction based on random walk for meta-path sampling and node embedding in a skip-gram manner. RESULTS We evaluated the prediction performance of the proposed method on a real dataset and the experimental results show that it can achieve the average area under the curve of 0.8519 ± 0.0731 in 5-fold cross validation. Besides, we conducted case studies on the prediction of two important kinds of TF, NFKB1 and TP53. As a result, 33 and 32 in the top-40 ranking lists of NFKB1 and TP53 were successfully confirmed by looking up another public database(hTftarget). It is envisioned that the proposed HGETGI method is feasible and effective for predicting TF-target gene interactions on a large scale. AVAILABILITY AND IMPLEMENTATION The source code and dataset are available at https://github.com/PGTSING/HGETGI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yu-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Gui-Qing Pan
- College of Computer Science and Software Engineering, Shenzhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Jia Wang
- College of Computer Science and Software Engineering, Shenzhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Jian-Qiang Li
- College of Computer Science and Software Engineering, Shenzhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Jie Chen
- College of Computer Science and Software Engineering, Shenzhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Yang-Han Wu
- College of Computer Science and Software Engineering, Shenzhen University, 3688 Nanhai Avenue, Shenzhen, China
| |
Collapse
|
10
|
Wang XF, Sun J, Wang XL, Tian JK, Tian ZW, Zhang JL, Jia R. MD investigation on the binding of microphthalmia-associated transcription factor with DNA. JOURNAL OF SAUDI CHEMICAL SOCIETY 2022. [DOI: 10.1016/j.jscs.2022.101420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
11
|
Sun H, Chen W, Chen L, Zheng W. Exploring the molecular basis of UG-rich RNA recognition by the human splicing factor TDP-43 using molecular dynamics simulation and free energy calculation. J Comput Chem 2021; 42:1670-1680. [PMID: 34109652 DOI: 10.1002/jcc.26704] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2021] [Revised: 04/15/2021] [Accepted: 05/23/2021] [Indexed: 11/12/2022]
Abstract
Transactivation response element RNA/DNA-binding protein 43 (TDP-43) is involved in the regulation of alternative splicing of human neurodegenerative disease-related genes through binding to long UG-rich RNA sequences. Mutations in TDP-43, most in the homeodomain, cause neurological disorders such as amyotrophic lateral sclerosis and fronto temporal lobar degeneration. Several mutants destabilize the structure and disrupt RNA-binding activity. The biological functions of these mutants have been characterized, but the structural basis behind the loss of RNA-binding activity is unclear. Focused on the specific TDP-43-ssRNA complex (PDB code 4BS2), we applied molecular dynamics simulations and the molecular mechanics Poisson-Boltzmann surface area free energy calculation to characterize and explore the structural and dynamic effects between ssRNA and TDP-43. The energetic analysis indicated that the intermolecular van der Waals interaction and nonpolar solvation energy play an important role in the binding process of TDP-43 and ssRNA. Compared with the wild-type TDP-43, the reduction of the polar or non-polar interaction between all the mutants F149A, D105A/S254A, R171A/D174A, F147L/F149L/F229L/F231L and ssRNA is the main reason for the reduction of its binding free energy. Decomposing energies suggested that the extensive interactions between TDP-43 and the nitrogenous bases of ssRNA are responsible for the specific ssRNA recognition by TDP-43. These results elucidated the TDP-43-ssRNA interaction comprehensively and further extended our understanding of the previous experimental data. The uncovering of TDP-43-ssRNA recognition mechanism will provide us useful insights and new chances for the development of anti-neurodegenerative drugs.
Collapse
Affiliation(s)
- Han Sun
- College of Chemistry and Chemical Engineering, Qiqihar University, Qiqihar, China
| | - Wei Chen
- College of Chemistry and Chemical Engineering, Qiqihar University, Qiqihar, China
| | - Lin Chen
- College of Chemistry and Chemical Engineering, Qiqihar University, Qiqihar, China
| | | |
Collapse
|
12
|
Zhang L, Karimzadeh M, Welch M, McIntosh C, Wang B. Analytics methods and tools for integration of biomedical data in medicine. Artif Intell Med 2021. [DOI: 10.1016/b978-0-12-821259-2.00007-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
13
|
Ireland WT, Beeler SM, Flores-Bautista E, McCarty NS, Röschinger T, Belliveau NM, Sweredoski MJ, Moradian A, Kinney JB, Phillips R. Deciphering the regulatory genome of Escherichia coli, one hundred promoters at a time. eLife 2020; 9:e55308. [PMID: 32955440 PMCID: PMC7567609 DOI: 10.7554/elife.55308] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Accepted: 09/18/2020] [Indexed: 01/28/2023] Open
Abstract
Advances in DNA sequencing have revolutionized our ability to read genomes. However, even in the most well-studied of organisms, the bacterium Escherichia coli, for ≈65% of promoters we remain ignorant of their regulation. Until we crack this regulatory Rosetta Stone, efforts to read and write genomes will remain haphazard. We introduce a new method, Reg-Seq, that links massively parallel reporter assays with mass spectrometry to produce a base pair resolution dissection of more than a E. coli promoters in 12 growth conditions. We demonstrate that the method recapitulates known regulatory information. Then, we examine regulatory architectures for more than 80 promoters which previously had no known regulatory information. In many cases, we also identify which transcription factors mediate their regulation. This method clears a path for highly multiplexed investigations of the regulatory genome of model organisms, with the potential of moving to an array of microbes of ecological and medical relevance.
Collapse
Affiliation(s)
- William T Ireland
- Department of Physics, California Institute of TechnologyPasadenaUnited States
| | - Suzannah M Beeler
- Division of Biology and Biological Engineering, California Institute of TechnologyPasadenaUnited States
| | - Emanuel Flores-Bautista
- Division of Biology and Biological Engineering, California Institute of TechnologyPasadenaUnited States
| | - Nicholas S McCarty
- Division of Biology and Biological Engineering, California Institute of TechnologyPasadenaUnited States
| | - Tom Röschinger
- Division of Chemistry and Chemical Engineering, California Institute of TechnologyPasadenaUnited States
| | - Nathan M Belliveau
- Division of Biology and Biological Engineering, California Institute of TechnologyPasadenaUnited States
| | - Michael J Sweredoski
- Proteome Exploration Laboratory, Division of Biology and Biological Engineering, Beckman Institute, California Institute of TechnologyPasadenaUnited States
| | - Annie Moradian
- Proteome Exploration Laboratory, Division of Biology and Biological Engineering, Beckman Institute, California Institute of TechnologyPasadenaUnited States
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor LaboratoryCold Spring HarborUnited States
| | - Rob Phillips
- Department of Physics, California Institute of TechnologyPasadenaUnited States
- Division of Biology and Biological Engineering, California Institute of TechnologyPasadenaUnited States
| |
Collapse
|
14
|
Cencini M, Pigolotti S. Energetic funnel facilitates facilitated diffusion. Nucleic Acids Res 2019; 46:558-567. [PMID: 29216364 PMCID: PMC5778461 DOI: 10.1093/nar/gkx1220] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Accepted: 11/24/2017] [Indexed: 01/25/2023] Open
Abstract
Transcription factors (TFs) are able to associate to their binding sites on DNA faster than the physical limit posed by diffusion. Such high association rates can be achieved by alternating between three-dimensional diffusion and one-dimensional sliding along the DNA chain, a mechanism-dubbed facilitated diffusion. By studying a collection of TF binding sites of Escherichia coli from the RegulonDB database and of Bacillus subtilis from DBTBS, we reveal a funnel in the binding energy landscape around the target sequences. We show that such a funnel is linked to the presence of gradients of AT in the base composition of the DNA region around the binding sites. An extensive computational study of the stochastic sliding process along the energetic landscapes obtained from the database shows that the funnel can significantly enhance the probability of TFs to find their target sequences when sliding in their proximity. We demonstrate that this enhancement leads to a speed-up of the association process.
Collapse
Affiliation(s)
- Massimo Cencini
- Istituto dei Sistemi Complessi, Consiglio Nazionale delle Ricerche, via dei Taurini 19, 00185 Rome, Italy
| | - Simone Pigolotti
- Biological Complexity Unit, Okinawa Institute of Science and Technology and Graduate University, Onna, Okinawa 904-0495, Japan.,Max Planck Institute for the Physics of Complex Systems, Nöthnitzerstraße 38, 01187 Dresden, Germany.,Departament de Fisica, Universitat Politecnica de Catalunya Edif. GAIA, Rambla Sant Nebridi 22, 08222 Terrassa, Barcelona, Spain
| |
Collapse
|
15
|
Kinney JB, McCandlish DM. Massively Parallel Assays and Quantitative Sequence-Function Relationships. Annu Rev Genomics Hum Genet 2019; 20:99-127. [PMID: 31091417 DOI: 10.1146/annurev-genom-083118-014845] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Over the last decade, a rich variety of massively parallel assays have revolutionized our understanding of how biological sequences encode quantitative molecular phenotypes. These assays include deep mutational scanning, high-throughput SELEX, and massively parallel reporter assays. Here, we review these experimental methods and how the data they produce can be used to quantitatively model sequence-function relationships. In doing so, we touch on a diverse range of topics, including the identification of clinically relevant genomic variants, the modeling of transcription factor binding to DNA, the functional and evolutionary landscapes of proteins, and cis-regulatory mechanisms in both transcription and mRNA splicing. We further describe a unified conceptual framework and a core set of mathematical modeling strategies that studies in these diverse areas can make use of. Finally, we highlight key aspects of experimental design and mathematical modeling that are important for the results of such studies to be interpretable and reproducible.
Collapse
Affiliation(s)
- Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; ,
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; ,
| |
Collapse
|
16
|
Djordjevic M, Rodic A, Graovac S. From biophysics to 'omics and systems biology. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2019; 48:413-424. [PMID: 30972433 DOI: 10.1007/s00249-019-01366-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 02/12/2019] [Accepted: 04/03/2019] [Indexed: 01/03/2023]
Abstract
Recent decades brought a revolution to biology, driven mainly by exponentially increasing amounts of data coming from "'omics" sciences. To handle these data, bioinformatics often has to combine biologically heterogeneous signals, for which methods from statistics and engineering (e.g. machine learning) are often used. While such an approach is sometimes necessary, it effectively treats the underlying biological processes as a black box. Similarly, systems biology deals with inherently complex systems, characterized by a large number of degrees of freedom, and interactions that are highly non-linear. To deal with this complexity, the underlying physical interactions are often (over)simplified, such as in Boolean modelling of network dynamics. In this review, we argue for the utility of applying a biophysical approach in bioinformatics and systems biology, including discussion of two examples from our research which address sequence analysis and understanding intracellular gene expression dynamics.
Collapse
Affiliation(s)
- Marko Djordjevic
- Faculty of Biology, Institute of Physiology and Biochemistry, University of Belgrade, Belgrade, Serbia.
| | - Andjela Rodic
- Faculty of Biology, Institute of Physiology and Biochemistry, University of Belgrade, Belgrade, Serbia.,Interdisciplinary PhD Program in Biophysics, University of Belgrade, Belgrade, Serbia
| | - Stefan Graovac
- Faculty of Biology, Institute of Physiology and Biochemistry, University of Belgrade, Belgrade, Serbia.,Interdisciplinary PhD Program in Biophysics, University of Belgrade, Belgrade, Serbia
| |
Collapse
|
17
|
Li H, Quang D, Guan Y. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Res 2019; 29:281-292. [PMID: 30567711 PMCID: PMC6360811 DOI: 10.1101/gr.237156.118] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 12/13/2018] [Indexed: 12/16/2022]
Abstract
The ENCyclopedia of DNA Elements (ENCODE) consortium has generated transcription factor (TF) binding ChIP-seq data covering hundreds of TF proteins and cell types; however, due to limits on time and resources, only a small fraction of all possible TF-cell type pairs have been profiled. One solution is to build machine learning models trained on currently available epigenomic data sets that can be applied to the remaining missing pairs. A major challenge is that TF binding sites are cell-type-specific, which can be attributed to cellular contexts such as chromatin accessibility. Meanwhile, indirect TF-DNA binding and interactions between TFs complicate this regulatory process. Technical issues such as sequencing biases and batch effects render the prediction task even more challenging. Many pioneering efforts have been made to predict TF binding profiles based on DNA sequence and DNase-seq footprints, but to what extent a model can be generalized to completely untested cell conditions remains unknown. In this study, we describe our first place solution to the 2017 ENCODE-DREAM in vivo TF binding site prediction challenge. By carefully addressing multisource biases and information imbalance across cell types, we created a pipeline that significantly outperforms the current state-of-the-art methods. The proposed method is sufficiently complex enough to model nonlinear interactions between TF binding motifs and chromatin accessibility information up to 1500 bp from the genomic region of interest.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Daniel Quang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
18
|
Keilwagen J, Posch S, Grau J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol 2019; 20:9. [PMID: 30630522 PMCID: PMC6327544 DOI: 10.1186/s13059-018-1614-y] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 12/18/2018] [Indexed: 01/11/2023] Open
Abstract
Prediction of cell type-specific, in vivo transcription factor binding sites is one of the central challenges in regulatory genomics. Here, we present our approach that earned a shared first rank in the "ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge" in 2017. In post-challenge analyses, we benchmark the influence of different feature sets and find that chromatin accessibility and binding motifs are sufficient to yield state-of-the-art performance. Finally, we provide 682 lists of predicted peaks for a total of 31 transcription factors in 22 primary cell types and tissues and a user-friendly version of our approach, Catchitt, for download.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Erwin-Baur-Straße 27, Quedlinburg, 06484 Germany
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120 Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120 Germany
| |
Collapse
|
19
|
Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
20
|
Zhang Q, Fan X, Wang Y, Sun MA, Shao J, Guo D. BPP: a sequence-based algorithm for branch point prediction. Bioinformatics 2018. [PMID: 28633445 DOI: 10.1093/bioinformatics/btx401] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Motivation Although high-throughput sequencing methods have been proposed to identify splicing branch points in the human genome, these methods can only detect a small fraction of the branch points subject to the sequencing depth, experimental cost and the expression level of the mRNA. An accurate computational model for branch point prediction is therefore an ongoing objective in human genome research. Results We here propose a novel branch point prediction algorithm that utilizes information on the branch point sequence and the polypyrimidine tract. Using experimentally validated data, we demonstrate that our proposed method outperforms existing methods. Availability and implementation: https://github.com/zhqingit/BPP. Contact djguo@cuhk.edu.hk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qing Zhang
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Shatin, NT, Hong Kong SAR, China
| | - Yejun Wang
- Department of Cell Biology and Genetics, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Ming-An Sun
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology
| | - Jianlin Shao
- First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Dianjing Guo
- School of Life Sciences and the State Key Laboratory of Agrobiotechnology
| |
Collapse
|
21
|
Käppel S, Melzer R, Rümpler F, Gafert C, Theißen G. The floral homeotic protein SEPALLATA3 recognizes target DNA sequences by shape readout involving a conserved arginine residue in the MADS-domain. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2018; 95:341-357. [PMID: 29744943 DOI: 10.1111/tpj.13954] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Revised: 04/17/2018] [Accepted: 04/23/2018] [Indexed: 05/05/2023]
Abstract
SEPALLATA3 of Arabidopsis thaliana is a MADS-domain transcription factor (TF) and a key regulator of flower development. MADS-domain proteins bind to sequences termed 'CArG-boxes' [consensus 5'-CC(A/T)6 GG-3']. Because only a fraction of the CArG-boxes in the Arabidopsis genome are bound by SEPALLATA3, more elaborate principles have to be discovered to better understand which features turn CArG-boxes into genuine recognition sites. Here, we investigate to what extent the shape of the DNA is involved in a 'shape readout' that contributes to the binding of SEPALLATA3. We determined in vitro binding affinities of SEPALLATA3 to DNA probes that all contain the CArG-box motif, but differ in their predicted DNA shape. We found that binding affinity correlates well with a narrow minor groove of the DNA. Substitution of canonical bases with non-standard bases supports the hypothesis of minor groove shape readout by SEPALLATA3. Analysis of mutant SEPALLATA3 proteins further revealed that a highly conserved arginine residue, which is expected to contact the DNA minor groove, contributes significantly to the shape readout. Our studies show that the specific recognition of cis-regulatory elements by a plant MADS-domain TF, and by inference probably also of other TFs of this type, heavily depends on shape readout mechanisms.
Collapse
Affiliation(s)
- Sandra Käppel
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Rainer Melzer
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Florian Rümpler
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Christian Gafert
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| |
Collapse
|
22
|
Comprehensive, high-resolution binding energy landscapes reveal context dependencies of transcription factor binding. Proc Natl Acad Sci U S A 2018; 115:E3702-E3711. [PMID: 29588420 PMCID: PMC5910820 DOI: 10.1073/pnas.1715888115] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Transcription factors (TFs) are primary regulators of gene expression in cells, where they bind specific genomic target sites to control transcription. Quantitative measurements of TF-DNA binding energies can improve the accuracy of predictions of TF occupancy and downstream gene expression in vivo and shed light on how transcriptional networks are rewired throughout evolution. Here, we present a sequencing-based TF binding assay and analysis pipeline (BET-seq, for Binding Energy Topography by sequencing) capable of providing quantitative estimates of binding energies for more than one million DNA sequences in parallel at high energetic resolution. Using this platform, we measured the binding energies associated with all possible combinations of 10 nucleotides flanking the known consensus DNA target interacting with two model yeast TFs, Pho4 and Cbf1. A large fraction of these flanking mutations change overall binding energies by an amount equal to or greater than consensus site mutations, suggesting that current definitions of TF binding sites may be too restrictive. By systematically comparing estimates of binding energies output by deep neural networks (NNs) and biophysical models trained on these data, we establish that dinucleotide (DN) specificities are sufficient to explain essentially all variance in observed binding behavior, with Cbf1 binding exhibiting significantly more nonadditivity than Pho4. NN-derived binding energies agree with orthogonal biochemical measurements and reveal that dynamically occupied sites in vivo are both energetically and mutationally distant from the highest affinity sites.
Collapse
|
23
|
Wei X, Zhang J. Why Phenotype Robustness Promotes Phenotype Evolvability. Genome Biol Evol 2017; 9:3509-3515. [PMID: 29228219 PMCID: PMC5751051 DOI: 10.1093/gbe/evx264] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/07/2017] [Indexed: 12/14/2022] Open
Abstract
Robustness and evolvability are fundamental characteristics of life whose relationship has intrigued generations of biologists. Studies of several genotype–phenotype maps (GPMs) such as the map between short DNA sequences and their bindings to transcription factors showed that phenotype robustness (PR) promotes phenotype evolvability (PE), but the underlying reason is unclear. Here, we show mathematically that the expected PE is a monotonically increasing function of the expected PR in random GPMs. Population genetic simulations confirm that increasing PR raises the probability that a target phenotype appears in a population within a given time, under empirical as well as randomly rewired GPMs. These and other results demonstrate that the positive correlation between PR and PE is mathematical rather than biological. Hence, it is unsurprising to observe this correlation in every empirical GPM investigated, although the magnitude of the correlation may vary due to influences of various biological factors.
Collapse
Affiliation(s)
- Xinzhu Wei
- Department of Ecology and Evolutionary Biology, University of Michigan
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan
| |
Collapse
|
24
|
Djordjevic M, Djordjevic M, Zdobnov E. Scoring Targets of Transcription in Bacteria Rather than Focusing on Individual Binding Sites. Front Microbiol 2017; 8:2314. [PMID: 29213263 PMCID: PMC5702782 DOI: 10.3389/fmicb.2017.02314] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Accepted: 11/09/2017] [Indexed: 11/13/2022] Open
Abstract
Reliable identification of targets of bacterial regulators is necessary to understand bacterial gene expression regulation. These targets are commonly predicted by searching for high-scoring binding sites in the upstream genomic regions, which typically leads to a large number of false positives. In contrast to the common approach, here we propose a novel concept, where overrepresentation of the scoring distribution that corresponds to the entire searched region is assessed, as opposed to predicting individual binding sites. We explore two implementations of this concept, based on Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) tests, which both provide straightforward P-value estimates for predicted targets. This approach is implemented for pleiotropic bacterial regulators, including σ70 (bacterial housekeeping σ factor) target predictions, which is a classical bioinformatics problem characterized by low specificity. We show that KS based approach is both faster and more accurate, departing from the current paradigm of AD being slower, but more accurate. Moreover, KS approach leads to a significant increase in the search accuracy compared to the standard approach, while at the same time straightforwardly assigning well established P-values to each potential target. Consequently, the new KS based method proposed here, which assigns P-values to fixed length upstream regions, provides a fast and accurate approach for predicting bacterial transcription targets.
Collapse
Affiliation(s)
- Marko Djordjevic
- Institute of Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Belgrade, Serbia
| | | | - Evgeny Zdobnov
- Swiss Institute of Bioinformatics and Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
| |
Collapse
|
25
|
Yesudhas D, Anwar MA, Panneerselvam S, Kim HK, Choi S. Evaluation of Sox2 binding affinities for distinct DNA patterns using steered molecular dynamics simulation. FEBS Open Bio 2017; 7:1750-1767. [PMID: 29123983 PMCID: PMC5666385 DOI: 10.1002/2211-5463.12316] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Revised: 08/14/2017] [Accepted: 09/05/2017] [Indexed: 11/29/2022] Open
Abstract
Transcription factors (TFs) are gene expression regulators that bind to DNA in a sequence‐specific manner and determine the functional characteristics of the gene. It is worthwhile to study the unique characteristics of such specific TF‐binding pattern in DNA. Sox2 recognizes a 6‐ to 7‐base pair consensus DNA sequence; the central four bases of the binding site are highly conserved, whereas the two to three flanking bases are variable. Here, we attempted to analyze the binding affinity and specificity of the Sox2 protein for distinct DNA sequence patterns via steered molecular dynamics, in which a pulling force is employed to dissociate Sox2 from Sox2–DNA during simulation to study the behavior of a complex under nonequilibrium conditions. The simulation results revealed that the first two stacking bases of the binding pattern have an exclusive impact on the binding affinity, with the corresponding mutant complexes showing greater binding and longer dissociation time than the experimental complexes do. In contrast, mutation of the conserved bases tends to reduce the affinity, and mutation of the complete conserved region disrupts the binding. It might pave the way to identify the most likely binding pattern recognized by Sox2 based on the affinity of each configuration. The α2‐helix of Sox2 was found to be the key player in the Sox2–DNA association. The characterization of Sox2's binding patterns for the target genes in the genome helps in understanding of its regulatory functions.
Collapse
Affiliation(s)
- Dhanusha Yesudhas
- Department of Molecular Science and Technology Ajou University Suwon Korea
| | | | | | - Han-Kyul Kim
- Department of Molecular Science and Technology Ajou University Suwon Korea
| | - Sangdun Choi
- Department of Molecular Science and Technology Ajou University Suwon Korea
| |
Collapse
|
26
|
Gursky VV, Kozlov KN, Kulakovskiy IV, Zubair A, Marjoram P, Lawrie DS, Nuzhdin SV, Samsonova MG. Translating natural genetic variation to gene expression in a computational model of the Drosophila gap gene regulatory network. PLoS One 2017; 12:e0184657. [PMID: 28898266 PMCID: PMC5595321 DOI: 10.1371/journal.pone.0184657] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 08/28/2017] [Indexed: 11/18/2022] Open
Abstract
Annotating the genotype-phenotype relationship, and developing a proper quantitative description of the relationship, requires understanding the impact of natural genomic variation on gene expression. We apply a sequence-level model of gap gene expression in the early development of Drosophila to analyze single nucleotide polymorphisms (SNPs) in a panel of natural sequenced D. melanogaster lines. Using a thermodynamic modeling framework, we provide both analytical and computational descriptions of how single-nucleotide variants affect gene expression. The analysis reveals that the sequence variants increase (decrease) gene expression if located within binding sites of repressors (activators). We show that the sign of SNP influence (activation or repression) may change in time and space and elucidate the origin of this change in specific examples. The thermodynamic modeling approach predicts non-local and non-linear effects arising from SNPs, and combinations of SNPs, in individual fly genotypes. Simulation of individual fly genotypes using our model reveals that this non-linearity reduces to almost additive inputs from multiple SNPs. Further, we see signatures of the action of purifying selection in the gap gene regulatory regions. To infer the specific targets of purifying selection, we analyze the patterns of polymorphism in the data at two phenotypic levels: the strengths of binding and expression. We find that combinations of SNPs show evidence of being under selective pressure, while individual SNPs do not. The model predicts that SNPs appear to accumulate in the genotypes of the natural population in a way biased towards small increases in activating action on the expression pattern. Taken together, these results provide a systems-level view of how genetic variation translates to the level of gene regulatory networks via combinatorial SNP effects.
Collapse
Affiliation(s)
- Vitaly V. Gursky
- Theoretical Department, Ioffe Institute, Saint Petersburg, Russia
- Systems Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, Saint Petersburg, Russia
- * E-mail:
| | - Konstantin N. Kozlov
- Systems Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, Saint Petersburg, Russia
| | - Ivan V. Kulakovskiy
- Engelhardt Institute of Molecular Biology, Moscow, Russia
- Vavilov Institute of General Genetics, Moscow, Russia
- Center for Data-Intensive Biomedicine and Biotechnology, Skolkovo Institute of Science and Technology, Moscow, Russia
| | - Asif Zubair
- Molecular and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Paul Marjoram
- Molecular and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - David S. Lawrie
- Molecular and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Sergey V. Nuzhdin
- Molecular and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Maria G. Samsonova
- Systems Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, Saint Petersburg, Russia
| |
Collapse
|
27
|
Inherent limitations of probabilistic models for protein-DNA binding specificity. PLoS Comput Biol 2017; 13:e1005638. [PMID: 28686588 PMCID: PMC5521849 DOI: 10.1371/journal.pcbi.1005638] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Revised: 07/21/2017] [Accepted: 06/21/2017] [Indexed: 01/10/2023] Open
Abstract
The specificities of transcription factors are most commonly represented with probabilistic models. These models provide a probability for each base occurring at each position within the binding site and the positions are assumed to contribute independently. The model is simple and intuitive and is the basis for many motif discovery algorithms. However, the model also has inherent limitations that prevent it from accurately representing true binding probabilities, especially for the highest affinity sites under conditions of high protein concentration. The limitations are not due to the assumption of independence between positions but rather are caused by the non-linear relationship between binding affinity and binding probability and the fact that independent normalization at each position skews the site probabilities. Generally probabilistic models are reasonably good approximations, but new high-throughput methods allow for biophysical models with increased accuracy that should be used whenever possible. Transcription factors (TFs), a class of DNA-binding proteins, play a central role in the regulation of gene expression. TFs control the rate of transcription by binding to the genome in a sequence-specific manner. Thus, one important aspect in the study of gene regulation mechanism is to model the binding specificities of TFs, namely the features of the DNA sequences that a TF prefers to bind. Multiple models have been proposed to characterize the binding specificities of TFs, among which the class of probabilistic models is the most popular. In this study, we point out several major limitations of the well-established probabilistic model by comparing it with the biophysical model. Through simulations we demonstrate that the probabilistic model is only an approximation of the biophysical model. The latter has most of the advantages of the former, and is a more accurate representation of binding specificities. We propose a shift from the probabilistic model to the biophysical model in future studies of protein-DNA interactions.
Collapse
|
28
|
López Y, Vandenbon A, Nose A, Nakai K. Modeling the cis-regulatory modules of genes expressed in developmental stages of Drosophila melanogaster. PeerJ 2017; 5:e3389. [PMID: 28584716 PMCID: PMC5452948 DOI: 10.7717/peerj.3389] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 05/08/2017] [Indexed: 12/30/2022] Open
Abstract
Because transcription is the first step in the regulation of gene expression, understanding how transcription factors bind to their DNA binding motifs has become absolutely necessary. It has been shown that the promoters of genes with similar expression profiles share common structural patterns. This paper presents an extensive study of the regulatory regions of genes expressed in 24 developmental stages of Drosophila melanogaster. It proposes the use of a combination of structural features, such as positioning of individual motifs relative to the transcription start site, orientation, pairwise distance between motifs, and presence of motifs anywhere in the promoter for predicting gene expression from structural features of promoter sequences. RNA-sequencing data was utilized to create and validate the 24 models. When genes with high-scoring promoters were compared to those identified by RNA-seq samples, 19 (79.2%) statistically significant models, a number that exceeds previous studies, were obtained. Each model yielded a set of highly informative features, which were used to search for genes with similar biological functions.
Collapse
Affiliation(s)
- Yosvany López
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan.,Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Alexis Vandenbon
- Immunology Frontier Research Center, Osaka University, Osaka, Japan
| | - Akinao Nose
- Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Kenta Nakai
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
29
|
Li L, Wunderlich Z. An Enhancer's Length and Composition Are Shaped by Its Regulatory Task. Front Genet 2017; 8:63. [PMID: 28588608 PMCID: PMC5440464 DOI: 10.3389/fgene.2017.00063] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Accepted: 05/08/2017] [Indexed: 12/02/2022] Open
Abstract
Enhancers drive the gene expression patterns required for virtually every process in metazoans. We propose that enhancer length and transcription factor (TF) binding site composition—the number and identity of TF binding sites—reflect the complexity of the enhancer's regulatory task. In development, we define regulatory task complexity as the number of fates specified in a set of cells at once. We hypothesize that enhancers with more complex regulatory tasks will be longer, with more, but less specific, TF binding sites. Larger numbers of binding sites can be arranged in more ways, allowing enhancers to drive many distinct expression patterns, and therefore cell fates, using a finite number of TF inputs. We compare ~100 enhancers patterning the more complex anterior-posterior (AP) axis and the simpler dorsal-ventral (DV) axis in Drosophila and find that the AP enhancers are longer with more, but less specific binding sites than the (DV) enhancers. Using a set of ~3,500 enhancers, we find enhancer length and TF binding site number again increase with increasing regulatory task complexity. Therefore, to be broadly applicable, computational tools to study enhancers must account for differences in regulatory task.
Collapse
Affiliation(s)
- Lily Li
- Department of Developmental and Cell Biology, University of California, IrvineIrvine, CA, United States
| | - Zeba Wunderlich
- Department of Developmental and Cell Biology, University of California, IrvineIrvine, CA, United States
| |
Collapse
|
30
|
Guo C, McDowell IC, Nodzenski M, Scholtens DM, Allen AS, Lowe WL, Reddy TE. Transversions have larger regulatory effects than transitions. BMC Genomics 2017; 18:394. [PMID: 28525990 PMCID: PMC5438547 DOI: 10.1186/s12864-017-3785-4] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 05/10/2017] [Indexed: 12/30/2022] Open
Abstract
Background Transversions (Tv’s) are more likely to alter the amino acid sequence of proteins than transitions (Ts’s), and local deviations in the Ts:Tv ratio are indicative of evolutionary selection on genes. Whether the two different types of mutations have different effects in non-protein-coding sequences remains unknown. Genetic variants primarily impact gene expression by disrupting the binding of transcription factors (TFs) and other DNA-binding proteins. Because Tv’s cause larger changes in the shape of a DNA backbone, we hypothesized that Tv’s would have larger impacts on TF binding and gene expression. Results Here, we provide multiple lines of evidence demonstrating that Tv’s have larger impacts on regulatory DNA including analyses of TF binding motifs and allele-specific TF binding. In these analyses, we observed a depletion of Tv’s within TF binding motifs and TF binding sites. Using massively parallel population-scale reporter assays, we also provided empirical evidence that Tv’s have larger effects than Ts’s on the activity of human gene regulatory elements. Conclusions Tv’s are more likely to disrupt TF binding, resulting in larger changes in gene expression. Although the observed differences are small, these findings represent a novel, fundamental property of regulatory variation. Understanding the features of functional non-coding variation could be valuable for revealing the genetic underpinnings of complex traits and diseases in future studies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3785-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cong Guo
- Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, 27710, USA.,University Program in Genetics and Genomics, Duke University, Durham, NC, 27710, USA
| | - Ian C McDowell
- Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, NC, 27710, USA
| | - Michael Nodzenski
- Department of Preventive Medicine, Division of Biostatistics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Denise M Scholtens
- Department of Preventive Medicine, Division of Biostatistics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Andrew S Allen
- Center for Statistical Genetics and Genomics, Duke University Durham, North Carolina, 27710, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, NC, 27710, USA
| | - William L Lowe
- Division of Endocrinology, Metabolism and Molecular Medicine, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Timothy E Reddy
- Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, 27710, USA. .,Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, NC, 27710, USA. .,Present Address: Biostatistics & Bioinformatics, 101 Science Dr., 2347 CIEMAS, Durham, NC, 27708, USA.
| |
Collapse
|
31
|
Goldshtein M, Lukatsky DB. Specificity-Determining DNA Triplet Code for Positioning of Human Preinitiation Complex. Biophys J 2017; 112:2047-2050. [PMID: 28479135 DOI: 10.1016/j.bpj.2017.04.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2016] [Revised: 03/30/2017] [Accepted: 04/14/2017] [Indexed: 01/23/2023] Open
Abstract
The notion that transcription factors bind DNA only through specific, consensus binding sites has been recently questioned. No specific consensus motif for the positioning of the human preinitiation complex (PIC) has been identified. Here, we reveal that nonconsensus, statistical, DNA triplet code provides specificity for the positioning of the human PIC. In particular, we reveal a highly nonrandom, statistical pattern of repetitive nucleotide triplets that correlates with the genomewide binding preferences of PIC measured by Chip-exo. We analyze the triplet enrichment and depletion near the transcription start site and identify triplets that have the strongest effect on PIC-DNA nonconsensus binding. Using statistical mechanics, a random-binder model without fitting parameters, with genomic DNA sequence being the only input, we further validate that the nonconsensus nucleotide triplet code constitutes a key signature providing PIC binding specificity in the human genome. Our results constitute a proof-of-concept for, to our knowledge, a new design principle for protein-DNA recognition in the human genome, which can lead to a better mechanistic understanding of transcriptional regulation.
Collapse
Affiliation(s)
- Matan Goldshtein
- Avram and Stella Goldstein-Goren Department of Biotechnology Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - David B Lukatsky
- Department of Chemistry, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
| |
Collapse
|
32
|
Abstract
Protein-DNA binding plays a central role in gene regulation and by that in all processes in the living cell. Novel experimental and computational approaches facilitate better understanding of protein-DNA binding preferences via high-throughput measurement of protein binding to a large number of DNA sequences and inference of binding models from them. Here we review the state of the art in measuring protein-DNA binding in vitro, emphasizing the advantages and limitations of different technologies. In addition, we describe models for representing protein-DNA binding preferences and key computational approaches to learn those from high-throughput data. Using large experimental data sets, we test the performance of different models based on different measuring techniques. We conclude with pertinent open problems.
Collapse
|
33
|
Abstract
Background Transcription initiation is in bacteria exhibited by different σ factors, most of which fall within σ70 family. This family is diverse, ranging from the housekeeping Group I (RpoDs), to Group IV (ECF) σ factors, that transcribe smaller regulons under more stringent conditions. RpoDs employ a kinetic mix-and-match mechanism, where promoter elements complement each other binding strengths in achieving sufficient transcription activity. On the other hand, it is assumed that ECF σs, which are the most distant from the housekeeping σ factors, cannot exhibit mix-and-matching. However, mix-and-matching for ECF σ factors was not quantitatively checked before, and recent results show a much larger flexibility in the promoter recognition by the members of this group. Results To this end, we quantitatively investigate mix-and-matching in two canonical ECF σ family members (σE and σW), for which we use a biophysics based model of transcription initiation. For σE, we perform a separate analysis for in-vitro active and in-vitro inactive promoters, which allows us investigating how mix-and-matching depends on the external factors that may control transcription activity in the in-vitro inactive set. We show that the promoter elements of canonical ECF σs significantly complement each other strengths, where such mix-and-matching is in the in-vitro active set even stronger compared to the correlations observed for the housekeeping σs. This complementation however significantly decreases for the in-vitro inactive set, which we propose is due to mix-and-matching with regulatory sequences outside of the canonical promoter elements. In line with this proposition, we show that a conserved spacer element, which appears in the in-vitro inactive promoter set, significantly increases the promoter element complementation. While RpoD promoter elements mix-and-match to achieve sufficient total transcription activity, for σE they complement each other to achieve sufficiently strong total binding affinity, which we relate to differences in physiological responses between the two groups of σ factors. Conclusion Despite a common notion that smaller σ factor specificity leads to a larger mix-and-matching, we here obtain a larger promoter element complementation for σE compared to RpoDs. Finally, to explain this finding, we propose a simple model which relates the size of σ factor regulon with the extent of mix-and-matching, based on an assumption of a selection pressure on promoters that are near the non-specific binding boundary to remain functional. Electronic supplementary material The online version of this article (doi:10.1186/s12862-016-0865-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jelena Guzina
- Institute of Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Studentski trg 16, 11000, Belgrade, Serbia.,Multidisciplinary PhD program in Biophysics, University of Belgrade, Belgrade, Serbia
| | - Marko Djordjevic
- Institute of Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Studentski trg 16, 11000, Belgrade, Serbia.
| |
Collapse
|
34
|
Austin RS, Hiu S, Waese J, Ierullo M, Pasha A, Wang TT, Fan J, Foong C, Breit R, Desveaux D, Moses A, Provart NJ. New BAR tools for mining expression data and exploring Cis-elements in Arabidopsis thaliana. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2016; 88:490-504. [PMID: 27401965 DOI: 10.1111/tpj.13261] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Revised: 06/23/2016] [Accepted: 07/01/2016] [Indexed: 05/21/2023]
Abstract
Identifying sets of genes that are specifically expressed in certain tissues or in response to an environmental stimulus is useful for designing reporter constructs, generating gene expression markers, or for understanding gene regulatory networks. We have developed an easy-to-use online tool for defining a desired expression profile (a modification of our Expression Angler program), which can then be used to identify genes exhibiting patterns of expression that match this profile as closely as possible. Further, we have developed another online tool, Cistome, for predicting or exploring cis-elements in the promoters of sets of co-expressed genes identified by such a method, or by other methods. We present two use cases for these tools, which are freely available on the Bio-Analytic Resource at http://BAR.utoronto.ca.
Collapse
Affiliation(s)
- Ryan S Austin
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Shu Hiu
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Jamie Waese
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Matthew Ierullo
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Asher Pasha
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Ting Ting Wang
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Jim Fan
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Curtis Foong
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Robert Breit
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Darrell Desveaux
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Alan Moses
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Nicholas J Provart
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| |
Collapse
|
35
|
Chen D, Orenstein Y, Golodnitsky R, Pellach M, Avrahami D, Wachtel C, Ovadia-Shochat A, Shir-Shapira H, Kedmi A, Juven-Gershon T, Shamir R, Gerber D. SELMAP - SELEX affinity landscape MAPping of transcription factor binding sites using integrated microfluidics. Sci Rep 2016; 6:33351. [PMID: 27628341 PMCID: PMC5024299 DOI: 10.1038/srep33351] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Accepted: 08/19/2016] [Indexed: 01/19/2023] Open
Abstract
Transcription factors (TFs) alter gene expression in response to changes in the environment through sequence-specific interactions with the DNA. These interactions are best portrayed as a landscape of TF binding affinities. Current methods to study sequence-specific binding preferences suffer from limited dynamic range, sequence bias, lack of specificity and limited throughput. We have developed a microfluidic-based device for SELEX Affinity Landscape MAPping (SELMAP) of TF binding, which allows high-throughput measurement of 16 proteins in parallel. We used it to measure the relative affinities of Pho4, AtERF2 and Btd full-length proteins to millions of different DNA binding sites, and detected both high and low-affinity interactions in equilibrium conditions, generating a comprehensive landscape of the relative TF affinities to all possible DNA 6-mers, and even DNA10-mers with increased sequencing depth. Low quantities of both the TFs and DNA oligomers were sufficient for obtaining high-quality results, significantly reducing experimental costs. SELMAP allows in-depth screening of hundreds of TFs, and provides a means for better understanding of the regulatory processes that govern gene expression.
Collapse
Affiliation(s)
- Dana Chen
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Yaron Orenstein
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, 69978, Israel
| | - Rada Golodnitsky
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Michal Pellach
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Dorit Avrahami
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Chaim Wachtel
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Avital Ovadia-Shochat
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Hila Shir-Shapira
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Adi Kedmi
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Tamar Juven-Gershon
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, 69978, Israel
| | - Doron Gerber
- Mina and Everard Goodman Faculty of Life Sciences, Bar Ilan University, Ramat-Gan, 5290002, Israel
| |
Collapse
|
36
|
Westermark PO. Linking Core Promoter Classes to Circadian Transcription. PLoS Genet 2016; 12:e1006231. [PMID: 27504829 PMCID: PMC4978467 DOI: 10.1371/journal.pgen.1006231] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 07/08/2016] [Indexed: 01/09/2023] Open
Abstract
Circadian rhythms in transcription are generated by rhythmic abundances and DNA binding activities of transcription factors. Propagation of rhythms to transcriptional initiation involves the core promoter, its chromatin state, and the basal transcription machinery. Here, I characterize core promoters and chromatin states of genes transcribed in a circadian manner in mouse liver and in Drosophila. It is shown that the core promoter is a critical determinant of circadian mRNA expression in both species. A distinct core promoter class, strong circadian promoters (SCPs), is identified in mouse liver but not Drosophila. SCPs are defined by specific core promoter features, and are shown to drive circadian transcriptional activities with both high averages and high amplitudes. Data analysis and mathematical modeling further provided evidence for rhythmic regulation of both polymerase II recruitment and pause release at SCPs. The analysis provides a comprehensive and systematic view of core promoters and their link to circadian mRNA expression in mouse and Drosophila, and thus reveals a crucial role for the core promoter in regulated, dynamic transcription.
Collapse
Affiliation(s)
- Pål O. Westermark
- Institute for Theoretical Biology, Charité –Universitätsmedizin Berlin, Berlin, Germany
- * E-mail:
| |
Collapse
|
37
|
Promoter Recognition by Extracytoplasmic Function σ Factors: Analyzing DNA and Protein Interaction Motifs. J Bacteriol 2016; 198:1927-1938. [PMID: 27137497 DOI: 10.1128/jb.00244-16] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2016] [Accepted: 04/25/2016] [Indexed: 01/25/2023] Open
Abstract
UNLABELLED Extracytoplasmic function (ECF) σ factors are the largest and the most diverse group of alternative σ factors, but their mechanisms of transcription are poorly studied. This subfamily is considered to exhibit a rigid promoter structure and an absence of mixing and matching; both -35 and -10 elements are considered necessary for initiating transcription. This paradigm, however, is based on very limited data, which bias the analysis of diverse ECF σ subgroups. Here we investigate DNA and protein recognition motifs involved in ECF σ factor transcription by a computational analysis of canonical ECF subfamily members, much less studied ECF σ subgroups, and the group outliers, obtained from recently sequenced bacteriophages. The analysis identifies an extended -10 element in promoters for phage ECF σ factors; a comparison with bacterial σ factors points to a putative 6-amino-acid motif just C-terminal of domain σ2, which is responsible for the interaction with the identified extension of the -10 element. Interestingly, a similar protein motif is found C-terminal of domain σ2 in canonical ECF σ factors, at a position where it is expected to interact with a conserved motif further upstream of the -10 element. Moreover, the phiEco32 ECF σ factor lacks a recognizable -35 element and σ4 domain, which we identify in a homologous phage, 7-11, indicating that the extended -10 element can compensate for the lack of -35 element interactions. Overall, the results reveal greater flexibility in promoter recognition by ECF σ factors than previously recognized and raise the possibility that mixing and matching also apply to this group, a notion that remains to be biochemically tested. IMPORTANCE ECF σ factors are the most numerous group of alternative σ factors but have been little studied. Their promoter recognition mechanisms are obscured by the large diversity within the ECF σ factor group and the limited similarity with the well-studied housekeeping σ factors. Here we extensively compare bacterial and bacteriophage ECF σ factors and their promoters in order to infer DNA and protein recognition motifs involved in transcription initiation. We predict a more flexible promoter structure than is recognized by the current paradigm, which assumes rigidness, and propose that ECF σ promoter elements may complement (mix and match with) each other's strengths. These results warrant the refocusing of research efforts from the well-studied housekeeping σ factors toward the physiologically highly important, but insufficiently understood, alternative σ factors.
Collapse
|
38
|
Peng PC, Hassan Samee MA, Sinha S. Incorporating chromatin accessibility data into sequence-to-expression modeling. Biophys J 2016; 108:1257-67. [PMID: 25762337 DOI: 10.1016/j.bpj.2014.12.037] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Revised: 12/01/2014] [Accepted: 12/11/2014] [Indexed: 01/30/2023] Open
Abstract
Prediction of gene expression levels from regulatory sequences is one of the major challenges of genomic biology today. A particularly promising approach to this problem is that taken by thermodynamics-based models that interpret an enhancer sequence in a given cellular context specified by transcription factor concentration levels and predict precise expression levels driven by that enhancer. Such models have so far not accounted for the effect of chromatin accessibility on interactions between transcription factor and DNA and consequently on gene-expression levels. Here, we extend a thermodynamics-based model of gene expression, called GEMSTAT (Gene Expression Modeling Based on Statistical Thermodynamics), to incorporate chromatin accessibility data and quantify its effect on accuracy of expression prediction. In the new model, called GEMSTAT-A, accessibility at a binding site is assumed to affect the transcription factor's binding strength at the site, whereas all other aspects are identical to the GEMSTAT model. We show that this modification results in significantly better fits in a data set of over 30 enhancers regulating spatial expression patterns in the blastoderm-stage Drosophila embryo. It is important to note that the improved fits result not from an overall elevated accessibility in active enhancers but from the variation of accessibility levels within an enhancer. With whole-genome DNA accessibility measurements becoming increasingly popular, our work demonstrates how such data may be useful for sequence-to-expression models. It also calls for future advances in modeling accessibility levels from sequence and the transregulatory context, so as to predict accurately the effect of cis and trans perturbations on gene expression.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois
| | - Md Abul Hassan Samee
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois; Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois.
| |
Collapse
|
39
|
Tapan S, Wang D. A Further Study on Mining DNA Motifs Using Fuzzy Self-Organizing Maps. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:113-124. [PMID: 26068877 DOI: 10.1109/tnnls.2015.2435155] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Self-organizing map (SOM)-based motif mining, despite being a promising approach for problem solving, mostly fails to offer a consistent interpretation of clusters with respect to the mixed composition of signal and noise in the nodes. The main reason behind this shortcoming comes from the similarity metrics used in data assignment, specially designed with the biological interpretation for this domain, which are not meant to consider the inevitable noise mixture in the clusters. This limits the explicability of the majority of clusters that are supposedly noise dominated, degrading the overall system clarity in motif discovery. This paper aims to improve the explicability aspect of learning process by introducing a composite similarity function (CSF) that is specially designed for the k -mer-to-cluster similarity measure with respect to the degree of motif properties and embedded noise in the cluster. Our proposed motif finding algorithm in this paper is built on our previous work robust elicitation algorithms for discovering (READ) [1] and termed READ Deoxyribonucleic acid motifs using CSFs (READ(csf)), which performs slightly better than READ and shows some remarkable improvements over SOM-based SOMBRERO and SOMEA tools in terms of F-measure on the testing data sets. A real data set containing multiple motifs is used to explore the potential of the READ(csf) for more challenging biological data mining tasks. Visual comparisons with the verified logos extracted from JASPAR database demonstrate that our algorithm is promising to discover multiple motifs simultaneously.
Collapse
|
40
|
Riley TR, Lazarovici A, Mann RS, Bussemaker HJ. Building accurate sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using FeatureREDUCE. eLife 2015; 4:e06397. [PMID: 26701911 PMCID: PMC4758951 DOI: 10.7554/elife.06397] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 12/20/2015] [Indexed: 01/26/2023] Open
Abstract
Transcription factors are crucial regulators of gene expression. Accurate quantitative definition of their intrinsic DNA binding preferences is critical to understanding their biological function. High-throughput in vitro technology has recently been used to deeply probe the DNA binding specificity of hundreds of eukaryotic transcription factors, yet algorithms for analyzing such data have not yet fully matured. Here, we present a general framework (FeatureREDUCE) for building sequence-to-affinity models based on a biophysically interpretable and extensible model of protein-DNA interaction that can account for dependencies between nucleotides within the binding interface or multiple modes of binding. When training on protein binding microarray (PBM) data, we use robust regression and modeling of technology-specific biases to infer specificity models of unprecedented accuracy and precision. We provide quantitative validation of our results by comparing to gold-standard data when available.
Collapse
Affiliation(s)
- Todd R Riley
- Department of Biological Sciences, Columbia University, New York, United States
- Department of Systems Biology, Columbia University, New York, United States
- Department of Biology, University of Massachusetts Boston, Boston, United States
| | - Allan Lazarovici
- Department of Biological Sciences, Columbia University, New York, United States
- Department of Electrical Engineering, Columbia University, New York, United States
| | - Richard S Mann
- Department of Systems Biology, Columbia University, New York, United States
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, United States
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, United States
- Department of Systems Biology, Columbia University, New York, United States
| |
Collapse
|
41
|
Pulkkinen O, Metzler R. Variance-corrected Michaelis-Menten equation predicts transient rates of single-enzyme reactions and response times in bacterial gene-regulation. Sci Rep 2015; 5:17820. [PMID: 26635080 PMCID: PMC4669464 DOI: 10.1038/srep17820] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Accepted: 11/06/2015] [Indexed: 01/07/2023] Open
Abstract
Many chemical reactions in biological cells occur at very low concentrations of constituent molecules. Thus, transcriptional gene-regulation is often controlled by poorly expressed transcription-factors, such as E.coli lac repressor with few tens of copies. Here we study the effects of inherent concentration fluctuations of substrate-molecules on the seminal Michaelis-Menten scheme of biochemical reactions. We present a universal correction to the Michaelis-Menten equation for the reaction-rates. The relevance and validity of this correction for enzymatic reactions and intracellular gene-regulation is demonstrated. Our analytical theory and simulation results confirm that the proposed variance-corrected Michaelis-Menten equation predicts the rate of reactions with remarkable accuracy even in the presence of large non-equilibrium concentration fluctuations. The major advantage of our approach is that it involves only the mean and variance of the substrate-molecule concentration. Our theory is therefore accessible to experiments and not specific to the exact source of the concentration fluctuations.
Collapse
Affiliation(s)
- Otto Pulkkinen
- Department of Physics, Tampere University of Technology, FI-33101 Tampere, Finland
| | - Ralf Metzler
- Department of Physics, Tampere University of Technology, FI-33101 Tampere, Finland
- Institute for Physics & Astronomy, University of Potsdam, D-14476 Potsdam-Golm, Germany
| |
Collapse
|
42
|
Tuğrul M, Paixão T, Barton NH, Tkačik G. Dynamics of Transcription Factor Binding Site Evolution. PLoS Genet 2015; 11:e1005639. [PMID: 26545200 PMCID: PMC4636380 DOI: 10.1371/journal.pgen.1005639] [Citation(s) in RCA: 68] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2015] [Accepted: 10/09/2015] [Indexed: 11/19/2022] Open
Abstract
Evolution of gene regulation is crucial for our understanding of the phenotypic differences between species, populations and individuals. Sequence-specific binding of transcription factors to the regulatory regions on the DNA is a key regulatory mechanism that determines gene expression and hence heritable phenotypic variation. We use a biophysical model for directional selection on gene expression to estimate the rates of gain and loss of transcription factor binding sites (TFBS) in finite populations under both point and insertion/deletion mutations. Our results show that these rates are typically slow for a single TFBS in an isolated DNA region, unless the selection is extremely strong. These rates decrease drastically with increasing TFBS length or increasingly specific protein-DNA interactions, making the evolution of sites longer than ∼ 10 bp unlikely on typical eukaryotic speciation timescales. Similarly, evolution converges to the stationary distribution of binding sequences very slowly, making the equilibrium assumption questionable. The availability of longer regulatory sequences in which multiple binding sites can evolve simultaneously, the presence of “pre-sites” or partially decayed old sites in the initial sequence, and biophysical cooperativity between transcription factors, can all facilitate gain of TFBS and reconcile theoretical calculations with timescales inferred from comparative genomics. Evolution has produced a remarkable diversity of living forms that manifests in qualitative differences as well as quantitative traits. An essential factor that underlies this variability is transcription factor binding sites, short pieces of DNA that control gene expression levels. Nevertheless, we lack a thorough theoretical understanding of the evolutionary times required for the appearance and disappearance of these sites. By combining a biophysically realistic model for how cells read out information in transcription factor binding sites with model for DNA sequence evolution, we explore these timescales and ask what factors crucially affect them. We find that the emergence of binding sites from a random sequence is generically slow under point and insertion/deletion mutational mechanisms. Strong selection, sufficient genomic sequence in which the sites can evolve, the existence of partially decayed old binding sites in the sequence, as well as certain biophysical mechanisms such as cooperativity, can accelerate the binding site gain times and make them consistent with the timescales suggested by comparative analyses of genomic data.
Collapse
Affiliation(s)
- Murat Tuğrul
- Institute of Science and Technology Austria, Klosterneuburg, Austria
- * E-mail:
| | - Tiago Paixão
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| | | | - Gašper Tkačik
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| |
Collapse
|
43
|
Chen L, Zheng QC, Zhang HX. Insights into the effects of mutations on Cren7-DNA binding using molecular dynamics simulations and free energy calculations. Phys Chem Chem Phys 2015; 17:5704-11. [PMID: 25622968 DOI: 10.1039/c4cp05413j] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
A novel, highly conserved chromatin protein, Cren7 is involved in regulating essential cellular processes such as transcription, replication and repair. Although mutations in the DNA-binding loop of Cren7 destabilize the structure and reduce DNA-binding activity, the details are not very clear. Focusing on the specific Cren7-dsDNA complex (PDB code ), we applied molecular dynamics (MD) simulations and the molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) free energy calculations to explore the structural and dynamic effects of W26A, L28A, and K53A mutations in comparison to the wild-type protein. The energetic analysis indicated that the intermolecular van der Waals interaction and nonpolar solvation energy play an important role in the binding process of Cren7 and dsDNA. Compared with the wild type Cren7, all the studied mutants W26A, L28A, and K53A have obviously reduced binding free energies with dsDNA in the reduction of the polar and/or nonpolar interactions. These results further elucidated the previous experiments to understand the Cren7-DNA interaction comprehensively. Our work also would provide support for an understanding of the interactions of proteins with nucleic acids.
Collapse
Affiliation(s)
- Lin Chen
- International Joint Research Laboratory of Nano-Micro Architecture Chemistry, State Key Laboratory of Theoretical and Computational Chemistry, Institute of Theoretical Chemistry, Jilin University, Changchun 130023, P. R. China.
| | | | | |
Collapse
|
44
|
Simple Biophysical Model Predicts Faster Accumulation of Hybrid Incompatibilities in Small Populations Under Stabilizing Selection. Genetics 2015; 201:1525-37. [PMID: 26434721 PMCID: PMC4676520 DOI: 10.1534/genetics.115.181685] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Accepted: 09/23/2015] [Indexed: 01/07/2023] Open
Abstract
Speciation is fundamental to the process of generating the huge diversity of life on Earth. However, we are yet to have a clear understanding of its molecular-genetic basis. Here, we examine a computational model of reproductive isolation that explicitly incorporates a map from genotype to phenotype based on the biophysics of protein–DNA binding. In particular, we model the binding of a protein transcription factor to a DNA binding site and how their independent coevolution, in a stabilizing fitness landscape, of two allopatric lineages leads to incompatibilities. Complementing our previous coarse-grained theoretical results, our simulations give a new prediction for the monomorphic regime of evolution that smaller populations should develop incompatibilities more quickly. This arises as (1) smaller populations have a greater initial drift load, as there are more sequences that bind poorly than well, so fewer substitutions are needed to reach incompatible regions of phenotype space, and (2) slower divergence when the population size is larger than the inverse of discrete differences in fitness. Further, we find longer sequences develop incompatibilities more quickly at small population sizes, but more slowly at large population sizes. The biophysical model thus represents a robust mechanism of rapid reproductive isolation for small populations and large sequences that does not require peak shifts or positive selection. Finally, we show that the growth of DMIs with time is quadratic for small populations, agreeing with Orr’s model, but nonpower law for large populations, with a form consistent with our previous theoretical results.
Collapse
|
45
|
Clifford J, Adami C. Discovery and information-theoretic characterization of transcription factor binding sites that act cooperatively. Phys Biol 2015; 12:056004. [PMID: 26331781 DOI: 10.1088/1478-3975/12/5/056004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Transcription factor binding to the surface of DNA regulatory regions is one of the primary causes of regulating gene expression levels. A probabilistic approach to model protein-DNA interactions at the sequence level is through position weight matrices (PWMs) that estimate the joint probability of a DNA binding site sequence by assuming positional independence within the DNA sequence. Here we construct conditional PWMs that depend on the motif signatures in the flanking DNA sequence, by conditioning known binding site loci on the presence or absence of additional binding sites in the flanking sequence of each site's locus. Pooling known sites with similar flanking sequence patterns allows for the estimation of the conditional distribution function over the binding site sequences. We apply our model to the Dorsal transcription factor binding sites active in patterning the Dorsal-Ventral axis of Drosophila development. We find that those binding sites that cooperate with nearby Twist sites on average contain about 0.5 bits of information about the presence of Twist transcription factor binding sites in the flanking sequence. We also find that Dorsal binding site detectors conditioned on flanking sequence information make better predictions about what is a Dorsal site relative to background DNA than detection without information about flanking sequence features.
Collapse
Affiliation(s)
- Jacob Clifford
- Department of Physics and Astronomy, Michigan State University, East Lansing, MI, USA. BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI, USA
| | | |
Collapse
|
46
|
A Biophysical Approach to Predicting Protein-DNA Binding Energetics. Genetics 2015; 200:1349-61. [PMID: 26081193 DOI: 10.1534/genetics.115.178384] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 06/10/2015] [Indexed: 11/18/2022] Open
Abstract
Sequence-specific interactions between proteins and DNA play a central role in DNA replication, repair, recombination, and control of gene expression. These interactions can be studied in vitro using microfluidics, protein-binding microarrays (PBMs), and other high-throughput techniques. Here we develop a biophysical approach to predicting protein-DNA binding specificities from high-throughput in vitro data. Our algorithm, called BindSter, can model alternative DNA-binding modes and multiple protein species competing for access to DNA, while rigorously taking into account all sterically allowed configurations of DNA-bound factors. BindSter can be used with a hierarchy of protein-DNA interaction models of increasing complexity, including contributions of mononucleotides, dinucleotides, and longer words to the total protein-DNA binding energy. We observe that the quality of BindSter predictions does not change significantly as some of the energy parameters vary over a sizable range. To take this degeneracy into account, we have developed a graphical representation of parameter uncertainties called IntervalLogo. We find that our simplest model, in which each nucleotide in the binding site is treated independently, performs better than previous biophysical approaches. The extensions of this model, in which contributions of longer words are also considered, result in further improvements, underscoring the importance of higher-order effects in protein-DNA energetics. In contrast, we find little evidence of multiple binding modes for the transcription factors (TFs) and experimental conditions in our data set. Furthermore, there is limited consistency in predictions for the same TF based on microfluidics and PBM data.
Collapse
|
47
|
An adiabatic quantum algorithm and its application to DNA motif model discovery. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2014.10.057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
48
|
Guzina J, Djordjevic M. Inferring bacteriophage infection strategies from genome sequence: analysis of bacteriophage 7-11 and related phages. BMC Evol Biol 2015; 15 Suppl 1:S1. [PMID: 25708710 PMCID: PMC4331800 DOI: 10.1186/1471-2148-15-s1-s1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Background Analyzing regulation of bacteriophage gene expression historically lead to establishing major paradigms of molecular biology, and may provide important medical applications in the future. Temporal regulation of bacteriophage transcription is commonly analyzed through a labor-intensive combination of biochemical and bioinformatic approaches and macroarray measurements. We here investigate to what extent one can understand gene expression strategies of lytic phages, by directly analyzing their genomes through bioinformatic methods. We address this question on a recently sequenced lytic bacteriophage 7 - 11 that infects bacterium Salmonella enterica. Results We identify novel promoters for the bacteriophage-encoded σ factor, and test the predictions through homology with another bacteriophage (phiEco32) that has been experimentally characterized in detail. Interestingly, standard approach based on multiple local sequence alignment (MLSA) fails to correctly identify the promoters, but a simpler procedure that is based on pairwise alignment of intergenic regions identifies the desired motifs; we argue that such search strategy is more effective for promoters of bacteriophage-encoded σ factors that are typically well conserved but appear in low copy numbers, which we also verify on two additional bacteriophage genomes. Identifying promoters for bacteriophage encoded σ factors together with a more straightforward identification of promoters for bacterial encoded σ factor, allows clustering the genes in putative early, middle and late class, and consequently predicting the temporal regulation of bacteriophage gene expression, which we demonstrate on phage 7-11. Conclusions While MLSA algorithms proved highly useful in computational analysis of transcription regulation, we here established that a simpler procedure is more successful for identifying promoters that are recognized by bacteriophage encoded σ factor/RNA polymerase. We here used this approach for predicting sequence specificity of a novel (bacteriophage encoded) σ factor, and consequently inferring phage 7-11 transcription strategy. Therefore, direct analysis of bacteriophage genome sequences is a plausible first-line approach for efficiently inferring phage transcription strategies, and may provide a wealth of information on transcription initiation by diverse σ factors/RNA polymerases.
Collapse
|
49
|
Maaskola J, Rajewsky N. Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models. Nucleic Acids Res 2014; 42:12995-3011. [PMID: 25389269 PMCID: PMC4245949 DOI: 10.1093/nar/gku1083] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present a discriminative learning method for pattern discovery of binding sites in nucleic acid sequences based on hidden Markov models. Sets of positive and negative example sequences are mined for sequence motifs whose occurrence frequency varies between the sets. The method offers several objective functions, but we concentrate on mutual information of condition and motif occurrence. We perform a systematic comparison of our method and numerous published motif-finding tools. Our method achieves the highest motif discovery performance, while being faster than most published methods. We present case studies of data from various technologies, including ChIP-Seq, RIP-Chip and PAR-CLIP, of embryonic stem cell transcription factors and of RNA-binding proteins, demonstrating practicality and utility of the method. For the alternative splicing factor RBM10, our analysis finds motifs known to be splicing-relevant. The motif discovery method is implemented in the free software package Discrover. It is applicable to genome- and transcriptome-scale data, makes use of available repeat experiments and aside from binary contrasts also more complex data configurations can be utilized.
Collapse
Affiliation(s)
- Jonas Maaskola
- Laboratory for Systems Biology of Gene Regulatory Elements, Max-Delbrück-Center for Molecular Medicine, Robert-Rössle-Strasse 10, Berlin-Buch 13125, Germany
| | - Nikolaus Rajewsky
- Laboratory for Systems Biology of Gene Regulatory Elements, Max-Delbrück-Center for Molecular Medicine, Robert-Rössle-Strasse 10, Berlin-Buch 13125, Germany
| |
Collapse
|
50
|
High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding. Genetics 2014; 198:1329-43. [PMID: 25209146 DOI: 10.1534/genetics.114.170100] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Knowing the specificity of transcription factors is critical to understanding regulatory networks in cells. The lac repressor-operator system has been studied for many years, but not with high-throughput methods capable of determining specificity comprehensively. Details of its binding interaction and its selection of an asymmetric binding site have been controversial. We employed a new method to accurately determine relative binding affinities to thousands of sequences simultaneously, requiring only sequencing of bound and unbound fractions. An analysis of 2560 different DNA sequence variants, including both base changes and variations in operator length, provides a detailed view of lac repressor sequence specificity. We find that the protein can bind with nearly equal affinities to operators of three different lengths, but the sequence preference changes depending on the length, demonstrating alternative modes of interaction between the protein and DNA. The wild-type operator has an odd length, causing the two monomers to bind in alternative modes, making the asymmetric operator the preferred binding site. We tested two other members of the LacI/GalR protein family and find that neither can bind with high affinity to sites with alternative lengths or shows evidence of alternative binding modes. A further comparison with known and predicted motifs suggests that the lac repressor may be unique in this ability and that this may contribute to its selection.
Collapse
|