1
|
Banerjee S, Zhu H, Tang M, Feng WC, Wu X, Xie H. Identifying Transcriptional Regulatory Modules Among Different Chromatin States in Mouse Neural Stem Cells. Front Genet 2019; 9:731. [PMID: 30697231 PMCID: PMC6341026 DOI: 10.3389/fgene.2018.00731] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Accepted: 12/22/2018] [Indexed: 12/19/2022] Open
Abstract
Gene expression regulation is a complex process involving the interplay between transcription factors and chromatin states. Significant progress has been made toward understanding the impact of chromatin states on gene expression. Nevertheless, the mechanism of transcription factors binding combinatorially in different chromatin states to enable selective regulation of gene expression remains an interesting research area. We introduce a nonparametric Bayesian clustering method for inhomogeneous Poisson processes to detect heterogeneous binding patterns of multiple proteins including transcription factors to form regulatory modules in different chromatin states. We applied this approach on ChIP-seq data for mouse neural stem cells containing 21 proteins and observed different groups or modules of proteins clustered within different chromatin states. These chromatin-state-specific regulatory modules were found to have significant influence on gene expression. We also observed different motif preferences for certain TFs between different chromatin states. Our results reveal a degree of interdependency between chromatin states and combinatorial binding of proteins in the complex transcriptional regulatory process. The software package is available on Github at - https://github.com/BSharmi/DPM-LGCP.
Collapse
Affiliation(s)
- Sharmi Banerjee
- Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, United States.,Biocomplexity Institute of Virginia Tech, Blacksburg, VA, United States
| | - Hongxiao Zhu
- Department of Statistics, Virginia Tech, Blacksburg, VA, United States
| | - Man Tang
- Department of Statistics, Virginia Tech, Blacksburg, VA, United States
| | - Wu-Chun Feng
- Department of Computer Science, Virginia Tech, Blacksburg, VA, United States
| | - Xiaowei Wu
- Department of Statistics, Virginia Tech, Blacksburg, VA, United States
| | - Hehuang Xie
- Biocomplexity Institute of Virginia Tech, Blacksburg, VA, United States.,Department of Biomedical Sciences and Pathobiology, Virginia-Maryland College of Veterinary Medicine, Blacksburg, VA, United States.,Department of Biological Sciences, Virginia Tech, Blacksburg, VA, United States.,School of Neuroscience, Virginia Tech, Blacksburg, VA, United States
| |
Collapse
|
2
|
Levitsky VG, Oshchepkov DY, Klimova NV, Ignatieva EV, Vasiliev GV, Merkulov VM, Merkulova TI. Hidden heterogeneity of transcription factor binding sites: A case study of SF-1. Comput Biol Chem 2016; 64:19-32. [PMID: 27235721 DOI: 10.1016/j.compbiolchem.2016.04.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Revised: 04/19/2016] [Accepted: 04/19/2016] [Indexed: 01/15/2023]
Abstract
Steroidogenic factor 1 (SF-1) belongs to a small group of the transcription factors that bind DNA only as a monomer. Three different approaches-Sitecon, SiteGA, and oPWM-constructed using the same training sample of experimentally confirmed SF-1 binding sites have been used to recognize these sites. The appropriate prediction thresholds for recognition models have been selected. Namely, the thresholds concordant by false positive or negative rates for various methods were used to optimize the discrimination of steroidogenic gene promoters from the datasets of non-specific promoters. After experimental verification, the models were used to analyze the ChIP-seq data for SF-1. It has been shown that the sets of sites recognized by different models overlap only partially and that an integration of these models allows for identification of SF-1 sites in up to 80% of the ChIP-seq loci. The structures of the sites detected using the three recognition models in the ChIP-seq peaks falling within the [-5000, +5000] region relative to the transcription start sites (TSS) extracted from the FANTOM5 project have been analyzed. The MATLIGN classified the frequency matrices for the sites predicted by oPWM, Sitecon, and SiteGA into two groups. The first group is described by oPWM/Sitecon and the second, by SiteGA. Gene ontology (GO) analysis has been used to clarify the differences between the sets of genes carrying different variants of SF-1 binding sites. Although this analysis in general revealed a considerable overlap in GO terms for the genes carrying the binding sites predicted by oPWM, Sitecon, or SiteGA, only the last method elicited notable trend to terms related to negative regulation and apoptosis. The results suggest that the SF-1 binding sites are different in both their structure and the functional annotation of the set of target genes correspond to the predictions by oPWM+Sitecon and SiteGA. Further application of Homer software for de novo identification of enriched motifs in ChIP-Seq data for SF-1ChIP-seq dataset gave the data similar to oPWM+Sitecon.
Collapse
Affiliation(s)
- V G Levitsky
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia; Novosibirsk State University, Novosibirsk, Russia.
| | - D Yu Oshchepkov
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | - N V Klimova
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | - E V Ignatieva
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia; Novosibirsk State University, Novosibirsk, Russia
| | - G V Vasiliev
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | - V M Merkulov
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | - T I Merkulova
- Federal State Research Center Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia; Novosibirsk State University, Novosibirsk, Russia
| |
Collapse
|
3
|
Salas EN, Shu J, Cserhati MF, Weeks DP, Ladunga I. Pluralistic and stochastic gene regulation: examples, models and consistent theory. Nucleic Acids Res 2016; 44:4595-609. [PMID: 26823500 PMCID: PMC4889914 DOI: 10.1093/nar/gkw042] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Accepted: 01/12/2016] [Indexed: 12/17/2022] Open
Abstract
We present a theory of pluralistic and stochastic gene regulation. To bridge the gap between empirical studies and mathematical models, we integrate pre-existing observations with our meta-analyses of the ENCODE ChIP-Seq experiments. Earlier evidence includes fluctuations in levels, location, activity, and binding of transcription factors, variable DNA motifs, and bursts in gene expression. Stochastic regulation is also indicated by frequently subdued effects of knockout mutants of regulators, their evolutionary losses/gains and massive rewiring of regulatory sites. We report wide-spread pluralistic regulation in ≈800 000 tightly co-expressed pairs of diverse human genes. Typically, half of ≈50 observed regulators bind to both genes reproducibly, twice more than in independently expressed gene pairs. We also examine the largest set of co-expressed genes, which code for cytoplasmic ribosomal proteins. Numerous regulatory complexes are highly significant enriched in ribosomal genes compared to highly expressed non-ribosomal genes. We could not find any DNA-associated, strict sense master regulator. Despite major fluctuations in transcription factor binding, our machine learning model accurately predicted transcript levels using binding sites of 20+ regulators. Our pluralistic and stochastic theory is consistent with partially random binding patterns, redundancy, stochastic regulator binding, burst-like expression, degeneracy of binding motifs and massive regulatory rewiring during evolution.
Collapse
Affiliation(s)
- Elisa N Salas
- Department of Statistics, University of Nebraska, Lincoln, NE 68583-0963, USA Department of Biochemistry, University of Nebraska, Lincoln, NE 68588-0665, USA
| | - Jiang Shu
- Department of Statistics, University of Nebraska, Lincoln, NE 68583-0963, USA
| | - Matyas F Cserhati
- Department of Statistics, University of Nebraska, Lincoln, NE 68583-0963, USA
| | - Donald P Weeks
- Department of Biochemistry, University of Nebraska, Lincoln, NE 68588-0665, USA
| | - Istvan Ladunga
- Department of Statistics, University of Nebraska, Lincoln, NE 68583-0963, USA Department of Biochemistry, University of Nebraska, Lincoln, NE 68588-0665, USA
| |
Collapse
|
4
|
Cofunctional Subpathways Were Regulated by Transcription Factor with Common Motif, Common Family, or Common Tissue. BIOMED RESEARCH INTERNATIONAL 2015; 2015:780357. [PMID: 26688819 PMCID: PMC4672121 DOI: 10.1155/2015/780357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2015] [Revised: 11/02/2015] [Accepted: 11/04/2015] [Indexed: 11/17/2022]
Abstract
Dissecting the characteristics of the transcription factor (TF) regulatory subpathway is helpful for understanding the TF underlying regulatory function in complex biological systems. To gain insight into the influence of TFs on their regulatory subpathways, we constructed a global TF-subpathways network (TSN) to analyze systematically the regulatory effect of common-motif, common-family, or common-tissue TFs on subpathways. We performed cluster analysis to show that the common-motif, common-family, or common-tissue TFs that regulated the same pathway classes tended to cluster together and contribute to the same biological function that led to disease initiation and progression. We analyzed the Jaccard coefficient to show that the functional consistency of subpathways regulated by the TF pairs with common motif, common family, or common tissue was significantly greater than the random TF pairs at the subpathway level, pathway level, and pathway class level. For example, HNF4A (hepatocyte nuclear factor 4, alpha) and NR1I3 (nuclear receptor subfamily 1, group I, member 3) were a pair of TFs with common motif, common family, and common tissue. They were involved in drug metabolism pathways and were liver-specific factors required for physiological transcription. In short, we inferred that the cofunctional subpathways were regulated by common-motif, common-family, or common-tissue TFs.
Collapse
|
5
|
Clifford J, Adami C. Discovery and information-theoretic characterization of transcription factor binding sites that act cooperatively. Phys Biol 2015; 12:056004. [PMID: 26331781 DOI: 10.1088/1478-3975/12/5/056004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Transcription factor binding to the surface of DNA regulatory regions is one of the primary causes of regulating gene expression levels. A probabilistic approach to model protein-DNA interactions at the sequence level is through position weight matrices (PWMs) that estimate the joint probability of a DNA binding site sequence by assuming positional independence within the DNA sequence. Here we construct conditional PWMs that depend on the motif signatures in the flanking DNA sequence, by conditioning known binding site loci on the presence or absence of additional binding sites in the flanking sequence of each site's locus. Pooling known sites with similar flanking sequence patterns allows for the estimation of the conditional distribution function over the binding site sequences. We apply our model to the Dorsal transcription factor binding sites active in patterning the Dorsal-Ventral axis of Drosophila development. We find that those binding sites that cooperate with nearby Twist sites on average contain about 0.5 bits of information about the presence of Twist transcription factor binding sites in the flanking sequence. We also find that Dorsal binding site detectors conditioned on flanking sequence information make better predictions about what is a Dorsal site relative to background DNA than detection without information about flanking sequence features.
Collapse
Affiliation(s)
- Jacob Clifford
- Department of Physics and Astronomy, Michigan State University, East Lansing, MI, USA. BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI, USA
| | | |
Collapse
|
6
|
Jankowski A, Prabhakar S, Tiuryn J. TACO: a general-purpose tool for predicting cell-type-specific transcription factor dimers. BMC Genomics 2014; 15:208. [PMID: 24640962 PMCID: PMC4004051 DOI: 10.1186/1471-2164-15-208] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Accepted: 03/07/2014] [Indexed: 12/22/2022] Open
Abstract
Background Cooperative binding of transcription factor (TF) dimers to DNA is increasingly recognized as a major contributor to binding specificity. However, it is likely that the set of known TF dimers is highly incomplete, given that they were discovered using ad hoc approaches, or through computational analyses of limited datasets. Results Here, we present TACO (Transcription factor Association from Complex Overrepresentation), a general-purpose standalone software tool that takes as input any genome-wide set of regulatory elements and predicts cell-type–specific TF dimers based on enrichment of motif complexes. TACO is the first tool that can accommodate motif complexes composed of overlapping motifs, a characteristic feature of many known TF dimers. Our method comprehensively outperforms existing tools when benchmarked on a reference set of 29 known dimers. We demonstrate the utility and consistency of TACO by applying it to 152 DNase-seq datasets and 94 ChIP-seq datasets. Conclusions Based on these results, we uncover a general principle governing the structure of TF-TF-DNA ternary complexes, namely that the flexibility of the complex is correlated with, and most likely a consequence of, inter-motif spacing.
Collapse
Affiliation(s)
| | - Shyam Prabhakar
- Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore.
| | | |
Collapse
|
7
|
Jiang P, Singh M. CCAT: Combinatorial Code Analysis Tool for transcriptional regulation. Nucleic Acids Res 2013; 42:2833-47. [PMID: 24366875 PMCID: PMC3950699 DOI: 10.1093/nar/gkt1302] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Combinatorial interplay among transcription factors (TFs) is an important mechanism by which transcriptional regulatory specificity is achieved. However, despite the increasing number of TFs for which either binding specificities or genome-wide occupancy data are known, knowledge about cooperativity between TFs remains limited. To address this, we developed a computational framework for predicting genome-wide co-binding between TFs (CCAT, Combinatorial Code Analysis Tool), and applied it to Drosophila melanogaster to uncover cooperativity among TFs during embryo development. Using publicly available TF binding specificity data and DNaseI chromatin accessibility data, we first predicted genome-wide binding sites for 324 TFs across five stages of D. melanogaster embryo development. We then applied CCAT in each of these developmental stages, and identified from 19 to 58 pairs of TFs in each stage whose predicted binding sites are significantly co-localized. We found that nearby binding sites for pairs of TFs predicted to cooperate were enriched in regions bound in relevant ChIP experiments, and were more evolutionarily conserved than other pairs. Further, we found that TFs tend to be co-localized with other TFs in a dynamic manner across developmental stages. All generated data as well as source code for our front-to-end pipeline are available at http://cat.princeton.edu.
Collapse
Affiliation(s)
- Peng Jiang
- Department of Computer Science, Princeton University, Princeton, 08540 NJ, USA and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, 08544 NJ, USA
| | | |
Collapse
|
8
|
Chan TM, Lo LY, Sze-To HY, Leung KS, Xiao X, Wong MH. Modeling associated protein-DNA pattern discovery with unified scores. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:696-707. [PMID: 24091402 DOI: 10.1109/tcbb.2013.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Understanding protein-DNA interactions, specifically transcription factor (TF) and transcription factor binding site (TFBS) bindings, is crucial in deciphering gene regulation. The recent associated TF-TFBS pattern discovery combines one-sided motif discovery on both the TF and the TFBS sides. Using sequences only, it identifies the short protein-DNA binding cores available only in high-resolution 3D structures. The discovered patterns lead to promising subtype and disease analysis applications. While the related studies use either association rule mining or existing TFBS annotations, none has proposed any formal unified (both-sided) model to prioritize the top verifiable associated patterns. We propose the unified scores and develop an effective pipeline for associated TF-TFBS pattern discovery. Our stringent instance-level evaluations show that the patterns with the top unified scores match with the binding cores in 3D structures considerably better than the previous works, where up to 90 percent of the top 20 scored patterns are verified. We also introduce extended verification from literature surveys, where the high unified scores correspond to even higher verification percentage. The top scored patterns are confirmed to match the known WRKY binding cores with no available 3D structures and agree well with the top binding affinities of in vivo experiments.
Collapse
|
9
|
Jankowski A, Szczurek E, Jauch R, Tiuryn J, Prabhakar S. Comprehensive prediction in 78 human cell lines reveals rigidity and compactness of transcription factor dimers. Genome Res 2013; 23:1307-18. [PMID: 23554463 PMCID: PMC3730104 DOI: 10.1101/gr.154922.113] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
The binding of transcription factors (TFs) to their specific motifs in genomic regulatory regions is commonly studied in isolation. However, in order to elucidate the mechanisms of transcriptional regulation, it is essential to determine which TFs bind DNA cooperatively as dimers and to infer the precise nature of these interactions. So far, only a small number of such dimeric complexes are known. Here, we present an algorithm for predicting cell-type–specific TF–TF dimerization on DNA on a large scale, using DNase I hypersensitivity data from 78 human cell lines. We represented the universe of possible TF complexes by their corresponding motif complexes, and analyzed their occurrence at cell-type–specific DNase I hypersensitive sites. Based on ∼1.4 billion tests for motif complex enrichment, we predicted 603 highly significant cell-type–specific TF dimers, the vast majority of which are novel. Our predictions included 76% (19/25) of the known dimeric complexes and showed significant overlap with an experimental database of protein–protein interactions. They were also independently supported by evolutionary conservation, as well as quantitative variation in DNase I digestion patterns. Notably, the known and predicted TF dimers were almost always highly compact and rigidly spaced, suggesting that TFs dimerize in close proximity to their partners, which results in strict constraints on the structure of the DNA-bound complex. Overall, our results indicate that chromatin openness profiles are highly predictive of cell-type–specific TF–TF interactions. Moreover, cooperative TF dimerization seems to be a widespread phenomenon, with multiple TF complexes predicted in most cell types.
Collapse
Affiliation(s)
- Aleksander Jankowski
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | | | | | | | | |
Collapse
|
10
|
Narlikar L. MuMoD: a Bayesian approach to detect multiple modes of protein-DNA binding from genome-wide ChIP data. Nucleic Acids Res 2012; 41:21-32. [PMID: 23093591 PMCID: PMC3592440 DOI: 10.1093/nar/gks950] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
High-throughput chromatin immunoprecipitation has become the method of choice for identifying genomic regions bound by a protein. Such regions are then investigated for overrepresented sequence motifs, the assumption being that they must correspond to the binding specificity of the profiled protein. However this approach often fails: many bound regions do not contain the 'expected' motif. This is because binding DNA directly at its recognition site is not the only way the protein can cause the region to immunoprecipitate. Its binding specificity can change through association with different co-factors, it can bind DNA indirectly, through intermediaries, or even enforce its function through long-range chromosomal interactions. Conventional motif discovery methods, though largely capable of identifying overrepresented motifs from bound regions, lack the ability to characterize such diverse modes of protein-DNA binding and binding specificities. We present a novel Bayesian method that identifies distinct protein-DNA binding mechanisms without relying on any motif database. The method successfully identifies co-factors of proteins that do not bind DNA directly, such as mediator and p300. It also predicts literature-supported enhancer-promoter interactions. Even for well-studied direct-binding proteins, this method provides compelling evidence for previously uncharacterized dependencies within positions of binding sites, long-range chromosomal interactions and dimerization.
Collapse
Affiliation(s)
- Leelavati Narlikar
- Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune 411008, India.
| |
Collapse
|
11
|
Chan TM, Leung KS, Lee KH, Wong MH, Lau TCK, Tsui SKW. Subtypes of associated protein-DNA (Transcription Factor-Transcription Factor Binding Site) patterns. Nucleic Acids Res 2012; 40:9392-403. [PMID: 22904079 PMCID: PMC3479201 DOI: 10.1093/nar/gks749] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
In protein–DNA interactions, particularly transcription factor (TF) and transcription factor binding site (TFBS) bindings, associated residue variations form patterns denoted as subtypes. Subtypes may lead to changed binding preferences, distinguish conserved from flexible binding residues and reveal novel binding mechanisms. However, subtypes must be studied in the context of core bindings. While solving 3D structures would require huge experimental efforts, recent sequence-based associated TF-TFBS pattern discovery has shown to be promising, upon which a large-scale subtype study is possible and desirable. In this article, we investigate residue-varying subtypes based on associated TF-TFBS patterns. By re-categorizing the patterns with respect to varying TF amino acids, statistically significant (P values ≤ 0.005) subtypes leading to varying TFBS patterns are discovered without using TF family or domain annotations. Resultant subtypes have various biological meanings. The subtypes reflect familial and functional properties and exhibit changed binding preferences supported by 3D structures. Conserved residues critical for maintaining TF-TFBS bindings are revealed by analyzing the subtypes. In-depth analysis on the subtype pair PKVVIL-CACGTG versus PKVEIL-CAGCTG shows the V/E variation is indicative for distinguishing Myc from MRF families. Discovered from sequences only, the TF-TFBS subtypes are informative and promising for more biological findings, complementing and extending recent one-sided subtype and familial studies with comprehensive evidence.
Collapse
Affiliation(s)
- Tak-Ming Chan
- Department of Computer Science & Engineering, The Chinese University of Hong Kong, Shatin, N T, Hong Kong.
| | | | | | | | | | | |
Collapse
|