1
|
Ni P, Wilson D, Su Z. A map of cis-regulatory modules and constituent transcription factor binding sites in 80% of the mouse genome. BMC Genomics 2022; 23:714. [PMID: 36261804 PMCID: PMC9583556 DOI: 10.1186/s12864-022-08933-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, the characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. RESULTS Here, by integrating binding peaks in ~ 9,000 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into a cis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC set. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strong evolutionary constraints, indicating that they are likely cis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikely cis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. CONCLUSIONS Validation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - David Wilson
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
2
|
Ni P, Su Z. PCRMS: a database of predicted cis-regulatory modules and constituent transcription factor binding sites in genomes. Database (Oxford) 2022; 2022:6572594. [PMID: 35452518 PMCID: PMC9216522 DOI: 10.1093/database/baac024] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Revised: 02/20/2022] [Accepted: 04/12/2022] [Indexed: 01/13/2023]
Abstract
More accurate and more complete predictions of cis-regulatory modules (CRMs) and constituent transcription factor (TF) binding sites (TFBSs) in genomes can facilitate characterizing functions of regulatory sequences. Here, we developed a database predicted cis-regulatory modules (PCRMS) (https://cci-bioinfo.uncc.edu) that stores highly accurate and unprecedentedly complete maps of predicted CRMs and TFBSs in the human and mouse genomes. The web interface allows the user to browse CRMs and TFBSs in an organism, find the closest CRMs to a gene, search CRMs around a gene and find all TFBSs of a TF. PCRMS can be a useful resource for the research community to characterize regulatory genomes. Database URL: https://cci-bioinfo.uncc.edu/.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
3
|
Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform 2021; 3:lqab052. [PMID: 34159315 PMCID: PMC8210889 DOI: 10.1093/nargab/lqab052] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/01/2021] [Accepted: 06/14/2021] [Indexed: 02/07/2023] Open
Abstract
cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
4
|
Meckbach C, Wingender E, Gültas M. Removing Background Co-occurrences of Transcription Factor Binding Sites Greatly Improves the Prediction of Specific Transcription Factor Cooperations. Front Genet 2018; 9:189. [PMID: 29896218 PMCID: PMC5986914 DOI: 10.3389/fgene.2018.00189] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2016] [Accepted: 05/08/2018] [Indexed: 12/17/2022] Open
Abstract
Today, it is well-known that in eukaryotic cells the complex interplay of transcription factors (TFs) bound to the DNA of promoters and enhancers is the basis for precise and specific control of transcription. Computational methods have been developed for the identification of potentially cooperating TFs through the co-occurrence of their binding sites (TFBSs). One challenge of these methods is the differentiation of TFBS pairs that are specific for a given sequence set from those that are ubiquitously appearing, rendering the results highly dependent on the choice of a proper background set. Here, we present an extension of our previous PC-TraFF approach that estimates the background co-occurrence of any TF pair by preserving the (oligo-) nucleotide composition and, thus, the core of TFBSs in the sequences of interest. Applying our approach to a simulated data set with implanted TFBS pairs, we could successfully identify them as sequence-set specific under a variety of conditions. When we analyzed the gene expression data sets of five breast cancer associated subtypes, the number of overlapping pairs could be dramatically reduced in comparison to our previous approach. As a result, we could identify potentially cooperating transcriptional regulators that are characteristic for each of the five breast cancer subtypes. This indicates that our approach is able to discriminate specific potential TF cooperations against ubiquitously occurring combinations. The results obtained with our method may help to understand the genetic programs governing specific biological processes such as the development of different tumor types.
Collapse
Affiliation(s)
- Cornelia Meckbach
- Institute of Bioinformatics, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| | - Edgar Wingender
- Institute of Bioinformatics, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| | - Mehmet Gültas
- Institute of Bioinformatics, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany.,Department of Breeding Informatics, Georg-August University Göttingen, Göttingen, Germany.,Center for Integrated Breeding Research (CiBreed), Georg-August University Göttingen, Göttingen, Germany
| |
Collapse
|
5
|
A New Algorithm for Identifying Cis-Regulatory Modules Based on Hidden Markov Model. BIOMED RESEARCH INTERNATIONAL 2018; 2017:6274513. [PMID: 28497059 PMCID: PMC5405574 DOI: 10.1155/2017/6274513] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 03/06/2017] [Accepted: 03/23/2017] [Indexed: 11/24/2022]
Abstract
The discovery of cis-regulatory modules (CRMs) is the key to understanding mechanisms of transcription regulation. Since CRMs have specific regulatory structures that are the basis for the regulation of gene expression, how to model the regulatory structure of CRMs has a considerable impact on the performance of CRM identification. The paper proposes a CRM discovery algorithm called ComSPS. ComSPS builds a regulatory structure model of CRMs based on HMM by exploring the rules of CRM transcriptional grammar that governs the internal motif site arrangement of CRMs. We test ComSPS on three benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them.
Collapse
|
6
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
7
|
Diamanti K, Umer HM, Kruczyk M, Dąbrowski MJ, Cavalli M, Wadelius C, Komorowski J. Maps of context-dependent putative regulatory regions and genomic signal interactions. Nucleic Acids Res 2016; 44:9110-9120. [PMID: 27625394 PMCID: PMC5100580 DOI: 10.1093/nar/gkw800] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2015] [Accepted: 08/31/2016] [Indexed: 12/24/2022] Open
Abstract
Gene transcription is regulated mainly by transcription factors (TFs). ENCODE and Roadmap Epigenomics provide global binding profiles of TFs, which can be used to identify regulatory regions. To this end we implemented a method to systematically construct cell-type and species-specific maps of regulatory regions and TF-TF interactions. We illustrated the approach by developing maps for five human cell-lines and two other species. We detected ∼144k putative regulatory regions among the human cell-lines, with the majority of them being ∼300 bp. We found ∼20k putative regulatory elements in the ENCODE heterochromatic domains suggesting a large regulatory potential in the regions presumed transcriptionally silent. Among the most significant TF interactions identified in the heterochromatic regions were CTCF and the cohesin complex, which is in agreement with previous reports. Finally, we investigated the enrichment of the obtained putative regulatory regions in the 3D chromatin domains. More than 90% of the regions were discovered in the 3D contacting domains. We found a significant enrichment of GWAS SNPs in the putative regulatory regions. These significant enrichments provide evidence that the regulatory regions play a crucial role in the genomic structural stability. Additionally, we generated maps of putative regulatory regions for prostate and colorectal cancer human cell-lines.
Collapse
Affiliation(s)
- Klev Diamanti
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden
| | - Husen M Umer
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden
| | - Marcin Kruczyk
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden
| | - Michał J Dąbrowski
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala SE-751-08, Sweden
| | - Marco Cavalli
- Institute of Computer Science, Polish Academy of Sciences, Warsaw 012-48, Poland
| | - Claes Wadelius
- Institute of Computer Science, Polish Academy of Sciences, Warsaw 012-48, Poland
| | - Jan Komorowski
- Department of Cell and Molecular Biology, Uppsala University, Uppsala SE-751-24, Sweden .,Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala SE-751-08, Sweden
| |
Collapse
|
8
|
Yang W, Sinha S. A novel method for predicting activity of cis-regulatory modules, based on a diverse training set. Bioinformatics 2016; 33:1-7. [PMID: 27609510 DOI: 10.1093/bioinformatics/btw552] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2016] [Revised: 07/26/2016] [Accepted: 08/17/2016] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION With the rapid emergence of technologies for locating cis-regulatory modules (CRMs) genome-wide, the next pressing challenge is to assign precise functions to each CRM, i.e. to determine the spatiotemporal domains or cell-types where it drives expression. A popular approach to this task is to model the typical k-mer composition of a set of CRMs known to drive a common expression pattern, and assign that pattern to other CRMs exhibiting a similar k-mer composition. This approach does not rely on prior knowledge of transcription factors relevant to the CRM or their binding motifs, and is thus more widely applicable than motif-based methods for predicting CRM activity, but is also prone to false positive predictions. RESULTS We present a novel strategy to improve the above-mentioned approach: to predict if a CRM drives a specific gene expression pattern, assess not only how similar the CRM is to other CRMs with similar activity but also to CRMs with distinct activities. We use a state-of-the-art statistical method to quantify a CRM's sequence similarity to many different training sets of CRMs, and employ a classification algorithm to integrate these similarity scores into a single prediction of the CRM's activity. This strategy is shown to significantly improve CRM activity prediction over current approaches. AVAILABILITY AND IMPLEMENTATION Our implementation of the new method, called IMMBoost, is freely available as source code, at https://github.com/weiyangedward/IMMBoost CONTACT: sinhas@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Yang
- Department of Computer Science, University of Illinois, Urbana-Champaign, Urbana, IL, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois, Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
9
|
Boeva V. Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells. Front Genet 2016; 7:24. [PMID: 26941778 PMCID: PMC4763482 DOI: 10.3389/fgene.2016.00024] [Citation(s) in RCA: 91] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 02/05/2016] [Indexed: 12/27/2022] Open
Abstract
Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites, and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.
Collapse
Affiliation(s)
- Valentina Boeva
- Centre de Recherche, Institut CurieParis, France; INSERM, U900Paris, France; Mines ParisTechFontainebleau, France; PSL Research UniversityParis, France; Department of Development, Reproduction and Cancer, Institut CochinParis, France; INSERM, U1016Paris, France; Centre National de la Recherche Scientifique UMR 8104Paris, France; Université Paris Descartes UMR-S1016Paris, France
| |
Collapse
|
10
|
Meckbach C, Tacke R, Hua X, Waack S, Wingender E, Gültas M. PC-TraFF: identification of potentially collaborating transcription factors using pointwise mutual information. BMC Bioinformatics 2015; 16:400. [PMID: 26627005 PMCID: PMC4667426 DOI: 10.1186/s12859-015-0827-2] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Accepted: 11/17/2015] [Indexed: 01/06/2023] Open
Abstract
Background Transcription factors (TFs) are important regulatory proteins that govern transcriptional regulation. Today, it is known that in higher organisms different TFs have to cooperate rather than acting individually in order to control complex genetic programs. The identification of these interactions is an important challenge for understanding the molecular mechanisms of regulating biological processes. In this study, we present a new method based on pointwise mutual information, PC-TraFF, which considers the genome as a document, the sequences as sentences, and TF binding sites (TFBSs) as words to identify interacting TFs in a set of sequences. Results To demonstrate the effectiveness of PC-TraFF, we performed a genome-wide analysis and a breast cancer-associated sequence set analysis for protein coding and miRNA genes. Our results show that in any of these sequence sets, PC-TraFF is able to identify important interacting TF pairs, for most of which we found support by previously published experimental results. Further, we made a pairwise comparison between PC-TraFF and three conventional methods. The outcome of this comparison study strongly suggests that all these methods focus on different important aspects of interaction between TFs and thus the pairwise overlap between any of them is only marginal. Conclusions In this study, adopting the idea from the field of linguistics in the field of bioinformatics, we develop a new information theoretic method, PC-TraFF, for the identification of potentially collaborating transcription factors based on the idiosyncrasy of their binding site distributions on the genome. The results of our study show that PC-TraFF can succesfully identify known interacting TF pairs and thus its currently biologically uncorfirmed predictions could provide new hypotheses for further experimental validation. Additionally, the comparison of the results of PC-TraFF with the results of previous methods demonstrates that different methods with their specific scopes can perfectly supplement each other. Overall, our analyses indicate that PC-TraFF is a time-efficient method where its algorithm has a tractable computational time and memory consumption. The PC-TraFF server is freely accessible at http://pctraff.bioinf.med.uni-goettingen.de/ Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0827-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cornelia Meckbach
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Rebecca Tacke
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Xu Hua
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Stephan Waack
- Institute of Computer Science, University of Göttingen, Goldschmidtstr. 7, Göttingen, 37077, Germany.
| | - Edgar Wingender
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| | - Mehmet Gültas
- Institute of Bioinformatics, University of Göttingen, Goldschmidtstr. 1, Göttingen, 37077, Germany.
| |
Collapse
|
11
|
Leoncini M, Montangero M, Pellegrini M, Tillan KP. CMStalker: A Combinatorial Tool for Composite Motif Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1123-1136. [PMID: 26451824 DOI: 10.1109/tcbb.2014.2359444] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Controlling the differential expression of many thousands different genes at any given time is a fundamental task of metazoan organisms and this complex orchestration is controlled by the so-called regulatory genome encoding complex regulatory networks: several Transcription Factors bind to precise DNA regions, so to perform in a cooperative manner a specific regulation task for nearby genes. The in silico prediction of these binding sites is still an open problem, notwithstanding continuous progress and activity in the last two decades. In this paper, we describe a new efficient combinatorial approach to the problem of detecting sets of cooperating binding sites in promoter sequences, given in input a database of Transcription Factor Binding Sites encoded as Position Weight Matrices. We present CMStalker, a software tool for composite motif discovery which embodies a new approach that combines a constraint satisfaction formulation with a parameter relaxation technique to explore efficiently the space of possible solutions. Extensive experiments with 12 data sets and 11 state-of-the-art tools are reported, showing an average value of the correlation coefficient of 0.54 (against a value 0.41 of the closest competitor). This improvements in output quality due to CMStalker is statistically significant.
Collapse
|
12
|
Niu M, Tabari ES, Su Z. De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets. BMC Genomics 2014; 15:1047. [PMID: 25442502 PMCID: PMC4265420 DOI: 10.1186/1471-2164-15-1047] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 11/19/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In eukaryotes, transcriptional regulation is usually mediated by interactions of multiple transcription factors (TFs) with their respective specific cis-regulatory elements (CREs) in the so-called cis-regulatory modules (CRMs) in DNA. Although the knowledge of CREs and CRMs in a genome is crucial to elucidate gene regulatory networks and understand many important biological phenomena, little is known about the CREs and CRMs in most eukaryotic genomes due to the difficulty to characterize them by either computational or traditional experimental methods. However, the exponentially increasing number of TF binding location data produced by the recent wide adaptation of chromatin immunoprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs and CREs in genomes. Nonetheless, how to effectively mine these large volumes of ChIP data to identify CREs and CRMs at nucleotide resolution is a highly challenging task. RESULTS We have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CREs and CRMs using a large number of ChIP datasets. DePCRM predicts CREs and CRMs by identifying overrepresented combinatorial CRE motif patterns in multiple ChIP datasets in an effective way. When applied to 168 ChIP datasets of 56 TFs from D. melanogaster, DePCRM identified 184 and 746 overrepresented CRE motifs and their combinatorial patterns, respectively, and predicted a total of 115,932 CRMs in the genome. The predictions recover 77.9% of known CRMs in the datasets and 89.3% of known CRMs containing at least one predicted CRE. We found that the putative CRMs as well as CREs as a whole in a CRM are more conserved than randomly selected sequences. CONCLUSION Our results suggest that the CRMs predicted by DePCRM are highly likely to be functional. Our algorithm is the first of its kind for de novo genome-wide prediction of CREs and CRMs using larger number of transcription factor ChIP datasets. The algorithm and predictions will hopefully facilitate the elucidation of gene regulatory networks in eukaryotes. All the predicted CREs, CRMs, and their target genes are available at http://bioinfo.uncc.edu/mniu/pcrms/www/.
Collapse
Affiliation(s)
| | | | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA.
| |
Collapse
|
13
|
Navarro C, Lopez FJ, Cano C, Garcia-Alcalde F, Blanco A. CisMiner: genome-wide in-silico cis-regulatory module prediction by fuzzy itemset mining. PLoS One 2014; 9:e108065. [PMID: 25268582 PMCID: PMC4182448 DOI: 10.1371/journal.pone.0108065] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2014] [Accepted: 08/25/2014] [Indexed: 01/18/2023] Open
Abstract
Eukaryotic gene control regions are known to be spread throughout non-coding DNA sequences which may appear distant from the gene promoter. Transcription factors are proteins that coordinately bind to these regions at transcription factor binding sites to regulate gene expression. Several tools allow to detect significant co-occurrences of closely located binding sites (cis-regulatory modules, CRMs). However, these tools present at least one of the following limitations: 1) scope limited to promoter or conserved regions of the genome; 2) do not allow to identify combinations involving more than two motifs; 3) require prior information about target motifs. In this work we present CisMiner, a novel methodology to detect putative CRMs by means of a fuzzy itemset mining approach able to operate at genome-wide scale. CisMiner allows to perform a blind search of CRMs without any prior information about target CRMs nor limitation in the number of motifs. CisMiner tackles the combinatorial complexity of genome-wide cis-regulatory module extraction using a natural representation of motif combinations as itemsets and applying the Top-Down Fuzzy Frequent- Pattern Tree algorithm to identify significant itemsets. Fuzzy technology allows CisMiner to better handle the imprecision and noise inherent to regulatory processes. Results obtained for a set of well-known binding sites in the S. cerevisiae genome show that our method yields highly reliable predictions. Furthermore, CisMiner was also applied to putative in-silico predicted transcription factor binding sites to identify significant combinations in S. cerevisiae and D. melanogaster, proving that our approach can be further applied genome-wide to more complex genomes. CisMiner is freely accesible at: http://genome2.ugr.es/cisminer. CisMiner can be queried for the results presented in this work and can also perform a customized cis-regulatory module prediction on a query set of transcription factor binding sites provided by the user.
Collapse
Affiliation(s)
- Carmen Navarro
- Department of Computer Science and AI, University of Granada, Granada, Spain
| | - Francisco J. Lopez
- Andalusian Human Genome Sequencing Centre (CASEGH), Medical Genome Project (MGP), Sevilla, Spain
| | - Carlos Cano
- Department of Computer Science and AI, University of Granada, Granada, Spain
| | | | - Armando Blanco
- Department of Computer Science and AI, University of Granada, Granada, Spain
| |
Collapse
|
14
|
Ding J, Hu H, Li X. SIOMICS: a novel approach for systematic identification of motifs in ChIP-seq data. Nucleic Acids Res 2013; 42:e35. [PMID: 24322294 PMCID: PMC3950686 DOI: 10.1093/nar/gkt1288] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The identification of transcription factor binding motifs is important for the study of gene transcriptional regulation. The chromatin immunoprecipitation (ChIP), followed by massive parallel sequencing (ChIP-seq) experiments, provides an unprecedented opportunity to discover binding motifs. Computational methods have been developed to identify motifs from ChIP-seq data, while at the same time encountering several problems. For example, existing methods are often not scalable to the large number of sequences obtained from ChIP-seq peak regions. Some methods heavily rely on well-annotated motifs even though the number of known motifs is limited. To simplify the problem, de novo motif discovery methods often neglect underrepresented motifs in ChIP-seq peak regions. To address these issues, we developed a novel approach called SIOMICS to de novo discover motifs from ChIP-seq data. Tested on 13 ChIP-seq data sets, SIOMICS identified motifs of many known and new cofactors. Tested on 13 simulated random data sets, SIOMICS discovered no motif in any data set. Compared with two recently developed methods for motif discovery, SIOMICS shows advantages in terms of speed, the number of known cofactor motifs predicted in experimental data sets and the number of false motifs predicted in random data sets. The SIOMICS software is freely available at http://eecs.ucf.edu/∼xiaoman/SIOMICS/SIOMICS.html.
Collapse
Affiliation(s)
- Jun Ding
- Department of Electric Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA and Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| | | | | |
Collapse
|
15
|
Teng L, He B, Gao P, Gao L, Tan K. Discover context-specific combinatorial transcription factor interactions by integrating diverse ChIP-Seq data sets. Nucleic Acids Res 2013; 42:e24. [PMID: 24217919 PMCID: PMC3936738 DOI: 10.1093/nar/gkt1105] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Combinatorial interactions among transcription factors (TFs) are critical for integrating diverse intrinsic and extrinsic signals, fine-tuning regulatory output and increasing the robustness and plasticity of regulatory systems. Current knowledge about combinatorial regulation is rather limited due to the lack of suitable experimental technologies and bioinformatics tools. The rapid accumulation of ChIP-Seq data has provided genome-wide occupancy maps for a large number of TFs and chromatin modification marks for identifying enhancers without knowing individual TF binding sites. Integration of the two data types has not been researched extensively, resulting in underused data and missed opportunities. We describe a novel method for discovering frequent combinatorial occupancy patterns by multiple TFs at enhancers. Our method is based on probabilistic item set mining and takes into account uncertainty in both types of ChIP-Seq data. By joint analysis of 108 TFs in four human cell types, we found that cell–type-specific interactions among TFs are abundant and that the majority of enhancers have flexible architecture. We show that several families of transposable elements disproportionally overlap with enhancers with combinatorial patterns, suggesting that these transposable element families play an important role in the evolution of combinatorial regulation.
Collapse
Affiliation(s)
- Li Teng
- Department of Internal Medicine, University of Iowa, Iowa City, IA 52242, USA, Interdisciplinary Graduate Program in Genetics, University of Iowa, Iowa City, IA 52242, USA and Department of Biomedical Engineering, University of Iowa, Iowa City, IA 52242, USA
| | | | | | | | | |
Collapse
|
16
|
Diez D, Hutchins AP, Miranda-Saavedra D. Systematic identification of transcriptional regulatory modules from protein-protein interaction networks. Nucleic Acids Res 2013; 42:e6. [PMID: 24137002 PMCID: PMC3874207 DOI: 10.1093/nar/gkt913] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Transcription factors (TFs) combine with co-factors to form transcriptional regulatory modules (TRMs) that regulate gene expression programs with spatiotemporal specificity. Here we present a novel and generic method (rTRM) for the reconstruction of TRMs that integrates genomic information from TF binding, cell type-specific gene expression and protein–protein interactions. rTRM was applied to reconstruct the TRMs specific for embryonic stem cells (ESC) and hematopoietic stem cells (HSC), neural progenitor cells, trophoblast stem cells and distinct types of terminally differentiated CD4+ T cells. The ESC and HSC TRM predictions were highly precise, yielding 77 and 96 proteins, of which ∼75% have been independently shown to be involved in the regulation of these cell types. Furthermore, rTRM successfully identified a large number of bridging proteins with known roles in ESCs and HSCs, which could not have been identified using genomic approaches alone, as they lack the ability to bind specific DNA sequences. This highlights the advantage of rTRM over other methods that ignore PPI information, as proteins need to interact with other proteins to form complexes and perform specific functions. The prediction and experimental validation of the co-factors that endow master regulatory TFs with the capacity to select specific genomic sites, modulate the local epigenetic profile and integrate multiple signals will provide important mechanistic insights not only into how such TFs operate, but also into abnormal transcriptional states leading to disease.
Collapse
Affiliation(s)
- Diego Diez
- World Premier International (WPI) Immunology Frontier Research Center (IFReC), Osaka University, 3-1 Yamadaoka, Suita 565-0871, Osaka, Japan, South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, 190 Kaiyuan Ave, Guangzhou 510663, China and Fibrosis Laboratories, Institute of Cellular Medicine, Newcastle University Medical School, Framlington Place, Newcastle upon Tyne NE2 4HH, United Kingdom
| | | | | |
Collapse
|
17
|
Rohr CO, Parra RG, Yankilevich P, Perez-Castro C. INSECT: IN-silico SEarch for Co-occurring Transcription factors. ACTA ACUST UNITED AC 2013; 29:2852-8. [PMID: 24008418 DOI: 10.1093/bioinformatics/btt506] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Transcriptional regulation occurs through the concerted actions of multiple transcription factors (TFs) that bind cooperatively to cis-regulatory modules (CRMs) of genes. These CRMs usually contain a variable number of transcription factor-binding sites (TFBSs) involved in related cellular and physiological processes. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has been effective in detecting TFBSs and nucleosome location to identify potential CRMs in genome-wide studies. Although several attempts were previously reported to predict the potential binding of TFs at TFBSs within CRMs by comparing different ChIP-seq data, these have been hampered by excessive background, usually emerging as a consequence of experimental conditions. To understand these complex regulatory circuits, it would be helpful to have reliable and updated user-friendly tools to assist in the identification of TFBSs and CRMs for gene(s) of interest. RESULTS Here we present INSECT (IN-silico SEarch for Co-occurring Transcription factors), a novel web server for identifying potential TFBSs and CRMs in gene sequences. By combining several strategies, INSECT provides flexible analysis of multiple co-occurring TFBSs, by applying differing search schemes and restriction parameters. availability and implementation: INSECT is freely available as a web server at http://bioinformatics.ibioba-mpsp-conicet.gov.ar/INSECT .
Collapse
Affiliation(s)
- Cristian O Rohr
- Instituto de Ecología, Genética y Evolución (IEGEBA)-CONICET, Protein Physiology Laboratory, Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN, C1428EGA, Buenos Aires, Argentina and Instituto de Investigación en Biomedicina de Buenos Aires (IBioBA) -CONICET- Partner Institute of the Max Planck Society, C1425FQD, Buenos Aires, Argentina
| | | | | | | |
Collapse
|
18
|
Ha N, Polychronidou M, Lohmann I. COPS: detecting co-occurrence and spatial arrangement of transcription factor binding motifs in genome-wide datasets. PLoS One 2012; 7:e52055. [PMID: 23272209 PMCID: PMC3525548 DOI: 10.1371/journal.pone.0052055] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2012] [Accepted: 11/12/2012] [Indexed: 11/18/2022] Open
Abstract
In multi-cellular organisms, spatiotemporal activity of cis-regulatory DNA elements depends on their occupancy by different transcription factors (TFs). In recent years, genome-wide ChIP-on-Chip, ChIP-Seq and DamID assays have been extensively used to unravel the combinatorial interaction of TFs with cis-regulatory modules (CRMs) in the genome. Even though genome-wide binding profiles are increasingly becoming available for different TFs, single TF binding profiles are in most cases not sufficient for dissecting complex regulatory networks. Thus, potent computational tools detecting statistically significant and biologically relevant TF-motif co-occurrences in genome-wide datasets are essential for analyzing context-dependent transcriptional regulation. We have developed COPS (Co-Occurrence Pattern Search), a new bioinformatics tool based on a combination of association rules and Markov chain models, which detects co-occurring TF binding sites (BSs) on genomic regions of interest. COPS scans DNA sequences for frequent motif patterns using a Frequent-Pattern tree based data mining approach, which allows efficient performance of the software with respect to both data structure and implementation speed, in particular when mining large datasets. Since transcriptional gene regulation very often relies on the formation of regulatory protein complexes mediated by closely adjoining TF binding sites on CRMs, COPS additionally detects preferred short distance between co-occurring TF motifs. The performance of our software with respect to biological significance was evaluated using three published datasets containing genomic regions that are independently bound by several TFs involved in a defined biological process. In sum, COPS is a fast, efficient and user-friendly tool mining statistically and biologically significant TFBS co-occurrences and therefore allows the identification of TFs that combinatorially regulate gene expression.
Collapse
Affiliation(s)
- Nati Ha
- Centre for Organismal Studies (COS) Heidelberg, University of Heidelberg, Heidelberg and CellNetworks – Cluster of Excellence Germany, Heidelberg, Germany
| | - Maria Polychronidou
- Centre for Organismal Studies (COS) Heidelberg, University of Heidelberg, Heidelberg and CellNetworks – Cluster of Excellence Germany, Heidelberg, Germany
| | - Ingrid Lohmann
- Centre for Organismal Studies (COS) Heidelberg, University of Heidelberg, Heidelberg and CellNetworks – Cluster of Excellence Germany, Heidelberg, Germany
- * E-mail:
| |
Collapse
|