51
|
Müller-Molina AJ, Schöler HR, Araúzo-Bravo MJ. Comprehensive human transcription factor binding site map for combinatory binding motifs discovery. PLoS One 2012; 7:e49086. [PMID: 23209563 PMCID: PMC3509107 DOI: 10.1371/journal.pone.0049086] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 10/08/2012] [Indexed: 11/18/2022] Open
Abstract
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%-20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory "DNA words." From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%-far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of "DNA words," newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
Collapse
Affiliation(s)
- Arnoldo J. Müller-Molina
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| | - Hans R. Schöler
- Department of Cell and Developmental Biology, Max Planck Institute for Molecular Biomedicine, Münster, Germany
- Medical Faculty, University of Münster, Münster, Germany
| | - Marcos J. Araúzo-Bravo
- Computational Biology and Bioinformatics Group, Max Planck Institute for Molecular Biomedicine, Münster, Germany
| |
Collapse
|
52
|
Seitzer P, Wilbanks EG, Larsen DJ, Facciotti MT. A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs. BMC Bioinformatics 2012. [PMID: 23181585 PMCID: PMC3542263 DOI: 10.1186/1471-2105-13-317] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research. RESULTS We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature. CONCLUSIONS Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/.
Collapse
Affiliation(s)
- Phillip Seitzer
- Department of Biomedical Engineering, One Shields Ave, University of California, Davis, CA 95616, USA
| | | | | | | |
Collapse
|
53
|
Mehdi AM, Sehgal MSB, Kobe B, Bailey TL, Bodén M. DLocalMotif: a discriminative approach for discovering local motifs in protein sequences. ACTA ACUST UNITED AC 2012; 29:39-46. [PMID: 23142965 DOI: 10.1093/bioinformatics/bts654] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery. RESULTS This article introduces the method DLocalMotif that makes use of positional information and negative data for local motif discovery in protein sequences. DLocalMotif combines three scoring functions, measuring degrees of motif over-representation, entropy and spatial confinement, specifically designed to discriminatively exploit the availability of negative data. The method is shown to outperform current methods that use only a subset of these motif characteristics. We apply the method to several biological datasets. The analysis of peroxisomal targeting signals uncovers several novel motifs that occur immediately upstream of the dominant peroxisomal targeting signal-1 signal. The analysis of proline-tyrosine nuclear localization signals uncovers multiple novel motifs that overlap with C2H2 zinc finger domains. We also evaluate the method on classical nuclear localization signals and endoplasmic reticulum retention signals and find that DLocalMotif successfully recovers biologically relevant sequence properties. AVAILABILITY http://bioinf.scmb.uq.edu.au/dlocalmotif/
Collapse
Affiliation(s)
- Ahmed M Mehdi
- Institute for Molecular Bioscience, The University of Queensland, Australia
| | | | | | | | | |
Collapse
|
54
|
Xia X. Position weight matrix, gibbs sampler, and the associated significance tests in motif characterization and prediction. SCIENTIFICA 2012; 2012:917540. [PMID: 24278755 PMCID: PMC3820676 DOI: 10.6064/2012/917540] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 10/11/2012] [Indexed: 05/31/2023]
Abstract
Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa, ON, Canada K1N 6N5
| |
Collapse
|
55
|
Anderson DM, George R, Noyes MB, Rowton M, Liu W, Jiang R, Wolfe SA, Wilson-Rawls J, Rawls A. Characterization of the DNA-binding properties of the Mohawk homeobox transcription factor. J Biol Chem 2012; 287:35351-35359. [PMID: 22923612 DOI: 10.1074/jbc.m112.399386] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The homeobox transcription factor Mohawk (Mkx) is a potent transcriptional repressor expressed in the embryonic precursors of skeletal muscle, cartilage, and bone. MKX has recently been shown to be a critical regulator of musculoskeletal tissue differentiation and gene expression; however, the genetic pathways through which MKX functions and its DNA-binding properties are currently unknown. Using a modified bacterial one-hybrid site selection assay, we determined the core DNA-recognition motif of the mouse monomeric Mkx homeodomain to be A-C-A. Using cell-based assays, we have identified a minimal Mkx-responsive element (MRE) located within the Mkx promoter, which is composed of a highly conserved inverted repeat of the core Mkx recognition motif. Using the minimal MRE sequence, we have further identified conserved MREs within the locus of Sox6, a transcription factor that represses slow fiber gene expression during skeletal muscle differentiation. Real-time PCR and immunostaining of in vitro differentiated muscle satellite cells isolated from Mkx-null mice revealed an increase in the expression of Sox6 and down-regulation of slow fiber structural genes. Together, these data identify the unique DNA-recognition properties of MKX and reveal a novel role for Mkx in promoting slow fiber type specification during skeletal muscle differentiation.
Collapse
Affiliation(s)
- Douglas M Anderson
- School of Life Sciences, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501; Molecular and Cellular Biology Graduate Program, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501
| | - Rajani George
- School of Life Sciences, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501; Molecular and Cellular Biology Graduate Program, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501
| | - Marcus B Noyes
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, Massachusetts 01605; Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, Massachusetts 01605
| | - Megan Rowton
- School of Life Sciences, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501; Molecular and Cellular Biology Graduate Program, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501
| | - Wenjin Liu
- Department of Biomedical Genetics and Center for Oral Biology, University of Rochester School of Medicine and Dentistry, Rochester, New York 14642
| | - Rulang Jiang
- Department of Biomedical Genetics and Center for Oral Biology, University of Rochester School of Medicine and Dentistry, Rochester, New York 14642
| | - Scot A Wolfe
- Program in Gene Function and Expression, University of Massachusetts Medical School, Worcester, Massachusetts 01605; Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, Massachusetts 01605
| | - Jeanne Wilson-Rawls
- School of Life Sciences, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501
| | - Alan Rawls
- School of Life Sciences, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501; Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University, Tempe, Arizona 85287-4501.
| |
Collapse
|
56
|
Wang S, Yin Y, Ma Q, Tang X, Hao D, Xu Y. Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis. BMC PLANT BIOLOGY 2012; 12:138. [PMID: 22877077 PMCID: PMC3463447 DOI: 10.1186/1471-2229-12-138] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Accepted: 07/30/2012] [Indexed: 05/21/2023]
Abstract
BACKGROUND Identification of the novel genes relevant to plant cell-wall (PCW) synthesis represents a highly important and challenging problem. Although substantial efforts have been invested into studying this problem, the vast majority of the PCW related genes remain unknown. RESULTS Here we present a computational study focused on identification of the novel PCW genes in Arabidopsis based on the co-expression analyses of transcriptomic data collected under 351 conditions, using a bi-clustering technique. Our analysis identified 217 highly co-expressed gene clusters (modules) under some experimental conditions, each containing at least one gene annotated as PCW related according to the Purdue Cell Wall Gene Families database. These co-expression modules cover 349 known/annotated PCW genes and 2,438 new candidates. For each candidate gene, we annotated the specific PCW synthesis stages in which it is involved and predicted the detailed function. In addition, for the co-expressed genes in each module, we predicted and analyzed their cis regulatory motifs in the promoters using our motif discovery pipeline, providing strong evidence that the genes in each co-expression module are transcriptionally co-regulated. From the all co-expression modules, we infer that 108 modules are related to four major PCW synthesis components, using three complementary methods. CONCLUSIONS We believe our approach and data presented here will be useful for further identification and characterization of PCW genes. All the predicted PCW genes, co-expression modules, motifs and their annotations are available at a web-based database: http://csbl.bmb.uga.edu/publications/materials/shanwang/CWRPdb/index.html.
Collapse
Affiliation(s)
- Shan Wang
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- Key Lab for Molecular Enzymology and Engineering of the Ministry of Education, Jilin University, Changchun, China
- Biotechnology Research Centre, Jilin Academy of Agricultural Sciences (JAAS), Changchun, China
| | - Yanbin Yin
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
| | - Qin Ma
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
| | - Xiaojia Tang
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
| | - Dongyun Hao
- Key Lab for Molecular Enzymology and Engineering of the Ministry of Education, Jilin University, Changchun, China
- Biotechnology Research Centre, Jilin Academy of Agricultural Sciences (JAAS), Changchun, China
| | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, Athens, GA, USA
- BESC BioEerngy Science Center, University of Georgia, Athens, GA, USA
- College of Computer Science and Technology, Jilin University, Changchun, China
| |
Collapse
|
57
|
Mahdevar G, Sadeghi M, Nowzari-Dalini A. Transcription factor binding sites detection by using alignment-based approach. J Theor Biol 2012; 304:96-102. [PMID: 22504445 DOI: 10.1016/j.jtbi.2012.03.039] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 03/27/2012] [Accepted: 03/29/2012] [Indexed: 11/25/2022]
Abstract
Gene expression is the main cause for the existence of various phenotypes. Through this procedure, the information stored in DNA rises to the phenotype. Essentially, gene expression is dependent upon the successful binding of transcription factors (TFs) - a specific type of proteins - to explicit positions in its upstream, TF binding sites (TFBSs). Unfortunately, finding these TFBSs is costly and laborious; therefore, discovering TFBSs computationally is a significant problem that many researches endeavor to solve. In this paper, a new TFBS discovery method is presented by considering known biological facts about TFBSs. The input to this method includes sequences with arbitrary lengths and the output comprises positions that tend to be TFBS. Through the application of previous methods along with a method that focuses on biological and simulated datasets, it is shown that this method achieves higher accuracy in discovering TFBSs.
Collapse
Affiliation(s)
- Ghasem Mahdevar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | | | | |
Collapse
|
58
|
Heyndrickx KS, Vandepoele K. Systematic identification of functional plant modules through the integration of complementary data sources. PLANT PHYSIOLOGY 2012; 159:884-901. [PMID: 22589469 PMCID: PMC3387714 DOI: 10.1104/pp.112.196725] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
A major challenge is to unravel how genes interact and are regulated to exert specific biological functions. The integration of genome-wide functional genomics data, followed by the construction of gene networks, provides a powerful approach to identify functional gene modules. Large-scale expression data, functional gene annotations, experimental protein-protein interactions, and transcription factor-target interactions were integrated to delineate modules in Arabidopsis (Arabidopsis thaliana). The different experimental input data sets showed little overlap, demonstrating the advantage of combining multiple data types to study gene function and regulation. In the set of 1,563 modules covering 13,142 genes, most modules displayed strong coexpression, but functional and cis-regulatory coherence was less prevalent. Highly connected hub genes showed a significant enrichment toward embryo lethality and evidence for cross talk between different biological processes. Comparative analysis revealed that 58% of the modules showed conserved coexpression across multiple plants. Using module-based functional predictions, 5,562 genes were annotated, and an evaluation experiment disclosed that, based on 197 recently experimentally characterized genes, 38.1% of these functions could be inferred through the module context. Examples of confirmed genes of unknown function related to cell wall biogenesis, xylem and phloem pattern formation, cell cycle, hormone stimulus, and circadian rhythm highlight the potential to identify new gene functions. The module-based predictions offer new biological hypotheses for functionally unknown genes in Arabidopsis (1,701 genes) and six other plant species (43,621 genes). Furthermore, the inferred modules provide new insights into the conservation of coexpression and coregulation as well as a starting point for comparative functional annotation.
Collapse
|
59
|
Zia A, Moses AM. Towards a theoretical understanding of false positives in DNA motif finding. BMC Bioinformatics 2012; 13:151. [PMID: 22738169 PMCID: PMC3436861 DOI: 10.1186/1471-2105-13-151] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2011] [Accepted: 06/27/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios. RESULTS Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size for the one-occurrence per sequence motif-finding problem. As expected, we predict that false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. Interestingly, we find that the false-positive strength depends more strongly on the number of sequences in the dataset than it does on the sequence length, but that the dependence on the number of sequences diminishes, after which adding more sequences does not reduce the false-positive rate significantly. We compare our theoretical predictions by applying four popular motif-finding algorithms that solve the one-occurrence-per-sequence problem (MEME, the Gibbs Sampler, Weeder, and GIMSAN) to simulated data that contain no motifs. We find that the dependence of false positives detected by these softwares on the motif-finding parameters is similar to that predicted by our formula. CONCLUSIONS We quantify the relationship between the sequence search space and motif-finding false-positives. Based on the simple formula we derive, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice. Our results provide a theoretical advance in an important problem in computational biology.
Collapse
Affiliation(s)
- Amin Zia
- Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON, M5S 3B2, Canada
| | - Alan M Moses
- Department of Cell & Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, ON, M5S 3B2, Canada
| |
Collapse
|
60
|
The B-cell identity factor Pax5 regulates distinct transcriptional programmes in early and late B lymphopoiesis. EMBO J 2012; 31:3130-46. [PMID: 22669466 DOI: 10.1038/emboj.2012.155] [Citation(s) in RCA: 162] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2011] [Accepted: 04/05/2012] [Indexed: 12/15/2022] Open
Abstract
Pax5 controls the identity and development of B cells by repressing lineage-inappropriate genes and activating B-cell-specific genes. Here, we used genome-wide approaches to identify Pax5 target genes in pro-B and mature B cells. In these cell types, Pax5 bound to 40% of the cis-regulatory elements defined by mapping DNase I hypersensitive (DHS) sites, transcription start sites and histone modifications. Although Pax5 bound to 8000 target genes, it regulated only 4% of them in pro-B and mature B cells by inducing enhancers at activated genes and eliminating DHS sites at repressed genes. Pax5-regulated genes in pro-B cells account for 23% of all expression changes occurring between common lymphoid progenitors and committed pro-B cells, which identifies Pax5 as an important regulator of this developmental transition. Regulated Pax5 target genes minimally overlap in pro-B and mature B cells, which reflects massive expression changes between these cell types. Hence, Pax5 controls B-cell identity and function by regulating distinct target genes in early and late B lymphopoiesis.
Collapse
|
61
|
Quimbaya M, Vandepoele K, Raspé E, Matthijs M, Dhondt S, Beemster GTS, Berx G, De Veylder L. Identification of putative cancer genes through data integration and comparative genomics between plants and humans. Cell Mol Life Sci 2012; 69:2041-55. [PMID: 22218400 PMCID: PMC11114995 DOI: 10.1007/s00018-011-0909-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Revised: 12/11/2011] [Accepted: 12/13/2011] [Indexed: 11/27/2022]
Abstract
Coordination of cell division with growth and development is essential for the survival of organisms. Mistakes made during replication of genetic material can result in cell death, growth defects, or cancer. Because of the essential role of the molecular machinery that controls DNA replication and mitosis during development, its high degree of conservation among organisms is not surprising. Mammalian cell cycle genes have orthologues in plants, and vice versa. However, besides the many known and characterized proliferation genes, still undiscovered regulatory genes are expected to exist with conserved functions in plants and humans. Starting from genome-wide Arabidopsis thaliana microarray data, an integrative strategy based on coexpression, functional enrichment analysis, and cis-regulatory element annotation was combined with a comparative genomics approach between plants and humans to detect conserved cell cycle genes involved in DNA replication and/or DNA repair. With this systemic strategy, a set of 339 genes was identified as potentially conserved proliferation genes. Experimental analysis confirmed that 20 out of 40 selected genes had an impact on plant cell proliferation; likewise, an evolutionarily conserved role in cell division was corroborated for two human orthologues. Moreover, association analysis integrating Homo sapiens gene expression data with clinical information revealed that, for 45 genes, altered transcript levels and relapse risk clearly correlated. Our results illustrate how a systematic exploration of the A. thaliana genome can contribute to the experimental identification of new cell cycle regulators that might represent novel oncogenes or/and tumor suppressors.
Collapse
Affiliation(s)
- Mauricio Quimbaya
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
- Molecular and Cellular Oncology Unit, Department for Molecular Biomedical Research, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Klaas Vandepoele
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Eric Raspé
- Molecular and Cellular Oncology Unit, Department for Molecular Biomedical Research, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Michiel Matthijs
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Stijn Dhondt
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Gerrit T. S. Beemster
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
- Department of Biology, University of Antwerp, Groenenborgerlaan 171, 2020 Antwerpen, Belgium
| | - Geert Berx
- Molecular and Cellular Oncology Unit, Department for Molecular Biomedical Research, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Biomedical Molecular Biology, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| | - Lieven De Veylder
- Department of Plant Systems Biology, VIB, Technologiepark 927, 9052 Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium
| |
Collapse
|
62
|
Claeys M, Storms V, Sun H, Michoel T, Marchal K. MotifSuite: workflow for probabilistic motif detection and assessment. Bioinformatics 2012; 28:1931-2. [DOI: 10.1093/bioinformatics/bts293] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
63
|
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2012; 14:225-37. [PMID: 22517426 PMCID: PMC3603212 DOI: 10.1093/bib/bbs016] [Citation(s) in RCA: 93] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein-DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
Collapse
|
64
|
Regulation of ykrL (htpX) by Rok and YkrK, a novel type of regulator in Bacillus subtilis. J Bacteriol 2012; 194:2837-45. [PMID: 22447908 DOI: 10.1128/jb.00324-12] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Expression of ykrL of Bacillus subtilis, encoding a close homologue of the Escherichia coli membrane protein quality control protease HtpX, was shown to be upregulated under membrane protein overproduction stress. Using DNA affinity chromatography, two proteins were found to bind to the promoter region of ykrL: Rok, known as a repressor of competence and genes for extracytoplasmic functions, and YkrK, a novel type of regulator encoded by the gene adjacent to ykrL but divergently transcribed. Electrophoretic mobility shift assays showed Rok and YkrK binding to the ykrL promoter region as well as YkrK binding to the ykrK promoter region. Comparative bioinformatic analysis of the ykrL promoter regions in related Bacillus species revealed a consensus motif, which was demonstrated to be the binding site of YkrK. Deletion of rok and ykrK in a PykrL-gfp reporter strain showed that both proteins are repressors of ykrL expression. In addition, conditions which activated PykrL (membrane protein overproduction, dissipation of the membrane potential, and salt and phenol stress) point to the involvement of YkrL in membrane protein quality control.
Collapse
|
65
|
Finding Transcription Factor Binding Motifs for Coregulated Genes by Combining Sequence Overrepresentation with Cross-Species Conservation. JOURNAL OF PROBABILITY AND STATISTICS 2012. [DOI: 10.1155/2012/830575] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Novel computational methods for finding transcription factor binding motifs have long been sought due to tedious work of experimentally identifying them. However, the current prevailing methods yield a large number of false positive predictions due to the short, variable nature of transcriptional factor binding sites (TFBSs). We proposed here a method that combines sequence overrepresentation and cross-species sequence conservation to detect TFBSs in upstream regions of a given set of coregulated genes. We applied the method to 35S. cerevisiaetranscriptional factors with known DNA binding motifs (with the support of orthologous sequences from genomes ofS. mikatae,S. bayanus, andS. paradoxus), and the proposed method outperformed the single-genome-based motif finding methodsMEMEandAlignACEas well as the multiple-genome-based methodsPHYMEandFootprinterfor the majority of these transcriptional factors. Compared with the prevailing motif finding software, our method has some advantages in finding transcriptional factor binding motifs for potential coregulated genes if the gene upstream sequences of multiple closely related species are available. Although we used yeast genomes to assess our method in this study, it might also be applied to other organisms if suitable related species are available and the upstream sequences of coregulated genes can be obtained for the multiple closely related species.
Collapse
|
66
|
Aerts S. Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr Top Dev Biol 2012; 98:121-45. [PMID: 22305161 DOI: 10.1016/b978-0-12-386499-4.00005-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Transcription factors (TFs) are key proteins that decode the information in our genome to express a precise and unique set of proteins and RNA molecules in each cell type in our body. These factors play a pivotal role in all biological processes, including the determination of a cell's fate during development and the maintenance of a cell's physiological function. To achieve this, a TF binds to specific DNA sequences in the noncoding part of the genome, recruits chromatin modifiers and cofactors, and directs the transcription initiation rate of its "target genes." Therefore, a key challenge in deciphering a transcriptional switch is to identify the direct target genes of the master regulators that control the switch, the cis-regulatory elements implementing (auto-)regulatory loops, and the target genes of all the TFs in the downstream regulatory network. A better knowledge of a TF's targetome during specification and differentiation of a particular cell type will generate mechanistic insight into its developmental program. Here, I review computational strategies and methods to predict transcriptional targets by genome-wide searches for TF binding sites using position weight matrices, motif clusters, phylogenetic footprinting, chromatin binding and accessibility data, enhancer classification, motif enrichment, and gene expression signatures.
Collapse
Affiliation(s)
- Stein Aerts
- Laboratory of Computational Biology, Center for Human Genetics, Katholieke Universiteit Leuven, Leuven, Belgium
| |
Collapse
|
67
|
Kim T, Tyndel MS, Huang H, Sidhu SS, Bader GD, Gfeller D, Kim PM. MUSI: an integrated system for identifying multiple specificity from very large peptide or nucleic acid data sets. Nucleic Acids Res 2011; 40:e47. [PMID: 22210894 PMCID: PMC3315295 DOI: 10.1093/nar/gkr1294] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Peptide recognition domains and transcription factors play crucial roles in cellular signaling. They bind linear stretches of amino acids or nucleotides, respectively, with high specificity. Experimental techniques that assess the binding specificity of these domains, such as microarrays or phage display, can retrieve thousands of distinct ligands, providing detailed insight into binding specificity. In particular, the advent of next-generation sequencing has recently increased the throughput of such methods by several orders of magnitude. These advances have helped reveal the presence of distinct binding specificity classes that co-exist within a set of ligands interacting with the same target. Here, we introduce a software system called MUSI that can rapidly analyze very large data sets of binding sequences to determine the relevant binding specificity patterns. Our pipeline provides two major advances. First, it can detect previously unrecognized multiple specificity patterns in any data set. Second, it offers integrated processing of very large data sets from next-generation sequencing machines. The results are visualized as multiple sequence logos describing the different binding preferences of the protein under investigation. We demonstrate the performance of MUSI by analyzing recent phage display data for human SH3 domains as well as microarray data for mouse transcription factors.
Collapse
Affiliation(s)
- Taehyung Kim
- The Donnelly Centre, Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada M5S 3E1
| | | | | | | | | | | | | |
Collapse
|
68
|
Technau M, Knispel M, Roth S. Molecular mechanisms of EGF signaling-dependent regulation of pipe, a gene crucial for dorsoventral axis formation in Drosophila. Dev Genes Evol 2011; 222:1-17. [PMID: 22198544 PMCID: PMC3291829 DOI: 10.1007/s00427-011-0384-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 11/29/2011] [Indexed: 01/28/2023]
Abstract
During Drosophila oogenesis the expression of the sulfotransferase Pipe in ventral follicle cells is crucial for dorsoventral axis formation. Pipe modifies proteins that are incorporated in the ventral eggshell and activate Toll signaling which in turn initiates embryonic dorsoventral patterning. Ventral pipe expression is the result of an oocyte-derived EGF signal which down-regulates pipe in dorsal follicle cells. The analysis of mutant follicle cell clones reveals that none of the transcription factors known to act downstream of EGF signaling in Drosophila is required or sufficient for pipe regulation. However, the pipe cis-regulatory region harbors a 31-bp element which is essential for pipe repression, and ovarian extracts contain a protein that binds this element. Thus, EGF signaling does not act by down-regulating an activator of pipe as previously suggested but rather by activating a repressor. Surprisingly, this repressor acts independent of the common co-repressors Groucho or CtBP.
Collapse
Affiliation(s)
- Martin Technau
- Institute for Developmental Biology, Biocenter, University of Cologne, Zuelpicher Straße 47b, 50674, Cologne, Germany
| | | | | |
Collapse
|
69
|
Gruel J, LeBorgne M, LeMeur N, Théret N. Simple Shared Motifs (SSM) in conserved region of promoters: a new approach to identify co-regulation patterns. BMC Bioinformatics 2011; 12:365. [PMID: 21910886 PMCID: PMC3215511 DOI: 10.1186/1471-2105-12-365] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2010] [Accepted: 09/12/2011] [Indexed: 01/07/2023] Open
Abstract
Background Regulation of gene expression plays a pivotal role in cellular functions. However, understanding the dynamics of transcription remains a challenging task. A host of computational approaches have been developed to identify regulatory motifs, mainly based on the recognition of DNA sequences for transcription factor binding sites. Recent integration of additional data from genomic analyses or phylogenetic footprinting has significantly improved these methods. Results Here, we propose a different approach based on the compilation of Simple Shared Motifs (SSM), groups of sequences defined by their length and similarity and present in conserved sequences of gene promoters. We developed an original algorithm to search and count SSM in pairs of genes. An exceptional number of SSM is considered as a common regulatory pattern. The SSM approach is applied to a sample set of genes and validated using functional gene-set enrichment analyses. We demonstrate that the SSM approach selects genes that are over-represented in specific biological categories (Ontology and Pathways) and are enriched in co-expressed genes. Finally we show that genes co-expressed in the same tissue or involved in the same biological pathway have increased SSM values. Conclusions Using unbiased clustering of genes, Simple Shared Motifs analysis constitutes an original contribution to provide a clearer definition of expression networks.
Collapse
Affiliation(s)
- Jérémy Gruel
- EA 4427 SeRAIC IFR140, Université de Rennes 1, 2 avenue du Pr, Léon Bernard, Rennes 35043, France.
| | | | | | | |
Collapse
|
70
|
Yan R, Boutros PC, Jurisica I. A tree-based approach for motif discovery and sequence classification. Bioinformatics 2011; 27:2054-61. [PMID: 21685048 DOI: 10.1093/bioinformatics/btr353] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pattern discovery algorithms are widely used for the analysis of DNA and protein sequences. Most algorithms have been designed to find overrepresented motifs in sparse datasets of long sequences, and ignore most positional information. We introduce an algorithm optimized to exploit spatial information in sparse-but-populous datasets. RESULTS Our algorithm Tree-based Weighted-Position Pattern Discovery and Classification (T-WPPDC) supports both unsupervised pattern discovery and supervised sequence classification. It identifies positionally enriched patterns using the Kullback-Leibler distance between foreground and background sequences at each position. This spatial information is used to discover positionally important patterns. T-WPPDC then uses a scoring function to discriminate different biological classes. We validated T-WPPDC on an important biological problem: prediction of single nucleotide polymorphisms (SNPs) from flanking sequence. We evaluated 672 separate experiments on 120 datasets derived from multiple species. T-WPPDC outperformed other pattern discovery methods and was comparable to the supervised machine learning algorithms. The algorithm is computationally efficient and largely insensitive to dataset size. It allows arbitrary parameterization and is embarrassingly parallelizable. CONCLUSIONS T-WPPDC is a minimally parameterized algorithm for both pattern discovery and sequence classification that directly incorporates positional information. We use it to confirm the predictability of SNPs from flanking sequence, and show that positional information is a key to this biological problem. AVAILABILITY The algorithm, code and data are available at: http://www.cs.utoronto.ca/~juris/data/TWPPDC
Collapse
Affiliation(s)
- Rui Yan
- Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4.
| | | | | |
Collapse
|
71
|
Pilalis E, Chatziioannou AA, Grigoroudis AI, Panagiotidis CA, Kolisis FN, Kyriakidis DA. Escherichia coli genome-wide promoter analysis: identification of additional AtoC binding target elements. BMC Genomics 2011; 12:238. [PMID: 21569465 PMCID: PMC3118216 DOI: 10.1186/1471-2164-12-238] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2010] [Accepted: 05/13/2011] [Indexed: 11/16/2022] Open
Abstract
Background Studies on bacterial signal transduction systems have revealed complex networks of functional interactions, where the response regulators play a pivotal role. The AtoSC system of E. coli activates the expression of atoDAEB operon genes, and the subsequent catabolism of short-chain fatty acids, upon acetoacetate induction. Transcriptome and phenotypic analyses suggested that atoSC is also involved in several other cellular activities, although we have recently reported a palindromic repeat within the atoDAEB promoter as the single, cis-regulatory binding site of the AtoC response regulator. In this work, we used a computational approach to explore the presence of yet unidentified AtoC binding sites within other parts of the E. coli genome. Results Through the implementation of a computational de novo motif detection workflow, a set of candidate motifs was generated, representing putative AtoC binding targets within the E. coli genome. In order to assess the biological relevance of the motifs and to select for experimental validation of those sequences related robustly with distinct cellular functions, we implemented a novel approach that applies Gene Ontology Term Analysis to the motif hits and selected those that were qualified through this procedure. The computational results were validated using Chromatin Immunoprecipitation assays to assess the in vivo binding of AtoC to the predicted sites. This process verified twenty-two additional AtoC binding sites, located not only within intergenic regions, but also within gene-encoding sequences. Conclusions This study, by tracing a number of putative AtoC binding sites, has indicated an AtoC-related cross-regulatory function. This highlights the significance of computational genome-wide approaches in elucidating complex patterns of bacterial cell regulation.
Collapse
Affiliation(s)
- Eleftherios Pilalis
- Institute of Biological Research and Biotechnology, National Hellenic Research Foundation, Athens, Greece
| | | | | | | | | | | |
Collapse
|
72
|
Scherbak N, Ala-Häivälä A, Brosché M, Böwer N, Strid H, Gittins JR, Grahn E, Eriksson LA, Strid Å. The pea SAD short-chain dehydrogenase/reductase: quinone reduction, tissue distribution, and heterologous expression. PLANT PHYSIOLOGY 2011; 155:1839-50. [PMID: 21343423 PMCID: PMC3091106 DOI: 10.1104/pp.111.173336] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/26/2011] [Accepted: 02/20/2011] [Indexed: 05/04/2023]
Abstract
The pea (Pisum sativum) tetrameric short-chain alcohol dehydrogenase-like protein (SAD) family consists of at least three highly similar members (SAD-A, -B, and -C). According to mRNA data, environmental stimuli induce SAD expression. The aim of this study was to characterize the SAD proteins by examining their catalytic function, distribution in pea, and induction in different tissues. In enzyme activity assays using a range of potential substrates, the SAD-C enzyme was shown to reduce one- or two-ring-membered quinones lacking long hydrophobic hydrocarbon tails. Immunological assays using a specific antiserum against the protein demonstrated that different tissues and cell types contain small amounts of SAD protein that was predominantly located within epidermal or subepidermal cells and around vascular tissue. Particularly high local concentrations were observed in the protoderm of the seed cotyledonary axis. Two bow-shaped rows of cells in the ovary and the placental surface facing the ovule also exhibited considerable SAD staining. Ultraviolet-B irradiation led to increased staining in epidermal and subepidermal cells of leaves and stems. The different localization patterns of SAD suggest functions both in development and in responses to environmental stimuli. Finally, the pea SAD-C promoter was shown to confer heterologous wound-induced expression in Arabidopsis (Arabidopsis thaliana), which confirmed that the inducibility of its expression is regulated at the transcriptional level.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Åke Strid
- Akademin för Naturvetenskap och Teknik och Centrum för Livsvetenskap (N.S., A.A.-H., N.B., E.G., L.A.E., Å.S.) and Hälsoakademin och Centrum för Livsvetenskap (H.S.), Orebro Universitet, S–70182 Orebro, Sweden; Biokemi och Biofysik, Institutionen för Kemi, Goteborg Universitet, S–405 30 Goteborg, Sweden (M.B., J.R.G.); Division of Plant Biology, Department of Biosciences, University of Helsinki, FIN–00014 Helsinki, Finland (M.B.); Institute of Technology, University of Tartu, Tartu 50411, Estonia (M.B.)
| |
Collapse
|
73
|
Reeves WM, Lynch TJ, Mobin R, Finkelstein RR. Direct targets of the transcription factors ABA-Insensitive(ABI)4 and ABI5 reveal synergistic action by ABI4 and several bZIP ABA response factors. PLANT MOLECULAR BIOLOGY 2011; 75:347-63. [PMID: 21243515 PMCID: PMC3044226 DOI: 10.1007/s11103-011-9733-9] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2010] [Accepted: 01/03/2011] [Indexed: 05/19/2023]
Abstract
The plant hormone abscisic acid (ABA) is a key regulator of seed development. In addition to promoting seed maturation, ABA inhibits seed germination and seedling growth. Many components involved in ABA response have been identified, including the transcription factors ABA insensitive (ABI)4 and ABI5. The genes encoding these factors are expressed predominantly in developing and mature seeds, and are positive regulators of ABA mediated inhibition of seed germination and growth. The direct effects of ABI4 and ABI5 in ABA response remain largely undefined. To address this question, plants over-expressing ABI4 or ABI5 were used to allow identification of direct transcriptional targets. Ectopically expressed ABI4 and ABI5 conferred ABA-dependent induction of slightly over 100 genes in 11 day old plants. In addition to effector genes involved in seed maturation and reserve storage, several signaling proteins and transcription factors were identified as targets of ABI4 and/or ABI5. Although only 12% of the ABA- and ABI-dependent transcriptional targets were induced by both ABI factors in 11 day old plants, 40% of those normally expressed in seeds had reduced transcript levels in both abi4 and abi5 mutants. Surprisingly, many of the ABI4 transcriptional targets do not contain the previously characterized ABI4 binding motifs, the CE1 or S box, in their promoters, but some of these interact with ABI4 in electrophoretic mobility shift assays, suggesting that sequence recognition by ABI4 may be more flexible than known canonical sequences. Yeast one-hybrid assays demonstrated synergistic action of ABI4 with ABI5 or related bZIP factors in regulating these promoters, and mutant analyses showed that ABI4 and these bZIPs share some functions in plants.
Collapse
Affiliation(s)
- Wendy M. Reeves
- Molecular, Cellular, and Developmental Biology Department, University of California at Santa Barbara, Santa Barbara, CA 93106 USA
| | - Tim J. Lynch
- Molecular, Cellular, and Developmental Biology Department, University of California at Santa Barbara, Santa Barbara, CA 93106 USA
| | - Raisa Mobin
- Molecular, Cellular, and Developmental Biology Department, University of California at Santa Barbara, Santa Barbara, CA 93106 USA
| | - Ruth R. Finkelstein
- Molecular, Cellular, and Developmental Biology Department, University of California at Santa Barbara, Santa Barbara, CA 93106 USA
| |
Collapse
|
74
|
9p21 DNA variants associated with coronary artery disease impair interferon-γ signalling response. Nature 2011; 470:264-8. [PMID: 21307941 PMCID: PMC3079517 DOI: 10.1038/nature09753] [Citation(s) in RCA: 476] [Impact Index Per Article: 36.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2009] [Accepted: 12/16/2010] [Indexed: 02/07/2023]
Abstract
Genome wide association studies (GWAS) have identified SNPs in the 9p21 gene desert associated with coronary artery disease (CAD)1–4 and Type 2 diabetes (T2D)5–7. Despite evidence for a role of the associated interval in neighboring gene regulation8–10, the biological underpinnings of these genetic associations to CAD or T2D have not yet been explained. Here we identify 33 enhancers in 9p21; the interval is the second densest gene-desert for predicted enhancers and 6 times denser than the whole genome (p<6.55 10−33). The CAD risk alleles of SNPs rs10811656/rs10757278 are located in one of these enhancers and disrupt a binding site for STAT1. Lymphoblastoid cell lines (LCL) homozygous for the CAD risk haplotype exhibit no binding of STAT1, and in LCL homozygous for the CAD non-risk haplotype binding of STAT1 inhibits CDKN2BAS expression, which is reversed by siRNA knock-down of STAT1. Using a new, open-ended approach to detect long-distance interactions (3D-DSL), we find that in human vascular endothelium cells (HUVEC) the enhancer interval containing the CAD locus physically interacts with the CDKN2A/B locus, the MTAP gene and an interval downstream of INFA21. In HUVEC, IFNγ activation strongly affects the structure of the chromatin and the transcriptional regulation in the 9p21 locus, including STAT1 binding, long-range enhancer interactions and altered expression of neighboring genes. Our findings establish a link between CAD genetic susceptibility and the response to inflammatory signaling in a vascular cell type and thus demonstrate the utility of GWAS findings to direct studies to novel genomic loci and biological processes important for disease etiology.
Collapse
|
75
|
Yamamoto YY, Yoshioka Y, Hyakumachi M, Maruyama K, Yamaguchi-Shinozaki K, Tokizawa M, Koyama H. Prediction of transcriptional regulatory elements for plant hormone responses based on microarray data. BMC PLANT BIOLOGY 2011; 11:39. [PMID: 21349196 PMCID: PMC3058078 DOI: 10.1186/1471-2229-11-39] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Accepted: 02/24/2011] [Indexed: 05/20/2023]
Abstract
BACKGROUND Phytohormones organize plant development and environmental adaptation through cell-to-cell signal transduction, and their action involves transcriptional activation. Recent international efforts to establish and maintain public databases of Arabidopsis microarray data have enabled the utilization of this data in the analysis of various phytohormone responses, providing genome-wide identification of promoters targeted by phytohormones. RESULTS We utilized such microarray data for prediction of cis-regulatory elements with an octamer-based approach. Our test prediction of a drought-responsive RD29A promoter with the aid of microarray data for response to drought, ABA and overexpression of DREB1A, a key regulator of cold and drought response, provided reasonable results that fit with the experimentally identified regulatory elements. With this succession, we expanded the prediction to various phytohormone responses, including those for abscisic acid, auxin, cytokinin, ethylene, brassinosteroid, jasmonic acid, and salicylic acid, as well as for hydrogen peroxide, drought and DREB1A overexpression. Totally 622 promoters that are activated by phytohormones were subjected to the prediction. In addition, we have assigned putative functions to 53 octamers of the Regulatory Element Group (REG) that have been extracted as position-dependent cis-regulatory elements with the aid of their feature of preferential appearance in the promoter region. CONCLUSIONS Our prediction of Arabidopsis cis-regulatory elements for phytohormone responses provides guidance for experimental analysis of promoters to reveal the basis of the transcriptional network of phytohormone responses.
Collapse
Affiliation(s)
- Yoshiharu Y Yamamoto
- Faculty of Applied Biological Sciences, Gifu University, Yanagido 1-1, Gifu City, Gifu 501-1193, Japan
| | - Yohei Yoshioka
- Faculty of Applied Biological Sciences, Gifu University, Yanagido 1-1, Gifu City, Gifu 501-1193, Japan
| | - Mitsuro Hyakumachi
- Faculty of Applied Biological Sciences, Gifu University, Yanagido 1-1, Gifu City, Gifu 501-1193, Japan
| | - Kyonoshin Maruyama
- Japan International Research Center for Agricultural Sciences, Ohwashi 1-1, Tsukuba, Ibaraki 305-8686, Japan
| | - Kazuko Yamaguchi-Shinozaki
- Japan International Research Center for Agricultural Sciences, Ohwashi 1-1, Tsukuba, Ibaraki 305-8686, Japan
| | - Mutsutomo Tokizawa
- Faculty of Applied Biological Sciences, Gifu University, Yanagido 1-1, Gifu City, Gifu 501-1193, Japan
| | - Hiroyuki Koyama
- Faculty of Applied Biological Sciences, Gifu University, Yanagido 1-1, Gifu City, Gifu 501-1193, Japan
| |
Collapse
|
76
|
Hu L, Liang W, Yin C, Cui X, Zong J, Wang X, Hu J, Zhang D. Rice MADS3 regulates ROS homeostasis during late anther development. THE PLANT CELL 2011; 23:515-33. [PMID: 21297036 PMCID: PMC3077785 DOI: 10.1105/tpc.110.074369] [Citation(s) in RCA: 200] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2010] [Revised: 01/06/2011] [Accepted: 01/19/2011] [Indexed: 05/17/2023]
Abstract
The rice (Oryza sativa) floral homeotic C-class gene, MADS3, was previously shown to be required for stamen identity determination during early flower development. Here, we describe a role for MADS3 in regulating late anther development and pollen formation. Consistent with this role, MADS3 is highly expressed in the tapetum and microspores during late anther development, and a newly identified MADS3 mutant allele, mads3-4, displays defective anther walls, aborted microspores, and complete male sterility. During late anther development, mads3-4 exhibits oxidative stress-related phenotypes. Microarray analysis revealed expression level changes in many genes in mads3-4 anthers. Some of these genes encode proteins involved in reactive oxygen species (ROS) homeostasis; among them is MT-1-4b, which encodes a type 1 small Cys-rich and metal binding protein. In vivo and in vitro assays showed that MADS3 is associated with the promoter of MT-1-4b, and recombinant MT-1-4b has superoxide anion and hydroxyl radical scavenging activity. Reducing the expression of MT-1-4b causes decreased pollen fertility and an increased level of superoxide anion in transgenic plants. Our findings suggest that MADS3 is a key transcriptional regulator that functions in rice male reproductive development, at least in part, by modulating ROS levels through MT-1-4b.
Collapse
Affiliation(s)
- Lifang Hu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Bio-X Research Center, Key Laboratory of Genetics and Development and Neuropsychiatric Diseases, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Wanqi Liang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Changsong Yin
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Xiao Cui
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | | | - Xing Wang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jianping Hu
- Department of Energy Plant Research Laboratory, Michigan State University, East Lansing, Michigan 48824
| | - Dabing Zhang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Bio-X Research Center, Key Laboratory of Genetics and Development and Neuropsychiatric Diseases, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China
- Address correspondence to
| |
Collapse
|
77
|
Henne KL, Wan XF, Wei W, Thompson DK. SO2426 is a positive regulator of siderophore expression in Shewanella oneidensis MR-1. BMC Microbiol 2011; 11:125. [PMID: 21624143 PMCID: PMC3127752 DOI: 10.1186/1471-2180-11-125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2010] [Accepted: 05/31/2011] [Indexed: 11/14/2022] Open
Abstract
Background The Shewanella oneidensis MR-1 genome encodes a predicted orphan DNA-binding response regulator, SO2426. Previous studies with a SO2426-deficient MR-1 strain suggested a putative functional role for SO2426 in the regulation of iron acquisition genes, in particular, the siderophore (hydroxamate) biosynthesis operon so3030-3031-3032. To further investigate the functional role of SO2426 in iron homeostasis, we employed computational strategies to identify putative gene targets of SO2426 regulation and biochemical approaches to validate the participation of SO2426 in the control of siderophore biosynthesis in S. oneidensis MR-1. Results In silico prediction analyses revealed a single 14-bp consensus motif consisting of two tandem conserved pentamers (5'-CAAAA-3') in the upstream regulatory regions of 46 genes, which were shown previously to be significantly down-regulated in a so2426 deletion mutant. These genes included so3030 and so3032, members of an annotated siderophore biosynthetic operon in MR-1. Electrophoretic mobility shift assays demonstrated that the SO2426 protein binds to its motif in the operator region of so3030. A "short" form of SO2426, beginning with a methionine at position 11 (M11) of the originally annotated coding sequence for SO2426, was also functional in binding to its consensus motif, confirming previous 5' RACE results that suggested that amino acid M11 is the actual translation start codon for SO2426. Alignment of SO2426 orthologs from all sequenced Shewanella spp. showed a high degree of sequence conservation beginning at M11, in addition to conservation of a putative aspartyl phosphorylation residue and the helix-turn-helix (HTH) DNA-binding domain. Finally, the so2426 deletion mutant was unable to synthesize siderophores at wild-type rates upon exposure to the iron chelator 2,2'-dipyridyl. Conclusions Collectively, these data support the functional characterization of SO2426 as a positive regulator of siderophore-mediated iron acquisition and provide the first insight into a coordinate program of multiple regulatory schemes controlling iron homeostasis in S. oneidensis MR-1.
Collapse
|
78
|
Fang X, Yu W, Li L, Shao J, Zhao N, Chen Q, Ye Z, Lin SC, Zheng S, Lin B. ChIP-seq and functional analysis of the SOX2 gene in colorectal cancers. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 14:369-84. [PMID: 20726797 DOI: 10.1089/omi.2010.0053] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
SOX2 is an HMG box containing transcription factor that has been implicated in various types of cancer, but its role in colorectal cancers (CRC) has not been studied. Here we show that SOX2 is overexpressed in CRC tissues compared with normal adjacent tissues using immunohistochemical staining and RT-PCR. We also observed an increased SOX2 expression in nucleus of colorectal cancer tissues (46%, 14/30 cases vs. 7%, 2/30 adjacent tissues). Furthermore, knockdown of SOX2 in SW620 colorectal cancer cells decreased their growth rates in vitro cell line, and in vivo in xenograft models. ChIP-Seq analysis of SOX2 revealed a consensus sequence of wwTGywTT. An integrated expression profiling and ChIP-seq analysis show that SOX2 is involved in the BMP signaling pathway, steroid metabolic process, histone modifications, and many receptor-mediated signaling pathways such as IGF1R and ITPR2 (Inositol 1,4,5-triphosphate receptor, type 2).
Collapse
Affiliation(s)
- Xuefeng Fang
- Cancer Institute (Key Laboratory of Cancer Prevention and Intervention, China National Ministry of Education), The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, People's Republic of China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
79
|
De Coninck BMA, Sels J, Venmans E, Thys W, Goderis IJWM, Carron D, Delauré SL, Cammue BPA, De Bolle MFC, Mathys J. Arabidopsis thaliana plant defensin AtPDF1.1 is involved in the plant response to biotic stress. THE NEW PHYTOLOGIST 2010; 187:1075-1088. [PMID: 20561213 DOI: 10.1111/j.1469-8137.2010.03326.x] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
*Previously, it was shown that the Arabidopsis thaliana plant defensins AtPDF1.1 (At1g75830) and AtPDF1.2a (At5g44420) exert in vitro antimicrobial properties and that their corresponding genes are expressed in seeds and induced in leaves upon pathogen attack, respectively. *In this study, the expression profile of both AtPDF1.1 and AtPDF1.2a is analysed in wild-type plants upon different stress-related treatments and the effect of modulation of their expression in transgenic plants is examined in both host and nonhost resistance. *AtPDF1.1, which was originally considered to be seed-specific, is demonstrated to be locally induced in leaves upon fungal attack and exhibits an expression profile distinct from that of AtPDF1.2a, a gene frequently used as marker for the ethylene/jasmonate-mediated signaling pathway. Transgenic plants with modulated AtPDF1.1 or AtPDF1.2a gene expression show no altered phenotype upon Botrytis cinerea inoculation. However, constitutive overexpression of AtPDF1.1 in A. thaliana leads to a reduction in symptoms caused by the nonhost Cercospora beticola causing non-spreading spots on A. thaliana leaves. *These results indicate that AtPDF1.1 and AtPDF1.2a clearly differ regarding their expression profile and functionality in planta. It emphasizes the additional level of complexity and fine-tuning within the highly redundant plant defensin genes in A. thaliana.
Collapse
Affiliation(s)
| | | | | | - Wannes Thys
- Center of Microbial and Plant Genetics, Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, B-3001 Heverlee, Belgium
| | - Inge J W M Goderis
- Center of Microbial and Plant Genetics, Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, B-3001 Heverlee, Belgium
| | - Delphine Carron
- Center of Microbial and Plant Genetics, Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, B-3001 Heverlee, Belgium
| | - Stijn L Delauré
- Center of Microbial and Plant Genetics, Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, B-3001 Heverlee, Belgium
| | - Bruno P A Cammue
- Center of Microbial and Plant Genetics, Katholieke Universiteit Leuven, Kasteelpark Arenberg 20, B-3001 Heverlee, Belgium
| | | | | |
Collapse
|
80
|
Fang X, Yu W, Li L, Shao J, Zhao N, Chen Q, Ye Z, Lin SC, Zheng S, Lin B. ChIP-seq and Functional Analysis of the SOX2 Gene in Colorectal Cancers. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2010:121207092956007. [PMID: 20726776 DOI: 10.1089/omi.2010.0026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Abstract SOX2 is a high mobility group (HMG) box containing transcription factor that has been implicated in various types of cancer, but its role in colorectal cancers (CRC) has not been studied. Here we show that SOX2 is overexpressed in CRC tissues compared with normal adjacent tissues using immunohistochemical staining and RT-PCR. We also observed an increased SOX2 expression in nucleus of colorectal cancer tissues (46%, 14/30 cases vs. 7%, 2/30 adjacent tissues). Furthermore, knockdown of SOX2 in SW620 colorectal cancer cells decreased their growth rates in vitro cell line, and in vivo in xenograft models. ChIP-seq analysis of SOX2 revealed a consensus sequence of wwTGywTT. An integrated expression profiling and ChIP-seq analysis show that SOX2 is involved in the BMP signaling pathway, steroid metabolic process, histone modifications, and many receptor-mediated signaling pathways such as IGF1R and ITPR2 (Inositol 1,4,5-triphosphate receptor, type 2).
Collapse
Affiliation(s)
- Xuefeng Fang
- 1 Cancer Institute (Key Laboratory of Cancer Prevention and Intervention, China National Ministry of Education), The Second Affiliated Hospital, Zhejiang University School of Medicine , Hangzhou, Zhejiang, People's Republic of China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
81
|
|
82
|
Tomljenovic-Berube AM, Mulder DT, Whiteside MD, Brinkman FSL, Coombes BK. Identification of the regulatory logic controlling Salmonella pathoadaptation by the SsrA-SsrB two-component system. PLoS Genet 2010; 6:e1000875. [PMID: 20300643 PMCID: PMC2837388 DOI: 10.1371/journal.pgen.1000875] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2009] [Accepted: 02/06/2010] [Indexed: 11/19/2022] Open
Abstract
Sequence data from the past decade has laid bare the significance of horizontal gene transfer in creating genetic diversity in the bacterial world. Regulatory evolution, in which non-coding DNA is mutated to create new regulatory nodes, also contributes to this diversity to allow niche adaptation and the evolution of pathogenesis. To survive in the host environment, Salmonella enterica uses a type III secretion system and effector proteins, which are activated by the SsrA-SsrB two-component system in response to the host environment. To better understand the phenomenon of regulatory evolution in S. enterica, we defined the SsrB regulon and asked how this transcription factor interacts with the cis-regulatory region of target genes. Using ChIP-on-chip, cDNA hybridization, and comparative genomics analyses, we describe the SsrB-dependent regulon of ancestral and horizontally acquired genes. Further, we used a genetic screen and computational analyses integrating experimental data from S. enterica and sequence data from an orthologous regulatory system in the insect endosymbiont, Sodalis glossinidius, to identify the conserved yet flexible palindrome sequence that defines DNA recognition by SsrB. Mutational analysis of a representative promoter validated this palindrome as the minimal architecture needed for regulatory input by SsrB. These data provide a high-resolution map of a regulatory network and the underlying logic enabling pathogen adaptation to a host. All organisms have a means to control gene expression ensuring correct spatiotemporal deployment of gene products. In bacteria, gene control presents a challenge because one species can reside in multiple niches, requiring them to coordinate gene expression with environmental sensing. Also, widespread acquisition of DNA by horizontal gene transfer demands a mechanism to integrate new genes into existing regulatory circuitry. The environmental awareness issue can be controlled using two-component regulatory systems that connect environmental cues to transcription factor activation, whereas the integration problem can be resolved using DNA regulatory evolution to create new regulatory connections between genes. The evolutionary significance of regulatory evolution for host adaptation is not fully known. We studied the convergence of environmental sensing and genetic networks by examining how the Salmonella enterica SsrA-SsrB two-component system, activated in response to host cues, has integrated ancestral and acquired genes into a common regulon. We identified a palindrome as the major element apportioning SsrB on the chromosome. SsrB binding sites have been selected to co-regulate a gene program involved in pathogenic adaptation of Salmonella to its host. In addition, our results indicate that promoter architecture emerging from SsrB-dependent regulatory evolution may support both mutualistic and parasitic bacteria-host relationships.
Collapse
Affiliation(s)
- Ana M. Tomljenovic-Berube
- Michael G. DeGroote Institute for Infectious Disease Research and the Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Canada
| | - David T. Mulder
- Michael G. DeGroote Institute for Infectious Disease Research and the Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Canada
| | - Matthew D. Whiteside
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, Canada
| | - Fiona S. L. Brinkman
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, Canada
| | - Brian K. Coombes
- Michael G. DeGroote Institute for Infectious Disease Research and the Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Canada
- * E-mail:
| |
Collapse
|
83
|
Abstract
MOTIVATION Discovery of nucleotide motifs that are localized with respect to a certain biological landmark is important in several appli-cations, such as in regulatory sequences flanking the transcription start site, in the neighborhood of known transcription factor binding sites, and in transcription factor binding regions discovered by massively parallel sequencing (ChIP-Seq). RESULTS We report an algorithm called LocalMotif to discover such localized motifs. The algorithm is based on a novel scoring function, called spatial confinement score, which can determine the exact interval of localization of a motif. This score is combined with other existing scoring measures including over-representation and relative entropy to determine the overall prominence of the motif. The approach successfully discovers biologically relevant motifs and their intervals of localization in scenarios where the motifs cannot be discovered by general motif finding tools. It is especially useful for discovering multiple co-localized motifs in a set of regulatory sequences, such as those identified by ChIP-Seq. AVAILABILITY AND IMPLEMENTATION The LocalMotif software is available at http://www.comp.nus.edu.sg/~bioinfo/LocalMotif.
Collapse
Affiliation(s)
- Vipin Narang
- Department of Computer Science, National University of Singapore, Singapore
| | | | | |
Collapse
|
84
|
Sleumer MC, Mah AK, Baillie DL, Jones SJM. Conserved elements associated with ribosomal genes and their trans-splice acceptor sites in Caenorhabditis elegans. Nucleic Acids Res 2010; 38:2990-3004. [PMID: 20100800 PMCID: PMC2875031 DOI: 10.1093/nar/gkq003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The recent publication of the Caenorhabditis elegans cisRED database has provided an extensive catalog of upstream elements that are conserved between nematode genomes. We have performed a secondary analysis to determine which subsequences of the cisRED motifs are found in multiple locations throughout the C. elegans genome. We used the word-counting motif discovery algorithm DME to form the motifs into groups based on sequence similarity. We then examined the genes associated with each motif group using DAVID and Ontologizer to determine which groups are associated with genes that also have significant functional associations in the Gene Ontology and other gene annotation sources. Of the 3265 motif groups formed, 612 (19%) had significant functional associations with respect to GO terms. Eight of the first 20 motif groups based on frequent dodecamers among the cisRED motif sequences were specifically associated with ribosomal protein genes; two of these were similar to mouse EBP-45, rat HNF3-family and Drosophila Zeste transcription factor binding sites. Additionally, seven motif groups were extensions of the canonical C. elegans trans-splice acceptor site. One motif group was tested for regulatory function in a series of green fluorescent protein expression experiments and was shown to be involved in pharyngeal expression.
Collapse
Affiliation(s)
- Monica C Sleumer
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 W 7th Ave Suite 100, Vancouver, BC, Canada
| | | | | | | |
Collapse
|
85
|
Reid JE, Evans KJ, Dyer N, Wernisch L, Ott S. Variable structure motifs for transcription factor binding sites. BMC Genomics 2010; 11:30. [PMID: 20074339 PMCID: PMC2824720 DOI: 10.1186/1471-2164-11-30] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Accepted: 01/14/2010] [Indexed: 02/06/2023] Open
Abstract
Background Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets. Results We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance. Conclusions We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
Collapse
Affiliation(s)
- John E Reid
- MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Cambridge, CB2 0SR, UK.
| | | | | | | | | |
Collapse
|
86
|
Syed Z, Stultz C, Kellis M, Indyk P, Guttag J. Motif Discovery in Physiological Datasets: A Methodology for Inferring Predictive Elements. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA 2010; 4:2. [PMID: 20730037 PMCID: PMC2923403 DOI: 10.1145/1644873.1644875] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/01/2008] [Accepted: 05/01/2009] [Indexed: 05/29/2023]
Abstract
In this article, we propose a methodology for identifying predictive physiological patterns in the absence of prior knowledge. We use the principle of conservation to identify activity that consistently precedes an outcome in patients, and describe a two-stage process that allows us to efficiently search for such patterns in large datasets. This involves first transforming continuous physiological signals from patients into symbolic sequences, and then searching for patterns in these reduced representations that are strongly associated with an outcome.Our strategy of identifying conserved activity that is unlikely to have occurred purely by chance in symbolic data is analogous to the discovery of regulatory motifs in genomic datasets. We build upon existing work in this area, generalizing the notion of a regulatory motif and enhancing current techniques to operate robustly on non-genomic data. We also address two significant considerations associated with motif discovery in general: computational efficiency and robustness in the presence of degeneracy and noise. To deal with these issues, we introduce the concept of active regions and new subset-based techniques such as a two-layer Gibbs sampling algorithm. These extensions allow for a framework for information inference, where precursors are identified as approximately conserved activity of arbitrary complexity preceding multiple occurrences of an event.We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for motifs in the sudden death population against control populations of normal individuals and those with non-fatal supraventricular arrhythmias. Our results suggest that predictive motif discovery may be able to identify clinically relevant information even in the absence of significant prior knowledge.
Collapse
|
87
|
Kadupitige SR, Leung KC, Sellmeier J, Sivieng J, Catchpoole DR, Bain ME, Gaëta BA. MINER: exploratory analysis of gene interaction networks by machine learning from expression data. BMC Genomics 2009; 10 Suppl 3:S17. [PMID: 19958480 PMCID: PMC2788369 DOI: 10.1186/1471-2164-10-s3-s17] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Background The reconstruction of gene regulatory networks from high-throughput "omics" data has become a major goal in the modelling of living systems. Numerous approaches have been proposed, most of which attempt only "one-shot" reconstruction of the whole network with no intervention from the user, or offer only simple correlation analysis to infer gene dependencies. Results We have developed MINER (Microarray Interactive Network Exploration and Representation), an application that combines multivariate non-linear tree learning of individual gene regulatory dependencies, visualisation of these dependencies as both trees and networks, and representation of known biological relationships based on common Gene Ontology annotations. MINER allows biologists to explore the dependencies influencing the expression of individual genes in a gene expression data set in the form of decision, model or regression trees, using their domain knowledge to guide the exploration and formulate hypotheses. Multiple trees can then be summarised in the form of a gene network diagram. MINER is being adopted by several of our collaborators and has already led to the discovery of a new significant regulatory relationship with subsequent experimental validation. Conclusion Unlike most gene regulatory network inference methods, MINER allows the user to start from genes of interest and build the network gene-by-gene, incorporating domain expertise in the process. This approach has been used successfully with RNA microarray data but is applicable to other quantitative data produced by high-throughput technologies such as proteomics and "next generation" DNA sequencing.
Collapse
Affiliation(s)
- Sidath Randeni Kadupitige
- School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, 2052, Australia.
| | | | | | | | | | | | | |
Collapse
|
88
|
Li G, Liu B, Xu Y. Accurate recognition of cis-regulatory motifs with the correct lengths in prokaryotic genomes. Nucleic Acids Res 2009; 38:e12. [PMID: 19906734 PMCID: PMC2811016 DOI: 10.1093/nar/gkp907] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
We present a new computational method for solving a classical problem, the identification problem of cis-regulatory motifs in a given set of promoter sequences, based on one key new idea. Instead of scoring candidate motifs individually like in all the existing motif-finding programs, our method scores groups of candidate motifs with similar sequences, called motif closures, using a P-value, which has substantially improved the prediction reliability over the existing methods. Our new P-value scoring scheme is sequence length independent, hence allowing direct comparisons among predicted motifs with different lengths on the same footing. We have implemented this method as a Motif Recognition Computer (MREC) program, and have extensively tested MREC on both simulated and biological data from prokaryotic genomes. Our test results indicate that MREC can accurately pick out the actual motif with the correct length as the best scoring candidate for the vast majority of the cases in our test set. We compared our prediction results with two motif-finding programs Cosmo and MEME, and found that MREC outperforms both programs across all the test cases by a large margin. The MREC program is available at http://csbl.bmb.uga.edu/~bingqiang/MREC1/.
Collapse
Affiliation(s)
- Guojun Li
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, GA 30602, USA
| | | | | |
Collapse
|
89
|
Ho ES, Jakubowski CD, Gunderson SI. iTriplet, a rule-based nucleic acid sequence motif finder. Algorithms Mol Biol 2009; 4:14. [PMID: 19874606 PMCID: PMC2784457 DOI: 10.1186/1748-7188-4-14] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2009] [Accepted: 10/29/2009] [Indexed: 12/29/2022] Open
Abstract
Background With the advent of high throughput sequencing techniques, large amounts of sequencing data are readily available for analysis. Natural biological signals are intrinsically highly variable making their complete identification a computationally challenging problem. Many attempts in using statistical or combinatorial approaches have been made with great success in the past. However, identifying highly degenerate and long (>20 nucleotides) motifs still remains an unmet challenge as high degeneracy will diminish statistical significance of biological signals and increasing motif size will cause combinatorial explosion. In this report, we present a novel rule-based method that is focused on finding degenerate and long motifs. Our proposed method, named iTriplet, avoids costly enumeration present in existing combinatorial methods and is amenable to parallel processing. Results We have conducted a comprehensive assessment on the performance and sensitivity-specificity of iTriplet in analyzing artificial and real biological sequences in various genomic regions. The results show that iTriplet is able to solve challenging cases. Furthermore we have confirmed the utility of iTriplet by showing it accurately predicts polyA-site-related motifs using a dual Luciferase reporter assay. Conclusion iTriplet is a novel rule-based combinatorial or enumerative motif finding method that is able to process highly degenerate and long motifs that have resisted analysis by other methods. In addition, iTriplet is distinguished from other methods of the same family by its parallelizability, which allows it to leverage the power of today's readily available high-performance computing systems.
Collapse
|
90
|
Characterization of Citrus sinensis type 1 mitochondrial alternative oxidase and expression analysis in biotic stress. Biosci Rep 2009; 30:59-71, 1 p following 71. [DOI: 10.1042/bsr20080180] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
The higher plant mitochondrial electron transport chain contains an alternative pathway that ends with the AOX (alternative oxidase). The AOX proteins are encoded by a small gene family composed of two discrete gene subfamilies. Aox1 is present in both monocot and eudicot plants, whereas Aox2 is only present in eudicot plants. We isolated a genomic clone from Citrus sinensis containing the Aox1a gene. The orange Aox1a consists of four exons interrupted by three introns and its promoter harbours diverse putative stress-specific regulatory motifs including pathogen response elements. The role of the Aox1a gene was evaluated during the compatible interaction between C. sinensis and Xanthomonas axonopodis pv. citri and no induction of the Aox1a at the transcriptional level was observed. On the other hand, Aox1a was studied in orange plants during non-host interactions with Pseudomonas syringae pv. tomato and Xanthomonas campestris pv. vesicatoria, which result in hypersensitive response. Both phytopathogens produced a strong induction of Aox1a, reaching a maximum at 8 h post-infiltration. Exogenous application of salicylic acid produced a slight increase in the steady-state level of Aox1a, whereas the application of fungi elicitors showed the highest induction. These results suggest that AOX1a plays a role during biotic stress in non-host plant pathogen interaction.
Collapse
|
91
|
Discovering multiple realistic TFBS motifs based on a generalized model. BMC Bioinformatics 2009; 10:321. [PMID: 19811641 PMCID: PMC2770069 DOI: 10.1186/1471-2105-10-321] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2009] [Accepted: 10/07/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of transcription factor binding sites (TFBSs) is a central problem in Bioinformatics on gene regulation. de novo motif discovery serves as a promising way to predict and better understand TFBSs for biological verifications. Real TFBSs of a motif may vary in their widths and their conservation degrees within a certain range. Deciding a single motif width by existing models may be biased and misleading. Additionally, multiple, possibly overlapping, candidate motifs are desired and necessary for biological verification in practice. However, current techniques either prohibit overlapping TFBSs or lack explicit control of different motifs. RESULTS We propose a new generalized model to tackle the motif widths by considering and evaluating a width range of interest simultaneously, which should better address the width uncertainty. Moreover, a meta-convergence framework for genetic algorithms (GAs), is proposed to provide multiple overlapping optimal motifs simultaneously in an effective and flexible way. Users can easily specify the difference amongst expected motif kinds via similarity test. Incorporating Genetic Algorithm with Local Filtering (GALF) for searching, the new GALF-G (G for generalized) algorithm is proposed based on the generalized model and meta-convergence framework. CONCLUSION GALF-G was tested extensively on over 970 synthetic, real and benchmark datasets, and is usually better than the state-of-the-art methods. The range model shows an increase in sensitivity compared with the single-width ones, while providing competitive precisions on the E. coli benchmark. Effectiveness can be maintained even using a very small population, exhibiting very competitive efficiency. In discovering multiple overlapping motifs in a real liver-specific dataset, GALF-G outperforms MEME by up to 73% in overall F-scores. GALF-G also helps to discover an additional motif which has probably not been annotated in the dataset. http://www.cse.cuhk.edu.hk/%7Etmchan/GALFG/
Collapse
|
92
|
Priest HD, Filichkin SA, Mockler TC. Cis-regulatory elements in plant cell signaling. CURRENT OPINION IN PLANT BIOLOGY 2009; 12:643-649. [PMID: 19717332 DOI: 10.1016/j.pbi.2009.07.016] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2009] [Revised: 06/30/2009] [Accepted: 07/21/2009] [Indexed: 05/26/2023]
Abstract
Plant cell signaling pathways are in part dependent on transcriptional regulatory networks comprising circuits of transcription factors (TFs) and regulatory DNA elements that control the expression of target genes. Here, we describe experimental and bioinformatic approaches for identifying potential cis-regulatory elements. We also discuss recent integrative genomics studies aimed at elucidating the functions of cis-regulatory elements in aspects of plant biology, including the circadian clock, interactions with the environment, stress responses, and regulation of growth and development by phytohormones. Finally, we discuss emerging technologies and approaches that offer great potential for accelerating the discovery and functional characterization of cis-elements and interacting TFs--which will help realize the promise of systems biology.
Collapse
Affiliation(s)
- Henry D Priest
- Department of Botany and Plant Pathology and Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR 97331, USA
| | | | | |
Collapse
|
93
|
Defrance M, van Helden J. info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling. ACTA ACUST UNITED AC 2009; 25:2715-22. [PMID: 19689955 DOI: 10.1093/bioinformatics/btp490] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Discovering cis-regulatory elements in genome sequence remains a challenging issue. Several methods rely on the optimization of some target scoring function. The information content (IC) or relative entropy of the motif has proven to be a good estimator of transcription factor DNA binding affinity. However, these information-based metrics are usually used as a posteriori statistics rather than during the motif search process itself. RESULTS We introduce here info-gibbs, a Gibbs sampling algorithm that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the motif while keeping computation time low. The method compares well with existing methods like MEME, BioProspector, Gibbs or GAME on both synthetic and biological datasets. Our study shows that motif discovery techniques can be enhanced by directly focusing the search on the motif IC or the motif LLR. AVAILABILITY http://rsat.ulb.ac.be/rsat/info-gibbs
Collapse
Affiliation(s)
- Matthieu Defrance
- Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles CP 263, Campus Plaine, Boulevard du Triomphe, B-1050 Bruxelles, Belgium.
| | | |
Collapse
|
94
|
de Almeida Engler J, De Veylder L, De Groodt R, Rombauts S, Boudolf V, De Meyer B, Hemerly A, Ferreira P, Beeckman T, Karimi M, Hilson P, Inzé D, Engler G. Systematic analysis of cell-cycle gene expression during Arabidopsis development. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2009; 59:645-60. [PMID: 19392699 DOI: 10.1111/j.1365-313x.2009.03893.x] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
The steady-state distribution of cell-cycle transcripts in Arabidopsis thaliana seedlings was studied in a broad in situ survey to provide a better understanding of the expression of cell-cycle genes during plant development. The 61 core cell-cycle genes analyzed were expressed at variable levels throughout the different plant tissues: 23 genes generally in dividing and young differentiating tissues, 34 genes mostly in both dividing and differentiated tissues and four gene transcripts primarily in differentiated tissues. Only 21 genes had a typical patchy expression pattern, indicating tight cell-cycle regulation. The increased expression of 27 cell-cycle genes in the root elongation zone hinted at their involvement in the switch from cell division to differentiation. The induction of 20 cell-cycle genes in differentiated cortical cells of etiolated hypocotyls pointed to their possible role in the process of endoreduplication. Of seven cyclin-dependent kinase inhibitor genes, five were upregulated in etiolated hypocotyls, suggesting a role in cell-cycle arrest. Nineteen genes were preferentially expressed in pericycle cells activated by auxin that give rise to lateral root primordia. Approximately 1800 images have been collected and can be queried via an online database. Our in situ analysis revealed that 70% of the cell-cycle genes, although expressed at different levels, show a large overlap in their localization. The lack of regulatory motifs in the upstream regions of the analyzed genes suggests the absence of a universal transcriptional control mechanism for all cell-cycle genes.
Collapse
Affiliation(s)
- Janice de Almeida Engler
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
95
|
Fadda A, Fierro AC, Lemmens K, Monsieurs P, Engelen K, Marchal K. Inferring the transcriptional network of Bacillus subtilis. MOLECULAR BIOSYSTEMS 2009; 5:1840-52. [PMID: 20023724 DOI: 10.1039/b907310h] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The adaptation of bacteria to the vigorous environmental changes they undergo is crucial to their survival. They achieve this adaptation partly via intricate regulation of the transcription of their genes. In this study, we infer the transcriptional network of the Gram-positive model organism, Bacillus subtilis. We use a data integration workflow, exploiting both motif and expression data, towards the generation of condition-dependent transcriptional modules. In building the motif data, we rely on both known and predicted information. Known motifs were derived from DBTBS, while predicted motifs were generated by a de novo motif detection method that utilizes comparative genomics. The expression data consists of a compendium of microarrays across different platforms. Our results indicate that a considerable part of the B. subtilis network is yet undiscovered; we could predict 417 new regulatory interactions for known regulators and 453 interactions for yet uncharacterized regulators. The regulators in our network showed a preference for regulating modules in certain environmental conditions. Also, substantial condition-dependent intra-operonic regulation seems to take place. Global regulators seem to require functional flexibility to attain their roles by acting as both activators and repressors.
Collapse
Affiliation(s)
- Abeer Fadda
- Department of Microbial and Molecular Systems, KULeuven, Kasteelpark Arenberg 20, 3001 Heverlee, Belgium
| | | | | | | | | | | |
Collapse
|
96
|
Bi C. A Monte Carlo EM algorithm for de novo motif discovery in biomolecular sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:370-386. [PMID: 19644166 DOI: 10.1109/tcbb.2008.103] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Motif discovery methods play pivotal roles in deciphering the genetic regulatory codes (i.e., motifs) in genomes as well as in locating conserved domains in protein sequences. The Expectation Maximization (EM) algorithm is one of the most popular methods used in de novo motif discovery. Based on the position weight matrix (PWM) updating technique, this paper presents a Monte Carlo version of the EM motif-finding algorithm that carries out stochastic sampling in local alignment space to overcome the conventional EM's main drawback of being trapped in a local optimum. The newly implemented algorithm is named as Monte Carlo EM Motif Discovery Algorithm (MCEMDA). MCEMDA starts from an initial model, and then it iteratively performs Monte Carlo simulation and parameter update until convergence. A log-likelihood profiling technique together with the top-k strategy is introduced to cope with the phase shifts and multiple modal issues in motif discovery problem. A novel grouping motif alignment (GMA) algorithm is designed to select motifs by clustering a population of candidate local alignments and successfully applied to subtle motif discovery. MCEMDA compares favorably to other popular PWM-based and word enumerative motif algorithms tested using simulated (l, d)-motif cases, documented prokaryotic, and eukaryotic DNA motif sequences. Finally, MCEMDA is applied to detect large blocks of conserved domains using protein benchmarks and exhibits its excellent capacity while compared with other multiple sequence alignment methods.
Collapse
Affiliation(s)
- Chengpeng Bi
- Bioinformatics and Intelligent Computing Laboratory, Division of Clinical Pharmacology, Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA.
| |
Collapse
|
97
|
Blanco F, Salinas P, Cecchini NM, Jordana X, Van Hummelen P, Alvarez ME, Holuigue L. Early genomic responses to salicylic acid in Arabidopsis. PLANT MOLECULAR BIOLOGY 2009; 70:79-102. [PMID: 19199050 DOI: 10.1007/s11103-009-9458-1] [Citation(s) in RCA: 83] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/11/2008] [Accepted: 01/11/2009] [Indexed: 05/21/2023]
Abstract
Salicylic acid (SA) is a stress-induced hormone involved in the activation of defense genes. Here we analyzed the early genetic responses to SA of wild type and npr1-1 mutant Arabidopsis seedlings, using Complete Arabidopsis Transcriptome MicroArray (CATMAv2) chip. We identified 217 genes rapidly induced by SA (early SAIGs); 193 by a NPR1-dependent and 24 by a NPR1-independent pathway. These two groups of genes also differed in their functional classification, expression profiles and over-representation of cis-elements, supporting differential pathways for their activation. Examination of the expression patterns for selected early SAIGs from both groups indicated that their activation by SA required TGA2/5/6 subclass of transcription factors. These genes were also activated by Pseudomonas syringae pv. tomato AvrRpm1, suggesting that they might play a role in defense against bacteria. This study gives a global idea of the early response to SA in Arabidopsis seedlings, expanding our knowledge about SA function in plant defense.
Collapse
Affiliation(s)
- Francisca Blanco
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, P.O. Box 114-D, Santiago, Chile
| | | | | | | | | | | | | |
Collapse
|
98
|
Erill I, O'Neill MC. A reexamination of information theory-based methods for DNA-binding site identification. BMC Bioinformatics 2009; 10:57. [PMID: 19210776 PMCID: PMC2680408 DOI: 10.1186/1471-2105-10-57] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2008] [Accepted: 02/11/2009] [Indexed: 11/10/2022] Open
Abstract
Background Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. Results Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. Conclusion We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.
Collapse
Affiliation(s)
- Ivan Erill
- Department of Biological Sciences, University of Maryland-Baltimore County, Baltimore, MD, USA.
| | | |
Collapse
|
99
|
Sun H, De Bie T, Storms V, Fu Q, Dhollander T, Lemmens K, Verstuyf A, De Moor B, Marchal K. ModuleDigger: an itemset mining framework for the detection of cis-regulatory modules. BMC Bioinformatics 2009; 10 Suppl 1:S30. [PMID: 19208131 PMCID: PMC2648767 DOI: 10.1186/1471-2105-10-s1-s30] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background The detection of cis-regulatory modules (CRMs) that mediate transcriptional responses in eukaryotes remains a key challenge in the postgenomic era. A CRM is characterized by a set of co-occurring transcription factor binding sites (TFBS). In silico methods have been developed to search for CRMs by determining the combination of TFBS that are statistically overrepresented in a certain geneset. Most of these methods solve this combinatorial problem by relying on computational intensive optimization methods. As a result their usage is limited to finding CRMs in small datasets (containing a few genes only) and using binding sites for a restricted number of transcription factors (TFs) out of which the optimal module will be selected. Results We present an itemset mining based strategy for computationally detecting cis-regulatory modules (CRMs) in a set of genes. We tested our method by applying it on a large benchmark data set, derived from a ChIP-Chip analysis and compared its performance with other well known cis-regulatory module detection tools. Conclusion We show that by exploiting the computational efficiency of an itemset mining approach and combining it with a well-designed statistical scoring scheme, we were able to prioritize the biologically valid CRMs in a large set of coregulated genes using binding sites for a large number of potential TFs as input.
Collapse
Affiliation(s)
- Hong Sun
- Department of Electrical Engineering, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium.
| | | | | | | | | | | | | | | | | |
Collapse
|
100
|
Tang MHE, Krogh A, Winther O. BayesMD: flexible biological modeling for motif discovery. J Comput Biol 2009; 15:1347-63. [PMID: 19040368 DOI: 10.1089/cmb.2007.0176] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present BayesMD, a Bayesian Motif Discovery model with several new features. Three different types of biological a priori knowledge are built into the framework in a modular fashion. A mixture of Dirichlets is used as prior over nucleotide probabilities in binding sites. It is trained on transcription factor (TF) databases in order to extract the typical properties of TF binding sites. In a similar fashion we train organism-specific priors for the background sequences. Lastly, we use a prior over the position of binding sites. This prior represents information complementary to the motif and background priors coming from conservation, local sequence complexity, nucleosome occupancy, etc. and assumptions about the number of occurrences. The Bayesian inference is carried out using a combination of exact marginalization (multinomial parameters) and sampling (over the position of sites). Robust sampling results are achieved using the advanced sampling method parallel tempering. In a post-analysis step candidate motifs with high marginal probability are found by searching among those motifs that contain sites that occur frequently. Thereby, maximum a posteriori inference for the motifs is avoided and the marginal probabilities can be used directly to assess the significance of the findings. The framework is benchmarked against other methods on a number of real and artificial data sets. The accompanying prediction server, documentation, software, models and data are available from http://bayesmd.binf.ku.dk/.
Collapse
Affiliation(s)
- Man-Hung Eric Tang
- Bioinformatics Centre, Department of Molecular Biology, University of Copenhagen, Copenhagen, Denmark
| | | | | |
Collapse
|