1
|
Xie D, Chen CC, He X, Cao X, Zhong S. Towards an evolutionary model of transcription networks. PLoS Comput Biol 2011; 7:e1002064. [PMID: 21695281 PMCID: PMC3111474 DOI: 10.1371/journal.pcbi.1002064] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2010] [Accepted: 04/08/2011] [Indexed: 11/18/2022] Open
Abstract
DNA evolution models made invaluable contributions to comparative genomics, although it seemed formidable to include non-genomic features into these models. In order to build an evolutionary model of transcription networks (TNs), we had to forfeit the substitution model used in DNA evolution and to start from modeling the evolution of the regulatory relationships. We present a quantitative evolutionary model of TNs, subjecting the phylogenetic distance and the evolutionary changes of cis-regulatory sequence, gene expression and network structure to one probabilistic framework. Using the genome sequences and gene expression data from multiple species, this model can predict regulatory relationships between a transcription factor (TF) and its target genes in all species, and thus identify TN re-wiring events. Applying this model to analyze the pre-implantation development of three mammalian species, we identified the conserved and re-wired components of the TNs downstream to a set of TFs including Oct4, Gata3/4/6, cMyc and nMyc. Evolutionary events on the DNA sequence that led to turnover of TF binding sites were identified, including a birth of an Oct4 binding site by a 2nt deletion. In contrast to recent reports of large interspecies differences of TF binding sites and gene expression patterns, the interspecies difference in TF-target relationship is much smaller. The data showed increasing conservation levels from genomic sequences to TF-DNA interaction, gene expression, TN, and finally to morphology, suggesting that evolutionary changes are larger at molecular levels and smaller at functional levels. The data also showed that evolutionarily older TFs are more likely to have conserved target genes, whereas younger TFs tend to have larger re-wiring rates. DNA evolution models made invaluable contributions to comparative genomic studies. Still lacking is an evolutionary model of transcription networks (TNs). To develop such a model, we had to forfeit the substitution model used in DNA evolution and to start from modeling the evolution of the regulatory relationships, and then subject the phylogenetic distance and the multi-species DNA sequence and gene expression data to one probabilistic framework. This model enabled us to infer the evolutionary changes of transcriptional regulatory relationships. Applying this model to analyze three yeast species, we found the anaerobic phenotype in two species was associated with the evolutionary loss of a larger cis-regulatory motif than previously thought. Analyzing three mammalian species, we found increasing conservation levels from genomic sequences to transcription factor-DNA interaction, gene expression, TN, and finally to morphology, suggesting that evolutionary changes are larger at molecular levels and smaller at functional levels. We also found that evolutionarily younger TFs are more likely to regulate different target genes in different species.
Collapse
Affiliation(s)
- Dan Xie
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Chieh-Chun Chen
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xin He
- Department of Biochemistry and Biophysics, University of California, San Francisco, California, United States of America
| | - Xiaoyi Cao
- Center for Biophysics and Computational Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Sheng Zhong
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Center for Biophysics and Computational Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
2
|
Chen G, Zhou Q. Heterogeneity in DNA multiple alignments: modeling, inference, and applications in motif finding. Biometrics 2011; 66:694-704. [PMID: 19995355 DOI: 10.1111/j.1541-0420.2009.01362.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Transcription factors bind sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBSs) is an important step for understanding gene regulation. Although sophisticated in modeling TFBSs and their combinatorial patterns, computational methods for TFBS detection and motif finding often make oversimplified homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for motif finding to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, we propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for different conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming recursions. Simulation studies and empirical evidence from biological data sets reveal the dramatic effect of background modeling on motif finding, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.
Collapse
Affiliation(s)
- Gong Chen
- Department of Statistics, University of California, Los Angeles, Los Angeles, California 90095, USA
| | | |
Collapse
|
3
|
Fuellen G, Struckmann S. Evolution of gene regulation of pluripotency--the case for wiki tracks at genome browsers. Biol Direct 2010; 5:67. [PMID: 21190561 PMCID: PMC3024949 DOI: 10.1186/1745-6150-5-67] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2010] [Accepted: 12/29/2010] [Indexed: 12/23/2022] Open
Abstract
Background Experimentally validated data on gene regulation are hard to obtain. In particular, information about transcription factor binding sites in regulatory regions are scattered around in the literature. This impedes their systematic in-context analysis, e.g. the inference of their conservation in evolutionary history. Results We demonstrate the power of integrative bioinformatics by including curated transcription factor binding site information into the UCSC genome browser, using wiki and custom tracks, which enable easy publication of annotation data. Data integration allows to investigate the evolution of gene regulation of the pluripotency-associated genes Oct4, Sox2 and Nanog. For the first time, experimentally validated transcription factor binding sites in the regulatory regions of all three genes were assembled together based on manual curation of data from 39 publications. Using the UCSC genome browser, these data were then visualized in the context of multi-species conservation based on genomic alignment. We confirm previous hypotheses regarding the evolutionary age of specific regulatory patterns, establishing their "deep homology". We also confirm some other principles of Carroll's "Genetic theory of Morphological Evolution", such as "mosaic pleiotropy", exemplified by the dual role of Sox2 reflected in its regulatory region. Conclusions We were able to elucidate some aspects of the evolution of gene regulation for three genes associated with pluripotency. Based on the expected return on investment for the community, we encourage other scientists to contribute experimental data on gene regulation (original work as well as data collected for reviews) to the UCSC system, to enable studies of the evolution of gene regulation on a large scale, and to report their findings. Reviewers This article was reviewed by Dr. Gustavo Glusman and Dr. Juan Caballero, Institute for Systems Biology, Seattle, USA (nominated by Dr. Doron Lancet, Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel), Dr. Niels Grabe, TIGA Center (BIOQUANT) and Medical Systems Biology Group, Institute of Medical Biometry and Informatics, University Hospital Heidelberg, Germany (nominated by Dr. Mikhail Gelfand, Department of Bioinformatics, Institute of Information Transfer Problems, Russian Academy of Science, Moscow, Russian Federation) and Dr. Franz-Josef Müller, Center for Regenerative Medicine, The Scripps Research Institute, La Jolla, CA, USA and University Hospital for Psychiatry and Psychotherapy (part of ZIP gGmbH), University of Kiel, Germany (nominated by Dr. Trey Ideker, University of California, San Diego, La Jolla CA, United States).
Collapse
Affiliation(s)
- Georg Fuellen
- Institute for Biostatistics and Informatics in Medicine and Ageing Research - IBIMA, University of Rostock, Medical Faculty, Ernst-Heydemann-Str. 8, 18057 Rostock, Germany.
| | | |
Collapse
|
4
|
Majoros WH, Ohler U. Modeling the evolution of regulatory elements by simultaneous detection and alignment with phylogenetic pair HMMs. PLoS Comput Biol 2010; 6:e1001037. [PMID: 21187896 PMCID: PMC3002982 DOI: 10.1371/journal.pcbi.1001037] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2010] [Accepted: 11/17/2010] [Indexed: 11/18/2022] Open
Abstract
The computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation.
Collapse
Affiliation(s)
- William H Majoros
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America.
| | | |
Collapse
|
5
|
Evolution of conserved non-coding sequences within the vertebrate Hox clusters through the two-round whole genome duplications revealed by phylogenetic footprinting analysis. J Mol Evol 2010; 71:427-36. [PMID: 20981416 DOI: 10.1007/s00239-010-9396-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2010] [Accepted: 09/17/2010] [Indexed: 02/01/2023]
Abstract
As a result of two-round whole genome duplications, four or more paralogous Hox clusters exist in vertebrate genomes. The paralogous genes in the Hox clusters show similar expression patterns, implying shared regulatory mechanisms for expression of these genes. Previous studies partly revealed the expression mechanisms of Hox genes. However, cis-regulatory elements that control these paralogous gene expression are still poorly understood. Toward solving this problem, the authors searched conserved non-coding sequences (CNSs), which are candidates of cis-regulatory elements. When comparing orthologous Hox clusters of 19 vertebrate species, 208 intergenic conserved regions were found. The authors then searched for CNSs that were conserved not only between orthologous clusters but also among the four paralogous Hox clusters. The authors found three regions that are conserved among all the four clusters and eight regions that are conserved between intergenic regions of two paralogous Hox clusters. In total, 28 CNSs were identified in the paralogous Hox clusters, and nine of them were newly found in this study. One of these novel regions bears a RARE motif. These CNSs are candidates for gene expression regulatory regions among paralogous Hox clusters. The authors also compared vertebrate CNSs with amphioxus CNSs within the Hox cluster, and found that two CNSs in the HoxA and HoxB clusters retain homology with amphioxus CNSs through the two-round whole genome duplications.
Collapse
|
6
|
Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Comput Biol 2010; 6. [PMID: 20862354 PMCID: PMC2940721 DOI: 10.1371/journal.pcbi.1000935] [Citation(s) in RCA: 123] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2010] [Accepted: 08/17/2010] [Indexed: 01/08/2023] Open
Abstract
Quantitative models of cis-regulatory activity have the potential to improve our mechanistic understanding of transcriptional regulation. However, the few models available today have been based on simplistic assumptions about the sequences being modeled, or heuristic approximations of the underlying regulatory mechanisms. We have developed a thermodynamics-based model to predict gene expression driven by any DNA sequence, as a function of transcription factor concentrations and their DNA-binding specificities. It uses statistical thermodynamics theory to model not only protein-DNA interaction, but also the effect of DNA-bound activators and repressors on gene expression. In addition, the model incorporates mechanistic features such as synergistic effect of multiple activators, short range repression, and cooperativity in transcription factor-DNA binding, allowing us to systematically evaluate the significance of these features in the context of available expression data. Using this model on segmentation-related enhancers in Drosophila, we find that transcriptional synergy due to simultaneous action of multiple activators helps explain the data beyond what can be explained by cooperative DNA-binding alone. We find clear support for the phenomenon of short-range repression, where repressors do not directly interact with the basal transcriptional machinery. We also find that the binding sites contributing to an enhancer's function may not be conserved during evolution, and a noticeable fraction of these undergo lineage-specific changes. Our implementation of the model, called GEMSTAT, is the first publicly available program for simultaneously modeling the regulatory activities of a given set of sequences. The development of complex multicellular organisms requires genes to be expressed at specific stages and in specific tissues. Regulatory DNA sequences, often called cis-regulatory modules, drive the desired gene expression patterns by integrating information about the environment in the form of the activities of transcription factors. The rules by which regulatory sequences read this type of information, however, are unclear. In this work, we developed quantitative models based on physicochemical principles that directly map regulatory sequences to the expression profiles they generate. We evaluated these models on the segmentation network of the model organism Drosophila melanogaster. Our models incorporate mechanistic features that attempt to capture how activating and repressing transcription factors work in the segmentation system. By evaluating the importance of these features, we were able to gain insights on the quantitative regulatory rules. We found that two different mechanisms may contribute to cooperative gene activation and that repressors often have a short range of influence in DNA sequences. Combining the quantitative modeling with comparative sequence analysis, we also found that even functional sequences may be lost during evolution.
Collapse
|
7
|
Fuellen G. Evolution of gene regulation--on the road towards computational inferences. Brief Bioinform 2010; 12:122-31. [PMID: 20702596 DOI: 10.1093/bib/bbq060] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
If fragments of DNA are transcribed (expressed), they deserve to be called (parts of) a gene. Whether transcription takes place depends on the 'gene regulatory network'. This network is defined as the complex interplay of the sequence, biochemical modifications and structure of the chromosomal DNA with the regulatory proteins/RNA (transcription factors, co-factors, regulating RNA and the transcriptional apparatus itself). Gene regulatory networks play a role in various stages of development as well as in the maintenance of the organism; in this review we will concentrate on the former. Their evolutionary reconstruction is daunting (to say the least), and bioinformatics tools are in their infancy. However, gain of understanding offers a reward beyond itself, since evolutionary considerations can enable discoveries in the first place, e.g. the computational identification of conserved transcription factor binding sites. We discuss the evolution of gene regulation in the context of the 'Genetic Theory of Morphological Evolution' as described by Carroll, identifying those parts of the theory that are relevant for bioinformatics, and their implications. We discuss the important question of how bioinformatics analysis results on the evolution of gene regulation may be validated. Finally, we briefly exemplify use of the UCSC genome browser, exploiting its pre-computed alignments to describe the evolution of gene regulation.
Collapse
Affiliation(s)
- Georg Fuellen
- Institute for Biostatistics and Informatics in Medicine and Ageing Research-IBIMA, University of Rostock, Medical Faculty, Ernst-Heydemann-Str. 8, 18057 Rostock, Germany.
| |
Collapse
|
8
|
Abstract
We have produced an evolutionary model for promoters, analogous to the commonly used synonymous/nonsynonymous mutation models for protein-coding sequences. Although our model, called Sunflower, relies on some simple assumptions, it captures enough of the biology of transcription factor action to show clear correlation with other biological features. Sunflower predicts a binding profile of transcription factors to DNA sequences, in which different factors compete for the same potential binding sites. The parametrized model simultaneously estimates a continuous measurement of binding occupancy across the genomic sequence for each factor. We can then introduce a localized mutation, rerun the binding model, and record the difference in binding profiles. A single mutation can alter interactions both upstream and downstream of its position due to potential overlapping binding sites, and our statistic captures this domino effect. Over evolutionary time, we observe a clear excess of low-scoring mutations fixed in promoters, consistent with most changes being neutral. However, this is not consistent across all promoters, and some promoters show more rapid divergence. This divergence often occurs in the presence of relatively constant protein-coding divergence. Interestingly, different classes of promoters show different sensitivity to mutations, with phosphorylation-related genes having promoters inherently more sensitive to mutations than immune genes. Although there have previously been a number of models attempting to handle transcription factor binding, Sunflower provides a richer biological model, incorporating weak binding sites and the possibility of competition. The results show the first clear correlations between such a model and evolutionary processes.
Collapse
Affiliation(s)
- Michael M Hoffman
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, United Kingdom
| | | |
Collapse
|
9
|
The effect of orthology and coregulation on detecting regulatory motifs. PLoS One 2010; 5:e8938. [PMID: 20140085 PMCID: PMC2815771 DOI: 10.1371/journal.pone.0008938] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 01/05/2010] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Computational de novo discovery of transcription factor binding sites is still a challenging problem. The growing number of sequenced genomes allows integrating orthology evidence with coregulation information when searching for motifs. Moreover, the more advanced motif detection algorithms explicitly model the phylogenetic relatedness between the orthologous input sequences and thus should be well adapted towards using orthologous information. In this study, we evaluated the conditions under which complementing coregulation with orthologous information improves motif detection for the class of probabilistic motif detection algorithms with an explicit evolutionary model. METHODOLOGY We designed datasets (real and synthetic) covering different degrees of coregulation and orthologous information to test how well Phylogibbs and Phylogenetic sampler, as representatives of the motif detection algorithms with evolutionary model performed as compared to MEME, a more classical motif detection algorithm that treats orthologs independently. RESULTS AND CONCLUSIONS Under certain conditions detecting motifs in the combined coregulation-orthology space is indeed more efficient than using each space separately, but this is not always the case. Moreover, the difference in success rate between the advanced algorithms and MEME is still marginal. The success rate of motif detection depends on the complex interplay between the added information and the specificities of the applied algorithms. Insights in this relation provide information useful to both developers and users. All benchmark datasets are available at http://homes.esat.kuleuven.be/~kmarchal/Supplementary_Storms_Valerie_PlosONE.
Collapse
|
10
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
11
|
Abstract
MOTIVATION Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. RESULTS This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. AVAILABILITY AND IMPLEMENTATION The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenjie Fu
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | |
Collapse
|
12
|
He X, Ling X, Sinha S. Alignment and prediction of cis-regulatory modules based on a probabilistic model of evolution. PLoS Comput Biol 2009; 5:e1000299. [PMID: 19293946 PMCID: PMC2657044 DOI: 10.1371/journal.pcbi.1000299] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2008] [Accepted: 01/22/2009] [Indexed: 11/30/2022] Open
Abstract
Cross-species comparison has emerged as a powerful paradigm for predicting cis-regulatory modules (CRMs) and understanding their evolution. The comparison requires reliable sequence alignment, which remains a challenging task for less conserved noncoding sequences. Furthermore, the existing models of DNA sequence evolution generally do not explicitly treat the special properties of CRM sequences. To address these limitations, we propose a model of CRM evolution that captures different modes of evolution of functional transcription factor binding sites (TFBSs) and the background sequences. A particularly novel aspect of our work is a probabilistic model of gains and losses of TFBSs, a process being recognized as an important part of regulatory sequence evolution. We present a computational framework that uses this model to solve the problems of CRM alignment and prediction. Our alignment method is similar to existing methods of statistical alignment but uses the conserved binding sites to improve alignment. Our CRM prediction method deals with the inherent uncertainties of binding site annotations and sequence alignment in a probabilistic framework. In simulated as well as real data, we demonstrate that our program is able to improve both alignment and prediction of CRM sequences over several state-of-the-art methods. Finally, we used alignments produced by our program to study binding site conservation in genome-wide binding data of key transcription factors in the Drosophila blastoderm, with two intriguing results: (i) the factor-bound sequences are under strong evolutionary constraints even if their neighboring genes are not expressed in the blastoderm and (ii) binding sites in distal bound sequences (relative to transcription start sites) tend to be more conserved than those in proximal regions. Our approach is implemented as software, EMMA (Evolutionary Model-based cis-regulatory Module Analysis), ready to be applied in a broad biological context. Comparison of noncoding DNA sequences across species has the potential to significantly improve our understanding of gene regulation and our ability to annotate regulatory regions of the genome. This potential is evident from recent publications analyzing 12 Drosophila genomes for regulatory annotation. However, because noncoding sequences are much less structured than coding sequences, their interspecies comparison presents technical challenges, such as ambiguity about how to align them and how to predict transcription factor binding sites, which are the fundamental units that make up regulatory sequences. This article describes how to build an integrated probabilistic framework that performs alignment and binding site prediction simultaneously, in the process improving the accuracy of both tasks. It defines a stochastic model for the evolution of entire “cis-regulatory modules,” with its highlight being a novel theoretical treatment of the commonly observed loss and gain of binding sites during evolution. This new evolutionary model forms the backbone of newly developed software for the prediction of new cis-regulatory modules, alignment of known modules to elucidate general principles of cis-regulatory evolution, or both. The new software is demonstrated to provide benefits in performance of these two crucial genomics tasks.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xu Ling
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|