1
|
Sethuraman M, Dronadula N, Bi L, Wacker BK, Knight E, De Bleser P, Dichek DA. Novel expression cassettes for increasing apolipoprotein AI transgene expression in vascular endothelial cells. Sci Rep 2022; 12:21079. [PMID: 36473901 PMCID: PMC9726828 DOI: 10.1038/s41598-022-25333-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 11/28/2022] [Indexed: 12/12/2022] Open
Abstract
Transduction of endothelial cells (EC) with a vector that expresses apolipoprotein A-I (APOAI) reduces atherosclerosis in arteries of fat-fed rabbits. However, the effects on atherosclerosis are partial and might be enhanced if APOAI expression could be increased. With a goal of developing an expression cassette that generates higher levels of APOAI mRNA in EC, we tested 4 strategies, largely in vitro: addition of 2 types of enhancers, addition of computationally identified EC-specific cis-regulatory modules (CRM), and insertion of the rabbit APOAI gene at the transcription start site (TSS) of sequences cloned from genes that are highly expressed in cultured EC. Addition of a shear stress-responsive enhancer did not increase APOAI expression. Addition of 2 copies of a Mef2c enhancer increased APOAI expression from a moderately active promoter/enhancer but decreased APOAI expression from a highly active promoter/enhancer. Of the 11 CRMs, 3 increased APOAI expression from a moderately active promoter (2-7-fold; P < 0.05); none increased expression from a highly active promoter/enhancer. Insertion of the APOAI gene into the TSS of highly expressed EC genes did not increase expression above levels obtained with a moderately active promoter/enhancer. New strategies are needed to further increase APOAI transgene expression in EC.
Collapse
Affiliation(s)
- Meena Sethuraman
- Department of Medicine, University of Washington, Seattle, WA, USA
| | | | - Lianxiang Bi
- Department of Medicine, University of Washington, Seattle, WA, USA
| | - Bradley K Wacker
- Department of Medicine, University of Washington, Seattle, WA, USA
| | - Ethan Knight
- Department of Medicine, University of Washington, Seattle, WA, USA
| | - Pieter De Bleser
- Department of Biomedical Molecular Biology, Ghent University, Ghent, Belgium
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research, Ghent, Belgium
| | - David A Dichek
- Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
2
|
Huang T, Gu W, Liu E, Zhang L, Dong F, He X, Jiao W, Li C, Wang B, Xu G. Screening and Validation of p38 MAPK Involved in Ovarian Development of Brachymystax lenok. Front Vet Sci 2022; 9:752521. [PMID: 35252414 PMCID: PMC8889577 DOI: 10.3389/fvets.2022.752521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Accepted: 01/13/2022] [Indexed: 11/17/2022] Open
Abstract
Brachymystax lenok (lenok) is a rare cold-water fish native to China that is of high meat quality. Its wild population has declined sharply in recent years, and therefore, exploring the molecular mechanisms underlying the development and reproduction of lenoks for the purposes of artificial breeding and genetic improvement is necessary. The lenok comparative transcriptome was analyzed by combining single molecule, real-time, and next generation sequencing (NGS) technology. Differentially expressed genes (DEGs) were identified in five tissues (head kidney, spleen, liver, muscle, and gonad) between immature [300 days post-hatching (dph)] and mature [three years post-hatching (ph)] lenoks. In total, 234,124 and 229,008 full-length non-chimeric reads were obtained from the immature and mature sequencing data, respectively. After NGS correction, 61,405 and 59,372 non-redundant transcripts were obtained for the expression level and pathway enrichment analyses, respectively. Compared with the mature group, 719 genes with significantly increased expression and 1,727 genes with significantly decreased expression in all five tissues were found in the immature group. Furthermore, DEGs and pathways involved in the endocrine system and gonadal development were identified, and p38 mitogen-activated protein kinases (MAPKs) were identified as potentially regulating gonadal development in lenok. Inhibiting the activity of p38 MAPKs resulted in abnormal levels of gonadotropin-releasing hormone, follicle-stimulating hormone, and estradiol, and affected follicular development. The full-length transcriptome data obtained in this study may provide a valuable reference for the study of gene function, gene expression, and evolutionary relationships in B. lenok and may illustrate the basic regulatory mechanism of ovarian development in teleosts.
Collapse
Affiliation(s)
- Tianqing Huang
- Key Laboratory of Freshwater Aquatic Biotechnology and Breeding, Ministry of Agriculture and Rural Affairs, Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
| | - Wei Gu
- Key Laboratory of Freshwater Aquatic Biotechnology and Breeding, Ministry of Agriculture and Rural Affairs, Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
| | - Enhui Liu
- Key Laboratory of Freshwater Aquatic Biotechnology and Breeding, Ministry of Agriculture and Rural Affairs, Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
| | - Lanlan Zhang
- Heilongjiang Province General Station of Aquatic Technology Promotion, Harbin, China
| | - Fulin Dong
- Key Laboratory of Freshwater Aquatic Biotechnology and Breeding, Ministry of Agriculture and Rural Affairs, Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
| | - Xianchen He
- Heilongjiang Aquatic Animal Resource Conservation Center, Harbin, China
| | - Wenlong Jiao
- Gansu Fisheries Research Institute, Lanzhou, China
| | - Chunyu Li
- Xinjiang Tianyun Organic Agriculture Co., Yili Group, Hohhot, China
| | - Bingqian Wang
- Key Laboratory of Freshwater Aquatic Biotechnology and Breeding, Ministry of Agriculture and Rural Affairs, Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
- *Correspondence: Bingqian Wang
| | - Gefeng Xu
- Key Laboratory of Freshwater Aquatic Biotechnology and Breeding, Ministry of Agriculture and Rural Affairs, Heilongjiang River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Harbin, China
- Gefeng Xu
| |
Collapse
|
3
|
Benner P, Vingron M. Quantifying the tissue-specific regulatory information within enhancer DNA sequences. NAR Genom Bioinform 2021; 3:lqab095. [PMID: 34729474 PMCID: PMC8557370 DOI: 10.1093/nargab/lqab095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 09/23/2021] [Accepted: 09/28/2021] [Indexed: 12/04/2022] Open
Abstract
Recent efforts to measure epigenetic marks across a wide variety of different cell types and tissues provide insights into the cell type-specific regulatory landscape. We use these data to study whether there exists a correlate of epigenetic signals in the DNA sequence of enhancers and explore with computational methods to what degree such sequence patterns can be used to predict cell type-specific regulatory activity. By constructing classifiers that predict in which tissues enhancers are active, we are able to identify sequence features that might be recognized by the cell in order to regulate gene expression. While classification performances vary greatly between tissues, we show examples where our classifiers correctly predict tissue-specific regulation from sequence alone. We also show that many of the informative patterns indeed harbor transcription factor footprints.
Collapse
Affiliation(s)
- Philipp Benner
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 73, 14195 Berlin, Germany
| |
Collapse
|
4
|
Lee JY, Nguyen B, Orosco C, Styczynski MP. SCOUR: a stepwise machine learning framework for predicting metabolite-dependent regulatory interactions. BMC Bioinformatics 2021; 22:365. [PMID: 34238207 PMCID: PMC8268592 DOI: 10.1186/s12859-021-04281-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 06/30/2021] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND The topology of metabolic networks is both well-studied and remarkably well-conserved across many species. The regulation of these networks, however, is much more poorly characterized, though it is known to be divergent across organisms-two characteristics that make it difficult to model metabolic networks accurately. While many computational methods have been built to unravel transcriptional regulation, there have been few approaches developed for systems-scale analysis and study of metabolic regulation. Here, we present a stepwise machine learning framework that applies established algorithms to identify regulatory interactions in metabolic systems based on metabolic data: stepwise classification of unknown regulation, or SCOUR. RESULTS We evaluated our framework on both noiseless and noisy data, using several models of varying sizes and topologies to show that our approach is generalizable. We found that, when testing on data under the most realistic conditions (low sampling frequency and high noise), SCOUR could identify reaction fluxes controlled only by the concentration of a single metabolite (its primary substrate) with high accuracy. The positive predictive value (PPV) for identifying reactions controlled by the concentration of two metabolites ranged from 32 to 88% for noiseless data, 9.2 to 49% for either low sampling frequency/low noise or high sampling frequency/high noise data, and 6.6-27% for low sampling frequency/high noise data, with results typically sufficiently high for lab validation to be a practical endeavor. While the PPVs for reactions controlled by three metabolites were lower, they were still in most cases significantly better than random classification. CONCLUSIONS SCOUR uses a novel approach to synthetically generate the training data needed to identify regulators of reaction fluxes in a given metabolic system, enabling metabolomics and fluxomics data to be leveraged for regulatory structure inference. By identifying and triaging the most likely candidate regulatory interactions, SCOUR can drastically reduce the amount of time needed to identify and experimentally validate metabolic regulatory interactions. As high-throughput experimental methods for testing these interactions are further developed, SCOUR will provide critical impact in the development of predictive metabolic models in new organisms and pathways.
Collapse
Affiliation(s)
- Justin Y Lee
- School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Britney Nguyen
- School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Carlos Orosco
- School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Mark P Styczynski
- School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
5
|
Hosseini S, Schmitt AO, Tetens J, Brenig B, Simianer H, Sharifi AR, Gültas M. In Silico Prediction of Transcription Factor Collaborations Underlying Phenotypic Sexual Dimorphism in Zebrafish ( Danio rerio). Genes (Basel) 2021; 12:873. [PMID: 34200177 PMCID: PMC8227731 DOI: 10.3390/genes12060873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 06/02/2021] [Accepted: 06/05/2021] [Indexed: 11/17/2022] Open
Abstract
The transcriptional regulation of gene expression in higher organisms is essential for different cellular and biological processes. These processes are controlled by transcription factors and their combinatorial interplay, which are crucial for complex genetic programs and transcriptional machinery. The regulation of sex-biased gene expression plays a major role in phenotypic sexual dimorphism in many species, causing dimorphic gene expression patterns between two different sexes. The role of transcription factor (TF) in gene regulatory mechanisms so far has not been studied for sex determination and sex-associated colour patterning in zebrafish with respect to phenotypic sexual dimorphism. To address this open biological issue, we applied bioinformatics approaches for identifying the predicted TF pairs based on their binding sites for sex and colour genes in zebrafish. In this study, we identified 25 (e.g., STAT6-GATA4; JUN-GATA4; SOX9-JUN) and 14 (e.g., IRF-STAT6; SOX9-JUN; STAT6-GATA4) potentially cooperating TFs based on their binding patterns in promoter regions for sex determination and colour pattern genes in zebrafish, respectively. The comparison between identified TFs for sex and colour genes revealed several predicted TF pairs (e.g., STAT6-GATA4; JUN-SOX9) are common for both phenotypes, which may play a pivotal role in phenotypic sexual dimorphism in zebrafish.
Collapse
Affiliation(s)
- Shahrbanou Hosseini
- Molecular Biology of Livestock and Molecular Diagnostics Group, Department of Animal Sciences, University of Göttingen, 37077 Göttingen, Germany;
- Functional Breeding Group, Department of Animal Sciences, University of Göttingen, 37077 Göttingen, Germany;
- Institute of Veterinary Medicine, University of Göttingen, 37077 Göttingen, Germany
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
| | - Armin Otto Schmitt
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
- Breeding Informatics Group, Department of Animal Sciences, University of Göttingen, 37075 Göttingen, Germany
| | - Jens Tetens
- Functional Breeding Group, Department of Animal Sciences, University of Göttingen, 37077 Göttingen, Germany;
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
| | - Bertram Brenig
- Molecular Biology of Livestock and Molecular Diagnostics Group, Department of Animal Sciences, University of Göttingen, 37077 Göttingen, Germany;
- Institute of Veterinary Medicine, University of Göttingen, 37077 Göttingen, Germany
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
| | - Henner Simianer
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
- Animal Breeding and Genetics Group, Department of Animal Sciences, University of Göttingen, 37075 Göttingen, Germany
| | - Ahmad Reza Sharifi
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
- Animal Breeding and Genetics Group, Department of Animal Sciences, University of Göttingen, 37075 Göttingen, Germany
| | - Mehmet Gültas
- Center for Integrated Breeding Research (CiBreed), University of Göttingen, 37075 Göttingen, Germany; (A.O.S.); (H.S.); (A.R.S.); (M.G.)
- Breeding Informatics Group, Department of Animal Sciences, University of Göttingen, 37075 Göttingen, Germany
- Faculty of Agriculture, South Westphalia University of Applied Sciences, 59494 Soest, Germany
| |
Collapse
|
6
|
Li H, Quang D, Guan Y. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Res 2019; 29:281-292. [PMID: 30567711 PMCID: PMC6360811 DOI: 10.1101/gr.237156.118] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 12/13/2018] [Indexed: 12/16/2022]
Abstract
The ENCyclopedia of DNA Elements (ENCODE) consortium has generated transcription factor (TF) binding ChIP-seq data covering hundreds of TF proteins and cell types; however, due to limits on time and resources, only a small fraction of all possible TF-cell type pairs have been profiled. One solution is to build machine learning models trained on currently available epigenomic data sets that can be applied to the remaining missing pairs. A major challenge is that TF binding sites are cell-type-specific, which can be attributed to cellular contexts such as chromatin accessibility. Meanwhile, indirect TF-DNA binding and interactions between TFs complicate this regulatory process. Technical issues such as sequencing biases and batch effects render the prediction task even more challenging. Many pioneering efforts have been made to predict TF binding profiles based on DNA sequence and DNase-seq footprints, but to what extent a model can be generalized to completely untested cell conditions remains unknown. In this study, we describe our first place solution to the 2017 ENCODE-DREAM in vivo TF binding site prediction challenge. By carefully addressing multisource biases and information imbalance across cell types, we created a pipeline that significantly outperforms the current state-of-the-art methods. The proposed method is sufficiently complex enough to model nonlinear interactions between TF binding motifs and chromatin accessibility information up to 1500 bp from the genomic region of interest.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Daniel Quang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
7
|
Subramanian S, Thomas T. Regular expression based pattern extraction from a cell - Specific gene expression data. INFORMATICS IN MEDICINE UNLOCKED 2019. [DOI: 10.1016/j.imu.2019.100269] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
8
|
Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
9
|
Leveraging human genetic and adverse outcome pathway (AOP) data to inform susceptibility in human health risk assessment. Mamm Genome 2018; 29:190-204. [DOI: 10.1007/s00335-018-9738-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 01/31/2018] [Indexed: 12/19/2022]
|
10
|
Al-Ssulami AM, Azmi AM, Mathkour H. An efficient method for significant motifs discovery from multiple DNA sequences. J Bioinform Comput Biol 2017; 15:1750014. [PMID: 28571483 DOI: 10.1142/s0219720017500147] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Identification of transcription factor binding sites or biological motifs is an important step in deciphering the mechanisms of gene regulation. It is a classic problem that has eluded a satisfactory and efficient solution. In this paper, we devise a three-phase algorithm to mine for biologically significant motifs. In the first phase, we generate all the possible string motifs, this phase is followed by a filtering process where we discard all motifs that do not meet the constraints. And in the final phase, motifs are scored and ranked using a combination of stochastic techniques and [Formula: see text]-value. We show that our method outperforms some very well-known motif discovery tools, e.g. MEME and Weeder on well-established benchmark data suites. We also apply the algorithm on the non-coding regions of M. tuberculosis and report significant motifs of size 10 with excellent [Formula: see text]-values in a fraction of the time MEME and MoSDi did. In fact, among the best 10 motifs ([Formula: see text]-value wise) in the non-coding regions of M. tuberculosis reported by the tools MEME, MoSDi and ours, five were discovered by our approach which included the third and the fourth best ones. All this in 1/17 and 1/6 the time which MEME and MoSDi (respectively) took.
Collapse
Affiliation(s)
- Abdulrakeeb M Al-Ssulami
- 1 Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Aqil M Azmi
- 1 Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| | - Hassan Mathkour
- 1 Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
| |
Collapse
|
11
|
Yu Q, Huo H, Feng D. PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4986707. [PMID: 27843946 PMCID: PMC5098105 DOI: 10.1155/2016/4986707] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 09/04/2016] [Accepted: 09/27/2016] [Indexed: 11/18/2022]
Abstract
Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Dazheng Feng
- School of Electronic Engineering, Xidian University, Xi'an 710071, China
| |
Collapse
|
12
|
Zhang S, Chen Y. CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design. PLoS One 2016; 11:e0160435. [PMID: 27487245 PMCID: PMC4972426 DOI: 10.1371/journal.pone.0160435] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2016] [Accepted: 07/19/2016] [Indexed: 11/19/2022] Open
Abstract
A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. However, an efficient clustering algorithm is desired for clustering the motifs that belong to the same groups and separating the motifs that belong to different groups, or even deleting an amount of spurious ones. In this work, a new motif clustering algorithm, CLIMP, is proposed by using maximal cliques and sped up by parallelizing its program. When a synthetic motif dataset from the database JASPAR, a set of putative motifs from a phylogenetic foot-printing dataset, and a set of putative motifs from a ChIP dataset are used to compare the performances of CLIMP and two other high-performance algorithms, the results demonstrate that CLIMP mostly outperforms the two algorithms on the three datasets for motif clustering, so that it can be a useful complement of the clustering procedures in some genome-wide motif prediction pipelines. CLIMP is available at http://sqzhang.cn/climp.html.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
- * E-mail: (SZ); (YC)
| | - Yong Chen
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Department of Biological Sciences, Center for Systems Biology, The University of Texas at Dallas, Richardson, Texas, United States of America
- * E-mail: (SZ); (YC)
| |
Collapse
|
13
|
Acuña V, Aravena A, Guziolowski C, Eveillard D, Siegel A, Maass A. Deciphering transcriptional regulations coordinating the response to environmental changes. BMC Bioinformatics 2016; 17:35. [PMID: 26772805 PMCID: PMC4715341 DOI: 10.1186/s12859-016-0885-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 01/08/2016] [Indexed: 11/20/2022] Open
Abstract
Background Gene co-expression evidenced as a response to environmental changes has shown that transcriptional activity is coordinated, which pinpoints the role of transcriptional regulatory networks (TRNs). Nevertheless, the prediction of TRNs based on the affinity of transcription factors (TFs) with binding sites (BSs) generally produces an over-estimation of the observable TF/BS relations within the network and therefore many of the predicted relations are spurious. Results We present Lombarde, a bioinformatics method that extracts from a TRN determined from a set of predicted TF/BS affinities a subnetwork explaining a given set of observed co-expressions by choosing the TFs and BSs most likely to be involved in the co-regulation. Lombarde solves an optimization problem which selects confident paths within a given TRN that join a putative common regulator with two co-expressed genes via regulatory cascades. To evaluate the method, we used public data of Escherichia coli to produce a regulatory network that explained almost all observed co-expressions while using only 19 % of the input TF/BS affinities but including about 66 % of the independent experimentally validated regulations in the input data. When all known validated TF/BS affinities were integrated into the input data the precision of Lombarde increased significantly. The topological characteristics of the subnetwork that was obtained were similar to the characteristics described for known validated TRNs. Conclusions Lombarde provides a useful modeling scheme for deciphering the regulatory mechanisms that underlie the phenotypic responses of an organism to environmental challenges. The method can become a reliable tool for further research on genome-scale transcriptional regulation studies. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0885-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vicente Acuña
- Center for Mathematical Modeling (UMI-CNRS 2807), Universidad de Chile, Santiago, Chile. .,Center for Genome Regulation, Universidad de Chile, Santiago, Chile.
| | - Andrés Aravena
- Department of Molecular Biology and Genetics, Istanbul University, Istanbul, Turkey.
| | | | - Damien Eveillard
- LINA (UMR CNRS 6241), Université de Nantes, École des Mines de Nantes, Nantes, France.
| | - Anne Siegel
- IRISA Project Dyliss (UMR CNRS 6074), Université de Rennes 1, Rennes, France.
| | - Alejandro Maass
- Center for Mathematical Modeling (UMI-CNRS 2807), Universidad de Chile, Santiago, Chile. .,Center for Genome Regulation, Universidad de Chile, Santiago, Chile. .,Department of Mathematical Engineering, Universidad de Chile, Santiago, Chile.
| |
Collapse
|
14
|
Maynou J, Pairó E, Marco S, Perera A. Sequence information gain based motif analysis. BMC Bioinformatics 2015; 16:377. [PMID: 26553056 PMCID: PMC4640167 DOI: 10.1186/s12859-015-0811-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2014] [Accepted: 10/30/2015] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. RESULTS This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70% of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. CONCLUSIONS Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.
Collapse
Affiliation(s)
- Joan Maynou
- Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, Pau Gargallo, 5, Barcelona, 08028, Spain.
- CIBER de Bioingeniería, Biomateriales y Biomedicina, Spain.
| | - Erola Pairó
- Institute for BioEngineering of Catalonia, balidiri Reixach 4-6, Barcelona, 08028, Spain.
- Electronics Department in the University of Barcelona (UB), Martí i Franquès, 1, Barcelona, 08028, Spain.
| | - Santiago Marco
- Institute for BioEngineering of Catalonia, balidiri Reixach 4-6, Barcelona, 08028, Spain.
- Electronics Department in the University of Barcelona (UB), Martí i Franquès, 1, Barcelona, 08028, Spain.
| | - Alexandre Perera
- Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, Pau Gargallo, 5, Barcelona, 08028, Spain.
- CIBER de Bioingeniería, Biomateriales y Biomedicina, Spain.
| |
Collapse
|
15
|
Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. WILEY INTERDISCIPLINARY REVIEWS. DEVELOPMENTAL BIOLOGY 2015; 4:59-84. [PMID: 25704908 PMCID: PMC4339228 DOI: 10.1002/wdev.168] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Revised: 11/04/2014] [Accepted: 11/16/2014] [Indexed: 11/08/2022]
Abstract
UNLABELLED Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website. CONFLICT OF INTEREST The authors have declared no conflicts of interest for this article.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
16
|
Dai Z, Guo D, Dai X, Xiong Y. Genome-wide analysis of transcription factor binding sites and their characteristic DNA structures. BMC Genomics 2015; 16 Suppl 3:S8. [PMID: 25708259 PMCID: PMC4331811 DOI: 10.1186/1471-2164-16-s3-s8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Transcription factors (TF) regulate gene expression by binding DNA regulatory regions. Transcription factor binding sites (TFBSs) are conserved not only in primary DNA sequences but also in DNA structures. However, the global relationship between TFs and their preferred DNA structures remains to be elucidated. Results In this paper, we have developed a computational method to generate a genome-wide landscape of TFs and their characteristic binding DNA structures in Saccharomyces cerevisiae. We revealed DNA structural features for different TFs. The structural conservation shows positional preference in TFBSs. Structural levels of DNA sequences are correlated with TF-DNA binding affinities. Conclusions We provided the genome-wide correspondences of TFs to DNA structures. Our findings will have implications in understanding TF regulatory mechanisms.
Collapse
|
17
|
Mahdevar G, Nowzari-Dalini A, Sadeghi M. Inferring gene correlation networks from transcription factor binding sites. Genes Genet Syst 2014; 88:301-9. [PMID: 24694393 DOI: 10.1266/ggs.88.301] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Gene expression is a highly regulated biological process that is fundamental to the existence of phenotypes of any living organism. The regulatory relations are usually modeled as a network; simply, every gene is modeled as a node and relations are shown as edges between two related genes. This paper presents a novel method for inferring correlation networks, networks constructed by connecting co-expressed genes, through predicting co-expression level from genes promoter's sequences. According to the results, this method works well on biological data and its outcome is comparable to the methods that use microarray as input. The method is written in C++ language and is available upon request from the corresponding author.
Collapse
Affiliation(s)
- Ghasem Mahdevar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran
| | | | | |
Collapse
|
18
|
Tanaka E, Bailey TL, Keich U. Improving MEME via a two-tiered significance analysis. Bioinformatics 2014; 30:1965-73. [PMID: 24665130 PMCID: PMC4080741 DOI: 10.1093/bioinformatics/btu163] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Revised: 02/20/2014] [Accepted: 03/19/2014] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION With over 9000 unique users recorded in the first half of 2013, MEME is one of the most popular motif-finding tools available. Reliable estimates of the statistical significance of motifs can greatly increase the usefulness of any motif finder. By analogy, it is difficult to imagine evaluating a BLAST result without its accompanying E-value. Currently MEME evaluates its EM-generated candidate motifs using an extension of BLAST's E-value to the motif-finding context. Although we previously indicated the drawbacks of MEME's current significance evaluation, we did not offer a practical substitute suited for its needs, especially because MEME also relies on the E-value internally to rank competing candidate motifs. RESULTS Here we offer a two-tiered significance analysis that can replace the E-value in selecting the best candidate motif and in evaluating its overall statistical significance. We show that our new approach could substantially improve MEME's motif-finding performance and would also provide the user with a reliable significance analysis. In addition, for large input sets, our new approach is in fact faster than the currently implemented E-value analysis.
Collapse
Affiliation(s)
- Emi Tanaka
- School of Mathematics and Statistics, University of Sydney, Sydney 2006, School of Mathematics and Applied Statistics, University of Wollongong, Wollongong 2522, New South Wales and Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland 4072, AustraliaSchool of Mathematics and Statistics, University of Sydney, Sydney 2006, School of Mathematics and Applied Statistics, University of Wollongong, Wollongong 2522, New South Wales and Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland 4072, Australia
| | - Timothy L Bailey
- School of Mathematics and Statistics, University of Sydney, Sydney 2006, School of Mathematics and Applied Statistics, University of Wollongong, Wollongong 2522, New South Wales and Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland 4072, Australia
| | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, Sydney 2006, School of Mathematics and Applied Statistics, University of Wollongong, Wollongong 2522, New South Wales and Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland 4072, Australia
| |
Collapse
|
19
|
Azmi AM, Al-Ssulami A. Encoded expansion: an efficient algorithm to discover identical string motifs. PLoS One 2014; 9:e95148. [PMID: 24871320 PMCID: PMC4037181 DOI: 10.1371/journal.pone.0095148] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 03/24/2014] [Indexed: 11/19/2022] Open
Abstract
A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
Collapse
Affiliation(s)
- Aqil M. Azmi
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
- * E-mail:
| | - Abdulrakeeb Al-Ssulami
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
20
|
Identifying functional transcription factor binding sites in yeast by considering their positional preference in the promoters. PLoS One 2014; 8:e83791. [PMID: 24386279 PMCID: PMC3873331 DOI: 10.1371/journal.pone.0083791] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 11/08/2013] [Indexed: 11/25/2022] Open
Abstract
Transcription factor binding site (TFBS) identification plays an important role in deciphering gene regulatory codes. With comprehensive knowledge of TFBSs, one can understand molecular mechanisms of gene regulation. In the recent decades, various computational approaches have been proposed to predict TFBSs in the genome. The TFBS dataset of a TF generated by each algorithm is a ranked list of predicted TFBSs of that TF, where top ranked TFBSs are statistically significant ones. However, whether these statistically significant TFBSs are functional (i.e. biologically relevant) is still unknown. Here we develop a post-processor, called the functional propensity calculator (FPC), to assign a functional propensity to each TFBS in the existing computationally predicted TFBS datasets. It is known that functional TFBSs reveal strong positional preference towards the transcriptional start site (TSS). This motivates us to take TFBS position relative to the TSS as the key idea in building our FPC. Based on our calculated functional propensities, the TFBSs of a TF in the original TFBS dataset could be reordered, where top ranked TFBSs are now the ones with high functional propensities. To validate the biological significance of our results, we perform three published statistical tests to assess the enrichment of Gene Ontology (GO) terms, the enrichment of physical protein-protein interactions, and the tendency of being co-expressed. The top ranked TFBSs in our reordered TFBS dataset outperform the top ranked TFBSs in the original TFBS dataset, justifying the effectiveness of our post-processor in extracting functional TFBSs from the original TFBS dataset. More importantly, assigning functional propensities to putative TFBSs enables biologists to easily identify which TFBSs in the promoter of interest are likely to be biologically relevant and are good candidates to do further detailed experimental investigation. The FPC is implemented as a web tool at http://santiago.ee.ncku.edu.tw/FPC/.
Collapse
|
21
|
Zhang S, Zhou X, Du C, Su Z. SPIC: a novel similarity metric for comparing transcription factor binding site motifs based on information contents. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 2:S14. [PMID: 24564945 PMCID: PMC3866262 DOI: 10.1186/1752-0509-7-s2-s14] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Discovering transcription factor binding sites (TFBS) is one of primary challenges to decipher complex gene regulatory networks encrypted in a genome. A set of short DNA sequences identified by a transcription factor (TF) is known as a motif, which can be expressed accurately in matrix form such as a position-specific scoring matrix (PSSM) and a position frequency matrix. Very frequently, we need to query a motif in a database of motifs by seeking its similar motifs, merge similar TFBS motifs possibly identified by the same TF, separate irrelevant motifs, or filter out spurious motifs. Therefore, a novel metric is required to seize slight differences between irrelevant motifs and highlight the similarity between motifs of the same group in all these applications. While there are already several metrics for motif similarity proposed before, their performance is still far from satisfactory for these applications. METHODS A novel metric has been proposed in this paper with name as SPIC (Similarity with Position Information Contents) for measuring the similarity between a column of a motif and a column of another motif. When defining this similarity score, we consider the likelihood that the column of the first motif's PFM can be produced by the column of the second motif's PSSM, and multiply the likelihood by the information content of the column of the second motif's PSSM, and vise versa. We evaluated the performance of SPIC combined with a local or a global alignment method having a function for affine gap penalty, for computing the similarity between two motifs. We also compared SPIC with seven existing state-of-the-arts metrics for their capability of clustering motifs from the same group and retrieving motifs from a database on three datasets. RESULTS When used jointly with the Smith-Waterman local alignment method with an affine gap penalty function (gap open penalty is equal to 1, gap extension penalty is equal to 0.5), SPIC outperforms the seven existing state-of-the-art motif similarity metrics combined with their best alignments for matching motifs in database searches, and clustering the same TF's sub-motifs or distinguishing relevant ones from a miscellaneous group of motifs. CONCLUSIONS We have developed a novel motif similarity metric that can more accurately match motifs in database searches, and more effectively cluster similar motifs and differentiate irrelevant motifs than do the other seven metrics we are aware of.
Collapse
|
22
|
Carvalho L. Bayesian centroid estimation for motif discovery. PLoS One 2013; 8:e80511. [PMID: 24324603 PMCID: PMC3855595 DOI: 10.1371/journal.pone.0080511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Accepted: 10/03/2013] [Indexed: 11/29/2022] Open
Abstract
Biological sequences may contain patterns that signal important biomolecular functions; a classical example is regulation of gene expression by transcription factors that bind to specific patterns in genomic promoter regions. In motif discovery we are given a set of sequences that share a common motif and aim to identify not only the motif composition, but also the binding sites in each sequence of the set. We propose a new centroid estimator that arises from a refined and meaningful loss function for binding site inference. We discuss the main advantages of centroid estimation for motif discovery, including computational convenience, and how its principled derivation offers further insights about the posterior distribution of binding site configurations. We also illustrate, using simulated and real datasets, that the centroid estimator can differ from the traditional maximum a posteriori or maximum likelihood estimators.
Collapse
Affiliation(s)
- Luis Carvalho
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
23
|
Weiss V, Medina-Rivera A, Huerta AM, Santos-Zavaleta A, Salgado H, Morett E, Collado-Vides J. Evidence classification of high-throughput protocols and confidence integration in RegulonDB. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bas059. [PMID: 23327937 PMCID: PMC3548332 DOI: 10.1093/database/bas059] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
RegulonDB provides curated information on the transcriptional regulatory network of Escherichia coli and contains both experimental data and computationally predicted objects. To account for the heterogeneity of these data, we introduced in version 6.0, a two-tier rating system for the strength of evidence, classifying evidence as either ‘weak’ or ‘strong’ (Gama-Castro,S., Jimenez-Jacinto,V., Peralta-Gil,M. et al. RegulonDB (Version 6.0): gene regulation model of Escherichia Coli K-12 beyond transcription, active (experimental) annotated promoters and textpresso navigation. Nucleic Acids Res., 2008;36:D120–D124.). We now add to our classification scheme the classification of high-throughput evidence, including chromatin immunoprecipitation (ChIP) and RNA-seq technologies. To integrate these data into RegulonDB, we present two strategies for the evaluation of confidence, statistical validation and independent cross-validation. Statistical validation involves verification of ChIP data for transcription factor-binding sites, using tools for motif discovery and quality assessment of the discovered matrices. Independent cross-validation combines independent evidence with the intention to mutually exclude false positives. Both statistical validation and cross-validation allow to upgrade subsets of data that are supported by weak evidence to a higher confidence level. Likewise, cross-validation of strong confidence data extends our two-tier rating system to a three-tier system by introducing a third confidence score ‘confirmed’. Database URL:http://regulondb.ccg.unam.mx/
Collapse
Affiliation(s)
- Verena Weiss
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, AP 565-A, Cuernavaca, Morelos 62100, Mexico.
| | | | | | | | | | | | | |
Collapse
|
24
|
Dean KM, Grayhack EJ. RNA-ID, a highly sensitive and robust method to identify cis-regulatory sequences using superfolder GFP and a fluorescence-based assay. RNA (NEW YORK, N.Y.) 2012; 18:2335-44. [PMID: 23097427 PMCID: PMC3504683 DOI: 10.1261/rna.035907.112] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2012] [Accepted: 09/14/2012] [Indexed: 05/16/2023]
Abstract
We have developed a robust and sensitive method, called RNA-ID, to screen for cis-regulatory sequences in RNA using fluorescence-activated cell sorting (FACS) of yeast cells bearing a reporter in which expression of both superfolder green fluorescent protein (GFP) and yeast codon-optimized mCherry red fluorescent protein (RFP) is driven by the bidirectional GAL1,10 promoter. This method recapitulates previously reported progressive inhibition of translation mediated by increasing numbers of CGA codon pairs, and restoration of expression by introduction of a tRNA with an anticodon that base pairs exactly with the CGA codon. This method also reproduces effects of paromomycin and context on stop codon read-through. Five key features of this method contribute to its effectiveness as a selection for regulatory sequences: The system exhibits greater than a 250-fold dynamic range, a quantitative and dose-dependent response to known inhibitory sequences, exquisite resolution that allows nearly complete physical separation of distinct populations, and a reproducible signal between different cells transformed with the identical reporter, all of which are coupled with simple methods involving ligation-independent cloning, to create large libraries. Moreover, we provide evidence that there are sequences within a 9-nt library that cause reduced GFP fluorescence, suggesting that there are novel cis-regulatory sequences to be found even in this short sequence space. This method is widely applicable to the study of both RNA-mediated and codon-mediated effects on expression.
Collapse
Affiliation(s)
- Kimberly M. Dean
- Department of Biochemistry and Biophysics, University of Rochester Medical School, Rochester, New York 14642, USA
| | - Elizabeth J. Grayhack
- Department of Biochemistry and Biophysics, University of Rochester Medical School, Rochester, New York 14642, USA
| |
Collapse
|
25
|
Mitchell JA, Clay I, Umlauf D, Chen CY, Moir CA, Eskiw CH, Schoenfelder S, Chakalova L, Nagano T, Fraser P. Nuclear RNA sequencing of the mouse erythroid cell transcriptome. PLoS One 2012; 7:e49274. [PMID: 23209567 PMCID: PMC3510205 DOI: 10.1371/journal.pone.0049274] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 10/08/2012] [Indexed: 12/31/2022] Open
Abstract
In addition to protein coding genes a substantial proportion of mammalian genomes are transcribed. However, most transcriptome studies investigate steady-state mRNA levels, ignoring a considerable fraction of the transcribed genome. In addition, steady-state mRNA levels are influenced by both transcriptional and posttranscriptional mechanisms, and thus do not provide a clear picture of transcriptional output. Here, using deep sequencing of nuclear RNAs (nucRNA-Seq) in parallel with chromatin immunoprecipitation sequencing (ChIP-Seq) of active RNA polymerase II, we compared the nuclear transcriptome of mouse anemic spleen erythroid cells with polymerase occupancy on a genome-wide scale. We demonstrate that unspliced transcripts quantified by nucRNA-seq correlate with primary transcript frequencies measured by RNA FISH, but differ from steady-state mRNA levels measured by poly(A)-enriched RNA-seq. Highly expressed protein coding genes showed good correlation between RNAPII occupancy and transcriptional output; however, genome-wide we observed a poor correlation between transcriptional output and RNAPII association. This poor correlation is due to intergenic regions associated with RNAPII which correspond with transcription factor bound regulatory regions and a group of stable, nuclear-retained long non-coding transcripts. In conclusion, sequencing the nuclear transcriptome provides an opportunity to investigate the transcriptional landscape in a given cell type through quantification of unspliced primary transcripts and the identification of nuclear-retained long non-coding RNAs.
Collapse
Affiliation(s)
- Jennifer A Mitchell
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Zheng G, Liu Q, Ding G, Wei C, Li Y. Towards biological characters of interactions between transcription factors and their DNA targets in mammals. BMC Genomics 2012; 13:388. [PMID: 22888987 PMCID: PMC3472306 DOI: 10.1186/1471-2164-13-388] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2012] [Accepted: 06/29/2012] [Indexed: 01/07/2023] Open
Abstract
Background In post-genomic era, the study of transcriptional regulation is pivotal to decode genetic information. Transcription factors (TFs) are central proteins for transcriptional regulation, and interactions between TFs and their DNA targets (TFBSs) are important for downstream genes’ expression. However, the lack of knowledge about interactions between TFs and TFBSs is still baffling people to investigate the mechanism of transcription. Results To expand the knowledge about interactions between TFs and TFBSs, three biological features (sequence feature, structure feature, and evolution feature) were utilized to build TFBS identification models for studying binding preference between TFs and their DNA targets in mammals. Results show that each feature does have fairly well performance to capture TFBSs, and the hybrid model combined all three features is more robust for TFBS identification. Subsequently, correspondence between TFs and their TFBSs was investigated to explore interactions among them in mammals. Results indicate that TFs and TFBSs are reciprocal in sequence, structure, and evolution level. Conclusions Our work demonstrates that, to some extent, TFs and TFBSs have developed a coevolutionary relationship in order to keep their physical binding and maintain their regulatory functions. In summary, our work will help understand transcriptional regulation and interpret binding mechanism between proteins and DNAs.
Collapse
Affiliation(s)
- Guangyong Zheng
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | | | | | | | | |
Collapse
|
27
|
Mahdevar G, Sadeghi M, Nowzari-Dalini A. Transcription factor binding sites detection by using alignment-based approach. J Theor Biol 2012; 304:96-102. [PMID: 22504445 DOI: 10.1016/j.jtbi.2012.03.039] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 03/27/2012] [Accepted: 03/29/2012] [Indexed: 11/25/2022]
Abstract
Gene expression is the main cause for the existence of various phenotypes. Through this procedure, the information stored in DNA rises to the phenotype. Essentially, gene expression is dependent upon the successful binding of transcription factors (TFs) - a specific type of proteins - to explicit positions in its upstream, TF binding sites (TFBSs). Unfortunately, finding these TFBSs is costly and laborious; therefore, discovering TFBSs computationally is a significant problem that many researches endeavor to solve. In this paper, a new TFBS discovery method is presented by considering known biological facts about TFBSs. The input to this method includes sequences with arbitrary lengths and the output comprises positions that tend to be TFBS. Through the application of previous methods along with a method that focuses on biological and simulated datasets, it is shown that this method achieves higher accuracy in discovering TFBSs.
Collapse
Affiliation(s)
- Ghasem Mahdevar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | | | | |
Collapse
|
28
|
Tan M, Yu D, Jin Y, Dou L, Li B, Wang Y, Yue J, Liang L. An information transmission model for transcription factor binding at regulatory DNA sites. Theor Biol Med Model 2012; 9:19. [PMID: 22672438 PMCID: PMC3442977 DOI: 10.1186/1742-4682-9-19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2012] [Accepted: 05/17/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational identification of transcription factor binding sites (TFBSs) is a rapid, cost-efficient way to locate unknown regulatory elements. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. To date, identifying TFBSs with high sensitivity and specificity is still an open challenge, necessitating the development of novel models for predicting transcription factor-binding regulatory DNA elements. RESULTS Based on the information theory, we propose a model for transcription factor binding of regulatory DNA sites. Our model incorporates position interdependencies in effective ways. The model computes the information transferred (TI) between the transcription factor and the TFBS during the binding process and uses TI as the criterion to determine whether the sequence motif is a possible TFBS. Based on this model, we developed a computational method to identify TFBSs. By theoretically proving and testing our model using both real and artificial data, we found that our model provides highly accurate predictive results. CONCLUSIONS In this study, we present a novel model for transcription factor binding regulatory DNA sites. The model can provide an increased ability to detect TFBSs.
Collapse
Affiliation(s)
- Mingfeng Tan
- Beijing Institute of Biotechnology, Beijing 100071, China
| | | | | | | | | | | | | | | |
Collapse
|
29
|
Conserved Motifs and Prediction of Regulatory Modules in Caenorhabditis elegans. G3-GENES GENOMES GENETICS 2012; 2:469-81. [PMID: 22540038 PMCID: PMC3337475 DOI: 10.1534/g3.111.001081] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2011] [Accepted: 02/06/2012] [Indexed: 01/30/2023]
Abstract
Transcriptional regulation, a primary mechanism for controlling the development of multicellular organisms, is carried out by transcription factors (TFs) that recognize and bind to their cognate binding sites. In Caenorhabditis elegans, our knowledge of which genes are regulated by which TFs, through binding to specific sites, is still very limited. To expand our knowledge about the C. elegans regulatory network, we performed a comprehensive analysis of the C. elegans, Caenorhabditis briggsae, and Caenorhabditis remanei genomes to identify regulatory elements that are conserved in all genomes. Our analysis identified 4959 elements that are significantly conserved across the genomes and that each occur multiple times within each genome, both hallmarks of functional regulatory sites. Our motifs show significant matches to known core promoter elements, TF binding sites, splice sites, and poly-A signals as well as many putative regulatory sites. Many of the motifs are significantly correlated with various types of experimental data, including gene expression patterns, tissue-specific expression patterns, and binding site location analysis as well as enrichment in specific functional classes of genes. Many can also be significantly associated with specific TFs. Combinations of motif occurrences allow us to predict the location of cis-regulatory modules and we show that many of them significantly overlap experimentally determined enhancers. We provide access to the predicted binding sites, their associated motifs, and the predicted cis-regulatory modules across the whole genome through a web-accessible database and as tracks for genome browsers.
Collapse
|
30
|
Aerts S. Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr Top Dev Biol 2012; 98:121-45. [PMID: 22305161 DOI: 10.1016/b978-0-12-386499-4.00005-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Transcription factors (TFs) are key proteins that decode the information in our genome to express a precise and unique set of proteins and RNA molecules in each cell type in our body. These factors play a pivotal role in all biological processes, including the determination of a cell's fate during development and the maintenance of a cell's physiological function. To achieve this, a TF binds to specific DNA sequences in the noncoding part of the genome, recruits chromatin modifiers and cofactors, and directs the transcription initiation rate of its "target genes." Therefore, a key challenge in deciphering a transcriptional switch is to identify the direct target genes of the master regulators that control the switch, the cis-regulatory elements implementing (auto-)regulatory loops, and the target genes of all the TFs in the downstream regulatory network. A better knowledge of a TF's targetome during specification and differentiation of a particular cell type will generate mechanistic insight into its developmental program. Here, I review computational strategies and methods to predict transcriptional targets by genome-wide searches for TF binding sites using position weight matrices, motif clusters, phylogenetic footprinting, chromatin binding and accessibility data, enhancer classification, motif enrichment, and gene expression signatures.
Collapse
Affiliation(s)
- Stein Aerts
- Laboratory of Computational Biology, Center for Human Genetics, Katholieke Universiteit Leuven, Leuven, Belgium
| |
Collapse
|
31
|
Assessing the effects of symmetry on motif discovery and modeling. PLoS One 2011; 6:e24908. [PMID: 21949783 PMCID: PMC3176789 DOI: 10.1371/journal.pone.0024908] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2011] [Accepted: 08/19/2011] [Indexed: 11/23/2022] Open
Abstract
Background Identifying the DNA binding sites for transcription factors is a key task in modeling the gene regulatory network of a cell. Predicting DNA binding sites computationally suffers from high false positives and false negatives due to various contributing factors, including the inaccurate models for transcription factor specificity. One source of inaccuracy in the specificity models is the assumption of asymmetry for symmetric models. Methodology/Principal Findings Using simulation studies, so that the correct binding site model is known and various parameters of the process can be systematically controlled, we test different motif finding algorithms on both symmetric and asymmetric binding site data. We show that if the true binding site is asymmetric the results are unambiguous and the asymmetric model is clearly superior to the symmetric model. But if the true binding specificity is symmetric commonly used methods can infer, incorrectly, that the motif is asymmetric. The resulting inaccurate motifs lead to lower sensitivity and specificity than would the correct, symmetric models. We also show how the correct model can be obtained by the use of appropriate measures of statistical significance. Conclusions/Significance This study demonstrates that the most commonly used motif-finding approaches usually model symmetric motifs incorrectly, which leads to higher than necessary false prediction errors. It also demonstrates how alternative motif-finding methods can correct the problem, providing more accurate motif models and reducing the errors. Furthermore, it provides criteria for determining whether a symmetric or asymmetric model is the most appropriate for any experimental dataset.
Collapse
|
32
|
Shi J, Yang W, Chen M, Du Y, Zhang J, Wang K. AMD, an automated motif discovery tool using stepwise refinement of gapped consensuses. PLoS One 2011; 6:e24576. [PMID: 21931761 PMCID: PMC3171486 DOI: 10.1371/journal.pone.0024576] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2011] [Accepted: 08/14/2011] [Indexed: 11/21/2022] Open
Abstract
Motif discovery is essential for deciphering regulatory codes from high throughput genomic data, such as those from ChIP-chip/seq experiments. However, there remains a lack of effective and efficient methods for the identification of long and gapped motifs in many relevant tools reported to date. We describe here an automated tool that allows for de novo discovery of transcription factor binding sites, regardless of whether the motifs are long or short, gapped or contiguous.
Collapse
Affiliation(s)
- Jiantao Shi
- Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Graduate School of the Chinese Academy of Sciences, Shanghai, China
| | - Wentao Yang
- Shanghai Institute of Hematology and Sino-French Center for Life Science and Genomics, Rui-Jin Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Mingjie Chen
- Shanghai Institute of Hematology and Sino-French Center for Life Science and Genomics, Rui-Jin Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yanzhi Du
- Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Ji Zhang
- Key Laboratory of Stem Cell Biology, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Shanghai Institute of Hematology and Sino-French Center for Life Science and Genomics, Rui-Jin Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
- * E-mail:
| | - Kankan Wang
- Shanghai Institute of Hematology and Sino-French Center for Life Science and Genomics, Rui-Jin Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| |
Collapse
|
33
|
Zhang S, Li S, Niu M, Pham PT, Su Z. MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinformatics 2011; 12:238. [PMID: 21679436 PMCID: PMC3225181 DOI: 10.1186/1471-2105-12-238] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2010] [Accepted: 06/16/2011] [Indexed: 11/21/2022] Open
Abstract
Background Although dozens of algorithms and tools have been developed to find a set of cis-regulatory binding sites called a motif in a set of intergenic sequences using various approaches, most of these tools focus on identifying binding sites that are significantly different from their background sequences. However, some motifs may have a similar nucleotide distribution to that of their background sequences. Therefore, such binding sites can be missed by these tools. Results Here, we present a graph-based polynomial-time algorithm, MotifClick, for the prediction of cis-regulatory binding sites, in particular, those that have a similar nucleotide distribution to that of their background sequences. To find binding sites with length k, we construct a graph using some 2(k-1)-mers in the input sequences as the vertices, and connect two vertices by an edge if the maximum number of matches of the local gapless alignments between the two 2(k-1)-mers is greater than a cutoff value. We identify a motif as a set of similar k-mers from a merged group of maximum cliques associated with some vertices. Conclusions When evaluated on both synthetic and real datasets of prokaryotes and eukaryotes, MotifClick outperforms existing leading motif-finding tools for prediction accuracy and balancing the prediction sensitivity and specificity in general. In particular, when the distribution of nucleotides of binding sites is similar to that of their background sequences, MotifClick is more likely to identify the binding sites than the other tools.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- Department of Bioinformatics and Genomics, Center for Bioinformatics Research, the University of North Carolina at Charlotte, 28223, USA
| | | | | | | | | |
Collapse
|
34
|
Sakabe NJ, Nobrega MA. Genome-wide maps of transcription regulatory elements. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2010; 2:422-437. [PMID: 20836039 DOI: 10.1002/wsbm.70] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Expression of eukaryotic genes with complex spatial-temporal regulation during development requires finer regulation than that of genes with simpler expression patterns. Given the high degree of conservation of the developmental gene set across distantly related phylogenetic taxa, it is argued that evolutionary variation has occurred by tweaking regulation of expression of developmental genes, rather than by changes in genes themselves. Complex regulation is often achieved through the coordinated action of transcription regulatory elements spread across the genome up to tens of kilobases from the promoters of their target genes. Disruption of regulatory elements has been implicated in several diseases and studies showing associations between disease traits and nonprotein coding variation hint for a role of regulatory elements as cause of diseases. Therefore, the identification and mapping of regulatory elements in genome scale is crucial to understand how gene expression is regulated, how organisms evolve, and to identify sequence variation causing diseases. Previously developed experimental techniques have been adapted to identify regulatory elements in genome scale and high-throughput, allowing a global view of their biological roles. We review methods as chromatin immunoprecipitation, DNase I hypersensitivity, and computational approaches and how they have been employed to generate maps of histone modifications, open chromatin, nucleosome positioning, and transcription factor binding regions in whole mammalian genomes. Given the importance of non-promoter elements in gene regulation and the recent explosion in the number of studies devoted to them, we focus on these elements and discuss the insights on gene regulation being obtained by these studies.
Collapse
Affiliation(s)
- Noboru J Sakabe
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Marcelo A Nobrega
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
35
|
Abstract
Activation of nuclear factor (NF)-κB, one of the most investigated transcription factors, has been found to control multiple cellular processes in cancer including inflammation, transformation, proliferation, angiogenesis, invasion, metastasis, chemoresistance and radioresistance. NF-κB is constitutively active in most tumor cells, and its suppression inhibits the growth of tumor cells, leading to the concept of 'NF-κB addiction' in cancer cells. Why NF-κB is constitutively and persistently active in cancer cells is not fully understood, but multiple mechanisms have been delineated including agents that activate NF-κB (such as viruses, viral proteins, bacteria and cytokines), signaling intermediates (such as mutant receptors, overexpression of kinases, mutant oncoproteins, degradation of IκBα, histone deacetylase, overexpression of transglutaminase and iNOS) and cross talk between NF-κB and other transcription factors (such as STAT3, HIF-1α, AP1, SP, p53, PPARγ, β-catenin, AR, GR and ER). As NF-κB is 'pre-active' in cancer cells through unrelated mechanisms, classic inhibitors of NF-κB (for example, bortezomib) are unlikely to mediate their anticancer effects through suppression of NF-κB. This review discusses multiple mechanisms of NF-κB activation and their regulation by multitargeted agents in contrast to monotargeted agents, thus 'one size does not fit all' cancers.
Collapse
|
36
|
Satija R, Hein J, Lunter GA. Genome-wide functional element detection using pairwise statistical alignment outperforms multiple genome footprinting techniques. Bioinformatics 2010; 26:2116-20. [PMID: 20610610 DOI: 10.1093/bioinformatics/btq360] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Comparative genomic sequence analysis is a powerful approach for identifying putative functional elements in silico. The availability of full-genome sequences from many vertebrate species has resulted in the development of popular tools, for example, the phastCons software package that search large numbers of genomes to identify conserved elements. While phastCons can analyze many genomes simultaneously, it ignores potentially informative insertion and deletion events and relies on a fixed, precomputed multiple sequence alignment. RESULTS We have developed a new method, GRAPeFoot, which simultaneously aligns two full genomes and annotates a set of conserved regions exhibiting reduced rates of insertion, deletion and substitution mutations. We tested GRAPeFoot using the human and mouse genomes and compared its performance to a set of phastCons predictions hosted on the UCSC genome browser. Our results demonstrate that despite the use of only two genomes, GRAPeFoot identified constrained elements at rates comparable with phastCons, which analyzed data from 28 vertebrate genomes. This study demonstrates how integrated modelling of substitutions, indels and purifying selection allows a pairwise analysis to exhibit a sensitivity similar to a heuristic analysis of many genomes. AVAILABILITY The GRAPeFoot software and set of genome-wide functional element predictions are freely available to download online at http://www.stats.ox.ac.uk/ approximately satija/GRAPeFoot/.
Collapse
Affiliation(s)
- R Satija
- Department of Statistics, Oxford University, Oxford, UK.
| | | | | |
Collapse
|
37
|
Paquet Y, Anderson A. Sequence composition similarities with the 7SL RNA are highly predictive of functional genomic features. Nucleic Acids Res 2010; 38:4907-16. [PMID: 20392819 PMCID: PMC2926601 DOI: 10.1093/nar/gkq234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Transposable elements derived from the 7SL RNA gene, such as Alu elements in primates, have had remarkable success in several mammalian lineages. The results presented here show a broad spectrum of functions for genomic segments that display sequence composition similarities with the 7SL RNA gene. Using thoroughly documented loci, we report that DNaseI-hypersensitive sites can be singled out in large genomic sequences by an assessment of sequence composition similarities with the 7SL RNA gene. We apply a root word frequency approach to illustrate a distinctive relationship between the sequence of the 7SL RNA gene and several classes of functional genomic features that are not presumed to be of transposable origin. Transposable elements that show noticeable similarities with the 7SL sequence include Alu sequences, as expected, but also long terminal repeats and the 5′-untranslated regions of long interspersed repetitive elements. In sequences masked for repeated elements, we find, when using the 7SL RNA gene as query sequence, distinctive similarities with promoters, exons and distal gene regulatory regions. The latter being the most notoriously difficult to detect, this approach may be useful for finding genomic segments that have regulatory functions and that may have escaped detection by existing methods.
Collapse
Affiliation(s)
- Yanick Paquet
- Centre de recherche en cancérologie de l’Université Laval, L’Hôtel-Dieu de Québec, Centre hospitalier universitaire de Québec, Québec G1R 2J6 and Département de biologie, Université Laval, Québec G1K 7P4, Canada
| | - Alan Anderson
- Centre de recherche en cancérologie de l’Université Laval, L’Hôtel-Dieu de Québec, Centre hospitalier universitaire de Québec, Québec G1R 2J6 and Département de biologie, Université Laval, Québec G1K 7P4, Canada
- *To whom correspondence should be addressed. Tel: + 418 691 5281; Fax: +418 691 5439;
| |
Collapse
|
38
|
Palumbo MJ, Newberg LA. Phyloscan: locating transcription-regulating binding sites in mixed aligned and unaligned sequence data. Nucleic Acids Res 2010; 38:W268-74. [PMID: 20435683 PMCID: PMC2896078 DOI: 10.1093/nar/gkq330] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a gene's promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).
Collapse
Affiliation(s)
- Michael J Palumbo
- Wadsworth Center, New York State Department of Health, Empire State Plaza, P.O. Box 509, Albany, NY 12201-0509, USA
| | | |
Collapse
|
39
|
Abstract
Expectation maximization and Gibbs' sampling are two statistical approaches used to identify transcription factor binding sites and the motif that represents them. Both take as input unaligned sequences and search for a statistically significant alignment of putative binding sites. Expectation maximization is deterministic so that starting with the same initial parameters will always converge to the same solution, making it wise to start it multiple times from different initial parameters. Gibbs' sampling is stochastic so that it may arrive at different solutions from the same initial parameters. In both cases multiple runs are advised because comparisons of the solutions after each run can indicate whether a global, optimum solution is likely to have been achieved.
Collapse
|
40
|
He X, Sinha S. Evolution of cis-regulatory sequences in Drosophila. Methods Mol Biol 2010; 674:283-296. [PMID: 20827599 DOI: 10.1007/978-1-60761-854-6_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Cross-species comparison is an emerging paradigm for identifying cis-regulatory sequences and understanding their function and evolution. In this chapter, we review probabilistic models of evolution of transcription factor binding sites, which provide the theoretical basis for a number of new bioinformatics tools for comparative sequence analysis. We illustrate how important functional and evolutionary insights on binding site gain and loss can be acquired through sequence comparison. This includes the observation that binding site turnover follows a molecular clock and that its rate correlates with the strength of binding sites and the presence of other sites in the neighborhood. We also comment on emerging trends that go beyond individual binding sites to a more holistic study of regulatory evolution. We point out common technical challenges, such as reliable sequence alignment and binding site prediction, when doing comparative regulatory sequence analysis and note some potential solutions thereof.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| | | |
Collapse
|
41
|
Li G, Liu B, Xu Y. Accurate recognition of cis-regulatory motifs with the correct lengths in prokaryotic genomes. Nucleic Acids Res 2009; 38:e12. [PMID: 19906734 PMCID: PMC2811016 DOI: 10.1093/nar/gkp907] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
We present a new computational method for solving a classical problem, the identification problem of cis-regulatory motifs in a given set of promoter sequences, based on one key new idea. Instead of scoring candidate motifs individually like in all the existing motif-finding programs, our method scores groups of candidate motifs with similar sequences, called motif closures, using a P-value, which has substantially improved the prediction reliability over the existing methods. Our new P-value scoring scheme is sequence length independent, hence allowing direct comparisons among predicted motifs with different lengths on the same footing. We have implemented this method as a Motif Recognition Computer (MREC) program, and have extensively tested MREC on both simulated and biological data from prokaryotic genomes. Our test results indicate that MREC can accurately pick out the actual motif with the correct length as the best scoring candidate for the vast majority of the cases in our test set. We compared our prediction results with two motif-finding programs Cosmo and MEME, and found that MREC outperforms both programs across all the test cases by a large margin. The MREC program is available at http://csbl.bmb.uga.edu/~bingqiang/MREC1/.
Collapse
Affiliation(s)
- Guojun Li
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, GA 30602, USA
| | | | | |
Collapse
|
42
|
Fauteux F, Strömvik MV. Seed storage protein gene promoters contain conserved DNA motifs in Brassicaceae, Fabaceae and Poaceae. BMC PLANT BIOLOGY 2009; 9:126. [PMID: 19843335 PMCID: PMC2770497 DOI: 10.1186/1471-2229-9-126] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Accepted: 10/20/2009] [Indexed: 05/22/2023]
Abstract
BACKGROUND Accurate computational identification of cis-regulatory motifs is difficult, particularly in eukaryotic promoters, which typically contain multiple short and degenerate DNA sequences bound by several interacting factors. Enrichment in combinations of rare motifs in the promoter sequence of functionally or evolutionarily related genes among several species is an indicator of conserved transcriptional regulatory mechanisms. This provides a basis for the computational identification of cis-regulatory motifs. RESULTS We have used a discriminative seeding DNA motif discovery algorithm for an in-depth analysis of 54 seed storage protein (SSP) gene promoters from three plant families, namely Brassicaceae (mustards), Fabaceae (legumes) and Poaceae (grasses) using backgrounds based on complete sets of promoters from a representative species in each family, namely Arabidopsis (Arabidopsis thaliana (L.) Heynh.), soybean (Glycine max (L.) Merr.) and rice (Oryza sativa L.) respectively. We have identified three conserved motifs (two RY-like and one ACGT-like) in Brassicaceae and Fabaceae SSP gene promoters that are similar to experimentally characterized seed-specific cis-regulatory elements. Fabaceae SSP gene promoter sequences are also enriched in a novel, seed-specific E2Fb-like motif. Conserved motifs identified in Poaceae SSP gene promoters include a GCN4-like motif, two prolamin-box-like motifs and an Skn-1-like motif. Evidence of the presence of a variant of the TATA-box is found in the SSP gene promoters from the three plant families. Motifs discovered in SSP gene promoters were used to score whole-genome sets of promoters from Arabidopsis, soybean and rice. The highest-scoring promoters are associated with genes coding for different subunits or precursors of seed storage proteins. CONCLUSION Seed storage protein gene promoter motifs are conserved in diverse species, and different plant families are characterized by a distinct combination of conserved motifs. The majority of discovered motifs match experimentally characterized cis-regulatory elements. These results provide a good starting point for further experimental analysis of plant seed-specific promoters and our methodology can be used to unravel more transcriptional regulatory mechanisms in plants and other eukaryotes.
Collapse
Affiliation(s)
- François Fauteux
- Department of Plant Science, McGill University, Ste-Anne-de-Bellevue, Canada
- McGill Centre for Bioinformatics, McGill University, Montréal, Canada
| | - Martina V Strömvik
- Department of Plant Science, McGill University, Ste-Anne-de-Bellevue, Canada
- McGill Centre for Bioinformatics, McGill University, Montréal, Canada
| |
Collapse
|
43
|
Tomovic A, Stadler M, Oakeley EJ. Transcription factor site dependencies in human, mouse and rat genomes. BMC Bioinformatics 2009; 10:339. [PMID: 19835596 PMCID: PMC2770556 DOI: 10.1186/1471-2105-10-339] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2009] [Accepted: 10/16/2009] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND It is known that transcription factors frequently act together to regulate gene expression in eukaryotes. In this paper we describe a computational analysis of transcription factor site dependencies in human, mouse and rat genomes. RESULTS Our approach for quantifying tendencies of transcription factor binding sites to co-occur is based on a binding site scoring function which incorporates dependencies between positions, the use of information about the structural class of each transcription factor (major/minor groove binder), and also considered the possible implications of varying GC content of the sequences. Significant tendencies (dependencies) have been detected by non-parametric statistical methodology (permutation tests). Evaluation of obtained results has been performed in several ways: reports from literature (many of the significant dependencies between transcription factors have previously been confirmed experimentally); dependencies between transcription factors are not biased due to similarities in their DNA-binding sites; the number of dependent transcription factors that belong to the same functional and structural class is significantly higher than would be expected by chance; supporting evidence from GO clustering of targeting genes. Based on dependencies between two transcription factor binding sites (second-order dependencies), it is possible to construct higher-order dependencies (networks). Moreover results about transcription factor binding sites dependencies can be used for prediction of groups of dependent transcription factors on a given promoter sequence. Our results, as well as a scanning tool for predicting groups of dependent transcription factors binding sites are available on the Internet. CONCLUSION We show that the computational analysis of transcription factor site dependencies is a valuable complement to experimental approaches for discovering transcription regulatory interactions and networks. Scanning promoter sequences with dependent groups of transcription factor binding sites improve the quality of transcription factor predictions.
Collapse
Affiliation(s)
- Andrija Tomovic
- Friedrich Miescher Institute for Biomedical Research, Novartis Research Foundation, Basel, Switzerland.
| | | | | |
Collapse
|
44
|
Roider HG, Lenhard B, Kanhere A, Haas SA, Vingron M. CpG-depleted promoters harbor tissue-specific transcription factor binding signals--implications for motif overrepresentation analyses. Nucleic Acids Res 2009; 37:6305-15. [PMID: 19736212 PMCID: PMC2770660 DOI: 10.1093/nar/gkp682] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Motif overrepresentation analysis of proximal promoters is a common approach to characterize the regulatory properties of co-expressed sets of genes. Here we show that these approaches perform well on mammalian CpG-depleted promoter sets that regulate expression in terminally differentiated tissues such as liver and heart. In contrast, CpG-rich promoters show very little overrepresentation signal, even when associated with genes that display highly constrained spatiotemporal expression. For instance, while ∼50% of heart specific genes possess CpG-rich promoters we find that the frequently observed enrichment of MEF2-binding sites upstream of heart-specific genes is solely due to contributions from CpG-depleted promoters. Similar results are obtained for all sets of tissue-specific genes indicating that CpG-rich and CpG-depleted promoters differ fundamentally in their distribution of regulatory inputs around the transcription start site. In order not to dilute the respective transcription factor binding signals, the two promoter types should thus be treated as separate sets in any motif overrepresentation analysis.
Collapse
Affiliation(s)
- Helge G Roider
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin.
| | | | | | | | | |
Collapse
|
45
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
46
|
Satija R, Novák Á, Miklós I, Lyngsø R, Hein J. BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC. BMC Evol Biol 2009; 9:217. [PMID: 19715598 PMCID: PMC2744684 DOI: 10.1186/1471-2148-9-217] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2008] [Accepted: 08/28/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We have previously combined statistical alignment and phylogenetic footprinting to detect conserved functional elements without assuming a fixed alignment. Considering a probability-weighted distribution of alignments removes sensitivity to alignment errors, properly accommodates regions of alignment uncertainty, and increases the accuracy of functional element prediction. Our method utilized standard dynamic programming hidden markov model algorithms to analyze up to four sequences. RESULTS We present a novel approach, implemented in the software package BigFoot, for performing phylogenetic footprinting on greater numbers of sequences. We have developed a Markov chain Monte Carlo (MCMC) approach which samples both sequence alignments and locations of slowly evolving regions. We implement our method as an extension of the existing StatAlign software package and test it on well-annotated regions controlling the expression of the even-skipped gene in Drosophila and the alpha-globin gene in vertebrates. The results exhibit how adding additional sequences to the analysis has the potential to improve the accuracy of functional predictions, and demonstrate how BigFoot outperforms existing alignment-based phylogenetic footprinting techniques. CONCLUSION BigFoot extends a combined alignment and phylogenetic footprinting approach to analyze larger amounts of sequence data using MCMC. Our approach is robust to alignment error and uncertainty and can be applied to a variety of biological datasets. The source code and documentation are publicly available for download from http://www.stats.ox.ac.uk/~satija/BigFoot/
Collapse
Affiliation(s)
- Rahul Satija
- Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - István Miklós
- Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
- Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, Reáltanoda u. 13-15, 1053 Budapest, Hungary
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| |
Collapse
|
47
|
Homsi DSF, Gupta V, Stormo GD. Modeling the quantitative specificity of DNA-binding proteins from example binding sites. PLoS One 2009; 4:e6736. [PMID: 19707584 PMCID: PMC2726951 DOI: 10.1371/journal.pone.0006736] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2009] [Accepted: 07/07/2009] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The binding of transcription factors to their respective DNA sites is a key component of every regulatory network. Predictions of transcription factor binding sites are usually based on models for transcription factor specificity. These models, in turn, are often based on examples of known binding sites. METHODOLOGY/PRINCIPAL FINDINGS Collections of binding sites are obtained in simulation experiments where the true model for the transcription factor is known and various sampling procedures are employed. We compare the accuracies of three different and commonly used methods for predicting the specificity of the transcription factor based on example binding sites. Different methods for constructing the models can lead to significant differences in the accuracy of the predictions and we show that commonly used methods can be positively misleading, even at large sample sizes and using noise-free data. Methods that minimize the number of predicted binding sequences are often significantly more accurate than the other methods tested. CONCLUSIONS/SIGNIFICANCE Different methods for generating motifs from example binding sites can have significantly different numbers of false positive and false negative predictions. For many different sampling procedures models based on quadratic programming are the most accurate.
Collapse
Affiliation(s)
- Dana S. F. Homsi
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Vineet Gupta
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | - Gary D. Stormo
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America
| |
Collapse
|
48
|
Hawkins J, Grant C, Noble WS, Bailey TL. Assessing phylogenetic motif models for predicting transcription factor binding sites. Bioinformatics 2009; 25:i339-47. [PMID: 19478008 PMCID: PMC2687955 DOI: 10.1093/bioinformatics/btp201] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION A variety of algorithms have been developed to predict transcription factor binding sites (TFBSs) within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models (PMMs) have never been rigorously benchmarked in order to determine whether they lead to better prediction of TFBSs than obtained using simple position weight matrix scanning. RESULTS We evaluate three PMM-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast TFBSs. However, the PMM scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled 'random' motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing 'weak' sites, which are exactly the type of sites needed to discriminate among predictors. We then extend our previous theoretical model of the statistical power of PMM-based prediction algorithms to allow for loss of binding sites during evolution, and show that it gives a more accurate upper bound on scanner accuracy. Finally, utilizing our theoretical model, we introduce a new method for predicting the number of real binding sites in a genome. The results suggest that the number of true sites for a yeast TF is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae Database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast TFBSs.
Collapse
Affiliation(s)
- John Hawkins
- Institute for Molecular Bioscience, University of Queensland, Qld, Australia.
| | | | | | | |
Collapse
|
49
|
Wang X, Haberer G, Mayer KFX. Discovery of cis-elements between sorghum and rice using co-expression and evolutionary conservation. BMC Genomics 2009; 10:284. [PMID: 19558665 PMCID: PMC2714861 DOI: 10.1186/1471-2164-10-284] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2008] [Accepted: 06/26/2009] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND The spatiotemporal regulation of gene expression largely depends on the presence and absence of cis-regulatory sites in the promoter. In the economically highly important grass family, our knowledge of transcription factor binding sites and transcriptional networks is still very limited. With the completion of the sorghum genome and the available rice genome sequence, comparative promoter analyses now allow genome-scale detection of conserved cis-elements. RESULTS In this study, we identified thousands of phylogenetic footprints conserved between orthologous rice and sorghum upstream regions that are supported by co-expression information derived from three different rice expression data sets. In a complementary approach, cis-motifs were discovered by their highly conserved co-occurrence in syntenic promoter pairs. Sequence conservation and matches to known plant motifs support our findings. Expression similarities of gene pairs positively correlate with the number of motifs that are shared by gene pairs and corroborate the importance of similar promoter architectures for concerted regulation. This strongly suggests that these motifs function in the regulation of transcript levels in rice and, presumably also in sorghum. CONCLUSION Our work provides the first large-scale collection of cis-elements for rice and sorghum and can serve as a paradigm for cis-element analysis through comparative genomics in grasses in general.
Collapse
Affiliation(s)
- Xi Wang
- MIPS/IBIS Institute of Bioinformatics and System Biology, Helmholtz Center Munich, Neuherberg, Germany.
| | | | | |
Collapse
|
50
|
Berger MF, Bulyk ML. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc 2009; 4:393-411. [PMID: 19265799 DOI: 10.1038/nprot.2008.195] [Citation(s) in RCA: 268] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein-binding microarray (PBM) technology provides a rapid, high-throughput means of characterizing the in vitro DNA-binding specificities of transcription factors (TFs). Using high-density, custom-designed microarrays containing all 10-mer sequence variants, one can obtain comprehensive binding-site measurements for any TF, regardless of its structural class or species of origin. Here, we present a protocol for the examination and analysis of TF-binding specificities at high resolution using such 'all 10-mer' universal PBMs. This procedure involves double-stranding a commercially synthesized DNA oligonucleotide array, binding a TF directly to the double-stranded DNA microarray and labeling the protein-bound microarray with a fluorophore-conjugated antibody. We describe how to computationally extract the relative binding preferences of the examined TF for all possible contiguous and gapped 8-mers over the full range of affinities, from highest affinity sites to nonspecific sites. Multiple proteins can be tested in parallel in separate chambers on a single microarray, enabling the processing of a dozen or more TFs in a single day.
Collapse
Affiliation(s)
- Michael F Berger
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA
| | | |
Collapse
|