1
|
Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol 2023; 24:154. [PMID: 37370113 DOI: 10.1186/s13059-023-02985-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Manu Saraswat
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington (UW), Seattle, USA
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
2
|
Wu H, Liu M, Zhang P, Zhang H. iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information. Brief Funct Genomics 2023; 22:302-311. [PMID: 36715222 DOI: 10.1093/bfgp/elac057] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 12/01/2022] [Accepted: 12/13/2022] [Indexed: 01/31/2023] Open
Abstract
Enhancers, a class of distal cis-regulatory elements located in the non-coding region of DNA, play a key role in gene regulation. It is difficult to identify enhancers from DNA sequence data because enhancers are freely distributed in the non-coding region, with no specific sequence features, and having a long distance with the targeted promoters. Therefore, this study presents a stacking ensemble learning method to accurately identify enhancers and classify enhancers into strong and weak enhancers. Firstly, we obtain the fusion feature matrix by fusing the four features of Kmer, PseDNC, PCPseDNC and Z-Curve9. Secondly, five K-Nearest Neighbor (KNN) models with different parameters are trained as the base model, and the Logistic Regression algorithm is utilized as the meta-model. Thirdly, the stacking ensemble learning strategy is utilized to construct a two-layer model based on the base model and meta-model to train the preprocessed feature sets. The proposed method, named iEnhancer-SKNN, is a two-layer prediction model, in which the function of the first layer is to predict whether the given DNA sequences are enhancers or non-enhancers, and the function of the second layer is to distinguish whether the predicted enhancers are strong enhancers or weak enhancers. The performance of iEnhancer-SKNN is evaluated on the independent testing dataset and the results show that the proposed method has better performance in predicting enhancers and their strength. In enhancer identification, iEnhancer-SKNN achieves an accuracy of 81.75%, an improvement of 1.35% to 8.75% compared with other predictors, and in enhancer classification, iEnhancer-SKNN achieves an accuracy of 80.50%, an improvement of 5.5% to 25.5% compared with other predictors. Moreover, we identify key transcription factor binding site motifs in the enhancer regions and further explore the biological functions of the enhancers and these key motifs. Source code and data can be downloaded from https://github.com/HaoWuLab-Bioinformatics/iEnhancer-SKNN.
Collapse
Affiliation(s)
- Hao Wu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China.,School of Software, Shandong University, Jinan, 250101, Shandong, China
| | - Mengdi Liu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Pengyu Zhang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hongming Zhang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| |
Collapse
|
3
|
Liao M, Zhao JP, Tian J, Zheng CH. iEnhancer-DCLA: using the original sequence to identify enhancers and their strength based on a deep learning framework. BMC Bioinformatics 2022; 23:480. [PMCID: PMC9664816 DOI: 10.1186/s12859-022-05033-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 11/02/2022] [Indexed: 11/16/2022] Open
Abstract
AbstractEnhancers are small regions of DNA that bind to proteins, which enhance the transcription of genes. The enhancer may be located upstream or downstream of the gene. It is not necessarily close to the gene to be acted on, because the entanglement structure of chromatin allows the positions far apart in the sequence to have the opportunity to contact each other. Therefore, identifying enhancers and their strength is a complex and challenging task. In this article, a new prediction method based on deep learning is proposed to identify enhancers and enhancer strength, called iEnhancer-DCLA. Firstly, we use word2vec to convert k-mers into number vectors to construct an input matrix. Secondly, we use convolutional neural network and bidirectional long short-term memory network to extract sequence features, and finally use the attention mechanism to extract relatively important features. In the task of predicting enhancers and their strengths, this method has improved to a certain extent in most evaluation indexes. In summary, we believe that this method provides new ideas in the analysis of enhancers.
Collapse
|
4
|
Nair SJ, Suter T, Wang S, Yang L, Yang F, Rosenfeld MG. Transcriptional enhancers at 40: evolution of a viral DNA element to nuclear architectural structures. Trends Genet 2022; 38:1019-1047. [PMID: 35811173 PMCID: PMC9474616 DOI: 10.1016/j.tig.2022.05.015] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/05/2022] [Accepted: 05/31/2022] [Indexed: 02/08/2023]
Abstract
Gene regulation by transcriptional enhancers is the dominant mechanism driving cell type- and signal-specific transcriptional diversity in metazoans. However, over four decades since the original discovery, how enhancers operate in the nuclear space remains largely enigmatic. Recent multidisciplinary efforts combining real-time imaging, genome sequencing, and biophysical strategies provide insightful but conflicting models of enhancer-mediated gene control. Here, we review the discovery and progress in enhancer biology, emphasizing the recent findings that acutely activated enhancers assemble regulatory machinery as mesoscale architectural structures with distinct physical properties. These findings help formulate novel models that explain several mysterious features of the assembly of transcriptional enhancers and the mechanisms of spatial control of gene expression.
Collapse
Affiliation(s)
- Sreejith J Nair
- Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA.
| | - Tom Suter
- Howard Hughes Medical Institute, Department and School of Medicine, University of California, San Diego, La Jolla, CA 92093, USA
| | - Susan Wang
- Howard Hughes Medical Institute, Department and School of Medicine, University of California, San Diego, La Jolla, CA 92093, USA; Cellular and Molecular Medicine Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Lu Yang
- Howard Hughes Medical Institute, Department and School of Medicine, University of California, San Diego, La Jolla, CA 92093, USA
| | - Feng Yang
- Howard Hughes Medical Institute, Department and School of Medicine, University of California, San Diego, La Jolla, CA 92093, USA
| | - Michael G Rosenfeld
- Howard Hughes Medical Institute, Department and School of Medicine, University of California, San Diego, La Jolla, CA 92093, USA.
| |
Collapse
|
5
|
Zhang WM, Cheng XZ, Fang D, Cao J. AT-HOOK MOTIF NUCLEAR LOCALIZED (AHL) proteins of ancient origin radiate new functions. Int J Biol Macromol 2022; 214:290-300. [PMID: 35716788 DOI: 10.1016/j.ijbiomac.2022.06.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 04/11/2022] [Accepted: 06/12/2022] [Indexed: 11/05/2022]
Abstract
AHL (AT-HOOK MOTIF NUCLEAR LOCALIZED) protein is an important transcription factor in plants that regulates a wide range of biological process. It is considered to have evolved from an independent PPC domain in prokaryotes to a complete protein in modern plants. AT-hook motif and PPC conserved domains are the main functional domains of AHL. Since the discovery of AHL, their evolution and function have been continuously studied. The AHL gene family has been identified in multiple species and the functions of several members of the gene family have been studied. Here, we summarize the evolution and structural characteristics of AHL genes, and emphasize their biological functions. This review will provide a basis for further functional study and crop breeding.
Collapse
Affiliation(s)
- Wei-Meng Zhang
- School of Life Sciences, Jiangsu University, Zhenjiang 212013, Jiangsu, China
| | - Xiu-Zhu Cheng
- School of Life Sciences, Jiangsu University, Zhenjiang 212013, Jiangsu, China
| | - Da Fang
- School of Life Sciences, Jiangsu University, Zhenjiang 212013, Jiangsu, China
| | - Jun Cao
- School of Life Sciences, Jiangsu University, Zhenjiang 212013, Jiangsu, China.
| |
Collapse
|
6
|
Qian Y, Zhang Y, Zhang J. Alignment-Free Sequence Comparison With Multiple k Values. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1841-1849. [PMID: 31765317 DOI: 10.1109/tcbb.2019.2955081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Alignment-free sequence comparison approaches have become increasingly popular in computational biology, because alignment-based approaches are inefficient to process large-scale datasets. Still, there is no way to determine the optimal value of the critical parameter k for alignment-free approaches in general. In this article, we tried to solve the problem by involving multiple k values simultaneously. The method counts the occurrence of each k-mer with different k values in a sequence. Two weighting schemes, based on maximizing deviation method and genetic algorithm, are then used on these counts. We applied the method to enhance the three common alignment-free approaches D2, D2S, and D2*, and evaluated its performance on similarity search and functionally related regulatory sequences recognition. The enhanced approaches achieve better performance than the original approaches in all cases, and much better performance than some other common measures, such as Pcc, Eu, Ma, Ch, Kld, and Cos.
Collapse
|
7
|
Ullah F, Ben-Hur A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res 2021; 49:e77. [PMID: 33950192 PMCID: PMC8287919 DOI: 10.1093/nar/gkab349] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 04/15/2021] [Accepted: 04/20/2021] [Indexed: 11/14/2022] Open
Abstract
Deep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problem. We present SATORI, a Self-ATtentiOn based model to detect Regulatory element Interactions. Our approach combines convolutional layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. A comprehensive evaluation demonstrates the ability of SATORI to identify numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of experimentally verified TF-TF interactions than existing methods, and has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactions.
Collapse
Affiliation(s)
- Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
8
|
Chen G, Yin Y, Lin Z, Wen H, Chen J, Luo W. Transcriptome profile analysis reveals KLHL30 as an essential regulator for myoblast differentiation. Biochem Biophys Res Commun 2021; 559:84-91. [PMID: 33933993 DOI: 10.1016/j.bbrc.2021.04.086] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Accepted: 04/20/2021] [Indexed: 11/29/2022]
Abstract
Skeletal muscle development is a sophisticated multistep process orchestrated by diverse myogenic transcription factors. Recent studies have suggested that Kelch-like genes play vital roles in muscle disease and myogenesis. However, it is still unclear how Kelch-like genes impact myoblast physiology. Here, through integrative analysis of the mRNA expression profile during chicken primary myoblast and C2C12 differentiation, many differentially expressed genes were found and suggested to be enriched in myoblast differentiation and muscle development. Interestingly, a little-known Kelch-like gene KLHL30 was screened as skeletal muscle-specific gene with essential roles in myogenic differentiation. Transcriptomic data and quantitative PCR analysis indicated that the expression of KLHL30 is upregulated under myoblast differentiation state. KLHL30 overexpression upregulated the protein expression of myogenic transcription factors (MYOD, MYOG, MEF2C) and induced myoblast differentiation and myotube formation, while knockdown of KLHL30 caused the opposite effect. Furthermore, KLHL30 was found to significantly decrease the numbers of cells in the S stage and thereby depress myoblast proliferation. Collectively, this study highlights that KLHL30 as a muscle-specific regulator plays essential roles in myoblast proliferation and differentiation.
Collapse
Affiliation(s)
- Genghua Chen
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science, South China Agricultural University, Guangzhou, 510642, Guangdong Province, China; Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affair, South China Agricultural University, Guangzhou, 510642, China
| | - Yunqian Yin
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science, South China Agricultural University, Guangzhou, 510642, Guangdong Province, China; Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affair, South China Agricultural University, Guangzhou, 510642, China
| | - Zetong Lin
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science, South China Agricultural University, Guangzhou, 510642, Guangdong Province, China; Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affair, South China Agricultural University, Guangzhou, 510642, China
| | - Huaqiang Wen
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science, South China Agricultural University, Guangzhou, 510642, Guangdong Province, China; Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affair, South China Agricultural University, Guangzhou, 510642, China
| | - Jiahui Chen
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science, South China Agricultural University, Guangzhou, 510642, Guangdong Province, China; Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affair, South China Agricultural University, Guangzhou, 510642, China
| | - Wen Luo
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science, South China Agricultural University, Guangzhou, 510642, Guangdong Province, China; Guangdong Provincial Key Laboratory of Agro-Animal Genomics and Molecular Breeding, and Key Laboratory of Chicken Genetics, Breeding and Reproduction, Ministry of Agriculture and Rural Affair, South China Agricultural University, Guangzhou, 510642, China.
| |
Collapse
|
9
|
Tobias IC, Abatti LE, Moorthy SD, Mullany S, Taylor T, Khader N, Filice MA, Mitchell JA. Transcriptional enhancers: from prediction to functional assessment on a genome-wide scale. Genome 2020; 64:426-448. [PMID: 32961076 DOI: 10.1139/gen-2020-0104] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Enhancers are cis-regulatory sequences located distally to target genes. These sequences consolidate developmental and environmental cues to coordinate gene expression in a tissue-specific manner. Enhancer function and tissue specificity depend on the expressed set of transcription factors, which recognize binding sites and recruit cofactors that regulate local chromatin organization and gene transcription. Unlike other genomic elements, enhancers are challenging to identify because they function independently of orientation, are often distant from their promoters, have poorly defined boundaries, and display no reading frame. In addition, there are no defined genetic or epigenetic features that are unambiguously associated with enhancer activity. Over recent years there have been developments in both empirical assays and computational methods for enhancer prediction. We review genome-wide tools, CRISPR advancements, and high-throughput screening approaches that have improved our ability to both observe and manipulate enhancers in vitro at the level of primary genetic sequences, chromatin states, and spatial interactions. We also highlight contemporary animal models and their importance to enhancer validation. Together, these experimental systems and techniques complement one another and broaden our understanding of enhancer function in development, evolution, and disease.
Collapse
Affiliation(s)
- Ian C Tobias
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Luis E Abatti
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Sakthi D Moorthy
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Shanelle Mullany
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Tiegh Taylor
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Nawrah Khader
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Mario A Filice
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Jennifer A Mitchell
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| |
Collapse
|
10
|
A New Algorithm for Identifying Cis-Regulatory Modules Based on Hidden Markov Model. BIOMED RESEARCH INTERNATIONAL 2018; 2017:6274513. [PMID: 28497059 PMCID: PMC5405574 DOI: 10.1155/2017/6274513] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 03/06/2017] [Accepted: 03/23/2017] [Indexed: 11/24/2022]
Abstract
The discovery of cis-regulatory modules (CRMs) is the key to understanding mechanisms of transcription regulation. Since CRMs have specific regulatory structures that are the basis for the regulation of gene expression, how to model the regulatory structure of CRMs has a considerable impact on the performance of CRM identification. The paper proposes a CRM discovery algorithm called ComSPS. ComSPS builds a regulatory structure model of CRMs based on HMM by exploring the rules of CRM transcriptional grammar that governs the internal motif site arrangement of CRMs. We test ComSPS on three benchmark datasets and compare it with five existing methods. Experimental results show that ComSPS performs better than them.
Collapse
|
11
|
Herman-Izycka J, Wlasnowolski M, Wilczynski B. Taking promoters out of enhancers in sequence based predictions of tissue-specific mammalian enhancers. BMC Med Genomics 2017; 10:34. [PMID: 28589862 PMCID: PMC5461523 DOI: 10.1186/s12920-017-0264-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Many genetic diseases are caused by mutations in non-coding regions of the genome. These mutations are frequently found in enhancer sequences, causing disruption to the regulatory program of the cell. Enhancers are short regulatory sequences in the non-coding part of the genome that are essential for the proper regulation of transcription. While the experimental methods for identification of such sequences are improving every year, our understanding of the rules behind the enhancer activity has not progressed much in the last decade. This is especially true in case of tissue-specific enhancers, where there are clear problems in predicting specificity of enhancer activity. RESULTS We show a random-forest based machine learning approach capable of matching the performance of the current state-of-the-art methods for enhancer prediction. Then we show that it is, similarly to other published methods, frequently cross-predicting enhancers as active in different tissues, making it less useful for predicting tissue specific activity. Then we proceed to show that the problem is related to the fact that the enhancer predicting models exhibit a bias towards predicting gene promoters as active enhancers. Then we show that using a two-step classifier can lead to lower cross-prediction between tissues. CONCLUSIONS We provide whole-genome predictions of human heart and brain enhancers obtained with two-step classifier.
Collapse
Affiliation(s)
- Julia Herman-Izycka
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Michal Wlasnowolski
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland
| | - Bartek Wilczynski
- Institute of Informatics, University of Warsaw, Banacha 2, Warsaw, 02-097, Poland.
| |
Collapse
|
12
|
Li L, Wunderlich Z. An Enhancer's Length and Composition Are Shaped by Its Regulatory Task. Front Genet 2017; 8:63. [PMID: 28588608 PMCID: PMC5440464 DOI: 10.3389/fgene.2017.00063] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Accepted: 05/08/2017] [Indexed: 12/02/2022] Open
Abstract
Enhancers drive the gene expression patterns required for virtually every process in metazoans. We propose that enhancer length and transcription factor (TF) binding site composition—the number and identity of TF binding sites—reflect the complexity of the enhancer's regulatory task. In development, we define regulatory task complexity as the number of fates specified in a set of cells at once. We hypothesize that enhancers with more complex regulatory tasks will be longer, with more, but less specific, TF binding sites. Larger numbers of binding sites can be arranged in more ways, allowing enhancers to drive many distinct expression patterns, and therefore cell fates, using a finite number of TF inputs. We compare ~100 enhancers patterning the more complex anterior-posterior (AP) axis and the simpler dorsal-ventral (DV) axis in Drosophila and find that the AP enhancers are longer with more, but less specific binding sites than the (DV) enhancers. Using a set of ~3,500 enhancers, we find enhancer length and TF binding site number again increase with increasing regulatory task complexity. Therefore, to be broadly applicable, computational tools to study enhancers must account for differences in regulatory task.
Collapse
Affiliation(s)
- Lily Li
- Department of Developmental and Cell Biology, University of California, IrvineIrvine, CA, United States
| | - Zeba Wunderlich
- Department of Developmental and Cell Biology, University of California, IrvineIrvine, CA, United States
| |
Collapse
|
13
|
Wilczynski B, Tiuryn J. FastBill: An Improved Tool for Prediction of Cis-Regulatory Modules. J Comput Biol 2016; 24:193-199. [PMID: 27710048 DOI: 10.1089/cmb.2016.0108] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Here, we provide a new software tool, called FastBill, for prediction of evolutionarily conserved cis-regulatory modules. It improves on the previous version of our program, called Billboard, by improving the statistical significance calculation. It is also faster than the original Billboard, allowing for large-scale analyses, including multiple informant species. We illustrate the utility of FastBill by performing a large-scale computational experiment of enhancer prediction in the promoter area of more than 150 Drosophila melanogaster genes that possess annotated experimentally verified enhancers. FastBill is written in Python and is freely available for download as a standalone tool.
Collapse
Affiliation(s)
- Bartek Wilczynski
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw , Warsaw, Poland
| | - Jerzy Tiuryn
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw , Warsaw, Poland
| |
Collapse
|
14
|
Guo H, Huo H, Yu Q. SMCis: An Effective Algorithm for Discovery of Cis-Regulatory Modules. PLoS One 2016; 11:e0162968. [PMID: 27637070 PMCID: PMC5026350 DOI: 10.1371/journal.pone.0162968] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Accepted: 08/31/2016] [Indexed: 12/02/2022] Open
Abstract
The discovery of cis-regulatory modules (CRMs) is a challenging problem in computational biology. Limited by the difficulty of using an HMM to model dependent features in transcriptional regulatory sequences (TRSs), the probabilistic modeling methods based on HMMs cannot accurately represent the distance between regulatory elements in TRSs and are cumbersome to model the prevailing dependencies between motifs within CRMs. We propose a probabilistic modeling algorithm called SMCis, which builds a more powerful CRM discovery model based on a hidden semi-Markov model. Our model characterizes the regulatory structure of CRMs and effectively models dependencies between motifs at a higher level of abstraction based on segments rather than nucleotides. Experimental results on three benchmark datasets indicate that our method performs better than the compared algorithms.
Collapse
Affiliation(s)
- Haitao Guo
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
- * E-mail:
| | - Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
| |
Collapse
|
15
|
Santolini M, Sakakibara I, Gauthier M, Ribas-Aulinas F, Takahashi H, Sawasaki T, Mouly V, Concordet JP, Defossez PA, Hakim V, Maire P. MyoD reprogramming requires Six1 and Six4 homeoproteins: genome-wide cis-regulatory module analysis. Nucleic Acids Res 2016; 44:8621-8640. [PMID: 27302134 PMCID: PMC5062961 DOI: 10.1093/nar/gkw512] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2015] [Accepted: 05/26/2016] [Indexed: 11/12/2022] Open
Abstract
Myogenic regulatory factors of the MyoD family have the ability to reprogram differentiated cells toward a myogenic fate. In this study, we demonstrate that Six1 or Six4 are required for the reprogramming by MyoD of mouse embryonic fibroblasts (MEFs). Using microarray experiments, we found 761 genes under the control of both Six and MyoD. Using MyoD ChIPseq data and a genome-wide search for Six1/4 MEF3 binding sites, we found significant co-localization of binding sites for MyoD and Six proteins on over a thousand mouse genomic DNA regions. The combination of both datasets yielded 82 genes which are synergistically activated by Six and MyoD, with 96 associated MyoD+MEF3 putative cis-regulatory modules (CRMs). Fourteen out of 19 of the CRMs that we tested demonstrated in Luciferase assays a synergistic action also observed for their cognate gene. We searched putative binding sites on these CRMs using available databases and de novo search of conserved motifs and demonstrated that the Six/MyoD synergistic activation takes place in a feedforward way. It involves the recruitment of these two families of transcription factors to their targets, together with partner transcription factors, encoded by genes that are themselves activated by Six and MyoD, including Mef2, Pbx-Meis and EBF.
Collapse
Affiliation(s)
- Marc Santolini
- Institut Cochin, Université Paris-Descartes, Centre National de la Recherche Scientifique (CNRS), UMR 8104, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM) U1016, Paris, France Ecole Normale Supérieure, CNRS, Laboratoire de Physique Statistique, PSL Research University, Université Pierre-et-Marie Curie, Paris, France
| | - Iori Sakakibara
- Institut Cochin, Université Paris-Descartes, Centre National de la Recherche Scientifique (CNRS), UMR 8104, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM) U1016, Paris, France Division of Integrative Pathophysiology, Proteo-Science Center, Graduate School of Medicine, Ehime University, Ehime, Japan
| | - Morgane Gauthier
- Institut Cochin, Université Paris-Descartes, Centre National de la Recherche Scientifique (CNRS), UMR 8104, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM) U1016, Paris, France
| | - Francesc Ribas-Aulinas
- Institut Cochin, Université Paris-Descartes, Centre National de la Recherche Scientifique (CNRS), UMR 8104, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM) U1016, Paris, France
| | | | | | - Vincent Mouly
- Sorbonne Universités, UPMC Univ Paris 06, INSERM UMRS974, CNRS FRE3617, Center for Research in Myology, 75013 Paris, France
| | - Jean-Paul Concordet
- Institut Cochin, Université Paris-Descartes, Centre National de la Recherche Scientifique (CNRS), UMR 8104, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM) U1016, Paris, France
| | | | - Vincent Hakim
- Ecole Normale Supérieure, CNRS, Laboratoire de Physique Statistique, PSL Research University, Université Pierre-et-Marie Curie, Paris, France
| | - Pascal Maire
- Institut Cochin, Université Paris-Descartes, Centre National de la Recherche Scientifique (CNRS), UMR 8104, Paris, France Institut National de la Santé et de la Recherche Médicale (INSERM) U1016, Paris, France
| |
Collapse
|
16
|
Murakawa Y, Yoshihara M, Kawaji H, Nishikawa M, Zayed H, Suzuki H, FANTOM Consortium, Hayashizaki Y. Enhanced Identification of Transcriptional Enhancers Provides Mechanistic Insights into Diseases. Trends Genet 2016; 32:76-88. [DOI: 10.1016/j.tig.2015.11.004] [Citation(s) in RCA: 73] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 11/25/2015] [Accepted: 11/30/2015] [Indexed: 12/24/2022]
|
17
|
McGettigan PA, Browne JA, Carrington SD, Crowe MA, Fair T, Forde N, Loftus BJ, Lohan A, Lonergan P, Pluta K, Mamo S, Murphy A, Roche J, Walsh SW, Creevey CJ, Earley B, Keady S, Kenny DA, Matthews D, McCabe M, Morris D, O'Loughlin A, Waters S, Diskin MG, Evans ACO. Fertility and genomics: comparison of gene expression in contrasting reproductive tissues of female cattle. Reprod Fertil Dev 2016; 28:11-24. [DOI: 10.1071/rd15354] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
To compare gene expression among bovine tissues, large bovine RNA-seq datasets were used, comprising 280 samples from 10 different bovine tissues (uterine endometrium, granulosa cells, theca cells, cervix, embryos, leucocytes, liver, hypothalamus, pituitary, muscle) and generating 260 Gbases of data. Twin approaches were used: an information–theoretic analysis of the existing annotated transcriptome to identify the most tissue-specific genes and a de-novo transcriptome annotation to evaluate general features of the transcription landscape. Expression was detected for 97% of the Ensembl transcriptome with at least one read in one sample and between 28% and 66% at a level of 10 tags per million (TPM) or greater in individual tissues. Over 95% of genes exhibited some level of tissue-specific gene expression. This was mostly due to different levels of expression in different tissues rather than exclusive expression in a single tissue. Less than 1% of annotated genes exhibited a highly restricted tissue-specific expression profile and approximately 2% exhibited classic housekeeping profiles. In conclusion, it is the combined effects of the variable expression of large numbers of genes (73%–93% of the genome) and the specific expression of a small number of genes (<1% of the transcriptome) that contribute to determining the outcome of the function of individual tissues.
Collapse
|
18
|
Grant CE, Johnson J, Bailey TL, Noble WS. MCAST: scanning for cis-regulatory motif clusters. Bioinformatics 2015; 32:1217-9. [PMID: 26704599 DOI: 10.1093/bioinformatics/btv750] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2015] [Accepted: 12/15/2015] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED Precise regulatory control of genes, particularly in eukaryotes, frequently requires the joint action of multiple sequence-specific transcription factors. A cis-regulatory module (CRM) is a genomic locus that is responsible for gene regulation and that contains multiple transcription factor binding sites in close proximity. Given a collection of known transcription factor binding motifs, many bioinformatics methods have been proposed over the past 15 years for identifying within a genomic sequence candidate CRMs consisting of clusters of those motifs. RESULTS The MCAST algorithm uses a hidden Markov model with a P-value-based scoring scheme to identify candidate CRMs. Here, we introduce a new version of MCAST that offers improved graphical output, a dynamic background model, statistical confidence estimates based on false discovery rate estimation and, most significantly, the ability to predict CRMs while taking into account epigenomic data such as DNase I sensitivity or histone modification data. We demonstrate the validity of MCAST's statistical confidence estimates and the utility of epigenomic priors in identifying CRMs. AVAILABILITY AND IMPLEMENTATION MCAST is part of the MEME Suite software toolkit. A web server and source code are available at http://meme-suite.org and http://alternate.meme-suite.org CONTACT t.bailey@imb.uq.edu.au or william-noble@uw.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Charles E Grant
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - James Johnson
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia and
| | - Timothy L Bailey
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia and
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA, Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
19
|
Payne JL, Wagner A. Mechanisms of mutational robustness in transcriptional regulation. Front Genet 2015; 6:322. [PMID: 26579194 PMCID: PMC4621482 DOI: 10.3389/fgene.2015.00322] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 10/10/2015] [Indexed: 12/17/2022] Open
Abstract
Robustness is the invariance of a phenotype in the face of environmental or genetic change. The phenotypes produced by transcriptional regulatory circuits are gene expression patterns that are to some extent robust to mutations. Here we review several causes of this robustness. They include robustness of individual transcription factor binding sites, homotypic clusters of such sites, redundant enhancers, transcription factors, redundant transcription factors, and the wiring of transcriptional regulatory circuits. Such robustness can either be an adaptation by itself, a byproduct of other adaptations, or the result of biophysical principles and non-adaptive forces of genome evolution. The potential consequences of such robustness include complex regulatory network topologies that arise through neutral evolution, as well as cryptic variation, i.e., genotypic divergence without phenotypic divergence. On the longest evolutionary timescales, the robustness of transcriptional regulation has helped shape life as we know it, by facilitating evolutionary innovations that helped organisms such as flowering plants and vertebrates diversify.
Collapse
Affiliation(s)
- Joshua L Payne
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich Zurich, Switzerland ; Swiss Institute of Bioinformatics Lausanne, Switzerland
| | - Andreas Wagner
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich Zurich, Switzerland ; Swiss Institute of Bioinformatics Lausanne, Switzerland ; The Santa Fe Institute Santa Fe, NM, USA
| |
Collapse
|
20
|
Leoncini M, Montangero M, Pellegrini M, Tillan KP. CMStalker: A Combinatorial Tool for Composite Motif Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1123-1136. [PMID: 26451824 DOI: 10.1109/tcbb.2014.2359444] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Controlling the differential expression of many thousands different genes at any given time is a fundamental task of metazoan organisms and this complex orchestration is controlled by the so-called regulatory genome encoding complex regulatory networks: several Transcription Factors bind to precise DNA regions, so to perform in a cooperative manner a specific regulation task for nearby genes. The in silico prediction of these binding sites is still an open problem, notwithstanding continuous progress and activity in the last two decades. In this paper, we describe a new efficient combinatorial approach to the problem of detecting sets of cooperating binding sites in promoter sequences, given in input a database of Transcription Factor Binding Sites encoded as Position Weight Matrices. We present CMStalker, a software tool for composite motif discovery which embodies a new approach that combines a constraint satisfaction formulation with a parameter relaxation technique to explore efficiently the space of possible solutions. Extensive experiments with 12 data sets and 11 state-of-the-art tools are reported, showing an average value of the correlation coefficient of 0.54 (against a value 0.41 of the closest competitor). This improvements in output quality due to CMStalker is statistically significant.
Collapse
|
21
|
Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. WILEY INTERDISCIPLINARY REVIEWS. DEVELOPMENTAL BIOLOGY 2015; 4:59-84. [PMID: 25704908 PMCID: PMC4339228 DOI: 10.1002/wdev.168] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Revised: 11/04/2014] [Accepted: 11/16/2014] [Indexed: 11/08/2022]
Abstract
UNLABELLED Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website. CONFLICT OF INTEREST The authors have declared no conflicts of interest for this article.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
22
|
Starick SR, Ibn-Salem J, Jurk M, Hernandez C, Love MI, Chung HR, Vingron M, Thomas-Chollier M, Meijsing SH. ChIP-exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors. Genome Res 2015; 25:825-35. [PMID: 25720775 PMCID: PMC4448679 DOI: 10.1101/gr.185157.114] [Citation(s) in RCA: 102] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2014] [Accepted: 02/23/2015] [Indexed: 12/22/2022]
Abstract
The classical DNA recognition sequence of the glucocorticoid receptor (GR) appears to be present at only a fraction of bound genomic regions. To identify sequences responsible for recruitment of this transcription factor (TF) to individual loci, we turned to the high-resolution ChIP-exo approach. We exploited this signal by determining footprint profiles of TF binding at single-base-pair resolution using ExoProfiler, a computational pipeline based on DNA binding motifs. When applied to our GR and the few available public ChIP-exo data sets, we find that ChIP-exo footprints are protein- and recognition sequence-specific signatures of genomic TF association. Furthermore, we show that ChIP-exo captures information about TFs other than the one directly targeted by the antibody in the ChIP procedure. Consequently, the shape of the ChIP-exo footprint can be used to discriminate between direct and indirect (tethering to other DNA-bound proteins) DNA association of GR. Together, our findings indicate that the absence of classical recognition sequences can be explained by direct GR binding to a broader spectrum of sequences than previously known, either as a homodimer or as a heterodimer binding together with a member of the ETS or TEAD families of TFs, or alternatively by indirect recruitment via FOX or STAT proteins. ChIP-exo footprints also bring structural insights and locate DNA:protein cross-link points that are compatible with crystal structures of the studied TFs. Overall, our generically applicable footprint-based approach uncovers new structural and functional insights into the diverse ways of genomic cooperation and association of TFs.
Collapse
Affiliation(s)
- Stephan R Starick
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Jonas Ibn-Salem
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany; Institut de Biologie de l'Ecole Normale Supérieure, Institut National de la Santé et de la Recherche Médicale, U1024, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 8197, F-75005 Paris, France
| | - Marcel Jurk
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Céline Hernandez
- Institut de Biologie de l'Ecole Normale Supérieure, Institut National de la Santé et de la Recherche Médicale, U1024, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 8197, F-75005 Paris, France
| | - Michael I Love
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Ho-Ryun Chung
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| | - Morgane Thomas-Chollier
- Institut de Biologie de l'Ecole Normale Supérieure, Institut National de la Santé et de la Recherche Médicale, U1024, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 8197, F-75005 Paris, France
| | - Sebastiaan H Meijsing
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
| |
Collapse
|
23
|
Taher L, Narlikar L, Ovcharenko I. Identification and computational analysis of gene regulatory elements. Cold Spring Harb Protoc 2015; 2015:pdb.top083642. [PMID: 25561628 PMCID: PMC5885252 DOI: 10.1101/pdb.top083642] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Over the last two decades, advances in experimental and computational technologies have greatly facilitated genomic research. Next-generation sequencing technologies have made de novo sequencing of large genomes affordable, and powerful computational approaches have enabled accurate annotations of genomic DNA sequences. Charting functional regions in genomes must account for not only the coding sequences, but also noncoding RNAs, repetitive elements, chromatin states, epigenetic modifications, and gene regulatory elements. A mix of comparative genomics, high-throughput biological experiments, and machine learning approaches has played a major role in this truly global effort. Here we describe some of these approaches and provide an account of our current understanding of the complex landscape of the human genome. We also present overviews of different publicly available, large-scale experimental data sets and computational tools, which we hope will prove beneficial for researchers working with large and complex genomes.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
- Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, 18051 Rostock, Germany
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, National Chemical Laboratory, CSIR, Pune 411008, India
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| |
Collapse
|
24
|
A gene regulatory network controls the binary fate decision of rod and bipolar cells in the vertebrate retina. Dev Cell 2014; 30:513-27. [PMID: 25155555 PMCID: PMC4304698 DOI: 10.1016/j.devcel.2014.07.018] [Citation(s) in RCA: 134] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Revised: 06/16/2014] [Accepted: 07/21/2014] [Indexed: 12/12/2022]
Abstract
Gene regulatory networks (GRNs) regulate critical events during development. In complex tissues, such as the mammalian central nervous system (CNS), networks likely provide the complex regulatory interactions needed to direct the specification of the many CNS cell types. Here, we dissect a GRN that regulates a binary fate decision between two siblings in the murine retina, the rod photoreceptor and bipolar interneuron. The GRN centers on Blimp1, one of the transcription factors (TFs) that regulates the rod versus bipolar cell fate decision. We identified a cis-regulatory module (CRM), B108, that mimics Blimp1 expression. Deletion of genomic B108 by CRISPR/Cas9 in vivo using electroporation abolished the function of Blimp1. Otx2 and RORβ were found to regulate Blimp1 expression via B108, and Blimp1 and Otx2 were shown to form a negative feedback loop that regulates the level of Otx2, which regulates the production of the correct ratio of rods and bipolar cells.
Collapse
|
25
|
Sohn I, Shim J, Hwang C, Kim S, Lee JW. Transcription factor-binding site identification and gene classification via fusion of the supervised-weighted discrete kernel clustering and support vector machine. J Appl Stat 2014. [DOI: 10.1080/02664763.2013.845143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
26
|
Abstract
Since the seminal discovery of the cell-fate regulator Myod, studies in skeletal myogenesis have inspired the search for cell-fate regulators of similar potential in other tissues and organs. It was perplexing that a similar transcription factor for other tissues was not found; however, it was later discovered that combinations of molecular regulators can divert somatic cell fates to other cell types. With the new era of reprogramming to induce pluripotent cells, the myogenesis paradigm can now be viewed under a different light. Here, we provide a short historical perspective and focus on how the regulation of skeletal myogenesis occurs distinctly in different scenarios and anatomical locations. In addition, some interesting features of this tissue underscore the importance of reconsidering the simple-minded view that a single stem cell population emerges after gastrulation to assure tissuegenesis. Notably, a self-renewing long-term Pax7+ myogenic stem cell population emerges during development only after a first wave of terminal differentiation occurs to establish a tissue anlagen in the mouse. How the future stem cell population is selected in this unusual scenario will be discussed. Recently, a wealth of information has emerged from epigenetic and genome-wide studies in myogenic cells. Although key transcription factors such as Pax3, Pax7, and Myod regulate only a small subset of genes, in some cases their genomic distribution and binding are considerably more promiscuous. This apparent nonspecificity can be reconciled in part by the permissivity of the cell for myogenic commitment, and also by new roles for some of these regulators as pioneer transcription factors acting on chromatin state.
Collapse
Affiliation(s)
- Glenda Comai
- Stem Cells and Development, CNRS URA 2578, Department of Developmental & Stem Cell Biology, Institut Pasteur, Paris, France
| | - Shahragim Tajbakhsh
- Stem Cells and Development, CNRS URA 2578, Department of Developmental & Stem Cell Biology, Institut Pasteur, Paris, France.
| |
Collapse
|
27
|
Jiang P, Singh M. CCAT: Combinatorial Code Analysis Tool for transcriptional regulation. Nucleic Acids Res 2013; 42:2833-47. [PMID: 24366875 PMCID: PMC3950699 DOI: 10.1093/nar/gkt1302] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Combinatorial interplay among transcription factors (TFs) is an important mechanism by which transcriptional regulatory specificity is achieved. However, despite the increasing number of TFs for which either binding specificities or genome-wide occupancy data are known, knowledge about cooperativity between TFs remains limited. To address this, we developed a computational framework for predicting genome-wide co-binding between TFs (CCAT, Combinatorial Code Analysis Tool), and applied it to Drosophila melanogaster to uncover cooperativity among TFs during embryo development. Using publicly available TF binding specificity data and DNaseI chromatin accessibility data, we first predicted genome-wide binding sites for 324 TFs across five stages of D. melanogaster embryo development. We then applied CCAT in each of these developmental stages, and identified from 19 to 58 pairs of TFs in each stage whose predicted binding sites are significantly co-localized. We found that nearby binding sites for pairs of TFs predicted to cooperate were enriched in regions bound in relevant ChIP experiments, and were more evolutionarily conserved than other pairs. Further, we found that TFs tend to be co-localized with other TFs in a dynamic manner across developmental stages. All generated data as well as source code for our front-to-end pipeline are available at http://cat.princeton.edu.
Collapse
Affiliation(s)
- Peng Jiang
- Department of Computer Science, Princeton University, Princeton, 08540 NJ, USA and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, 08544 NJ, USA
| | | |
Collapse
|
28
|
Deyneko IV, Kel AE, Kel-Margoulis OV, Deineko EV, Wingender E, Weiss S. MatrixCatch--a novel tool for the recognition of composite regulatory elements in promoters. BMC Bioinformatics 2013; 14:241. [PMID: 23924163 PMCID: PMC3754795 DOI: 10.1186/1471-2105-14-241] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2012] [Accepted: 08/05/2013] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Accurate recognition of regulatory elements in promoters is an essential prerequisite for understanding the mechanisms of gene regulation at the level of transcription. Composite regulatory elements represent a particular type of such transcriptional regulatory elements consisting of pairs of individual DNA motifs. In contrast to the present approach, most available recognition techniques are based purely on statistical evaluation of the occurrence of single motifs. Such methods are limited in application, since the accuracy of recognition is greatly dependent on the size and quality of the sequence dataset. Methods that exploit available knowledge and have broad applicability are evidently needed. RESULTS We developed a novel method to identify composite regulatory elements in promoters using a library of known examples. In depth investigation of regularities encoded in known composite elements allowed us to introduce a new characteristic measure and to improve the specificity compared with other methods. Tests on an established benchmark and real genomic data show that our method outperforms other available methods based either on known examples or statistical evaluations. In addition to better recognition, a practical advantage of this method is first the ability to detect a high number of different types of composite elements, and second direct biological interpretation of the identified results. The program is available at http://gnaweb.helmholtz-hzi.de/cgi-bin/MCatch/MatrixCatch.pl and includes an option to extend the provided library by user supplied data. CONCLUSIONS The novel algorithm for the identification of composite regulatory elements presented in this paper was proved to be superior to existing methods. Its application to tissue specific promoters identified several highly specific composite elements with relevance to their biological function. This approach together with other methods will further advance the understanding of transcriptional regulation of genes.
Collapse
Affiliation(s)
- Igor V Deyneko
- Department of Molecular Immunology, Helmholtz Centre for Infection Research, Braunschweig, Germany.
| | | | | | | | | | | |
Collapse
|
29
|
Nandi S, Blais A, Ioshikhes I. Identification of cis-regulatory modules in promoters of human genes exploiting mutual positioning of transcription factors. Nucleic Acids Res 2013; 41:8822-41. [PMID: 23913413 PMCID: PMC3799424 DOI: 10.1093/nar/gkt578] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
In higher organisms, gene regulation is controlled by the interplay of non-random combinations of multiple transcription factors (TFs). Although numerous attempts have been made to identify these combinations, important details, such as mutual positioning of the factors that have an important role in the TF interplay, are still missing. The goal of the present work is in silico mapping of some of such associating factors based on their mutual positioning, using computational screening. We have selected the process of myogenesis as a study case, and we focused on TF combinations involving master myogenic TF Myogenic differentiation (MyoD) with other factors situated at specific distances from it. The results of our work show that some muscle-specific factors occur together with MyoD within the range of ±100 bp in a large number of promoters. We confirm co-occurrence of the MyoD with muscle-specific factors as described in earlier studies. However, we have also found novel relationships of MyoD with other factors not specific for muscle. Additionally, we have observed that MyoD tends to associate with different factors in proximal and distal promoter areas. The major outcome of our study is establishing the genome-wide connection between biological interactions of TFs and close co-occurrence of their binding sites.
Collapse
Affiliation(s)
- Soumyadeep Nandi
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, Ontario K1H 8M5, Canada and Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario K1H 8M5, Canada
| | | | | |
Collapse
|
30
|
Malin J, Aniba MR, Hannenhalli S. Enhancer networks revealed by correlated DNAse hypersensitivity states of enhancers. Nucleic Acids Res 2013; 41:6828-38. [PMID: 23700312 PMCID: PMC3737527 DOI: 10.1093/nar/gkt374] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Revised: 03/23/2013] [Accepted: 04/15/2013] [Indexed: 12/14/2022] Open
Abstract
Mammalian gene expression is often regulated by distal enhancers. However, little is known about higher order functional organization of enhancers. Using ∼100 K P300-bound regions as candidate enhancers, we investigated their correlated activity across 72 cell types based on DNAse hypersensitivity. We found widespread correlated activity between enhancers, which decreases with increasing inter-enhancer genomic distance. We found that correlated enhancers tend to share common transcription factor (TF) binding motifs, and several chromatin modification enzymes preferentially interact with these TFs. Presence of shared motifs in enhancer pairs can predict correlated activity with 73% accuracy. Also, genes near correlated enhancers exhibit correlated expression and share common function. Correlated enhancers tend to be spatially proximal. Interestingly, weak enhancers tend to correlate with significantly greater numbers of other enhancers relative to strong enhancers. Furthermore, strong/weak enhancers preferentially correlate with strong/weak enhancers, respectively. We constructed enhancer networks based on shared motif and correlated activity and show significant functional enrichment in their putative target gene clusters. Overall, our analyses show extensive correlated activity among enhancers and reveal clusters of enhancers whose activities are coordinately regulated by multiple potential mechanisms involving shared TF binding, chromatin modifying enzymes and 3D chromatin structure, which ultimately co-regulate functionally linked genes.
Collapse
Affiliation(s)
- Justin Malin
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20740, USA, Computational Biology, Bioinformatics, and Genomics Program, University of Maryland, College Park, MD, 20740, USA and Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, 20740, USA
| | - Mohamed Radhouane Aniba
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20740, USA, Computational Biology, Bioinformatics, and Genomics Program, University of Maryland, College Park, MD, 20740, USA and Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, 20740, USA
| | - Sridhar Hannenhalli
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20740, USA, Computational Biology, Bioinformatics, and Genomics Program, University of Maryland, College Park, MD, 20740, USA and Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, 20740, USA
| |
Collapse
|
31
|
Loots GG, Bergmann A, Hum NR, Oldenburg CE, Wills AE, Hu N, Ovcharenko I, Harland RM. Interrogating transcriptional regulatory sequences in Tol2-mediated Xenopus transgenics. PLoS One 2013; 8:e68548. [PMID: 23874664 PMCID: PMC3713029 DOI: 10.1371/journal.pone.0068548] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2013] [Accepted: 05/30/2013] [Indexed: 12/13/2022] Open
Abstract
Identifying gene regulatory elements and their target genes in vertebrates remains a significant challenge. It is now recognized that transcriptional regulatory sequences are critical in orchestrating dynamic controls of tissue-specific gene expression during vertebrate development and in adult tissues, and that these elements can be positioned at great distances in relation to the promoters of the genes they control. While significant progress has been made in mapping DNA binding regions by combining chromatin immunoprecipitation and next generation sequencing, functional validation remains a limiting step in improving our ability to correlate in silico predictions with biological function. We recently developed a computational method that synergistically combines genome-wide gene-expression profiling, vertebrate genome comparisons, and transcription factor binding-site analysis to predict tissue-specific enhancers in the human genome. We applied this method to 270 genes highly expressed in skeletal muscle and predicted 190 putative cis-regulatory modules. Furthermore, we optimized Tol2 transgenic constructs in Xenopus laevis to interrogate 20 of these elements for their ability to function as skeletal muscle-specific transcriptional enhancers during embryonic development. We found 45% of these elements expressed only in the fast muscle fibers that are oriented in highly organized chevrons in the Xenopus laevis tadpole. Transcription factor binding site analysis identified >2 Mef2/MyoD sites within ∼200 bp regions in 6 of the validated enhancers, and systematic mutagenesis of these sites revealed that they are critical for the enhancer function. The data described herein introduces a new reporter system suitable for interrogating tissue-specific cis-regulatory elements which allows monitoring of enhancer activity in real time, throughout early stages of embryonic development, in Xenopus.
Collapse
Affiliation(s)
- Gabriela G Loots
- Biology and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, California, United States of America.
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Stanley D, Watson-Haigh NS, Cowled CJE, Moore RJ. Genetic architecture of gene expression in the chicken. BMC Genomics 2013; 14:13. [PMID: 23324119 PMCID: PMC3575264 DOI: 10.1186/1471-2164-14-13] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2012] [Accepted: 12/26/2012] [Indexed: 12/05/2022] Open
Abstract
Background The annotation of many genomes is limited, with a large proportion of identified genes lacking functional assignments. The construction of gene co-expression networks is a powerful approach that presents a way of integrating information from diverse gene expression datasets into a unified analysis which allows inferences to be drawn about the role of previously uncharacterised genes. Using this approach, we generated a condition-free gene co-expression network for the chicken using data from 1,043 publically available Affymetrix GeneChip Chicken Genome Arrays. This data was generated from a diverse range of experiments, including different tissues and experimental conditions. Our aim was to identify gene co-expression modules and generate a tool to facilitate exploration of the functional chicken genome. Results Fifteen modules, containing between 24 and 473 genes, were identified in the condition-free network. Most of the modules showed strong functional enrichment for particular Gene Ontology categories. However, a few showed no enrichment. Transcription factor binding site enrichment was also noted. Conclusions We have demonstrated that this chicken gene co-expression network is a useful tool in gene function prediction and the identification of putative novel transcription factors and binding sites. This work highlights the relevance of this methodology for functional prediction in poorly annotated genomes such as the chicken.
Collapse
Affiliation(s)
- Dragana Stanley
- CSIRO Animal, Food and Helath Sciences, Australian Animal Health Laboratories, Geelong, VIC 3220, Australia.
| | | | | | | |
Collapse
|
33
|
Jiwaji M, Daly R, Gibriel A, Barkess G, McLean P, Yang J, Pansare K, Cumming S, McLauchlan A, Kamola PJ, Bhutta MS, West AG, West KL, Kolch W, Girolami MA, Pitt AR. Unique reporter-based sensor platforms to monitor signalling in cells. PLoS One 2012; 7:e50521. [PMID: 23209767 PMCID: PMC3510088 DOI: 10.1371/journal.pone.0050521] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2012] [Accepted: 10/23/2012] [Indexed: 11/30/2022] Open
Abstract
Introduction In recent years much progress has been made in the development of tools for systems biology to study the levels of mRNA and protein, and their interactions within cells. However, few multiplexed methodologies are available to study cell signalling directly at the transcription factor level. Methods Here we describe a sensitive, plasmid-based RNA reporter methodology to study transcription factor activation in mammalian cells, and apply this technology to profiling 60 transcription factors in parallel. The methodology uses two robust and easily accessible detection platforms; quantitative real-time PCR for quantitative analysis and DNA microarrays for parallel, higher throughput analysis. Findings We test the specificity of the detection platforms with ten inducers and independently validate the transcription factor activation. Conclusions We report a methodology for the multiplexed study of transcription factor activation in mammalian cells that is direct and not theoretically limited by the number of available reporters.
Collapse
Affiliation(s)
- Meesbah Jiwaji
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
- School of Life and Health Science, Aston University, Birmingham, United Kingdom
| | - Rónán Daly
- School of Computing Science, University of Glasgow, Glasgow, United Kingdom
| | - Abdullah Gibriel
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Gráinne Barkess
- Institute of Cancer Sciences, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Pauline McLean
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Jingli Yang
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Kshama Pansare
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
- School of Life and Health Science, Aston University, Birmingham, United Kingdom
| | - Sarah Cumming
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Alisha McLauchlan
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Piotr J. Kamola
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Musab S. Bhutta
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Adam G. West
- Institute of Cancer Sciences, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Katherine L. West
- Institute of Cancer Sciences, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Walter Kolch
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
- Systems Biology Ireland and the Conway Institute, University College Dublin, Dublin, Ireland
| | - Mark A. Girolami
- School of Computing Science, University of Glasgow, Glasgow, United Kingdom
- Department of Statistical Science, University College London, London, United Kingdom
| | - Andrew R. Pitt
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
- School of Life and Health Science, Aston University, Birmingham, United Kingdom
- * E-mail:
| |
Collapse
|
34
|
oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3-GENES GENOMES GENETICS 2012; 2:987-1002. [PMID: 22973536 PMCID: PMC3429929 DOI: 10.1534/g3.112.003202] [Citation(s) in RCA: 230] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2012] [Accepted: 06/11/2012] [Indexed: 01/12/2023]
Abstract
oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca.
Collapse
|
35
|
Abstract
Differential gene expression is the fundamental mechanism underlying animal development and cell differentiation. However, it is a challenge to identify comprehensively and accurately the DNA sequences that are required to regulate gene expression: namely, cis-regulatory modules (CRMs). Three major features, either singly or in combination, are used to predict CRMs: clusters of transcription factor binding site motifs, non-coding DNA that is under evolutionary constraint and biochemical marks associated with CRMs, such as histone modifications and protein occupancy. The validation rates for predictions indicate that identifying diagnostic biochemical marks is the most reliable method, and understanding is enhanced by the analysis of motifs and conservation patterns within those predicted CRMs.
Collapse
|
36
|
Nikulova AA, Favorov AV, Sutormin RA, Makeev VJ, Mironov AA. CORECLUST: identification of the conserved CRM grammar together with prediction of gene regulation. Nucleic Acids Res 2012; 40:e93. [PMID: 22422836 PMCID: PMC3384346 DOI: 10.1093/nar/gks235] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Identification of transcriptional regulatory regions and tracing their internal organization are important for understanding the eukaryotic cell machinery. Cis-regulatory modules (CRMs) of higher eukaryotes are believed to possess a regulatory ‘grammar’, or preferred arrangement of binding sites, that is crucial for proper regulation and thus tends to be evolutionarily conserved. Here, we present a method CORECLUST (COnservative REgulatory CLUster STructure) that predicts CRMs based on a set of positional weight matrices. Given regulatory regions of orthologous and/or co-regulated genes, CORECLUST constructs a CRM model by revealing the conserved rules that describe the relative location of binding sites. The constructed model may be consequently used for the genome-wide prediction of similar CRMs, and thus detection of co-regulated genes, and for the investigation of the regulatory grammar of the system. Compared with related methods, CORECLUST shows better performance at identification of CRMs conferring muscle-specific gene expression in vertebrates and early-developmental CRMs in Drosophila.
Collapse
Affiliation(s)
- Anna A Nikulova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 1-73 Leninskie Gory, Moscow 119991, Russia.
| | | | | | | | | |
Collapse
|
37
|
Girgis HZ, Ovcharenko I. Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs. BMC Bioinformatics 2012; 13:25. [PMID: 22313678 PMCID: PMC3359238 DOI: 10.1186/1471-2105-13-25] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2011] [Accepted: 02/07/2012] [Indexed: 12/26/2022] Open
Abstract
Background Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed. Results We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4+ T cells. On several data sets, the system achieved 99% specificity. Conclusion These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation.
Collapse
Affiliation(s)
- Hani Z Girgis
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health 9600 Rockville Pike, Bethesda, MD 20896, USA
| | | |
Collapse
|
38
|
Abstract
miRNAs are small non-coding RNAs with average length of ~21 bp. miRNA formation seems to be dependent upon multiple factors besides Drosha and Dicer, in a tissue/stage-specific manner, with interplay of several specific binding factors. In the present study, we have investigated transcription factor binding sites in and around the genomic sequences of precursor miRNAs and RNA-binding protein (RBP) sites in miRNA precursor sequences, analysed and tested in comprehensive manner. Here, we report that miRNA precursor regions are positionally enriched for binding of transcription factors as well as RBPs around the 3' end of mature miRNA region in 5' arm. The pattern and distribution of such regulatory sites appears to be a characteristic of precursor miRNA sequences when compared with non-miRNA sequences as negative dataset and tested statistically.When compared with 1 kb upstreamregions, a sudden sharp peak for binding sites arises in the enriched zone near the mature miRNA region. An expression-data-based correlation analysis was performed between such miRNAs and their corresponding transcription factors and RBPs for this region. Some specific groups of binding factors and associated miRNAs were identified. We also identified some of the overrepresented transcription factors and associated miRNAs with high expression correlation values which could be useful in cancer-related studies. The highly correlated groups were found to host experimentally validated composite regulatory modules, in which Lmo2-GATA1 appeared as the predominant one. For many of RBP-miRNAs associations, coexpression similarity was also evident among the associated miRNA common to given RBPs, supporting the Regulon model, suggesting a common role and common control of these miRNAs by the associated RBPs. Based on our findings, we propose that the observed characteristic distribution of regulatory sites in precursor miRNA sequence regions could be critical inmiRNA transcription, processing, stability and formation and are important for therapeutic studies. Our findings also support the recently proposed theory of self-sufficient mode of transcription by miRNAs, which states that miRNA transcription can be carried out in host-independent mode too.
Collapse
Affiliation(s)
- Ashwani Jha
- Studio of Computational Biology and Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Council of Scientific and Industrial Research, Palampur 176061, HP, India
| | | | | |
Collapse
|
39
|
Aerts S. Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr Top Dev Biol 2012; 98:121-45. [PMID: 22305161 DOI: 10.1016/b978-0-12-386499-4.00005-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Transcription factors (TFs) are key proteins that decode the information in our genome to express a precise and unique set of proteins and RNA molecules in each cell type in our body. These factors play a pivotal role in all biological processes, including the determination of a cell's fate during development and the maintenance of a cell's physiological function. To achieve this, a TF binds to specific DNA sequences in the noncoding part of the genome, recruits chromatin modifiers and cofactors, and directs the transcription initiation rate of its "target genes." Therefore, a key challenge in deciphering a transcriptional switch is to identify the direct target genes of the master regulators that control the switch, the cis-regulatory elements implementing (auto-)regulatory loops, and the target genes of all the TFs in the downstream regulatory network. A better knowledge of a TF's targetome during specification and differentiation of a particular cell type will generate mechanistic insight into its developmental program. Here, I review computational strategies and methods to predict transcriptional targets by genome-wide searches for TF binding sites using position weight matrices, motif clusters, phylogenetic footprinting, chromatin binding and accessibility data, enhancer classification, motif enrichment, and gene expression signatures.
Collapse
Affiliation(s)
- Stein Aerts
- Laboratory of Computational Biology, Center for Human Genetics, Katholieke Universiteit Leuven, Leuven, Belgium
| |
Collapse
|
40
|
Kwon AT, Chou AY, Arenillas DJ, Wasserman WW. Validation of skeletal muscle cis-regulatory module predictions reveals nucleotide composition bias in functional enhancers. PLoS Comput Biol 2011; 7:e1002256. [PMID: 22144875 PMCID: PMC3228787 DOI: 10.1371/journal.pcbi.1002256] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Accepted: 09/16/2011] [Indexed: 11/19/2022] Open
Abstract
We performed a genome-wide scan for muscle-specific cis-regulatory modules (CRMs) using three computational prediction programs. Based on the predictions, 339 candidate CRMs were tested in cell culture with NIH3T3 fibroblasts and C2C12 myoblasts for capacity to direct selective reporter gene expression to differentiated C2C12 myotubes. A subset of 19 CRMs validated as functional in the assay. The rate of predictive success reveals striking limitations of computational regulatory sequence analysis methods for CRM discovery. Motif-based methods performed no better than predictions based only on sequence conservation. Analysis of the properties of the functional sequences relative to inactive sequences identifies nucleotide sequence composition can be an important characteristic to incorporate in future methods for improved predictive specificity. Muscle-related TFBSs predicted within the functional sequences display greater sequence conservation than non-TFBS flanking regions. Comparison with recent MyoD and histone modification ChIP-Seq data supports the validity of the functional regions. For efficient identification of genomic sequences responsible for regulating gene expression, a number of computer programs have been developed for automatic annotation of these regulatory regions. We searched for potential regulatory regions responsible for controlling the expression of skeletal muscle-specific genes using these programs, and validated the predictions in a popular cell culture model for muscle. We were able to identify 19 previously uncharacterized regulatory regions for muscle genes. The accuracy of the predictions made by these programs leaves much to be desired, leading us to conclude that other signals in addition to the sequence information will be required to achieve sufficient predictive power for genome annotation. Genomic regions with confirmed regulatory function were compared against non-functional sequences, revealing sequence conservation, composition and chromatin modification properties as important signals in determining regulatory region functionality.
Collapse
Affiliation(s)
- Andrew T. Kwon
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Alice Yi Chou
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - David J. Arenillas
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Wyeth W. Wasserman
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
- * E-mail:
| |
Collapse
|
41
|
Starr MO, Ho MCW, Gunther EJM, Tu YK, Shur AS, Goetz SE, Borok MJ, Kang V, Drewell RA. Molecular dissection of cis-regulatory modules at the Drosophila bithorax complex reveals critical transcription factor signature motifs. Dev Biol 2011; 359:290-302. [PMID: 21821017 PMCID: PMC3202680 DOI: 10.1016/j.ydbio.2011.07.028] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2011] [Revised: 07/17/2011] [Accepted: 07/19/2011] [Indexed: 11/17/2022]
Abstract
At the Drosophila melanogaster bithorax complex (BX-C) over 330kb of intergenic DNA is responsible for directing the transcription of just three homeotic (Hox) genes during embryonic development. A number of distinct enhancer cis-regulatory modules (CRMs) are responsible for controlling the specific expression patterns of the Hox genes in the BX-C. While it has proven possible to identify orthologs of known BX-C CRMs in different Drosophila species using overall sequence conservation, this approach has not proven sufficiently effective for identifying novel CRMs or defining the key functional sequences within enhancer CRMs. Here we demonstrate that the specific spatial clustering of transcription factor (TF) binding sites is important for BX-C enhancer activity. A bioinformatic search for combinations of putative TF binding sites in the BX-C suggests that simple clustering of binding sites is frequently not indicative of enhancer activity. However, through molecular dissection and evolutionary comparison across the Drosophila genus we discovered that specific TF binding site clustering patterns are an important feature of three known BX-C enhancers. Sub-regions of the defined IAB5 and IAB7b enhancers were both found to contain an evolutionarily conserved signature motif of clustered TF binding sites which is critical for the functional activity of the enhancers. Together, these results indicate that the spatial organization of specific activator and repressor binding sites within BX-C enhancers is of greater importance than overall sequence conservation and is indicative of enhancer functional activity.
Collapse
Affiliation(s)
| | | | | | - Yen-Kuei Tu
- Biology Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA
| | - Andrey S. Shur
- Biology Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA
| | - Sara E. Goetz
- Biology Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA
| | - Matthew J. Borok
- Biology Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA
| | - Victoria Kang
- Biology Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA
| | - Robert A. Drewell
- Biology Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA
| |
Collapse
|
42
|
Yan R, Boutros PC, Jurisica I. A tree-based approach for motif discovery and sequence classification. Bioinformatics 2011; 27:2054-61. [PMID: 21685048 DOI: 10.1093/bioinformatics/btr353] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pattern discovery algorithms are widely used for the analysis of DNA and protein sequences. Most algorithms have been designed to find overrepresented motifs in sparse datasets of long sequences, and ignore most positional information. We introduce an algorithm optimized to exploit spatial information in sparse-but-populous datasets. RESULTS Our algorithm Tree-based Weighted-Position Pattern Discovery and Classification (T-WPPDC) supports both unsupervised pattern discovery and supervised sequence classification. It identifies positionally enriched patterns using the Kullback-Leibler distance between foreground and background sequences at each position. This spatial information is used to discover positionally important patterns. T-WPPDC then uses a scoring function to discriminate different biological classes. We validated T-WPPDC on an important biological problem: prediction of single nucleotide polymorphisms (SNPs) from flanking sequence. We evaluated 672 separate experiments on 120 datasets derived from multiple species. T-WPPDC outperformed other pattern discovery methods and was comparable to the supervised machine learning algorithms. The algorithm is computationally efficient and largely insensitive to dataset size. It allows arbitrary parameterization and is embarrassingly parallelizable. CONCLUSIONS T-WPPDC is a minimally parameterized algorithm for both pattern discovery and sequence classification that directly incorporates positional information. We use it to confirm the predictability of SNPs from flanking sequence, and show that positional information is a key to this biological problem. AVAILABILITY The algorithm, code and data are available at: http://www.cs.utoronto.ca/~juris/data/TWPPDC
Collapse
Affiliation(s)
- Rui Yan
- Department of Computer Science, University of Toronto, Toronto, Canada M5S 3G4.
| | | | | |
Collapse
|
43
|
Fulton DL, Denarier E, Friedman HC, Wasserman WW, Peterson AC. Towards resolving the transcription factor network controlling myelin gene expression. Nucleic Acids Res 2011; 39:7974-91. [PMID: 21729871 PMCID: PMC3185407 DOI: 10.1093/nar/gkr326] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
In the central nervous system (CNS), myelin is produced from spirally-wrapped oligodendrocyte plasma membrane and, as exemplified by the debilitating effects of inherited or acquired myelin abnormalities in diseases such as multiple sclerosis, it plays a critical role in nervous system function. Myelin sheath production coincides with rapid up-regulation of numerous genes. The complexity of their subsequent expression patterns, along with recently recognized heterogeneity within the oligodendrocyte lineage, suggest that the regulatory networks controlling such genes drive multiple context-specific transcriptional programs. Conferring this nuanced level of control likely involves a large repertoire of interacting transcription factors (TFs). Here, we combined novel strategies of computational sequence analyses with in vivo functional analysis to establish a TF network model of coordinate myelin-associated gene transcription. Notably, the network model captures regulatory DNA elements and TFs known to regulate oligodendrocyte myelin gene transcription and/or oligodendrocyte development, thereby validating our approach. Further, it links to numerous TFs with previously unsuspected roles in CNS myelination and suggests collaborative relationships amongst both known and novel TFs, thus providing deeper insight into the myelin gene transcriptional network.
Collapse
Affiliation(s)
- Debra L Fulton
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, University of British Columbia, Vancouver, V5Z 4H4, Canada
| | | | | | | | | |
Collapse
|
44
|
Hemberg M, Kreiman G. Conservation of transcription factor binding events predicts gene expression across species. Nucleic Acids Res 2011; 39:7092-102. [PMID: 21622661 PMCID: PMC3167604 DOI: 10.1093/nar/gkr404] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Recent technological advances have made it possible to determine the genome-wide binding sites of transcription factors (TFs). Comparisons across species have suggested a relatively low degree of evolutionary conservation of experimentally defined TF binding events (TFBEs). Using binding data for six different TFs in hepatocytes and embryonic stem cells from human and mouse, we demonstrate that evolutionary conservation of TFBEs within orthologous proximal promoters is closely linked to function, defined as expression of the target genes. We show that (i) there is a significantly higher degree of conservation of TFBEs when the target gene is expressed in both species; (ii) there is increased conservation of binding events for groups of TFs compared to individual TFs; and (iii) conserved TFBEs have a greater impact on the expression of their target genes than non-conserved ones. These results link conservation of structural elements (TFBEs) to conservation of function (gene expression) and suggest a higher degree of functional conservation than implied by previous studies.
Collapse
Affiliation(s)
- Martin Hemberg
- Children's Hospital Boston, Program in Biophysics and Program in Neuroscience, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA
| | | |
Collapse
|
45
|
Zhang Z, Zhang MQ. Histone modification profiles are predictive for tissue/cell-type specific expression of both protein-coding and microRNA genes. BMC Bioinformatics 2011; 12:155. [PMID: 21569556 PMCID: PMC3120700 DOI: 10.1186/1471-2105-12-155] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2010] [Accepted: 05/14/2011] [Indexed: 02/04/2023] Open
Abstract
Background Gene expression is regulated at both the DNA sequence level and through modification of chromatin. However, the effect of chromatin on tissue/cell-type specific gene regulation (TCSR) is largely unknown. In this paper, we present a method to elucidate the relationship between histone modification/variation (HMV) and TCSR. Results A classifier for differentiating CD4+ T cell-specific genes from housekeeping genes using HMV data was built. We found HMV in both promoter and gene body regions to be predictive of genes which are targets of TCSR. For example, the histone modification types H3K4me3 and H3K27ac were identified as the most predictive for CpG-related promoters, whereas H3K4me3 and H3K79me3 were the most predictive for nonCpG-related promoters. However, genes targeted by TCSR can be predicted using other type of HMVs as well. Such redundancy implies that multiple type of underlying regulatory elements, such as enhancers or intragenic alternative promoters, which can regulate gene expression in a tissue/cell-type specific fashion, may be marked by the HMVs. Finally, we show that the predictive power of HMV for TCSR is not limited to protein-coding genes in CD4+ T cells, as we successfully predicted TCSR targeted genes in muscle cells, as well as microRNA genes with expression specific to CD4+ T cells, by the same classifier which was trained on HMV data of protein-coding genes in CD4+ T cells. Conclusion We have begun to understand the HMV patterns that guide gene expression in both tissue/cell-type specific and ubiquitous manner.
Collapse
Affiliation(s)
- Zhihua Zhang
- Department of Molecular Cell Biology, Center for Systems Biology, University of Texas at Dallas, Richardson, TX 75080, USA
| | | |
Collapse
|
46
|
Dojer N, Biecek P, Tiuryn J. Bi-billboard: symmetrization and careful choice of informant species results in higher accuracy of regulatory element prediction. J Comput Biol 2011; 18:809-19. [PMID: 21563976 DOI: 10.1089/cmb.2010.0299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The identification of cis-regulatory modules (CRM) is one of the most important problems towards the understanding of transcriptional regulation in higher eukaryotes. Computational methods for CRM detection are gaining importance due to the availability of genomic data on one side, and costs and difficulties of experimental methods on the other side. One of proposed approaches, called Billboard, predicts CRMs based on the location of transcription factor binding sites in an analyzed sequence and a related one in so-called informant species. In the present article, we show how to combine information obtained in two symmetric runs (on the sequence of interest and on the related one) of the Billboard tool. In a series of experiments on data from various organisms, we show that the predictive power of our symmetric approach is significantly higher than the power of the one-way approach of Billboard. Moreover, we show that the evolutionary distance between organisms considerably influences the quality of prediction and we provide guidelines on the choice of an informant species.
Collapse
Affiliation(s)
- Norbert Dojer
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.
| | | | | |
Collapse
|
47
|
Brohée S, Janky R, Abdel-Sater F, Vanderstocken G, André B, van Helden J. Unraveling networks of co-regulated genes on the sole basis of genome sequences. Nucleic Acids Res 2011; 39:6340-58. [PMID: 21572103 PMCID: PMC3159452 DOI: 10.1093/nar/gkr264] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
With the growing number of available microbial genome sequences, regulatory signals can now be revealed as conserved motifs in promoters of orthologous genes (phylogenetic footprints). A next challenge is to unravel genome-scale regulatory networks. Using as sole input genome sequences, we predicted cis-regulatory elements for each gene of the yeast Saccharomyces cerevisiae by discovering over-represented motifs in the promoters of their orthologs in 19 Saccharomycetes species. We then linked all genes displaying similar motifs in their promoter regions and inferred a co-regulation network including 56,919 links between 3171 genes. Comparison with annotated regulons highlights the high predictive value of the method: a majority of the top-scoring predictions correspond to already known co-regulations. We also show that this inferred network is as accurate as a co-expression network built from hundreds of transcriptome microarray experiments. Furthermore, we experimentally validated 14 among 16 new functional links between orphan genes and known regulons. This approach can be readily applied to unravel gene regulatory networks from hundreds of microbial genomes for which no other information is available except the sequence. Long-term benefits can easily be perceived when considering the exponential increase of new genome sequences.
Collapse
Affiliation(s)
- Sylvain Brohée
- Lab. Bioinformatique des Génomes et des Réseaux (BiGRe), Université Libre de Bruxelles (ULB), CP 263, Campus Plaine, Bld du Triomphe, 1050 Brussels, Belgium
| | | | | | | | | | | |
Collapse
|
48
|
Li XY, Thomas S, Sabo PJ, Eisen MB, Stamatoyannopoulos JA, Biggin MD. The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding. Genome Biol 2011; 12:R34. [PMID: 21473766 PMCID: PMC3218860 DOI: 10.1186/gb-2011-12-4-r34] [Citation(s) in RCA: 156] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2011] [Accepted: 04/07/2011] [Indexed: 12/11/2022] Open
Abstract
Background In Drosophila embryos, many biochemically and functionally unrelated transcription factors bind quantitatively to highly overlapping sets of genomic regions, with much of the lowest levels of binding being incidental, non-functional interactions on DNA. The primary biochemical mechanisms that drive these genome-wide occupancy patterns have yet to be established. Results Here we use data resulting from the DNaseI digestion of isolated embryo nuclei to provide a biophysical measure of the degree to which proteins can access different regions of the genome. We show that the in vivo binding patterns of 21 developmental regulators are quantitatively correlated with DNA accessibility in chromatin. Furthermore, we find that levels of factor occupancy in vivo correlate much more with the degree of chromatin accessibility than with occupancy predicted from in vitro affinity measurements using purified protein and naked DNA. Within accessible regions, however, the intrinsic affinity of the factor for DNA does play a role in determining net occupancy, with even weak affinity recognition sites contributing. Finally, we show that programmed changes in chromatin accessibility between different developmental stages correlate with quantitative alterations in factor binding. Conclusions Based on these and other results, we propose a general mechanism to explain the widespread, overlapping DNA binding by animal transcription factors. In this view, transcription factors are expressed at sufficiently high concentrations in cells such that they can occupy their recognition sequences in highly accessible chromatin without the aid of physical cooperative interactions with other proteins, leading to highly overlapping, graded binding of unrelated factors.
Collapse
Affiliation(s)
- Xiao-Yong Li
- Genomics Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road MS 84-171, Berkeley, CA 94720, USA
| | | | | | | | | | | |
Collapse
|
49
|
Kim TM, Park PJ. Advances in analysis of transcriptional regulatory networks. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2011; 3:21-35. [PMID: 21069662 DOI: 10.1002/wsbm.105] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
A transcriptional regulatory network represents a molecular framework in which developmental or environmental cues are transformed into differential expression of genes. Transcriptional regulation is mediated by the combinatorial interplay between cis-regulatory DNA elements and trans-acting transcription factors, and is perhaps the most important mechanism for controlling gene expression. Recent innovations, most notably the method for detecting protein-DNA interactions genome-wide, can help provide a comprehensive catalog of cis-regulatory elements and their interaction with given trans-acting factors in a given condition. A transcriptional regulatory network that integrates such information can lead to a systems-level understanding of regulatory mechanisms. In this review, we will highlight the key aspects of current knowledge on eukaryotic transcriptional regulation, especially on known transcription factors and their interacting regulatory elements. Then we will review some recent technical advances for genome-wide mapping of DNA-protein interactions based on high-throughput sequencing. Finally, we will discuss the types of biological insights that can be obtained from a network-level understanding of transcription regulation as well as future challenges in the field.
Collapse
Affiliation(s)
- Tae-Min Kim
- Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | |
Collapse
|
50
|
PCR DNA-array profiling of DNA-binding transcription factor activities in adult mouse tissues. Methods Mol Biol 2011; 687:319-31. [PMID: 20967619 DOI: 10.1007/978-1-60761-944-4_23] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
Abstract
Differential gene expression is tightly controlled by transcription factors (TFs), which bind close to target genes and interact together to activate and coregulate transcription. Bioinformatics analysis of published genome-wide gene expression data has allowed the development of comprehensive models of TFs likely to be active in particular tissues (signature TFs); however, the predicted activities of many of the TFs have not been experimentally confirmed. Here, we describe methods for the parallel analysis of the activities of more than 200 transcription factor proteins, using an advanced oligonucleotide array-based transcription factor assay (OATFA) platform, to assay TF activities in mice. The system uses a PCR-based system to translate cellular levels of target DNA-TF complex into a dye-tagged DNA signal, which is read by the developed microarray. The PCR step introduces semiquantitative amplification of the represented TF binding sequences. Experimental OATFA findings can identify many TF activities, which bioinformatics profiling does not predict. Newly identified TF activities can be confirmed by antibody-ELISA against active TFs. The PCR-based OATFA microarray analysis is a comprehensive method that can be used to reveal transcriptional systems and pathways which may function in different mammalian tissues and cells.
Collapse
|