1
|
Moldovan M, Gelfand MS. Phospho-islands and the evolution of phosphorylated amino acids in mammals. PeerJ 2020; 8:e10436. [PMID: 33344082 PMCID: PMC7718798 DOI: 10.7717/peerj.10436] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 11/06/2020] [Indexed: 01/23/2023] Open
Abstract
Background Protein phosphorylation is the best studied post-translational modification strongly influencing protein function. Phosphorylated amino acids not only differ in physico-chemical properties from non-phosphorylated counterparts, but also exhibit different evolutionary patterns, tending to mutate to and originate from negatively charged amino acids (NCAs). The distribution of phosphosites along protein sequences is non-uniform, as phosphosites tend to cluster, forming so-called phospho-islands. Methods Here, we have developed a hidden Markov model-based procedure for the identification of phospho-islands and studied the properties of the obtained phosphorylation clusters. To check robustness of evolutionary analysis, we consider different models for the reconstructions of ancestral phosphorylation states. Results Clustered phosphosites differ from individual phosphosites in several functional and evolutionary aspects including underrepresentation of phosphotyrosines, higher conservation, more frequent mutations to NCAs. The spectrum of tissues, frequencies of specific phosphorylation contexts, and mutational patterns observed near clustered sites also are different.
Collapse
Affiliation(s)
| | - Mikhail S Gelfand
- Skolkovo Institute of Science and Technology, Moscow, Russia.,A. A. Kharkevich Institute for Information Transmission Problems, Moscow, Russia
| |
Collapse
|
2
|
Tobias IC, Abatti LE, Moorthy SD, Mullany S, Taylor T, Khader N, Filice MA, Mitchell JA. Transcriptional enhancers: from prediction to functional assessment on a genome-wide scale. Genome 2020; 64:426-448. [PMID: 32961076 DOI: 10.1139/gen-2020-0104] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Enhancers are cis-regulatory sequences located distally to target genes. These sequences consolidate developmental and environmental cues to coordinate gene expression in a tissue-specific manner. Enhancer function and tissue specificity depend on the expressed set of transcription factors, which recognize binding sites and recruit cofactors that regulate local chromatin organization and gene transcription. Unlike other genomic elements, enhancers are challenging to identify because they function independently of orientation, are often distant from their promoters, have poorly defined boundaries, and display no reading frame. In addition, there are no defined genetic or epigenetic features that are unambiguously associated with enhancer activity. Over recent years there have been developments in both empirical assays and computational methods for enhancer prediction. We review genome-wide tools, CRISPR advancements, and high-throughput screening approaches that have improved our ability to both observe and manipulate enhancers in vitro at the level of primary genetic sequences, chromatin states, and spatial interactions. We also highlight contemporary animal models and their importance to enhancer validation. Together, these experimental systems and techniques complement one another and broaden our understanding of enhancer function in development, evolution, and disease.
Collapse
Affiliation(s)
- Ian C Tobias
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Luis E Abatti
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Sakthi D Moorthy
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Shanelle Mullany
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Tiegh Taylor
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Nawrah Khader
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Mario A Filice
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Jennifer A Mitchell
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| |
Collapse
|
3
|
Darnell CL, Schmid AK. Systems biology approaches to defining transcription regulatory networks in halophilic archaea. Methods 2015; 86:102-14. [PMID: 25976837 DOI: 10.1016/j.ymeth.2015.04.034] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Revised: 04/27/2015] [Accepted: 04/28/2015] [Indexed: 12/31/2022] Open
Abstract
To survive complex and changing environmental conditions, microorganisms use gene regulatory networks (GRNs) composed of interacting regulatory transcription factors (TFs) to control the timing and magnitude of gene expression. Genome-wide datasets; such as transcriptomics and protein-DNA interactions; and experiments such as high throughput growth curves; facilitate the construction of GRNs and provide insight into TF interactions occurring under stress. Systems biology approaches integrate these datasets into models of GRN architecture as well as statistical and/or dynamical models to understand the function of networks occurring in cells. Previously, these types of studies have focused on traditional model organisms (e.g. Escherichia coli, yeast). However, recent advances in archaeal genetics and other tools have enabled a systems approach to understanding GRNs in these relatively less studied archaeal model organisms. In this report, we outline a systems biology workflow for generating and integrating data focusing on the TF regulator. We discuss experimental design, outline the process of data collection, and provide the tools required to produce high confidence regulons for the TFs of interest. We provide a case study as an example of this workflow, describing the construction of a GRN centered on multi-TF coordinate control of gene expression governing the oxidative stress response in the hypersaline-adapted archaeon Halobacterium salinarum.
Collapse
Affiliation(s)
| | - Amy K Schmid
- Biology Department, Duke University, Durham, NC 27708, USA; Center for Systems Biology, Duke University, Durham, NC 27708, USA.
| |
Collapse
|
4
|
Tonner PD, Pittman AMC, Gulli JG, Sharma K, Schmid AK. A regulatory hierarchy controls the dynamic transcriptional response to extreme oxidative stress in archaea. PLoS Genet 2015; 11:e1004912. [PMID: 25569531 PMCID: PMC4287449 DOI: 10.1371/journal.pgen.1004912] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 11/20/2014] [Indexed: 12/21/2022] Open
Abstract
Networks of interacting transcription factors are central to the regulation of cellular responses to abiotic stress. Although the architecture of many such networks has been mapped, their dynamic function remains unclear. Here we address this challenge in archaea, microorganisms possessing transcription factors that resemble those of both eukaryotes and bacteria. Using genome-wide DNA binding location analysis integrated with gene expression and cell physiological data, we demonstrate that a bacterial-type transcription factor (TF), called RosR, and five TFIIB proteins, homologs of eukaryotic TFs, combinatorially regulate over 100 target genes important for the response to extremely high levels of peroxide. These genes include 20 other transcription factors and oxidative damage repair genes. RosR promoter occupancy is surprisingly dynamic, with the pattern of target gene expression during the transition from rapid growth to stress correlating strongly with the pattern of dynamic binding. We conclude that a hierarchical regulatory network orchestrated by TFs of hybrid lineage enables dynamic response and survival under extreme stress in archaea. This raises questions regarding the evolutionary trajectory of gene networks in response to stress. Complex circuits of genes rather than a single gene underlie many important processes such as disease, development, and cellular damage repair. Although the wiring of many of these circuits has been mapped, how circuits operate in real time to carry out their functions is poorly understood. Here we address these questions by investigating the function of a gene circuit that responds to reactive oxygen species damage in archaea, microorganisms that represent the third domain of life. Members of this domain of life are excellent models for investigating the function and evolution of gene circuits. Components of archaeal regulatory machinery driving gene circuits resemble those of both bacteria and eukaryotes. Here we demonstrate that regulatory proteins of hybrid ancestry collaborate to control the expression of over 100 genes whose products repair cellular damage. Among these are other regulatory proteins, setting up a stepwise hierarchical circuit that controls damage repair. Regulation is dynamic, with gene targets showing immediate response to damage and restoring normal cellular functions soon thereafter. This study demonstrates how strong environmental forces such as stress may have shaped the wiring and dynamic function of gene circuits, raising important questions regarding how circuits originated over evolutionary time.
Collapse
Affiliation(s)
- Peter D. Tonner
- Computational Biology and Bioinformatics Graduate Program, Duke University, Durham, North Carolina, United States of America
- Biology Department, Duke University, Durham, North Carolina, United States of America
| | | | - Jordan G. Gulli
- Biology Department, Duke University, Durham, North Carolina, United States of America
| | - Kriti Sharma
- Biology Department, Duke University, Durham, North Carolina, United States of America
| | - Amy K. Schmid
- Computational Biology and Bioinformatics Graduate Program, Duke University, Durham, North Carolina, United States of America
- Biology Department, Duke University, Durham, North Carolina, United States of America
- Center for Systems Biology, Duke University, Durham, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
5
|
Plaisier CL, Lo FY, Ashworth J, Brooks AN, Beer KD, Kaur A, Pan M, Reiss DJ, Facciotti MT, Baliga NS. Evolution of context dependent regulation by expansion of feast/famine regulatory proteins. BMC SYSTEMS BIOLOGY 2014; 8:122. [PMID: 25394904 PMCID: PMC4236453 DOI: 10.1186/s12918-014-0122-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 10/16/2014] [Indexed: 11/25/2022]
Abstract
Background Expansion of transcription factors is believed to have played a crucial role in evolution of all organisms by enabling them to deal with dynamic environments and colonize new environments. We investigated how the expansion of the Feast/Famine Regulatory Protein (FFRP) or Lrp-like proteins into an eight-member family in Halobacterium salinarum NRC-1 has aided in niche-adaptation of this archaeon to a complex and dynamically changing hypersaline environment. Results We mapped genome-wide binding locations for all eight FFRPs, investigated their preference for binding different effector molecules, and identified the contexts in which they act by analyzing transcriptional responses across 35 growth conditions that mimic different environmental and nutritional conditions this organism is likely to encounter in the wild. Integrative analysis of these data constructed an FFRP regulatory network with conditionally active states that reveal how interrelated variations in DNA-binding domains, effector-molecule preferences, and binding sites in target gene promoters have tuned the functions of each FFRP to the environments in which they act. We demonstrate how conditional regulation of similar genes by two FFRPs, AsnC (an activator) and VNG1237C (a repressor), have striking environment-specific fitness consequences for oxidative stress management and growth, respectively. Conclusions This study provides a systems perspective into the evolutionary process by which gene duplication within a transcription factor family contributes to environment-specific adaptation of an organism. Electronic supplementary material The online version of this article (doi:10.1186/s12918-014-0122-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Fang-Yin Lo
- Institute for Systems Biology, Seattle, WA, USA. .,Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA.
| | | | - Aaron N Brooks
- Institute for Systems Biology, Seattle, WA, USA. .,Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA.
| | - Karlyn D Beer
- Institute for Systems Biology, Seattle, WA, USA. .,Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA.
| | | | - Min Pan
- Institute for Systems Biology, Seattle, WA, USA.
| | | | - Marc T Facciotti
- Department of Biomedical Engineering, University of California, Davis, CA, USA. .,Genome Center, University of California, Davis, CA, USA.
| | - Nitin S Baliga
- Institute for Systems Biology, Seattle, WA, USA. .,Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA. .,Department of Microbiology, University of Washington, Seattle, WA, USA. .,Department of Biology, University of Washington, Seattle, WA, USA.
| |
Collapse
|
6
|
Ashworth J, Plaisier CL, Lo FY, Reiss DJ, Baliga NS. Inference of expanded Lrp-like feast/famine transcription factor targets in a non-model organism using protein structure-based prediction. PLoS One 2014; 9:e107863. [PMID: 25255272 PMCID: PMC4177876 DOI: 10.1371/journal.pone.0107863] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 08/16/2014] [Indexed: 11/18/2022] Open
Abstract
Widespread microbial genome sequencing presents an opportunity to understand the gene regulatory networks of non-model organisms. This requires knowledge of the binding sites for transcription factors whose DNA-binding properties are unknown or difficult to infer. We adapted a protein structure-based method to predict the specificities and putative regulons of homologous transcription factors across diverse species. As a proof-of-concept we predicted the specificities and transcriptional target genes of divergent archaeal feast/famine regulatory proteins, several of which are encoded in the genome of Halobacterium salinarum. This was validated by comparison to experimentally determined specificities for transcription factors in distantly related extremophiles, chromatin immunoprecipitation experiments, and cis-regulatory sequence conservation across eighteen related species of halobacteria. Through this analysis we were able to infer that Halobacterium salinarum employs a divergent local trans-regulatory strategy to regulate genes (carA and carB) involved in arginine and pyrimidine metabolism, whereas Escherichia coli employs an operon. The prediction of gene regulatory binding sites using structure-based methods is useful for the inference of gene regulatory relationships in new species that are otherwise difficult to infer.
Collapse
Affiliation(s)
- Justin Ashworth
- Institute for Systems Biology, Seattle, Washington, United States of America
- * E-mail: (JA); (NB)
| | | | - Fang Yin Lo
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - David J. Reiss
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Nitin S. Baliga
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
- * E-mail: (JA); (NB)
| |
Collapse
|
7
|
de Rooi JJ, Ruckebusch C, Eilers PHC. Sparse deconvolution in one and two dimensions: applications in endocrinology and single-molecule fluorescence imaging. Anal Chem 2014; 86:6291-8. [PMID: 24893114 DOI: 10.1021/ac500260h] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Deconvolution of noisy signals is an important task in analytical chemistry, examples being spectral deconvolution or deconvolution in microscopy. When the number of spectral peaks or single emitters in imaging is limited, the solution of the deconvolution is required to be sparse, and desirable results are obtained using a penalized estimation techniques. We impose sparseness by using penalized regression with a penalty based on the L0-norm, as discussed in earlier work. Several extensions to this approach are presented. Results are demonstrated on pulse identification in endocrine data where the aim is to model the secretion pattern as a sparse series of spikes. An application in single-molecule fluorescence imaging demonstrates the algorithm when applied to two-dimensional data.
Collapse
Affiliation(s)
- Johan J de Rooi
- Department of Biostatistics, Erasmus Medical Center , Dr. Molewaterplein 50 3015GE Rotterdam, The Netherlands
| | | | | |
Collapse
|
8
|
Rezaeian I, Rueda L. CMT: a constrained multi-level thresholding approach for ChIP-Seq data analysis. PLoS One 2014; 9:e93873. [PMID: 24736605 PMCID: PMC3988018 DOI: 10.1371/journal.pone.0093873] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Accepted: 03/11/2014] [Indexed: 01/22/2023] Open
Abstract
Genome-wide profiling of DNA-binding proteins using ChIP-Seq has emerged as an alternative to ChIP-chip methods. ChIP-Seq technology offers many advantages over ChIP-chip arrays, including but not limited to less noise, higher resolution, and more coverage. Several algorithms have been developed to take advantage of these abilities and find enriched regions by analyzing ChIP-Seq data. However, the complexity of analyzing various patterns of ChIP-Seq signals still needs the development of new algorithms. Most current algorithms use various heuristics to detect regions accurately. However, despite how many formulations are available, it is still difficult to accurately determine individual peaks corresponding to each binding event. We developed Constrained Multi-level Thresholding (CMT), an algorithm used to detect enriched regions on ChIP-Seq data. CMT employs a constraint-based module that can target regions within a specific range. We show that CMT has higher accuracy in detecting enriched regions (peaks) by objectively assessing its performance relative to other previously proposed peak finders. This is shown by testing three algorithms on the well-known FoxA1 Data set, four transcription factors (with a total of six antibodies) for Drosophila melanogaster and the H3K4ac antibody dataset.
Collapse
Affiliation(s)
- Iman Rezaeian
- School of Computer Science, University of Windsor, Windsor, Ontario, Canada
| | - Luis Rueda
- School of Computer Science, University of Windsor, Windsor, Ontario, Canada
- * E-mail:
| |
Collapse
|
9
|
Mendoza-Parra MA, Nowicka M, Van Gool W, Gronemeyer H. Characterising ChIP-seq binding patterns by model-based peak shape deconvolution. BMC Genomics 2013; 14:834. [PMID: 24279297 PMCID: PMC4046686 DOI: 10.1186/1471-2164-14-834] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2013] [Accepted: 11/20/2013] [Indexed: 12/01/2022] Open
Abstract
Background Chromatin immunoprecipitation combined with massive parallel sequencing (ChIP-seq) is widely used to study protein-chromatin interactions or chromatin modifications at genome-wide level. Sequence reads that accumulate locally at the genome (peaks) reveal loci of selectively modified chromatin or specific sites of chromatin-binding factors. Computational approaches (peak callers) have been developed to identify the global pattern of these sites, most of which assess the deviation from background by applying distribution statistics. Results We have implemented MeDiChISeq, a regression-based approach, which - by following a learning process - defines a representative binding pattern from the investigated ChIP-seq dataset. Using this model MeDiChISeq identifies significant genome-wide patterns of chromatin-bound factors or chromatin modification. MeDiChISeq has been validated for various publicly available ChIP-seq datasets and extensively compared with other peak callers. Conclusions MeDiChI-Seq has a high resolution when identifying binding events, a high degree of peak-assessment reproducibility in biological replicates, a low level of false calls and a high true discovery rate when evaluated in the context of gold-standard benchmark datasets. Importantly, this approach can be applied not only to ‘sharp’ binding patterns - like those retrieved for transcription factors (TFs) - but also to the broad binding patterns seen for several histone modifications. Notably, we show that at high sequencing depths, MeDiChISeq outperforms other algorithms due to its powerful peak shape recognition capacity which facilitates discerning significant binding events from spurious background enrichment patterns that are enhanced with increased sequencing depths. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-14-834) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marco-Antonio Mendoza-Parra
- Equipe Labellisée Ligue Contre le Cancer, Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC)/CNRS/INSERM/Université de Strasbourg, BP 10142, Illkirch Cedex 67404, France.
| | | | | | | |
Collapse
|
10
|
Danziger SA, Ratushny AV, Smith JJ, Saleem RA, Wan Y, Arens CE, Armstrong AM, Sitko K, Chen WM, Chiang JH, Reiss DJ, Baliga NS, Aitchison JD. Molecular mechanisms of system responses to novel stimuli are predictable from public data. Nucleic Acids Res 2013; 42:1442-60. [PMID: 24185701 PMCID: PMC3919619 DOI: 10.1093/nar/gkt938] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Systems scale models provide the foundation for an effective iterative cycle between hypothesis generation, experiment and model refinement. Such models also enable predictions facilitating the understanding of biological complexity and the control of biological systems. Here, we demonstrate the reconstruction of a globally predictive gene regulatory model from public data: a model that can drive rational experiment design and reveal new regulatory mechanisms underlying responses to novel environments. Specifically, using ∼ 1500 publically available genome-wide transcriptome data sets from Saccharomyces cerevisiae, we have reconstructed an environment and gene regulatory influence network that accurately predicts regulatory mechanisms and gene expression changes on exposure of cells to completely novel environments. Focusing on transcriptional networks that induce peroxisomes biogenesis, the model-guided experiments allow us to expand a core regulatory network to include novel transcriptional influences and linkage across signaling and transcription. Thus, the approach and model provides a multi-scalar picture of gene dynamics and are powerful resources for exploiting extant data to rationally guide experimentation. The techniques outlined here are generally applicable to any biological system, which is especially important when experimental systems are challenging and samples are difficult and expensive to obtain-a common problem in laboratory animal and human studies.
Collapse
Affiliation(s)
- Samuel A Danziger
- Seattle Biomedical Research Institute, Seattle, WA 98109-5219 USA, Institute for Systems Biology, Seattle, WA 98109-5240 USA, The Key Laboratory of Developmental Genes and Human Disease, Ministry of Education, Institute of Life Science, Southeast University, Nanjing 210096, China and Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 704, Taiwan
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Mendoza-Parra MA, Van Gool W, Mohamed Saleem MA, Ceschin DG, Gronemeyer H. A quality control system for profiles obtained by ChIP sequencing. Nucleic Acids Res 2013; 41:e196. [PMID: 24038469 PMCID: PMC3834836 DOI: 10.1093/nar/gkt829] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The absence of a quality control (QC) system is a major weakness for the comparative analysis of genome-wide profiles generated by next-generation sequencing (NGS). This concerns particularly genome binding/occupancy profiling assays like chromatin immunoprecipitation (ChIP-seq) but also related enrichment-based studies like methylated DNA immunoprecipitation/methylated DNA binding domain sequencing, global run on sequencing or RNA-seq. Importantly, QC assessment may significantly improve multidimensional comparisons that have great promise for extracting information from combinatorial analyses of the global profiles established for chromatin modifications, the bindings of epigenetic and chromatin-modifying enzymes/machineries, RNA polymerases and transcription factors and total, nascent or ribosome-bound RNAs. Here we present an approach that associates global and local QC indicators to ChIP-seq data sets as well as to a variety of enrichment-based studies by NGS. This QC system was used to certify >5600 publicly available data sets, hosted in a database for data mining and comparative QC analyses.
Collapse
Affiliation(s)
- Marco-Antonio Mendoza-Parra
- Department of Cancer Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC)/CNRS/INSERM/Université de Strasbourg, BP 10142, 67404 Illkirch Cedex, France
| | | | | | | | | |
Collapse
|
12
|
Guanghua X, Xinlei W, Quincey L, Nestler EJ, Xie Y. Detection of epigenetic changes using ANOVA with spatially varying coefficients. Stat Appl Genet Mol Biol 2013; 12:189-205. [PMID: 23502341 DOI: 10.1515/sagmb-2012-0057] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Identification of genome-wide epigenetic changes, the stable changes in gene function without a change in DNA sequence, under various conditions plays an important role in biomedical research. High-throughput epigenetic experiments are useful tools to measure genome-wide epigenetic changes, but the measured intensity levels from these high-resolution genome-wide epigenetic profiling data are often spatially correlated with high noise levels. In addition, it is challenging to detect genome-wide epigenetic changes across multiple conditions, so efficient statistical methodology development is needed for this purpose. In this study, we consider ANOVA models with spatially varying coefficients, combined with a hierarchical Bayesian approach, to explicitly model spatial correlation caused by location-dependent biological effects (i.e., epigenetic changes) and borrow strength among neighboring probes to compare epigenetic changes across multiple conditions. Through simulation studies and applications in drug addiction and depression datasets, we find that our approach compares favorably with competing methods; it is more efficient in estimation and more effective in detecting epigenetic changes. In addition, it can provide biologically meaningful results.
Collapse
Affiliation(s)
- Xiao Guanghua
- Division of Biostatistics, Department of Clinical Sciences, The University of Texas Southwestern Medical Center at Dallas, TX 75390, USA
| | | | | | | | | |
Collapse
|
13
|
Wang X, Zang M, Xiao G. Epigenetic change detection and pattern recognition via Bayesian hierarchical hidden Markov models. Stat Med 2012; 32:2292-307. [PMID: 23097332 DOI: 10.1002/sim.5658] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Revised: 07/16/2012] [Accepted: 09/19/2012] [Indexed: 11/09/2022]
Abstract
Epigenetics is the study of changes to the genome that can switch genes on or off and determine which proteins are transcribed without altering the DNA sequence. Recently, epigenetic changes have been linked to the development and progression of disease such as psychiatric disorders. High-throughput epigenetic experiments have enabled researchers to measure genome-wide epigenetic profiles and yield data consisting of intensity ratios of immunoprecipitation versus reference samples. The intensity ratios can provide a view of genomic regions where protein binding occur under one experimental condition and further allow us to detect epigenetic alterations through comparison between two different conditions. However, such experiments can be expensive, with only a few replicates available. Moreover, epigenetic data are often spatially correlated with high noise levels. In this paper, we develop a Bayesian hierarchical model, combined with hidden Markov processes with four states for modeling spatial dependence, to detect genomic sites with epigenetic changes from two-sample experiments with paired internal control. One attractive feature of the proposed method is that the four states of the hidden Markov process have well-defined biological meanings and allow us to directly call the change patterns based on the corresponding posterior probabilities. In contrast, none of existing methods can offer this advantage. In addition, the proposed method offers great power in statistical inference by spatial smoothing (via hidden Markov modeling) and information pooling (via hierarchical modeling). Both simulation studies and real data analysis in a cocaine addiction study illustrate the reliability and success of this method.
Collapse
Affiliation(s)
- Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, TX 75275, USA
| | | | | |
Collapse
|
14
|
Wang X, Zhang X. Pinpointing transcription factor binding sites from ChIP-seq data with SeqSite. BMC SYSTEMS BIOLOGY 2011; 5 Suppl 2:S3. [PMID: 22784574 PMCID: PMC3287483 DOI: 10.1186/1752-0509-5-s2-s3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Background Chromatin immunoprecipitation combined with the next-generation DNA sequencing technologies (ChIP-seq) becomes a key approach for detecting genome-wide sets of genomic sites bound by proteins, such as transcription factors (TFs). Several methods and open-source tools have been developed to analyze ChIP-seq data. However, most of them are designed for detecting TF binding regions instead of accurately locating transcription factor binding sites (TFBSs). It is still challenging to pinpoint TFBSs directly from ChIP-seq data, especially in regions with closely spaced binding events. Results With the aim to pinpoint TFBSs at a high resolution, we propose a novel method named SeqSite, implementing a two-step strategy: detecting tag-enriched regions first and pinpointing binding sites in the detected regions. The second step is done by modeling the tag density profile, locating TFBSs on each strand with a least-squares model fitting strategy, and merging the detections from the two strands. Experiments on simulation data show that SeqSite can locate most of the binding sites more than 40-bp from each other. Applications on three human TF ChIP-seq datasets demonstrate the advantage of SeqSite for its higher resolution in pinpointing binding sites compared with existing methods. Conclusions We have developed a computational tool named SeqSite, which can pinpoint both closely spaced and isolated binding sites, and consequently improves the resolution of TFBS detection from ChIP-seq data.
Collapse
Affiliation(s)
- Xi Wang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University, Beijing 100084, China
| | | |
Collapse
|
15
|
Mendoza-Parra MA, Sankar M, Walia M, Gronemeyer H. POLYPHEMUS: R package for comparative analysis of RNA polymerase II ChIP-seq profiles by non-linear normalization. Nucleic Acids Res 2011; 40:e30. [PMID: 22156059 PMCID: PMC3287170 DOI: 10.1093/nar/gkr1205] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Chromatin immunoprecipitation coupled with massive parallel sequencing (ChIP-seq) is increasingly used to map protein–chromatin interactions at global scale. The comparison of ChIP-seq profiles for RNA polymerase II (PolII) established in different biological contexts, such as specific developmental stages or specific time-points during cell differentiation, provides not only information about the presence/accumulation of PolII at transcription start sites (TSSs) but also about functional features of transcription, including PolII stalling, pausing and transcript elongation. However, annotation and normalization tools for comparative studies of multiple samples are currently missing. Here, we describe the R-package POLYPHEMUS, which integrates TSS annotation with PolII enrichment over TSSs and coding regions, and normalizes signal intensity profiles. Thereby POLYPHEMUS facilitates to extract information about global PolII action to reveal changes in the functional state of genes. We validated POLYPHEMUS using a kinetic study on retinoic acid-induced differentiation and a publicly available data set from a comparative PolII ChIP-seq profiling in Caenorhabditis elegans. We demonstrate that POLYPHEMUS corrects the data sets by normalizing for technical variation between samples and reveal the potential of the algorithm in comparing multiple data sets to infer features of transcription regulation from dynamic PolII binding profiles.
Collapse
Affiliation(s)
- Marco A Mendoza-Parra
- Department of Cancer Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire/CNRS/INSERM/Université de Strasbourg, BP 10142, 67404 Illkirch Cedex, France.
| | | | | | | |
Collapse
|
16
|
Turkarslan S, Reiss DJ, Gibbins G, Su WL, Pan M, Bare JC, Plaisier CL, Baliga NS. Niche adaptation by expansion and reprogramming of general transcription factors. Mol Syst Biol 2011; 7:554. [PMID: 22108796 PMCID: PMC3261711 DOI: 10.1038/msb.2011.87] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2011] [Accepted: 10/25/2011] [Indexed: 02/01/2023] Open
Abstract
Numerous lineage-specific expansions of the transcription factor B (TFB) family in archaea suggests an important role for expanded TFBs in encoding environment-specific gene regulatory programs. Given the characteristics of hypersaline lakes, the unusually large numbers of TFBs in halophilic archaea further suggests that they might be especially important in rapid adaptation to the challenges of a dynamically changing environment. Motivated by these observations, we have investigated the implications of TFB expansions by correlating sequence variations, regulation, and physical interactions of all seven TFBs in Halobacterium salinarum NRC-1 to their fitness landscapes, functional hierarchies, and genetic interactions across 2488 experiments covering combinatorial variations in salt, pH, temperature, and Cu stress. This systems analysis has revealed an elegant scheme in which completely novel fitness landscapes are generated by gene conversion events that introduce subtle changes to the regulation or physical interactions of duplicated TFBs. Based on these insights, we have introduced a synthetically redesigned TFB and altered the regulation of existing TFBs to illustrate how archaea can rapidly generate novel phenotypes by simply reprogramming their TFB regulatory network.
Collapse
Affiliation(s)
| | - David J Reiss
- Baliga Lab, Institute for Systems Biology, Seattle, WA, USA
| | | | - Wan Lin Su
- Baliga Lab, Institute for Systems Biology, Seattle, WA, USA
| | - Min Pan
- Baliga Lab, Institute for Systems Biology, Seattle, WA, USA
| | | | | | - Nitin S Baliga
- Baliga Lab, Institute for Systems Biology, Seattle, WA, USA
- Department of Microbiology, University of Washington, Seattle, WA, USA
- Department of Biology, Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
| |
Collapse
|
17
|
Mendoza-Parra MA, Walia M, Sankar M, Gronemeyer H. Dissecting the retinoid-induced differentiation of F9 embryonal stem cells by integrative genomics. Mol Syst Biol 2011; 7:538. [PMID: 21988834 PMCID: PMC3261707 DOI: 10.1038/msb.2011.73] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2011] [Accepted: 08/20/2011] [Indexed: 01/11/2023] Open
Abstract
Retinoic acid (RA) triggers physiological processes by activating heterodimeric transcription factors (TFs) comprising retinoic acid receptor (RARα, β, γ) and retinoid X receptor (RXRα, β, γ). How a single signal induces highly complex temporally controlled networks that ultimately orchestrate physiological processes is unclear. Using an RA-inducible differentiation model, we defined the temporal changes in the genome-wide binding patterns of RARγ and RXRα and correlated them with transcription regulation. Unexpectedly, both receptors displayed a highly dynamic binding, with different RXRα heterodimers targeting identical loci. Comparison of RARγ and RXRα co-binding at RA-regulated genes identified putative RXRα-RARγ target genes that were validated with subtype-selective agonists. Gene-regulatory decisions during differentiation were inferred from TF-target gene information and temporal gene expression. This analysis revealed six distinct co-expression paths of which RXRα-RARγ is associated with transcription activation, while Sox2 and Egr1 were predicted to regulate repression. Finally, RXRα-RARγ regulatory networks were reconstructed through integration of functional co-citations. Our analysis provides a dynamic view of RA signalling during cell differentiation, reveals RAR heterodimer dynamics and promiscuity, and predicts decisions that diversify the RA signal into distinct gene-regulatory programs.
Collapse
Affiliation(s)
- Marco A Mendoza-Parra
- Department of Cancer Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC)/CNRS/INSERM/Université de Strasbourg, Illkirch Cedex, France
| | - Mannu Walia
- Department of Cancer Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC)/CNRS/INSERM/Université de Strasbourg, Illkirch Cedex, France
| | - Martial Sankar
- Department of Cancer Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC)/CNRS/INSERM/Université de Strasbourg, Illkirch Cedex, France
| | - Hinrich Gronemeyer
- Department of Cancer Biology, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC)/CNRS/INSERM/Université de Strasbourg, Illkirch Cedex, France
| |
Collapse
|
18
|
Kaur A, Van PT, Busch CR, Robinson CK, Pan M, Pang WL, Reiss DJ, DiRuggiero J, Baliga NS. Coordination of frontline defense mechanisms under severe oxidative stress. Mol Syst Biol 2010; 6:393. [PMID: 20664639 PMCID: PMC2925529 DOI: 10.1038/msb.2010.50] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2009] [Accepted: 05/31/2010] [Indexed: 01/15/2023] Open
Abstract
Inference of an environmental and gene regulatory influence network (EGRINOS) by integrating transcriptional responses to H2O2 and paraquat (PQ) has revealed a multi-tiered oxidative stress (OS)-management program to transcriptionally coordinate three peroxidase/catalase enzymes, two superoxide dismutases, production of rhodopsins, carotenoids and gas vesicles, metal trafficking, and various other aspects of metabolism. ChIP-chip, microarray, and survival assays have validated important architectural aspects of this network, identified novel defense mechanisms (including two evolutionarily distant peroxidase enxymes), and showed that general transcription factors of the transcription factor B family have an important function in coordinating the OS response (OSR) despite their inability to directly sense ROS. A comparison of transcriptional responses to sub-lethal doses of H2O2 and PQ with predictions of these responses made by an EGRIN model generated earlier from responses to other environmental factors has confirmed that a significant fraction of the OSR is made up of a generalized component that is also observed in response to other stressors. Analysis of active regulons within environment and gene regulatory influence network for OS (EGRINOS) across diverse environmental conditions has identified the specialized component of oxidative stress response (OSR) that is triggered by sub-lethal OS, but not by other stressors, including sub-inhibitory levels of redox-active metals, extreme changes in oxygen tension, and a sub-lethal dose of γ rays.
Reactive oxygen species (ROS), such as hydrogen peroxide (H2O2), superoxide (O2−), and hydroxyl (OH−) radicals, are normal by-products of aerobic metabolism. Evolutionarily conserved mechanisms including detoxification enzymes (peroxidase/catalase and superoxide dismutase (SOD)) and free radical scavengers manage this endogenous production of ROS. OS is a condition reached when certain environmental stresses or genetic defects cause the production of ROS to exceed the management capacity. The damage to diverse cellular components including DNA, proteins, lipids, and carbohydrates resulting from OS (Imlay, 2003; Apel and Hirt, 2004; Perrone et al, 2008) is recognized as an important player in many diseases and in the aging process (Finkel, 2005). We have applied a systems approach to characterize the OSR of an archaeal model organism, Halobacterium salinarum NRC-1. This haloarchaeon grows aerobically at 4.3 M salt concentration in which it routinely faces cycles of desiccation and rehydration, and increased ultraviolet radiation—both of which can increase the production of ROS (Farr and Kogoma, 1991; Oliver et al, 2001). We have reconstructed the physiological adjustments associated with management of excessive OS through the analysis of global transcriptional changes elicited by step exposure to growth sub-inhibitory and sub-lethal levels of H2O2 and PQ (a redox-cycling drug that produces O2−; Hassan and Fridovich, 1979) as well as during subsequent recovery from these stresses. We have integrated all of these data into a unified model for OSR to discover conditional functional links between protective mechanisms and normal aspects of metabolism. Subsequent phenotypic analysis of gene deletion strains has verified the conditional detoxification functions of three putative peroxidase/catalase enzymes, two SODs, and the protective function of rhodopsins under increased levels of H2O2 and PQ. Similarly, we have also validated ROS scavenging by carotenoids and flotation by gas vesicles as secondary mechanisms that may minimize OS. Given the ubiquitous nature of OS, it is not entirely surprising that most organisms have evolved similar multiple lines of defense—both passive and active. Although such mechanisms have been extensively characterized using other model organisms, our integrated systems approach has uncovered additional protective mechanisms in H. salinarum (e.g. two evolutionarily distant peroxidase/catalase enzymes) and revealed a structure and hierarchy to the OSR through conditional regulatory associations among various components of the response. We have validated some aspects of the architecture of the regulatory network for managing OS by confirming physical protein–DNA interactions of six transcription factors (TFs) with promoters of genes they were predicted to influence in EGRINOS. Furthermore, we have also shown the consequence of deleting two of these TFs on transcript levels of genes they control and survival rate under OS. It is notable that these TFs are not directly associated with sensing ROS, but, rather, they have a general function in coordinating the overall response. This insight would not have been possible without constructing EGRINOS through systems integration of diverse datasets. Although it has been known that OS is a component of diverse environmental stress conditions, we quantitatively show for the first time that much of the transcriptional responses induced by the two treatments could indeed have been predicted using a model constructed from the analysis of transcriptional responses to changes in other environmental factors (UV and γ-radiation, light, oxygen, and six metals). However, using specific examples we also reveal the specific components of the OSR that are triggered only under severe OS. Notably, this model of OSR gives a unified perspective of the interconnections among all of these generalized and OS-specific regulatory mechanisms. Complexity of cellular response to oxidative stress (OS) stems from its wide-ranging damage to nucleic acids, proteins, carbohydrates, and lipids. We have constructed a systems model of OS response (OSR) for Halobacterium salinarum NRC-1 in an attempt to understand the architecture of its regulatory network that coordinates this complex response. This has revealed a multi-tiered OS-management program to transcriptionally coordinate three peroxidase/catalase enzymes, two superoxide dismutases, production of rhodopsins, carotenoids and gas vesicles, metal trafficking, and various other aspects of metabolism. Through experimental validation of interactions within the OSR regulatory network, we show that despite their inability to directly sense reactive oxygen species, general transcription factors have an important function in coordinating this response. Remarkably, a significant fraction of this OSR was accurately recapitulated by a model that was earlier constructed from cellular responses to diverse environmental perturbations—this constitutes the general stress response component. Notwithstanding this observation, comparison of the two models has identified the coordination of frontline defense and repair systems by regulatory mechanisms that are triggered uniquely by severe OS and not by other environmental stressors, including sub-inhibitory levels of redox-active metals, extreme changes in oxygen tension, and a sub-lethal dose of γ rays.
Collapse
Affiliation(s)
- Amardeep Kaur
- Institute for Systems Biology, Seattle, WA 98103, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
19
|
Schmid AK, Pan M, Sharma K, Baliga NS. Two transcription factors are necessary for iron homeostasis in a salt-dwelling archaeon. Nucleic Acids Res 2010; 39:2519-33. [PMID: 21109526 PMCID: PMC3074139 DOI: 10.1093/nar/gkq1211] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Because iron toxicity and deficiency are equally life threatening, maintaining intracellular iron levels within a narrow optimal range is critical for nearly all known organisms. However, regulatory mechanisms that establish homeostasis are not well understood in organisms that dwell in environments at the extremes of pH, temperature, and salinity. Under conditions of limited iron, the extremophile Halobacterium salinarum, a salt-loving archaeon, mounts a specific response to scavenge iron for growth. We have identified and characterized the role of two transcription factors (TFs), Idr1 and Idr2, in regulating this important response. An integrated systems analysis of TF knockout gene expression profiles and genome-wide binding locations in the presence and absence of iron has revealed that these TFs operate collaboratively to maintain iron homeostasis. In the presence of iron, Idr1 and Idr2 bind near each other at 24 loci in the genome, where they are both required to repress some genes. By contrast, Idr1 and Idr2 are both necessary to activate other genes in a putative a feed forward loop. Even at loci bound independently, the two TFs target different genes with similar functions in iron homeostasis. We discuss conserved and unique features of the Idr1-Idr2 system in the context of similar systems in organisms from other domains of life.
Collapse
Affiliation(s)
- Amy K Schmid
- Duke University, Department of Biology and Institute for Genome Sciences and Policy, Center for Systems Biology, Durham, NC 27708, USA.
| | | | | | | |
Collapse
|
20
|
Bare JC, Koide T, Reiss DJ, Tenenbaum D, Baliga NS. Integration and visualization of systems biology data in context of the genome. BMC Bioinformatics 2010; 11:382. [PMID: 20642854 PMCID: PMC2912892 DOI: 10.1186/1471-2105-11-382] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2010] [Accepted: 07/19/2010] [Indexed: 01/05/2023] Open
Abstract
Background High-density tiling arrays and new sequencing technologies are generating rapidly increasing volumes of transcriptome and protein-DNA interaction data. Visualization and exploration of this data is critical to understanding the regulatory logic encoded in the genome by which the cell dynamically affects its physiology and interacts with its environment. Results The Gaggle Genome Browser is a cross-platform desktop program for interactively visualizing high-throughput data in the context of the genome. Important features include dynamic panning and zooming, keyword search and open interoperability through the Gaggle framework. Users may bookmark locations on the genome with descriptive annotations and share these bookmarks with other users. The program handles large sets of user-generated data using an in-process database and leverages the facilities of SQL and the R environment for importing and manipulating data. A key aspect of the Gaggle Genome Browser is interoperability. By connecting to the Gaggle framework, the genome browser joins a suite of interconnected bioinformatics tools for analysis and visualization with connectivity to major public repositories of sequences, interactions and pathways. To this flexible environment for exploring and combining data, the Gaggle Genome Browser adds the ability to visualize diverse types of data in relation to its coordinates on the genome. Conclusions Genomic coordinates function as a common key by which disparate biological data types can be related to one another. In the Gaggle Genome Browser, heterogeneous data are joined by their location on the genome to create information-rich visualizations yielding insight into genome organization, transcription and its regulation and, ultimately, a better understanding of the mechanisms that enable the cell to dynamically respond to its environment.
Collapse
Affiliation(s)
- J Christopher Bare
- Institute for Systems Biology, 1441 N 34th Street, Seattle, WA 98103, USA
| | | | | | | | | |
Collapse
|
21
|
Wilbanks EG, Facciotti MT. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One 2010; 5:e11471. [PMID: 20628599 PMCID: PMC2900203 DOI: 10.1371/journal.pone.0011471] [Citation(s) in RCA: 193] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2010] [Accepted: 06/14/2010] [Indexed: 01/08/2023] Open
Abstract
Next-generation DNA sequencing coupled with chromatin immunoprecipitation (ChIP-seq) is revolutionizing our ability to interrogate whole genome protein-DNA interactions. Identification of protein binding sites from ChIP-seq data has required novel computational tools, distinct from those used for the analysis of ChIP-Chip experiments. The growing popularity of ChIP-seq spurred the development of many different analytical programs (at last count, we noted 31 open source methods), each with some purported advantage. Given that the literature is dense and empirical benchmarking challenging, selecting an appropriate method for ChIP-seq analysis has become a daunting task. Herein we compare the performance of eleven different peak calling programs on common empirical, transcription factor datasets and measure their sensitivity, accuracy and usability. Our analysis provides an unbiased critical assessment of available technologies, and should assist researchers in choosing a suitable tool for handling ChIP-seq data.
Collapse
Affiliation(s)
- Elizabeth G Wilbanks
- Graduate Group in Microbiology, University of California Davis, Davis, California, United States of America
| | | |
Collapse
|
22
|
Aleksic J, Russell S. ChIPing away at the genome: the new frontier travel guide. MOLECULAR BIOSYSTEMS 2010; 5:1421-8. [PMID: 19617957 DOI: 10.1039/b906179g] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Chromatin immunoprecipitation (ChIP) is a powerful technique for obtaining in vivo data on protein-DNA binding, providing an invaluable tool for elucidating gene regulation at a molecular level. Combined with high-throughput methods such as microarrays (ChIP-array) and second generation sequencing (ChIP-seq), the technique is now commonly used for answering questions about protein binding on a genome-wide level. This review focuses on the use of microarrays and sequencing for ChIP studies, provides a critical comparison of the currently used platforms and an overview of the computational methods available, and offers recommendations for optimal use of the techniques in a research context.
Collapse
Affiliation(s)
- Jelena Aleksic
- Department of Genetics and Cambridge Systems Biology Centre, University of Cambridge, Downing Street, Cambridge, UK
| | | |
Collapse
|
23
|
Abstract
MOTIVATION Chromatin immunoprecipitation (ChIP) coupled with tiling microarray (chip) experiments have been used in a wide range of biological studies such as identification of transcription factor binding sites and investigation of DNA methylation and histone modification. Hidden Markov models are widely used to model the spatial dependency of ChIP-chip data. However, parameter estimation for these models is typically either heuristic or suboptimal, leading to inconsistencies in their applications. To overcome this limitation and to develop an efficient software, we propose a hidden ferromagnetic Ising model for ChIP-chip data analysis. RESULTS We have developed a simple, but powerful Bayesian hierarchical model for ChIP-chip data via a hidden Ising model. Metropolis within Gibbs sampling algorithm is used to simulate from the posterior distribution of the model parameters. The proposed model naturally incorporates the spatial dependency of the data, and can be used to analyze data with various genomic resolutions and sample sizes. We illustrate the method using three publicly available datasets and various simulated datasets, and compare it with three closely related methods, namely TileMap HMM, tileHMM and BAC. We find that our method performs as well as TileMap HMM and BAC for the high-resolution data from Affymetrix platform, but significantly outperforms the other three methods for the low-resolution data from Agilent platform. Compared with the BAC method which also involves MCMC simulations, our method is computationally much more efficient. AVAILABILITY A software called iChip is freely available at http://www.bioconductor.org/. CONTACT moq@mskcc.org.
Collapse
Affiliation(s)
- Qianxing Mo
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY 10065, USA.
| | | |
Collapse
|
24
|
Combinatorial binding predicts spatio-temporal cis-regulatory activity. Nature 2009; 462:65-70. [PMID: 19890324 DOI: 10.1038/nature08531] [Citation(s) in RCA: 298] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2009] [Accepted: 09/22/2009] [Indexed: 11/09/2022]
Abstract
Development requires the establishment of precise patterns of gene expression, which are primarily controlled by transcription factors binding to cis-regulatory modules. Although transcription factor occupancy can now be identified at genome-wide scales, decoding this regulatory landscape remains a daunting challenge. Here we used a novel approach to predict spatio-temporal cis-regulatory activity based only on in vivo transcription factor binding and enhancer activity data. We generated a high-resolution atlas of cis-regulatory modules describing their temporal and combinatorial occupancy during Drosophila mesoderm development. The binding profiles of cis-regulatory modules with characterized expression were used to train support vector machines to predict five spatio-temporal expression patterns. In vivo transgenic reporter assays demonstrate the high accuracy of these predictions and reveal an unanticipated plasticity in transcription factor binding leading to similar expression. This data-driven approach does not require previous knowledge of transcription factor sequence affinity, function or expression, making it widely applicable.
Collapse
|
25
|
Fu AQ, Adryan B. Scoring overlapping and adjacent signals from genome-wide ChIP and DamID assays. MOLECULAR BIOSYSTEMS 2009; 5:1429-38. [PMID: 19763325 PMCID: PMC3475982 DOI: 10.1039/b906880e] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Much of the research utilising genome-wide ChIP and DamID assays aims to understand the combinatorial feature of transcription factor binding and the chromatin modification code. With these experimental methods becoming more affordable and widespread, the focus of research is shifting to making sense of the data. Amongst the many challenges arising from data analyses, we are concerned with identifying biologically meaningful co-occurrences of transcription factor binding or chromatin modifications, using genome-wide profiles generated from ChIP and DamID assays. Co-occurrences are reflected in overlapping and adjacent signals in multiple ChIP or DamID profiles. We review existing quantitative methods to score overlaps and to cluster binding events in ChIP and DamID profiles. For pairwise comparison, existing methods either are based on a single score at the genome level or take a genomic, region-specific view. To draw inference from many profiles simultaneously, methods exist to cluster regions by their regulatory importance or to infer cis-regulatory modules for a particular region. We provide a simple guide to some of the statistical tools used by these methods.
Collapse
Affiliation(s)
- Audrey Qiuyan Fu
- Cambridge Systems Biology Centre, University of Cambridge, Tennis Court Road, Cambridge, UK.
| | | |
Collapse
|
26
|
Wu M, Liang F, Tian Y. Bayesian modeling of ChIP-chip data using latent variables. BMC Bioinformatics 2009; 10:352. [PMID: 19857265 PMCID: PMC2779819 DOI: 10.1186/1471-2105-10-352] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2008] [Accepted: 10/26/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ChIP-chip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants. Various methods have been proposed in the literature for analyzing the ChIP-chip data, such as the sliding window methods, the hidden Markov model-based methods, and Bayesian methods. Although, due to the integrated consideration of uncertainty of the models and model parameters, Bayesian methods can potentially work better than the other two classes of methods, the existing Bayesian methods do not perform satisfactorily. They usually require multiple replicates or some extra experimental information to parametrize the model, and long CPU time due to involving of MCMC simulations. RESULTS In this paper, we propose a Bayesian latent model for the ChIP-chip data. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probe-specific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model more robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length. CONCLUSION The Bayesian latent method is successfully applied to real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results indicate that the Bayesian latent method can outperform other methods, especially when the data contain outliers.
Collapse
Affiliation(s)
- Mingqi Wu
- Department of Statistics, Texas A&M University, College Station, TX 77843, USA.
| | | | | |
Collapse
|
27
|
Kim Y, Bekiranov S, Lee JK, Park T. Double error shrinkage method for identifying protein binding sites observed by tiling arrays with limited replication. ACTA ACUST UNITED AC 2009; 25:2486-91. [PMID: 19667080 DOI: 10.1093/bioinformatics/btp471] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION ChIP-chip has been widely used for various genome-wide biological investigations. Given the small number of replicates (typically two to three) per biological sample, methods of analysis that control the variance are desirable but in short supply. We propose a double error shrinkage (DES) method by using moving average statistics based on local-pooled error estimates which effectively control both heterogeneous error variances and correlation structures of an extremely large number of individual probes on tiling arrays. RESULTS Applying DES to ChIP-chip tiling array study for discovering genome-wide protein-binding sites, we identified 8400 target regions that include highly likely TFIID binding sites. About 33% of these were well matched with the known transcription starting sites on the DBTSS library, while many other newly identified sites have a high chance to be real binding sites based on a high positive predictive value of DES. We also showed the superior performance of DES compared with other commonly used methods for detecting actual protein binding sites.
Collapse
Affiliation(s)
- Youngchul Kim
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
| | | | | | | |
Collapse
|
28
|
HOU L, QIAN MP, ZHU YP, DENG MH. Advances on bioinformatic research in transcription factor binding sites. YI CHUAN = HEREDITAS 2009; 31:365-73. [DOI: 10.3724/sp.j.1005.2009.00365] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
29
|
Schmid AK, Reiss DJ, Pan M, Koide T, Baliga NS. A single transcription factor regulates evolutionarily diverse but functionally linked metabolic pathways in response to nutrient availability. Mol Syst Biol 2009; 5:282. [PMID: 19536205 PMCID: PMC2710871 DOI: 10.1038/msb.2009.40] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2009] [Accepted: 05/15/2009] [Indexed: 01/02/2023] Open
Abstract
During evolution, enzyme-coding genes are acquired and/or replaced through lateral gene transfer and compiled into metabolic pathways. Gene regulatory networks evolve to fine tune biochemical fluxes through such metabolic pathways, enabling organisms to acclimate to nutrient fluctuations in a competitive environment. Here, we demonstrate that a single TrmB family transcription factor in Halobacterium salinarum NRC-1 globally coordinates functionally linked enzymes of diverse phylogeny in response to changes in carbon source availability. Specifically, during nutritional limitation, TrmB binds a cis-regulatory element to activate or repress 113 promoters of genes encoding enzymes in diverse metabolic pathways. By this mechanism, TrmB coordinates the expression of glycolysis, TCA cycle, and amino-acid biosynthesis pathways with the biosynthesis of their cognate cofactors (e.g. purine and thiamine). Notably, the TrmB-regulated metabolic network includes enzyme-coding genes that are uniquely archaeal as well as those that are conserved across all three domains of life. Simultaneous analysis of metabolic and gene regulatory network architectures suggests an ongoing process of co-evolution in which TrmB integrates the expression of metabolic enzyme-coding genes of diverse origins.
Collapse
Affiliation(s)
- Amy K Schmid
- Institute for Systems Biology, Seattle, WA 98103-8904, USA
| | | | | | | | | |
Collapse
|
30
|
Prevalence of transcription promoters within archaeal operons and coding sequences. Mol Syst Biol 2009; 5:285. [PMID: 19536208 PMCID: PMC2710873 DOI: 10.1038/msb.2009.42] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2008] [Accepted: 05/13/2009] [Indexed: 01/21/2023] Open
Abstract
Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of approximately 64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein-DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3' ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes-events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements.
Collapse
|
31
|
The role of predictive modelling in rationally re-engineering biological systems. Nat Rev Microbiol 2009; 7:297-305. [PMID: 19252506 DOI: 10.1038/nrmicro2107] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Technologies to synthesize and transplant a complete genome into a cell have opened limitless potential to redesign organisms for complex, specialized tasks. However, large-scale re-engineering of a biological circuit will require systems-level optimization that will come from a deep understanding of operational relationships among all the constituent parts of a cell. The integrated framework necessary for conducting such complex bioengineering requires the convergence of systems and synthetic biology. Here, we review the status of these rapidly developing interdisciplinary fields of biology and provide a perspective on plausible venues for their merger.
Collapse
|
32
|
An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008; 26:1293-300. [PMID: 18978777 PMCID: PMC2596672 DOI: 10.1038/nbt.1505] [Citation(s) in RCA: 607] [Impact Index Per Article: 37.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2008] [Accepted: 10/03/2008] [Indexed: 01/19/2023]
Abstract
CisGenome is a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. It is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false discovery rate (FDR) computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously published ChIP-chip analysis methods, the software contains new statistical methods designed specifically for ChIP-seq data. CisGenome has a modular design so that it supports interactive analyses through a graphic user interface as well as customized batch-mode computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure, conservation, and DNA sequence and motif information. We illustrate the use of these tools by a comparative analysis of ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis without negative control sample, and an analysis of a novel motif in Nanog- and Sox2-binding regions.
Collapse
|
33
|
Humburg P, Bulger D, Stone G. Parameter estimation for robust HMM analysis of ChIP-chip data. BMC Bioinformatics 2008; 9:343. [PMID: 18706106 PMCID: PMC2536674 DOI: 10.1186/1471-2105-9-343] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2008] [Accepted: 08/18/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Tiling arrays are an important tool for the study of transcriptional activity, protein-DNA interactions and chromatin structure on a genome-wide scale at high resolution. Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc. Especially in the context of ChIP-chip experiments, no standard procedures exist to obtain parameter estimates from the data. Common methods for the calculation of maximum likelihood estimates such as the Baum-Welch algorithm or Viterbi training are rarely applied in the context of tiling array analysis. RESULTS Here we develop a hidden Markov model for the analysis of chromatin structure ChIP-chip tiling array data, using t emission distributions to increase robustness towards outliers. Maximum likelihood estimates are used for all model parameters. Two different approaches to parameter estimation are investigated and combined into an efficient procedure. CONCLUSION We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates. The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.
Collapse
Affiliation(s)
- Peter Humburg
- Department of Statistics, Macquarie University, North Ryde, NSW 2109, Australia.
| | | | | |
Collapse
|
34
|
Rubio ED, Reiss DJ, Welcsh PL, Disteche CM, Filippova GN, Baliga NS, Aebersold R, Ranish JA, Krumm A. CTCF physically links cohesin to chromatin. Proc Natl Acad Sci U S A 2008; 105:8309-14. [PMID: 18550811 PMCID: PMC2448833 DOI: 10.1073/pnas.0801273105] [Citation(s) in RCA: 386] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2008] [Indexed: 12/24/2022] Open
Abstract
Cohesin is required to prevent premature dissociation of sister chromatids after DNA replication. Although its role in chromatid cohesion is well established, the functional significance of cohesin's association with interphase chromatin is not clear. Using a quantitative proteomics approach, we show that the STAG1 (Scc3/SA1) subunit of cohesin interacts with the CCTC-binding factor CTCF bound to the c-myc insulator element. Both allele-specific binding of CTCF and Scc3/SA1 at the imprinted IGF2/H19 gene locus and our analyses of human DM1 alleles containing base substitutions at CTCF-binding motifs indicate that cohesin recruitment to chromosomal sites depends on the presence of CTCF. A large-scale genomic survey using ChIP-Chip demonstrates that Scc3/SA1 binding strongly correlates with the CTCF-binding site distribution in chromosomal arms. However, some chromosomal sites interact exclusively with CTCF, whereas others interact with Scc3/SA1 only. Furthermore, immunofluorescence microscopy and ChIP-Chip experiments demonstrate that CTCF associates with both centromeres and chromosomal arms during metaphase. These results link cohesin to gene regulatory functions and suggest an essential role for CTCF during sister chromatid cohesion. These results have implications for the functional role of cohesin subunits in the pathogenesis of Cornelia de Lange syndrome and Roberts syndromes.
Collapse
Affiliation(s)
| | | | - Piri L. Welcsh
- Department of Medicine, Division of Medical Genetics, and
| | - Christine M. Disteche
- Department of Medicine, Division of Medical Genetics, and
- Department of Pathology, University of Washington, Seattle, WA 98195
| | - Galina N. Filippova
- Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109; and
| | | | - Ruedi Aebersold
- Institute for Systems Biology, Seattle, WA 98103
- Institute of Molecular Systems Biology, Swiss Federal Institute of Technology (ETH), and Faculty of Science, University of Zürich, CH-8006 Zürich, Switzerland
| | | | - Anton Krumm
- *Department of Radiation Oncology
- **Institute for Stem Cell and Regenerative Medicine, University of Washington School of Medicine, Seattle WA 98195
| |
Collapse
|