1
|
Morgan D, DeMeo DL, Glass K. Using methylation data to improve transcription factor binding prediction. Epigenetics 2024; 19:2309826. [PMID: 38300850 PMCID: PMC10841018 DOI: 10.1080/15592294.2024.2309826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 01/01/2024] [Indexed: 02/03/2024] Open
Abstract
Modelling the regulatory mechanisms that determine cell fate, response to external perturbation, and disease state depends on measuring many factors, a task made more difficult by the plasticity of the epigenome. Scanning the genome for the sequence patterns defined by Position Weight Matrices (PWM) can be used to estimate transcription factor (TF) binding locations. However, this approach does not incorporate information regarding the epigenetic context necessary for TF binding. CpG methylation is an epigenetic mark influenced by environmental factors that is commonly assayed in human cohort studies. We developed a framework to score inferred TF binding locations using methylation data. We intersected motif locations identified using PWMs with methylation information captured in both whole-genome bisulfite sequencing and Illumina EPIC array data for six cell lines, scored motif locations based on these data, and compared with experimental data characterizing TF binding (ChIP-seq). We found that for most TFs, binding prediction improves using methylation-based scoring compared to standard PWM-scores. We also illustrate that our approach can be generalized to infer TF binding when methylation information is only proximally available, i.e. measured for nearby CpGs that do not directly overlap with a motif location. Overall, our approach provides a framework for inferring context-specific TF binding using methylation data. Importantly, the availability of DNA methylation data in existing patient populations provides an opportunity to use our approach to understand the impact of methylation on gene regulatory processes in the context of human disease.
Collapse
Affiliation(s)
- Daniel Morgan
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Dawn L. DeMeo
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Kimberly Glass
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
- Department of Biostatistics, Harvard Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
2
|
Long T, Bhattacharyya T, Repele A, Naylor M, Nooti S, Krueger S, Manu. The contributions of DNA accessibility and transcription factor occupancy to enhancer activity during cellular differentiation. G3 (BETHESDA, MD.) 2024; 14:jkad269. [PMID: 38124496 PMCID: PMC11090500 DOI: 10.1093/g3journal/jkad269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 11/01/2023] [Indexed: 12/23/2023]
Abstract
During gene regulation, DNA accessibility is thought to limit the availability of transcription factor (TF) binding sites, while TFs can increase DNA accessibility to recruit additional factors that upregulate gene expression. Given this interplay, the causative regulatory events in the modulation of gene expression remain unknown for the vast majority of genes. We utilized deeply sequenced ATAC-Seq data and site-specific knock-in reporter genes to investigate the relationship between the binding-site resolution dynamics of DNA accessibility and the expression dynamics of the enhancers of Cebpa during macrophage-neutrophil differentiation. While the enhancers upregulate reporter expression during the earliest stages of differentiation, there is little corresponding increase in their total accessibility. Conversely, total accessibility peaks during the last stages of differentiation without any increase in enhancer activity. The accessibility of positions neighboring C/EBP-family TF binding sites, which indicates TF occupancy, does increase significantly during early differentiation, showing that the early upregulation of enhancer activity is driven by TF binding. These results imply that a generalized increase in DNA accessibility is not sufficient, and binding by enhancer-specific TFs is necessary, for the upregulation of gene expression. Additionally, high-coverage ATAC-Seq combined with time-series expression data can infer the sequence of regulatory events at binding-site resolution.
Collapse
Affiliation(s)
- Trevor Long
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| | - Tapas Bhattacharyya
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| | - Andrea Repele
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| | - Madison Naylor
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| | - Sunil Nooti
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| | - Shawn Krueger
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| | - Manu
- Department of Biology, University of North Dakota, Grand Forks, ND 58202-9019, USA
| |
Collapse
|
3
|
Wolpe JB, Martins AL, Guertin MJ. Correction of transposase sequence bias in ATAC-seq data with rule ensemble modeling. NAR Genom Bioinform 2023; 5:lqad054. [PMID: 37274120 PMCID: PMC10236359 DOI: 10.1093/nargab/lqad054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 04/02/2023] [Accepted: 05/19/2023] [Indexed: 06/06/2023] Open
Abstract
Chromatin accessibility assays have revolutionized the field of transcription regulation by providing single-nucleotide resolution measurements of regulatory features such as promoters and transcription factor binding sites. ATAC-seq directly measures how well the Tn5 transposase accesses chromatinized DNA. Tn5 has a complex sequence bias that is not effectively scaled with traditional bias-correction methods. We model this complex bias using a rule ensemble machine learning approach that integrates information from many input k-mers proximal to the ATAC sequence reads. We effectively characterize and correct single-nucleotide sequence biases and regional sequence biases of the Tn5 enzyme. Correction of enzymatic sequence bias is an important step in interpreting chromatin accessibility assays that aim to infer transcription factor binding and regulatory activity of elements in the genome.
Collapse
Affiliation(s)
- Jacob B Wolpe
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA
| | - André L Martins
- Center for Cell Analysis and Modeling, University of Connecticut, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut, Farmington, CT, USA
| | - Michael J Guertin
- Center for Cell Analysis and Modeling, University of Connecticut, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut, Farmington, CT, USA
| |
Collapse
|
4
|
Long T, Bhattacharyya T, Repele A, Naylor M, Nooti S, Krueger S, Manu. The contributions of DNA accessibility and transcription factor occupancy to enhancer activity during cellular differentiation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.22.529579. [PMID: 37090616 PMCID: PMC10120690 DOI: 10.1101/2023.02.22.529579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
The upregulation of gene expression by enhancers depends upon the interplay between the binding of sequence-specific transcription factors (TFs) and DNA accessibility. DNA accessibility is thought to limit the ability of TFs to bind to their sites, while TFs can increase accessibility to recruit additional factors that upregulate gene expression. Given this interplay, the causative regulatory events underlying the modulation of gene expression during cellular differentiation remain unknown for the vast majority of genes. We investigated the binding-site resolution dynamics of DNA accessibility and the expression dynamics of the enhancers of an important neutrophil gene, Cebpa, during macrophage-neutrophil differentiation. Reporter genes were integrated in a site-specific manner in PUER cells, which are progenitors that can be differentiated into neutrophils or macrophages in vitro by activating the pan-leukocyte TF PU.1. Time series data show that two enhancers upregulate reporter expression during the first 48 hours of neutrophil differentiation. Surprisingly, there is little or no increase in the total accessibility, measured by ATAC-Seq, of the enhancers during the same time period. Conversely, total accessibility peaks 96 hrs after PU.1 activation-consistent with its role as a pioneer-but the enhancers do not upregulate gene expression. Combining deeply sequenced ATAC-Seq data with a new bias-correction method allowed the profiling of accessibility at single-nucleotide resolution and revealed protected regions in the enhancers that match all previously characterized TF binding sites and ChIP-Seq data. Although the accessibility of most positions does not change during early differentiation, that of positions neighboring TF binding sites, an indicator of TF occupancy, did increase significantly. The localized accessibility changes are limited to nucleotides neighboring C/EBP-family TF binding sites, showing that the upregulation of enhancer activity during early differentiation is driven by C/EBP-family TF binding. These results show that increasing the total accessibility of enhancers is not sufficient for upregulating their activity and other events such as TF binding are necessary for upregulation. Also, TF binding can cause upregulation without a perceptible increase in total accessibility. Finally, this study demonstrates the feasibility of comprehensively mapping individual TF binding sites as footprints using high coverage ATAC-Seq and inferring the sequence of events in gene regulation by combining with time-series gene expression data.
Collapse
Affiliation(s)
- Trevor Long
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| | - Tapas Bhattacharyya
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| | - Andrea Repele
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| | - Madison Naylor
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| | - Sunil Nooti
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| | - Shawn Krueger
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| | - Manu
- Department of Biology, University of North Dakota, Grand Forks, 58202-9019 ND, USA
| |
Collapse
|
5
|
Singh P, Stevenson SR, Dickinson PJ, Reyna-Llorens I, Tripathi A, Reeves G, Schreier TB, Hibberd JM. C 4 gene induction during de-etiolation evolved through changes in cis to allow integration with ancestral C 3 gene regulatory networks. SCIENCE ADVANCES 2023; 9:eade9756. [PMID: 36989352 PMCID: PMC10058240 DOI: 10.1126/sciadv.ade9756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 03/01/2023] [Indexed: 06/19/2023]
Abstract
C4 photosynthesis has evolved by repurposing enzymes found in C3 plants. Compared with the ancestral C3 state, accumulation of C4 cycle proteins is enhanced. We used de-etiolation of C4 Gynandropsis gynandra and C3 Arabidopsis thaliana to understand this process. C4 gene expression and chloroplast biogenesis in G. gynandra were tightly coordinated. Although C3 and C4 photosynthesis genes showed similar induction patterns, in G. gynandra, C4 genes were more strongly induced than orthologs from A. thaliana. In vivo binding of TGA and homeodomain as well as light-responsive elements such as G- and I-box motifs were associated with the rapid increase in transcripts of C4 genes. Deletion analysis confirmed that regions containing G- and I-boxes were necessary for high expression. The data support a model in which accumulation of transcripts derived from C4 photosynthesis genes in C4 leaves is enhanced because modifications in cis allowed integration into ancestral transcriptional networks.
Collapse
|
6
|
Madrigal P, Deng S, Feng Y, Militi S, Goh KJ, Nibhani R, Grandy R, Osnato A, Ortmann D, Brown S, Pauklin S. Epigenetic and transcriptional regulations prime cell fate before division during human pluripotent stem cell differentiation. Nat Commun 2023; 14:405. [PMID: 36697417 PMCID: PMC9876972 DOI: 10.1038/s41467-023-36116-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 01/17/2023] [Indexed: 01/26/2023] Open
Abstract
Stem cells undergo cellular division during their differentiation to produce daughter cells with a new cellular identity. However, the epigenetic events and molecular mechanisms occurring between consecutive cell divisions have been insufficiently studied due to technical limitations. Here, using the FUCCI reporter we developed a cell-cycle synchronised human pluripotent stem cell (hPSC) differentiation system for uncovering epigenome and transcriptome dynamics during the first two divisions leading to definitive endoderm. We observed that transcription of key differentiation markers occurs before cell division, while chromatin accessibility analyses revealed the early inhibition of alternative cell fates. We found that Activator protein-1 members controlled by p38/MAPK signalling are necessary for inducing endoderm while blocking cell fate shifting toward mesoderm, and that enhancers are rapidly established and decommissioned between different cell divisions. Our study has practical biomedical utility for producing hPSC-derived patient-specific cell types since p38/MAPK induction increased the differentiation efficiency of insulin-producing pancreatic beta-cells.
Collapse
Affiliation(s)
- Pedro Madrigal
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Wellcome - MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge, CB2 0SZ, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Siwei Deng
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Yuliang Feng
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Stefania Militi
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Kim Jee Goh
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
- The Francis Crick Institute, London, NW1 1AT, UK
| | - Reshma Nibhani
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK
| | - Rodrigo Grandy
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Anna Osnato
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Daniel Ortmann
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Stephanie Brown
- Department of Surgery, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - Siim Pauklin
- Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Old Road, University of Oxford, Headington, Oxford, OX3 7LD, UK.
| |
Collapse
|
7
|
Intrinsic bias estimation for improved analysis of bulk and single-cell chromatin accessibility profiles using SELMA. Nat Commun 2022; 13:5533. [PMID: 36130957 PMCID: PMC9492688 DOI: 10.1038/s41467-022-33194-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Accepted: 09/08/2022] [Indexed: 11/25/2022] Open
Abstract
Genome-wide profiling of chromatin accessibility by DNase-seq or ATAC-seq has been widely used to identify regulatory DNA elements and transcription factor binding sites. However, enzymatic DNA cleavage exhibits intrinsic sequence biases that confound chromatin accessibility profiling data analysis. Existing computational tools are limited in their ability to account for such intrinsic biases and not designed for analyzing single-cell data. Here, we present Simplex Encoded Linear Model for Accessible Chromatin (SELMA), a computational method for systematic estimation of intrinsic cleavage biases from genomic chromatin accessibility profiling data. We demonstrate that SELMA yields accurate and robust bias estimation from both bulk and single-cell DNase-seq and ATAC-seq data. SELMA can utilize internal mitochondrial DNA data to improve bias estimation. We show that transcription factor binding inference from DNase footprints can be improved by incorporating estimated biases using SELMA. Furthermore, we show strong effects of intrinsic biases in single-cell ATAC-seq data, and develop the first single-cell ATAC-seq intrinsic bias correction model to improve cell clustering. SELMA can enhance the performance of existing bioinformatics tools and improve the analysis of both bulk and single-cell chromatin accessibility sequencing data. Genome-wide profiling of chromatin accessibility by DNase-seq or ATAC-seq has been widely used to identify regulatory DNA elements and transcription factor binding sites. Here the authors develop a computational model, SELMA, to estimate and correct enzymatic cleavage biases in chromatin accessibility profiling data.
Collapse
|
8
|
Luo K, Zhong J, Safi A, Hong LK, Tewari AK, Song L, Reddy TE, Ma L, Crawford GE, Hartemink AJ. Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data. Genome Res 2022; 32:1183-1198. [PMID: 35609992 PMCID: PMC9248881 DOI: 10.1101/gr.272203.120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 05/06/2022] [Indexed: 11/24/2022]
Abstract
Over a thousand different transcription factors (TFs) bind with varying occupancy across the human genome. Chromatin immunoprecipitation (ChIP) can assay occupancy genome-wide, but only one TF at a time, limiting our ability to comprehensively observe the TF occupancy landscape, let alone quantify how it changes across conditions. We developed TF occupancy profiler (TOP), a Bayesian hierarchical regression framework, to profile genome-wide quantitative occupancy of numerous TFs using data from a single chromatin accessibility experiment (DNase- or ATAC-seq). TOP is supervised, and its hierarchical structure allows it to predict the occupancy of any sequence-specific TF, even those never assayed with ChIP. We used TOP to profile the quantitative occupancy of hundreds of sequence-specific TFs at sites throughout the genome and examined how their occupancies changed in multiple contexts: in approximately 200 human cell types, through 12 h of exposure to different hormones, and across the genetic backgrounds of 70 individuals. TOP enables cost-effective exploration of quantitative changes in the landscape of TF binding.
Collapse
Affiliation(s)
- Kaixuan Luo
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Human Genetics, The University of Chicago, Chicago, Illinois 60637, USA
| | - Jianling Zhong
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
| | - Alexias Safi
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Linda K Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alok K Tewari
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Lingyun Song
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Timothy E Reddy
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Biostatistics and Bioinformatics, Durham, North Carolina 27710, USA
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina 27710, USA
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA
| | - Li Ma
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA
| | - Gregory E Crawford
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alexander J Hartemink
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Biology, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
9
|
Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, Zeitlinger J. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021; 53:354-366. [PMID: 33603233 PMCID: PMC8812996 DOI: 10.1038/s41588-021-00782-6] [Citation(s) in RCA: 233] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 01/07/2021] [Indexed: 01/30/2023]
Abstract
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Collapse
Affiliation(s)
- Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany,Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany,Currently at DeepMind, London, UK
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Sabrina Krueger
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Amr Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Khyati Dalal
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA
| | - Robin Fropf
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Charles McAnany
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA,Department of Genetics, Stanford University, Stanford, CA, USA,correspondence: ,
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA,correspondence: ,
| |
Collapse
|
10
|
D'Oliveira Albanus R, Kyono Y, Hensley J, Varshney A, Orchard P, Kitzman JO, Parker SCJ. Chromatin information content landscapes inform transcription factor and DNA interactions. Nat Commun 2021; 12:1307. [PMID: 33637709 PMCID: PMC7910283 DOI: 10.1038/s41467-021-21534-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 01/29/2021] [Indexed: 01/31/2023] Open
Abstract
Interactions between transcription factors and chromatin are fundamental to genome organization and regulation and, ultimately, cell state. Here, we use information theory to measure signatures of organized chromatin resulting from transcription factor-chromatin interactions encoded in the patterns of the accessible genome, which we term chromatin information enrichment (CIE). We calculate CIE for hundreds of transcription factor motifs across human samples and identify two classes: low and high CIE. The 10-20% of common and tissue-specific high CIE transcription factor motifs, associate with higher protein-DNA residence time, including different binding site subclasses of the same transcription factor, increased nucleosome phasing, specific protein domains, and the genetic control of both chromatin accessibility and gene expression. These results show that variations in the information encoded in chromatin architecture reflect functional biological variation, with implications for cell state dynamics and memory.
Collapse
Affiliation(s)
| | - Yasuhiro Kyono
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA
- Department of Human Genetics, University of Michigan, Ann Arbor, USA
- Tempus Labs, Inc. Chicago, IL, Chicago, USA
| | - John Hensley
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA
| | - Arushi Varshney
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA
| | - Peter Orchard
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA
| | - Jacob O Kitzman
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA
- Department of Human Genetics, University of Michigan, Ann Arbor, USA
| | - Stephen C J Parker
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA.
- Department of Human Genetics, University of Michigan, Ann Arbor, USA.
| |
Collapse
|
11
|
Minnoye L, Marinov GK, Krausgruber T, Pan L, Marand AP, Secchia S, Greenleaf WJ, Furlong EEM, Zhao K, Schmitz RJ, Bock C, Aerts S. Chromatin accessibility profiling methods. NATURE REVIEWS. METHODS PRIMERS 2021; 1:10. [PMID: 38410680 PMCID: PMC10895463 DOI: 10.1038/s43586-020-00008-9] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 12/01/2020] [Indexed: 02/06/2023]
Abstract
Chromatin accessibility, or the physical access to chromatinized DNA, is a widely studied characteristic of the eukaryotic genome. As active regulatory DNA elements are generally 'accessible', the genome-wide profiling of chromatin accessibility can be used to identify candidate regulatory genomic regions in a tissue or cell type. Multiple biochemical methods have been developed to profile chromatin accessibility, both in bulk and at the single-cell level. Depending on the method, enzymatic cleavage, transposition or DNA methyltransferases are used, followed by high-throughput sequencing, providing a view of genome-wide chromatin accessibility. In this Primer, we discuss these biochemical methods, as well as bioinformatics tools for analysing and interpreting the generated data, and insights into the key regulators underlying developmental, evolutionary and disease processes. We outline standards for data quality, reproducibility and deposition used by the genomics community. Although chromatin accessibility profiling is invaluable to study gene regulation, alone it provides only a partial view of this complex process. Orthogonal assays facilitate the interpretation of accessible regions with respect to enhancer-promoter proximity, functional transcription factor binding and regulatory function. We envision that technological improvements including single-molecule, multi-omics and spatial methods will bring further insight into the secrets of genome regulation.
Collapse
Affiliation(s)
- Liesbeth Minnoye
- Center for Brain & Disease Research, VIB-KU Leuven, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | | | - Thomas Krausgruber
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
| | - Lixia Pan
- Laboratory of Epigenome Biology, Systems Biology Center, Division of Intramural Research, National Heart, Lung and Blood Institute, NIH, Bethesda, MD, USA
| | | | - Stefano Secchia
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | | | - Eileen E M Furlong
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Keji Zhao
- Laboratory of Epigenome Biology, Systems Biology Center, Division of Intramural Research, National Heart, Lung and Blood Institute, NIH, Bethesda, MD, USA
| | | | - Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
- Institute of Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria
| | - Stein Aerts
- Center for Brain & Disease Research, VIB-KU Leuven, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| |
Collapse
|
12
|
Abstract
The ATAC-seq assay has emerged as the most useful, versatile, and widely adaptable method for profiling accessible chromatin regions and tracking the activity of cis-regulatory elements (cREs) in eukaryotes. Thanks to its great utility, it is now being applied to map active chromatin in the context of a very wide diversity of biological systems and questions. In the course of these studies, considerable experience working with ATAC-seq data has accumulated and a standard set of computational tasks that need to be carried for most ATAC-seq analyses has emerged. Here, we review and provide examples of common such analytical procedures (including data processing, quality control, peak calling, identifying differentially accessible open chromatin regions, and variable transcription factor (TF) motif accessibility) and discuss recommended optimal practices.
Collapse
|
13
|
Funk CC, Casella AM, Jung S, Richards MA, Rodriguez A, Shannon P, Donovan-Maiye R, Heavner B, Chard K, Xiao Y, Glusman G, Ertekin-Taner N, Golde TE, Toga A, Hood L, Van Horn JD, Kesselman C, Foster I, Madduri R, Price ND, Ament SA. Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data across 27 Tissue Types. Cell Rep 2020; 32:108029. [PMID: 32814038 PMCID: PMC7462736 DOI: 10.1016/j.celrep.2020.108029] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 05/07/2020] [Accepted: 07/22/2020] [Indexed: 12/27/2022] Open
Abstract
Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 ENCODE DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits.
Collapse
Affiliation(s)
- Cory C Funk
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Alex M Casella
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; Medical Scientist Training Program, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Segun Jung
- Globus, University of Chicago, Chicago, IL 60637, USA
| | | | | | - Paul Shannon
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | - Ben Heavner
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Kyle Chard
- Globus, University of Chicago, Chicago, IL 60637, USA
| | - Yukai Xiao
- Globus, University of Chicago, Chicago, IL 60637, USA
| | | | | | - Todd E Golde
- Mayo Clinic, Department of Neuroscience, Jacksonville, FL 32224, USA
| | - Arthur Toga
- Mark and Mary Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA 90033, USA
| | - Leroy Hood
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - John D Van Horn
- Department of Psychology, University of Southern California, Los Angeles, CA 90007, USA
| | - Carl Kesselman
- Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA
| | - Ian Foster
- Globus, University of Chicago, Chicago, IL 60637, USA; Data Science and Learning Division, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Ravi Madduri
- Globus, University of Chicago, Chicago, IL 60637, USA; Data Science and Learning Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| | | | - Seth A Ament
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA; Department of Psychiatry, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
| |
Collapse
|
14
|
Xu S, Feng W, Lu Z, Yu CY, Shao W, Nakshatri H, Reiter JL, Gao H, Chu X, Wang Y, Liu Y. regSNPs-ASB: A Computational Framework for Identifying Allele-Specific Transcription Factor Binding From ATAC-seq Data. Front Bioeng Biotechnol 2020; 8:886. [PMID: 32850739 PMCID: PMC7405637 DOI: 10.3389/fbioe.2020.00886] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Accepted: 07/09/2020] [Indexed: 12/21/2022] Open
Abstract
Expression quantitative trait loci (eQTL) analysis is useful for identifying genetic variants correlated with gene expression, however, it cannot distinguish between causal and nearby non-functional variants. Because the majority of disease-associated SNPs are located in regulatory regions, they can impact allele-specific binding (ASB) of transcription factors and result in differential expression of the target gene alleles. In this study, our aim was to identify functional single-nucleotide polymorphisms (SNPs) that alter transcriptional regulation and thus, potentially impact cellular function. Here, we present regSNPs-ASB, a generalized linear model-based approach to identify regulatory SNPs that are located in transcription factor binding sites. The input for this model includes ATAC-seq (assay for transposase-accessible chromatin with high-throughput sequencing) raw read counts from heterozygous loci, where differential transposase-cleavage patterns between two alleles indicate preferential transcription factor binding to one of the alleles. Using regSNPs-ASB, we identified 53 regulatory SNPs in human MCF-7 breast cancer cells and 125 regulatory SNPs in human mesenchymal stem cells (MSC). By integrating the regSNPs-ASB output with RNA-seq experimental data and publicly available chromatin interaction data from MCF-7 cells, we found that these 53 regulatory SNPs were associated with 74 potential target genes and that 32 (43%) of these genes showed significant allele-specific expression. By comparing all of the MCF-7 and MSC regulatory SNPs to the eQTLs in the Genome-Tissue Expression (GTEx) Project database, we found that 30% (16/53) of the regulatory SNPs in MCF-7 and 43% (52/122) of the regulatory SNPs in MSC were also in eQTL regions. The enrichment of regulatory SNPs in eQTLs indicated that many of them are likely responsible for allelic differences in gene expression (chi-square test, p-value < 0.01). In summary, we conclude that regSNPs-ASB is a useful tool for identifying causal variants from ATAC-seq data. This new computational tool will enable efficient prioritization of genetic variants identified as eQTL for further studies to validate their causal regulatory function. Ultimately, identifying causal genetic variants will further our understanding of the underlying molecular mechanisms of disease and the eventual development of potential therapeutic targets.
Collapse
Affiliation(s)
- Siwen Xu
- Institute of Intelligent System and Bioinformatics, College of Automation, Harbin Engineering University, Harbin, China.,Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Weixing Feng
- Institute of Intelligent System and Bioinformatics, College of Automation, Harbin Engineering University, Harbin, China
| | - Zixiao Lu
- Regenstrief Institute, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Christina Y Yu
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States.,Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
| | - Wei Shao
- Regenstrief Institute, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Harikrishna Nakshatri
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Jill L Reiter
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Hongyu Gao
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Xiaona Chu
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Yue Wang
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Yunlong Liu
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States.,Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
| |
Collapse
|
15
|
Liu Y, Fu L, Kaufmann K, Chen D, Chen M. A practical guide for DNase-seq data analysis: from data management to common applications. Brief Bioinform 2020; 20:1865-1877. [PMID: 30010713 DOI: 10.1093/bib/bby057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 06/06/2018] [Accepted: 06/10/2018] [Indexed: 01/01/2023] Open
Abstract
Deoxyribonuclease I (DNase I)-hypersensitive site sequencing (DNase-seq) has been widely used to determine chromatin accessibility and its underlying regulatory lexicon. However, exploring DNase-seq data requires sophisticated downstream bioinformatics analyses. In this study, we first review computational methods for all of the major steps in DNase-seq data analysis, including experimental design, quality control, read alignment, peak calling, annotation of cis-regulatory elements, genomic footprinting and visualization. The challenges associated with each step are highlighted. Next, we provide a practical guideline and a computational pipeline for DNase-seq data analysis by integrating some of these tools. We also discuss the competing techniques and the potential applications of this pipeline for the analysis of analogous experimental data. Finally, we discuss the integration of DNase-seq with other functional genomics techniques.
Collapse
Affiliation(s)
- Yongjing Liu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | - Liangyu Fu
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin 10115, Germany
| | - Kerstin Kaufmann
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin 10115, Germany
| | - Dijun Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ming Chen
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin 10115, Germany
| |
Collapse
|
16
|
Ouyang N, Boyle AP. TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence. Genome Res 2020; 30:1040-1046. [PMID: 32660981 PMCID: PMC7397869 DOI: 10.1101/gr.258228.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 06/26/2020] [Indexed: 02/06/2023]
Abstract
Transcription is tightly regulated by cis-regulatory DNA elements where transcription factors (TFs) can bind. Thus, identification of TF binding sites (TFBSs) is key to understanding gene expression and whole regulatory networks within a cell. The standard approaches used for TFBS prediction, such as position weight matrices (PWMs) and chromatin immunoprecipitation followed by sequencing (ChIP-seq), are widely used but have their drawbacks, including high false-positive rates and limited antibody availability, respectively. Several computational footprinting algorithms have been developed to detect TFBSs by investigating chromatin accessibility patterns; however, these also have limitations. We have developed a footprinting method to predict TF footprints in active chromatin elements (TRACE) to improve the prediction of TFBS footprints. TRACE incorporates DNase-seq data and PWMs within a multivariate hidden Markov model (HMM) to detect footprint-like regions with matching motifs. TRACE is an unsupervised method that accurately annotates binding sites for specific TFs automatically with no requirement for pregenerated candidate binding sites or ChIP-seq training data. Compared with published footprinting algorithms, TRACE has the best overall performance with the distinct advantage of targeting multiple motifs in a single model.
Collapse
Affiliation(s)
| | - Alan P Boyle
- Department of Computational Medicine and Bioinformatics.,Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
17
|
Srivastava D, Mahony S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2020; 1863:194443. [PMID: 31639474 PMCID: PMC7166147 DOI: 10.1016/j.bbagrm.2019.194443] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 09/21/2019] [Accepted: 10/06/2019] [Indexed: 12/14/2022]
Abstract
Transcription factors (TFs) selectively bind distinct sets of sites in different cell types. Such cell type-specific binding specificity is expected to result from interplay between the TF's intrinsic sequence preferences, cooperative interactions with other regulatory proteins, and cell type-specific chromatin landscapes. Cell type-specific TF binding events are highly correlated with patterns of chromatin accessibility and active histone modifications in the same cell type. However, since concurrent chromatin may itself be a consequence of TF binding, chromatin landscapes measured prior to TF activation provide more useful insights into how cell type-specific TF binding events became established in the first place. Here, we review the various sequence and chromatin determinants of cell type-specific TF binding specificity. We identify the current challenges and opportunities associated with computational approaches to characterizing, imputing, and predicting cell type-specific TF binding patterns. We further focus on studies that characterize TF binding in dynamic regulatory settings, and we discuss how these studies are leading to a more complex and nuanced understanding of dynamic protein-DNA binding activities. We propose that TF binding activities at individual sites can be viewed along a two-dimensional continuum of local sequence and chromatin context. Under this view, cell type-specific TF binding activities may result from either strongly favorable sequence features or strongly favorable chromatin context.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America.
| |
Collapse
|
18
|
Xu T, Zheng X, Li B, Jin P, Qin Z, Wu H. A comprehensive review of computational prediction of genome-wide features. Brief Bioinform 2020; 21:120-134. [PMID: 30462144 PMCID: PMC10233247 DOI: 10.1093/bib/bby110] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 10/15/2018] [Accepted: 10/16/2018] [Indexed: 12/15/2022] Open
Abstract
There are significant correlations among different types of genetic, genomic and epigenomic features within the genome. These correlations make the in silico feature prediction possible through statistical or machine learning models. With the accumulation of a vast amount of high-throughput data, feature prediction has gained significant interest lately, and a plethora of papers have been published in the past few years. Here we provide a comprehensive review on these published works, categorized by the prediction targets, including protein binding site, enhancer, DNA methylation, chromatin structure and gene expression. We also provide discussions on some important points and possible future directions.
Collapse
Affiliation(s)
- Tianlei Xu
- Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Peng Jin
- Department of Human Genetics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Zhaohui Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| |
Collapse
|
19
|
Behjati Ardakani F, Schmidt F, Schulz MH. Predicting transcription factor binding using ensemble random forest models. F1000Res 2019; 7:1603. [PMID: 31723409 PMCID: PMC6823902 DOI: 10.12688/f1000research.16200.2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/15/2019] [Indexed: 12/03/2022] Open
Abstract
Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs). Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the
ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups. Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier built based upon data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal. Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub:
https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697).
Collapse
Affiliation(s)
- Fatemeh Behjati Ardakani
- High throughput Genomics and Systems Biology, Cluster of Excellence on Multimodel Computing and Interaction, Saarland University, Saarbruecken,, Saarland, 66123, Germany.,Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbruecken, Saarland, 66123, Germany.,Graduate School of computer science, Saarland University, Saarbruecken, Saarland, 66123, Germany
| | - Florian Schmidt
- High throughput Genomics and Systems Biology, Cluster of Excellence on Multimodel Computing and Interaction, Saarland University, Saarbruecken,, Saarland, 66123, Germany.,Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbruecken, Saarland, 66123, Germany.,Graduate School of computer science, Saarland University, Saarbruecken, Saarland, 66123, Germany.,Computational Systems Biology, Genome Institute of Singapore, Singapore, Singapore
| | - Marcel H Schulz
- High throughput Genomics and Systems Biology, Cluster of Excellence on Multimodel Computing and Interaction, Saarland University, Saarbruecken,, Saarland, 66123, Germany.,Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbruecken, Saarland, 66123, Germany.,Institute for Cardiovasular Regeneration, Goethe University Frankfurt Am Main, Frankfurt Am Main, Hessen, 60590, Germany
| |
Collapse
|
20
|
Schmidt F, Schulz MH. On the problem of confounders in modeling gene expression. Bioinformatics 2019; 35:711-719. [PMID: 30084962 PMCID: PMC6530814 DOI: 10.1093/bioinformatics/bty674] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 06/21/2018] [Accepted: 08/02/2018] [Indexed: 01/01/2023] Open
Abstract
Motivation Modeling of Transcription Factor (TF) binding from both ChIP-seq and chromatin accessibility data has become prevalent in computational biology. Several models have been proposed to generate new hypotheses on transcriptional regulation. However, there is no distinct approach to derive TF binding scores from ChIP-seq and open chromatin experiments. Here, we review biases of various scoring approaches and their effects on the interpretation and reliability of predictive gene expression models. Results We generated predictive models for gene expression using ChIP-seq and DNase1-seq data from DEEP and ENCODE. Via randomization experiments, we identified confounders in TF gene scores derived from both ChIP-seq and DNase1-seq data. We reviewed correction approaches for both data types, which reduced the influence of identified confounders without harm to model performance. Also, our analyses highlighted further quality control measures, in addition to model performance, that may help to assure model reliability and to avoid misinterpretation in future studies. Availability and implementation The software used in this study is available online at https://github.com/SchulzLab/TEPIC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Schmidt
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany.,Graduate School for Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | - Marcel H Schulz
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| |
Collapse
|
21
|
Ibn-Salem J, Andrade-Navarro MA. 7C: Computational Chromosome Conformation Capture by Correlation of ChIP-seq at CTCF motifs. BMC Genomics 2019; 20:777. [PMID: 31653198 PMCID: PMC6814980 DOI: 10.1186/s12864-019-6088-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Accepted: 09/09/2019] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Knowledge of the three-dimensional structure of the genome is necessary to understand how gene expression is regulated. Recent experimental techniques such as Hi-C or ChIA-PET measure long-range chromatin interactions genome-wide but are experimentally elaborate, have limited resolution and such data is only available for a limited number of cell types and tissues. RESULTS While ChIP-seq was not designed to detect chromatin interactions, the formaldehyde treatment in the ChIP-seq protocol cross-links proteins with each other and with DNA. Consequently, also regions that are not directly bound by the targeted TF but interact with the binding site via chromatin looping are co-immunoprecipitated and sequenced. This produces minor ChIP-seq signals at loop anchor regions close to the directly bound site. We use the position and shape of ChIP-seq signals around CTCF motif pairs to predict whether they interact or not. We implemented this approach in a prediction method, termed Computational Chromosome Conformation Capture by Correlation of ChIP-seq at CTCF motifs (7C). We applied 7C to all CTCF motif pairs within 1 Mb in the human genome and validated predicted interactions with high-resolution Hi-C and ChIA-PET. A single ChIP-seq experiment from known architectural proteins (CTCF, Rad21, Znf143) but also from other TFs (like TRIM22 or RUNX3) predicts loops accurately. Importantly, 7C predicts loops in cell types and for TF ChIP-seq datasets not used in training. CONCLUSION 7C predicts chromatin loops which can help to associate TF binding sites to regulated genes. Furthermore, profiling of hundreds of ChIP-seq datasets results in novel candidate factors functionally involved in chromatin looping. Our method is available as an R/Bioconductor package: http://bioconductor.org/packages/sevenC .
Collapse
Affiliation(s)
- Jonas Ibn-Salem
- Faculty of Biology, Johannes Gutenberg University of Mainz, 55128, Mainz, Germany.
| | | |
Collapse
|
22
|
Burgess SJ, Reyna-Llorens I, Stevenson SR, Singh P, Jaeger K, Hibberd JM. Genome-Wide Transcription Factor Binding in Leaves from C 3 and C 4 Grasses. THE PLANT CELL 2019; 31:2297-2314. [PMID: 31427470 PMCID: PMC6790085 DOI: 10.1105/tpc.19.00078] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Revised: 06/06/2019] [Accepted: 08/14/2019] [Indexed: 05/19/2023]
Abstract
The majority of plants use C3 photosynthesis, but over 60 independent lineages of angiosperms have evolved the C4 pathway. In most C4 species, photosynthesis gene expression is compartmented between mesophyll and bundle-sheath cells. We performed DNaseI sequencing to identify genome-wide profiles of transcription factor binding in leaves of the C4 grasses Zea mays, Sorghum bicolor, and Setaria italica as well as C3 Brachypodium distachyon In C4 species, while bundle-sheath strands and whole leaves shared similarity in the broad regions of DNA accessible to transcription factors, the short sequences bound varied. Transcription factor binding was prevalent in gene bodies as well as promoters, and many of these sites could represent duons that influence gene regulation in addition to amino acid sequence. Although globally there was little correlation between any individual DNaseI footprint and cell-specific gene expression, within individual species transcription factor binding to the same motifs in multiple genes provided evidence for shared mechanisms governing C4 photosynthesis gene expression. Furthermore, interspecific comparisons identified a small number of highly conserved transcription factor binding sites associated with leaves from species that diverged around 60 million years ago. These data therefore provide insight into the architecture associated with C4 photosynthesis gene expression in particular and characteristics of transcription factor binding in cereal crops in general.
Collapse
Affiliation(s)
- Steven J Burgess
- Department of Plant Sciences, University of Cambridge, Cambridge CB2 3EA, United Kingdom
| | - Ivan Reyna-Llorens
- Department of Plant Sciences, University of Cambridge, Cambridge CB2 3EA, United Kingdom
| | - Sean R Stevenson
- Department of Plant Sciences, University of Cambridge, Cambridge CB2 3EA, United Kingdom
| | - Pallavi Singh
- Department of Plant Sciences, University of Cambridge, Cambridge CB2 3EA, United Kingdom
| | - Katja Jaeger
- Sainsbury Laboratory, University of Cambridge, Cambridge CB2 1LR, United Kingdom
| | - Julian M Hibberd
- Department of Plant Sciences, University of Cambridge, Cambridge CB2 3EA, United Kingdom
| |
Collapse
|
23
|
Youn A, Marquez EJ, Lawlor N, Stitzel ML, Ucar D. BiFET: sequencing Bias-free transcription factor Footprint Enrichment Test. Nucleic Acids Res 2019; 47:e11. [PMID: 30428075 PMCID: PMC6344870 DOI: 10.1093/nar/gky1117] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 10/23/2018] [Indexed: 01/15/2023] Open
Abstract
Transcription factor (TF) footprinting uncovers putative protein–DNA binding via combined analyses of chromatin accessibility patterns and their underlying TF sequence motifs. TF footprints are frequently used to identify TFs that regulate activities of cell/condition-specific genomic regions (target loci) in comparison to control regions (background loci) using standard enrichment tests. However, there is a strong association between the chromatin accessibility level and the GC content of a locus and the number and types of TF footprints that can be detected at this site. Traditional enrichment tests (e.g. hypergeometric) do not account for this bias and inflate false positive associations. Therefore, we developed a novel post-processing method, Bias-free Footprint Enrichment Test (BiFET), that corrects for the biases arising from the differences in chromatin accessibility levels and GC contents between target and background loci in footprint enrichment analyses. We applied BiFET on TF footprint calls obtained from EndoC-βH1 ATAC-seq samples using three different algorithms (CENTIPEDE, HINT-BC and PIQ) and showed BiFET’s ability to increase power and reduce false positive rate when compared to hypergeometric test. Furthermore, we used BiFET to study TF footprints from human PBMC and pancreatic islet ATAC-seq samples to show its utility to identify putative TFs associated with cell-type-specific loci.
Collapse
Affiliation(s)
- Ahrim Youn
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Eladio J Marquez
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Nathan Lawlor
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Michael L Stitzel
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT 06030, USA.,Department of Genetics & Genome Sciences, University of Connecticut Health Center, Farmington, CT 06030, USA
| | - Duygu Ucar
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT 06030, USA.,Department of Genetics & Genome Sciences, University of Connecticut Health Center, Farmington, CT 06030, USA
| |
Collapse
|
24
|
Li Z, Schulz MH, Look T, Begemann M, Zenke M, Costa IG. Identification of transcription factor binding sites using ATAC-seq. Genome Biol 2019; 20:45. [PMID: 30808370 PMCID: PMC6391789 DOI: 10.1186/s13059-019-1642-2] [Citation(s) in RCA: 233] [Impact Index Per Article: 46.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 01/25/2019] [Indexed: 01/07/2023] Open
Abstract
Transposase-Accessible Chromatin followed by sequencing (ATAC-seq) is a simple protocol for detection of open chromatin. Computational footprinting, the search for regions with depletion of cleavage events due to transcription factor binding, is poorly understood for ATAC-seq. We propose the first footprinting method considering ATAC-seq protocol artifacts. HINT-ATAC uses a position dependency model to learn the cleavage preferences of the transposase. We observe strand-specific cleavage patterns around transcription factor binding sites, which are determined by local nucleosome architecture. By incorporating all these biases, HINT-ATAC is able to significantly outperform competing methods in the prediction of transcription factor binding sites with footprints.
Collapse
Affiliation(s)
- Zhijian Li
- Institute for Computational Genomics, Joint Research Center for Computational Biomedicine, RWTH Aachen University Medical School, Aachen, 52074 Germany
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, Aachen, 52074 Germany
| | - Marcel H. Schulz
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
- Institute for Cardiovascular Regeneration, Goethe University, Frankfurt am Main, Germany
- German Centre for Cardiovascular Research (DZHK), Partner site RheinMain, Frankfurt am Main, Germany
| | - Thomas Look
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, Aachen, 52074 Germany
- Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, Aachen, Germany
| | - Matthias Begemann
- Institute of Human Genetics, RWTH Aachen University Medical School, Aachen, Germany
| | - Martin Zenke
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, Aachen, 52074 Germany
- Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, Aachen, Germany
| | - Ivan G. Costa
- Institute for Computational Genomics, Joint Research Center for Computational Biomedicine, RWTH Aachen University Medical School, Aachen, 52074 Germany
- Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, Aachen, Germany
| |
Collapse
|
25
|
Karabacak Calviello A, Hirsekorn A, Wurmus R, Yusuf D, Ohler U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol 2019; 20:42. [PMID: 30791920 PMCID: PMC6385462 DOI: 10.1186/s13059-019-1654-y] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 02/13/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND DNase-seq and ATAC-seq are broadly used methods to assay open chromatin regions genome-wide. The single nucleotide resolution of DNase-seq has been further exploited to infer transcription factor binding sites (TFBSs) in regulatory regions through footprinting. Recent studies have demonstrated the sequence bias of DNase I and its adverse effects on footprinting efficiency. However, footprinting and the impact of sequence bias have not been extensively studied for ATAC-seq. RESULTS Here, we undertake a systematic comparison of the two methods and show that a modification to the ATAC-seq protocol increases its yield and its agreement with DNase-seq data from the same cell line. We demonstrate that the two methods have distinct sequence biases and correct for these protocol-specific biases when performing footprinting. Despite the differences in footprint shapes, the locations of the inferred footprints in ATAC-seq and DNase-seq are largely concordant. However, the protocol-specific sequence biases in conjunction with the sequence content of TFBSs impact the discrimination of footprint from the background, which leads to one method outperforming the other for some TFs. Finally, we address the depth required for reproducible identification of open chromatin regions and TF footprints. CONCLUSIONS We demonstrate that the impact of bias correction on footprinting performance is greater for DNase-seq than for ATAC-seq and that DNase-seq footprinting leads to better performance. It is possible to infer concordant footprints by using replicates, highlighting the importance of reproducibility assessment. The results presented here provide an overview of the advantages and limitations of footprinting analyses using ATAC-seq and DNase-seq.
Collapse
Affiliation(s)
- Aslıhan Karabacak Calviello
- Max Delbrück Center for Molecular Medicine, Berlin Institute for Medical Systems Biology, Berlin, Germany
- Department of Biology, Humboldt University, Berlin, Germany
| | - Antje Hirsekorn
- Max Delbrück Center for Molecular Medicine, Berlin Institute for Medical Systems Biology, Berlin, Germany
| | - Ricardo Wurmus
- Max Delbrück Center for Molecular Medicine, Berlin Institute for Medical Systems Biology, Berlin, Germany
| | - Dilmurat Yusuf
- Max Delbrück Center for Molecular Medicine, Berlin Institute for Medical Systems Biology, Berlin, Germany
| | - Uwe Ohler
- Max Delbrück Center for Molecular Medicine, Berlin Institute for Medical Systems Biology, Berlin, Germany.
- Department of Biology, Humboldt University, Berlin, Germany.
- Department of Computer Science, Humboldt University, Berlin, Germany.
| |
Collapse
|
26
|
Li H, Quang D, Guan Y. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Res 2019; 29:281-292. [PMID: 30567711 PMCID: PMC6360811 DOI: 10.1101/gr.237156.118] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 12/13/2018] [Indexed: 12/16/2022]
Abstract
The ENCyclopedia of DNA Elements (ENCODE) consortium has generated transcription factor (TF) binding ChIP-seq data covering hundreds of TF proteins and cell types; however, due to limits on time and resources, only a small fraction of all possible TF-cell type pairs have been profiled. One solution is to build machine learning models trained on currently available epigenomic data sets that can be applied to the remaining missing pairs. A major challenge is that TF binding sites are cell-type-specific, which can be attributed to cellular contexts such as chromatin accessibility. Meanwhile, indirect TF-DNA binding and interactions between TFs complicate this regulatory process. Technical issues such as sequencing biases and batch effects render the prediction task even more challenging. Many pioneering efforts have been made to predict TF binding profiles based on DNA sequence and DNase-seq footprints, but to what extent a model can be generalized to completely untested cell conditions remains unknown. In this study, we describe our first place solution to the 2017 ENCODE-DREAM in vivo TF binding site prediction challenge. By carefully addressing multisource biases and information imbalance across cell types, we created a pipeline that significantly outperforms the current state-of-the-art methods. The proposed method is sufficiently complex enough to model nonlinear interactions between TF binding motifs and chromatin accessibility information up to 1500 bp from the genomic region of interest.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Daniel Quang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
27
|
Umeyama T, Ito T. DMS-Seq for In Vivo Genome-wide Mapping of Protein-DNA Interactions and Nucleosome Centers. Cell Rep 2018; 21:289-300. [PMID: 28978481 DOI: 10.1016/j.celrep.2017.09.035] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Revised: 07/31/2017] [Accepted: 09/08/2017] [Indexed: 01/05/2023] Open
Abstract
Protein-DNA interactions provide the basis for chromatin structure and gene regulation. Comprehensive identification of protein-occupied sites is thus vital to an in-depth understanding of genome function. Dimethyl sulfate (DMS) is a chemical probe that has long been used to detect footprints of DNA-bound proteins in vitro and in vivo. Here, we describe a genomic footprinting method, dimethyl sulfate sequencing (DMS-seq), which exploits the cell-permeable nature of DMS to obviate the need for nuclear isolation. This feature makes DMS-seq simple in practice and removes the potential risk of protein re-localization during nuclear isolation. DMS-seq successfully detects transcription factors bound to cis-regulatory elements and non-canonical chromatin particles in nucleosome-free regions. Furthermore, an unexpected preference of DMS confers on DMS-seq a unique potential to directly detect nucleosome centers without using genetic manipulation. We expect that DMS-seq will serve as a characteristic method for genome-wide interrogation of in vivo protein-DNA interactions.
Collapse
Affiliation(s)
- Taichi Umeyama
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, Fukuoka 812-8582, Japan; Core Research for Evolutional Science and Technology (CREST), Japan Agency for Medical Research and Development (AMED), Tokyo 100-0004, Japan; Laboratory for Microbiome Sciences, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan
| | - Takashi Ito
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, Fukuoka 812-8582, Japan; Core Research for Evolutional Science and Technology (CREST), Japan Agency for Medical Research and Development (AMED), Tokyo 100-0004, Japan.
| |
Collapse
|
28
|
Baek S, Goldstein I, Hager GL. Bivariate Genomic Footprinting Detects Changes in Transcription Factor Activity. Cell Rep 2018; 19:1710-1722. [PMID: 28538187 DOI: 10.1016/j.celrep.2017.05.003] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2017] [Revised: 04/04/2017] [Accepted: 04/26/2017] [Indexed: 02/06/2023] Open
Abstract
In response to activating signals, transcription factors (TFs) bind DNA and regulate gene expression. TF binding can be measured by protection of the bound sequence from DNase digestion (i.e., footprint). Here, we report that 80% of TF binding motifs do not show a measurable footprint, partly because of a variable cleavage pattern within the motif sequence. To more faithfully portray the effect of TFs on chromatin, we developed an algorithm that captures two TF-dependent effects on chromatin accessibility: footprinting and motif-flanking accessibility. The algorithm, termed bivariate genomic footprinting (BaGFoot), efficiently detects TF activity. BaGFoot is robust to different accessibility assays (DNase-seq, ATAC-seq), all examined peak-calling programs, and a variety of cut bias correction approaches. BaGFoot reliably predicts TF binding and provides valuable information regarding the TFs affecting chromatin accessibility in various biological systems and following various biological events, including in cases where an absolute footprint cannot be determined.
Collapse
Affiliation(s)
- Songjoon Baek
- Lab of Receptor Biology and Gene Expression, The National Cancer Institute, NIH, Bethesda, MD 20892, USA
| | - Ido Goldstein
- Lab of Receptor Biology and Gene Expression, The National Cancer Institute, NIH, Bethesda, MD 20892, USA.
| | - Gordon L Hager
- Lab of Receptor Biology and Gene Expression, The National Cancer Institute, NIH, Bethesda, MD 20892, USA.
| |
Collapse
|
29
|
Martins AL, Walavalkar NM, Anderson WD, Zang C, Guertin MJ. Universal correction of enzymatic sequence bias reveals molecular signatures of protein/DNA interactions. Nucleic Acids Res 2018; 46:e9. [PMID: 29126307 PMCID: PMC5778497 DOI: 10.1093/nar/gkx1053] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 09/19/2017] [Accepted: 10/18/2017] [Indexed: 12/04/2022] Open
Abstract
Coupling molecular biology to high-throughput sequencing has revolutionized the study of biology. Molecular genomics techniques are continually refined to provide higher resolution mapping of nucleic acid interactions and structure. Sequence preferences of enzymes can interfere with the accurate interpretation of these data. We developed seqOutBias to characterize enzymatic sequence bias from experimental data and scale individual sequence reads to correct intrinsic enzymatic sequence biases. SeqOutBias efficiently corrects DNase-seq, TACh-seq, ATAC-seq, MNase-seq and PRO-seq data. We show that seqOutBias correction facilitates identification of true molecular signatures resulting from transcription factors and RNA polymerase interacting with DNA.
Collapse
Affiliation(s)
- André L Martins
- Biochemistry and Molecular Genetics Department, University of Virginia, Charlottesville, Virginia, USA
| | - Ninad M Walavalkar
- Biochemistry and Molecular Genetics Department, University of Virginia, Charlottesville, Virginia, USA
| | - Warren D Anderson
- Biochemistry and Molecular Genetics Department, University of Virginia, Charlottesville, Virginia, USA
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, USA
| | - Chongzhi Zang
- Biochemistry and Molecular Genetics Department, University of Virginia, Charlottesville, Virginia, USA
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, USA
| | - Michael J Guertin
- Biochemistry and Molecular Genetics Department, University of Virginia, Charlottesville, Virginia, USA
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, USA
| |
Collapse
|
30
|
Goldstein I, Hager GL. Dynamic enhancer function in the chromatin context. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2018; 10:10.1002/wsbm.1390. [PMID: 28544514 PMCID: PMC6638546 DOI: 10.1002/wsbm.1390] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 03/21/2017] [Accepted: 03/23/2017] [Indexed: 12/28/2022]
Abstract
Enhancers serve as critical regulatory elements in higher eukaryotic cells. The characterization of enhancer function has evolved primarily from genome-wide methodologies, including chromatin immunoprecipitation (ChIP-seq), DNase-I hypersensitivity (DNase-seq), digital genomic footprinting (DGF), and the chromosome conformation capture techniques (3C, 4C, and Hi-C). These population-based assays average signals across millions of cells and lead to enhancer models characterized by static and sequential binding. More recently, fluorescent microscopy techniques, including fluorescence recovery after photobleaching, fluorescence correlation spectroscopy, and single molecule tracking (SMT), reveal a highly dynamic binding behavior for these factors in live cells. Furthermore, a refined analysis of genomic footprinting suggests that many transcription factors leave minimal or no footprints in chromatin, even when present and active in a given cell type. In this study, we review the implications of these new approaches for an accurate understanding of enhancer function in real time. In vivo SMT, in particular, has recently evolved as a promising methodology to probe enhancer function in live cells. Integration of findings from the many approaches now employed in the study of enhancer function suggest a highly dynamic view for the action of enhancer activating factors, viewed on a time scale of milliseconds to seconds, rather than minutes to hours. WIREs Syst Biol Med 2018, 10:e1390. doi: 10.1002/wsbm.1390 This article is categorized under: Analytical and Computational Methods > Computational Methods Laboratory Methods and Technologies > Genetic/Genomic Methods Laboratory Methods and Technologies > Imaging.
Collapse
Affiliation(s)
- Ido Goldstein
- Laboratory of Receptor Biology and Gene Expression, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Gordon L. Hager
- Laboratory of Receptor Biology and Gene Expression, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
31
|
Schwessinger R, Suciu MC, McGowan SJ, Telenius J, Taylor S, Higgs DR, Hughes JR. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res 2017; 27:1730-1742. [PMID: 28904015 PMCID: PMC5630036 DOI: 10.1101/gr.220202.117] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 08/07/2017] [Indexed: 12/22/2022]
Abstract
In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor binding. Sasquatch performs a comprehensive k-mer-based analysis of DNase footprints to determine any k-mer's potential for protein binding in a specific cell type and how this may be changed by sequence variants. Therefore, Sasquatch uses an unbiased approach, independent of known transcription factor binding sites and motifs. Sasquatch only requires a single DNase-seq data set per cell type, from any genotype, and produces consistent predictions from data generated by different experimental procedures and at different sequence depths. Here we demonstrate the effectiveness of Sasquatch using previously validated functional SNPs and benchmark its performance against existing approaches. Sasquatch is available as a versatile webtool incorporating publicly available data, including the human ENCODE collection. Thus, Sasquatch provides a powerful tool and repository for prioritizing likely regulatory SNPs in the noncoding genome.
Collapse
Affiliation(s)
- Ron Schwessinger
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Maria C Suciu
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Simon J McGowan
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jelena Telenius
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Stephen Taylor
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Doug R Higgs
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jim R Hughes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| |
Collapse
|
32
|
Correcting nucleotide-specific biases in high-throughput sequencing data. BMC Bioinformatics 2017; 18:357. [PMID: 28764645 PMCID: PMC5540620 DOI: 10.1186/s12859-017-1766-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2017] [Accepted: 07/19/2017] [Indexed: 01/07/2023] Open
Abstract
Background High-throughput sequence (HTS) data exhibit position-specific nucleotide biases that obscure the intended signal and reduce the effectiveness of these data for downstream analyses. These biases are particularly evident in HTS assays for identifying regulatory regions in DNA (DNase-seq, ChIP-seq, FAIRE-seq, ATAC-seq). Biases may result from many experiment-specific factors, including selectivity of DNA restriction enzymes and fragmentation method, as well as sequencing technology-specific factors, such as choice of adapters/primers and sample amplification methods. Results We present a novel method to detect and correct position-specific nucleotide biases in HTS short read data. Our method calculates read-specific weights based on aligned reads to correct the over- or underrepresentation of position-specific nucleotide subsequences, both within and adjacent to the aligned read, relative to a baseline calculated in assay-specific enriched regions. Using HTS data from a variety of ChIP-seq, DNase-seq, FAIRE-seq, and ATAC-seq experiments, we show that our weight-adjusted reads reduce the position-specific nucleotide imbalance across reads and improve the utility of these data for downstream analyses, including identification and characterization of open chromatin peaks and transcription-factor binding sites. Conclusions A general-purpose method to characterize and correct position-specific nucleotide sequence biases fills the need to recognize and deal with, in a systematic manner, binding-site preference for the growing number of HTS-based epigenetic assays. As the breadth and impact of these biases are better understood, the availability of a standard toolkit to correct them will be important. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1766-x) contains supplementary material, which is available to authorized users.
Collapse
|
33
|
Liu S, Zibetti C, Wan J, Wang G, Blackshaw S, Qian J. Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility. BMC Bioinformatics 2017; 18:355. [PMID: 28750606 PMCID: PMC5530957 DOI: 10.1186/s12859-017-1769-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 07/19/2017] [Indexed: 12/04/2022] Open
Abstract
Background Computational prediction of transcription factor (TF) binding sites in different cell types is challenging. Recent technology development allows us to determine the genome-wide chromatin accessibility in various cellular and developmental contexts. The chromatin accessibility profiles provide useful information in prediction of TF binding events in various physiological conditions. Furthermore, ChIP-Seq analysis was used to determine genome-wide binding sites for a range of different TFs in multiple cell types. Integration of these two types of genomic information can improve the prediction of TF binding events. Results We assessed to what extent a model built upon on other TFs and/or other cell types could be used to predict the binding sites of TFs of interest. A random forest model was built using a set of cell type-independent features such as specific sequences recognized by the TFs and evolutionary conservation, as well as cell type-specific features derived from chromatin accessibility data. Our analysis suggested that the models learned from other TFs and/or cell lines performed almost as well as the model learned from the target TF in the cell type of interest. Interestingly, models based on multiple TFs performed better than single-TF models. Finally, we proposed a universal model, BPAC, which was generated using ChIP-Seq data from multiple TFs in various cell types. Conclusion Integrating chromatin accessibility information with sequence information improves prediction of TF binding.The prediction of TF binding is transferable across TFs and/or cell lines suggesting there are a set of universal “rules”. A computational tool was developed to predict TF binding sites based on the universal “rules”. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1769-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sheng Liu
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Cristina Zibetti
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Jun Wan
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Guohua Wang
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Seth Blackshaw
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Centre for Human Systems Biology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Institute for Cell Engineering, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Jiang Qian
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
| |
Collapse
|
34
|
Lu R, Mucaki EJ, Rogan PK. Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Nucleic Acids Res 2017; 45:e27. [PMID: 27899659 PMCID: PMC5389469 DOI: 10.1093/nar/gkw1036] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Accepted: 10/19/2016] [Indexed: 02/06/2023] Open
Abstract
Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes.
Collapse
Affiliation(s)
- Ruipeng Lu
- Department of Computer Science, Western University, London, Ontario, N6A 5B7, Canada
| | - Eliseos J Mucaki
- Department of Biochemistry, Western University, London, Ontario, N6A 5C1, Canada
| | - Peter K Rogan
- Department of Computer Science, Western University, London, Ontario, N6A 5B7, Canada.,Department of Biochemistry, Western University, London, Ontario, N6A 5C1, Canada.,Department of Oncology, Western University, London, Ontario, N6A 4L6, Canada.,Cytognomix Inc., London, Ontario, N5X 3X5, Canada
| |
Collapse
|
35
|
Chen X, Yu B, Carriero N, Silva C, Bonneau R. Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility. Nucleic Acids Res 2017; 45:4315-4329. [PMID: 28334916 PMCID: PMC5416775 DOI: 10.1093/nar/gkx174] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2016] [Revised: 02/28/2017] [Accepted: 03/06/2017] [Indexed: 12/21/2022] Open
Abstract
Differential binding of transcription factors (TFs) at cis-regulatory loci drives the differentiation and function of diverse cellular lineages. Understanding the regulatory interactions that underlie cell fate decisions requires characterizing TF binding sites (TFBS) across multiple cell types and conditions. Techniques, e.g. ChIP-Seq can reveal genome-wide patterns of TF binding, but typically requires laborious and costly experiments for each TF-cell-type (TFCT) condition of interest. Chromosomal accessibility assays can connect accessible chromatin in one cell type to many TFs through sequence motif mapping. Such methods, however, rarely take into account that the genomic context preferred by each factor differs from TF to TF, and from cell type to cell type. To address the differences in TF behaviors, we developed Mocap, a method that integrates chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and other factors in an ensemble of TFCT-specific classifiers. We show that integration of genomic features, such as CpG islands improves TFBS prediction in some TFCT. Further, we describe a method for mapping new TFCT, for which no ChIP-seq data exists, onto our ensemble of classifiers and show that our cross-sample TFBS prediction method outperforms several previously described methods.
Collapse
Affiliation(s)
- Xi Chen
- Department of Biology, New York University, New York, NY 10003, USA
| | - Bowen Yu
- Department of Computer Science, New York University, New York, NY 10003, USA
| | - Nicholas Carriero
- Center for Computational Biology, Flatiron Foundation, Simons Foundation, New York, NY 10010, USA
| | - Claudio Silva
- Department of Computer Science, New York University, New York, NY 10003, USA
| | - Richard Bonneau
- Department of Biology, New York University, New York, NY 10003, USA
- Department of Computer Science, New York University, New York, NY 10003, USA
- Center for Computational Biology, Flatiron Foundation, Simons Foundation, New York, NY 10010, USA
| |
Collapse
|
36
|
Sobel JA, Krier I, Andersin T, Raghav S, Canella D, Gilardi F, Kalantzi AS, Rey G, Weger B, Gachon F, Dal Peraro M, Hernandez N, Schibler U, Deplancke B, Naef F. Transcriptional regulatory logic of the diurnal cycle in the mouse liver. PLoS Biol 2017; 15:e2001069. [PMID: 28414715 PMCID: PMC5393560 DOI: 10.1371/journal.pbio.2001069] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Accepted: 03/10/2017] [Indexed: 12/11/2022] Open
Abstract
Many organisms exhibit temporal rhythms in gene expression that propel diurnal cycles in physiology. In the liver of mammals, these rhythms are controlled by transcription-translation feedback loops of the core circadian clock and by feeding-fasting cycles. To better understand the regulatory interplay between the circadian clock and feeding rhythms, we mapped DNase I hypersensitive sites (DHSs) in the mouse liver during a diurnal cycle. The intensity of DNase I cleavages cycled at a substantial fraction of all DHSs, suggesting that DHSs harbor regulatory elements that control rhythmic transcription. Using chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq), we found that hypersensitivity cycled in phase with RNA polymerase II (Pol II) loading and H3K27ac histone marks. We then combined the DHSs with temporal Pol II profiles in wild-type (WT) and Bmal1-/- livers to computationally identify transcription factors through which the core clock and feeding-fasting cycles control diurnal rhythms in transcription. While a similar number of mRNAs accumulated rhythmically in Bmal1-/- compared to WT livers, the amplitudes in Bmal1-/- were generally lower. The residual rhythms in Bmal1-/- reflected transcriptional regulators mediating feeding-fasting responses as well as responses to rhythmic systemic signals. Finally, the analysis of DNase I cuts at nucleotide resolution showed dynamically changing footprints consistent with dynamic binding of CLOCK:BMAL1 complexes. Structural modeling suggested that these footprints are driven by a transient heterotetramer binding configuration at peak activity. Together, our temporal DNase I mappings allowed us to decipher the global regulation of diurnal transcription rhythms in the mouse liver.
Collapse
Affiliation(s)
- Jonathan Aryeh Sobel
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Irina Krier
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Teemu Andersin
- Department of Molecular Biology, University of Geneva, Geneva, Switzerland
| | - Sunil Raghav
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Donatella Canella
- Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
| | - Federica Gilardi
- Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
| | - Alexandra Styliani Kalantzi
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Guillaume Rey
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Benjamin Weger
- Department of Diabetes and Circadian Rhythms, Nestlé Institute of Health Sciences, Lausanne, Switzerland
| | - Frédéric Gachon
- Department of Diabetes and Circadian Rhythms, Nestlé Institute of Health Sciences, Lausanne, Switzerland
- School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Matteo Dal Peraro
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Nouria Hernandez
- Center for Integrative Genomics, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
| | - Ueli Schibler
- Department of Molecular Biology, University of Geneva, Geneva, Switzerland
| | - Bart Deplancke
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Felix Naef
- The Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | | |
Collapse
|
37
|
Schmidt F, Gasparoni N, Gasparoni G, Gianmoena K, Cadenas C, Polansky JK, Ebert P, Nordström K, Barann M, Sinha A, Fröhler S, Xiong J, Dehghani Amirabad A, Behjati Ardakani F, Hutter B, Zipprich G, Felder B, Eils J, Brors B, Chen W, Hengstler JG, Hamann A, Lengauer T, Rosenstiel P, Walter J, Schulz MH. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res 2017; 45:54-66. [PMID: 27899623 PMCID: PMC5224477 DOI: 10.1093/nar/gkw1061] [Citation(s) in RCA: 73] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 10/18/2016] [Accepted: 10/24/2016] [Indexed: 12/21/2022] Open
Abstract
The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively.
Collapse
Affiliation(s)
- Florian Schmidt
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Nina Gasparoni
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Gilles Gasparoni
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Kathrin Gianmoena
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Cristina Cadenas
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Julia K Polansky
- Experimental Rheumatology, German Rheumatism Research Centre, Berlin, 10117, Germany
| | - Peter Ebert
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Karl Nordström
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Matthias Barann
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Anupam Sinha
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Sebastian Fröhler
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Jieyi Xiong
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Azim Dehghani Amirabad
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Fatemeh Behjati Ardakani
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Barbara Hutter
- Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Gideon Zipprich
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Bärbel Felder
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Jürgen Eils
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Benedikt Brors
- Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Wei Chen
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Jan G Hengstler
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Alf Hamann
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Thomas Lengauer
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Philip Rosenstiel
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Jörn Walter
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Marcel H Schulz
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| |
Collapse
|
38
|
Genome-wide footprinting: ready for prime time? Nat Methods 2016; 13:222-228. [PMID: 26914206 DOI: 10.1038/nmeth.3766] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 12/31/2015] [Indexed: 01/16/2023]
Abstract
High-throughput sequencing technologies have allowed many gene locus-level molecular biology assays to become genome-wide profiling methods. DNA-cleaving enzymes such as DNase I have been used to probe accessible chromatin. The accessible regions contain functional regulatory sites, including promoters, insulators and enhancers. Deep sequencing of DNase-seq libraries and computational analysis of the cut profiles have been used to infer protein occupancy in the genome at the nucleotide level, a method introduced as 'digital genomic footprinting'. The approach has been proposed as an attractive alternative to the analysis of transcription factors (TFs) by chromatin immunoprecipitation followed by sequencing (ChIP-seq), and in theory it should overcome antibody issues, poor resolution and batch effects. Recent reports point to limitations of the DNase-based genomic footprinting approach and call into question the scope of detectable protein occupancy, especially for TFs with short-lived chromatin binding. The genomics community is grappling with issues concerning the utility of genomic footprinting and is reassessing the proposed approaches in terms of robust deliverables. Here we summarize the consensus as well as different views emerging from recent reports, and we describe the remaining issues and hurdles for genomic footprinting.
Collapse
|
39
|
Chaitankar V, Karakülah G, Ratnapriya R, Giuste FO, Brooks MJ, Swaroop A. Next generation sequencing technology and genomewide data analysis: Perspectives for retinal research. Prog Retin Eye Res 2016; 55:1-31. [PMID: 27297499 DOI: 10.1016/j.preteyeres.2016.06.001] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Revised: 06/06/2016] [Accepted: 06/08/2016] [Indexed: 02/08/2023]
Abstract
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well.
Collapse
Affiliation(s)
- Vijender Chaitankar
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, Bethesda, MD, 20892-0610, USA
| | - Gökhan Karakülah
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, Bethesda, MD, 20892-0610, USA
| | - Rinki Ratnapriya
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, Bethesda, MD, 20892-0610, USA
| | - Felipe O Giuste
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, Bethesda, MD, 20892-0610, USA
| | - Matthew J Brooks
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, Bethesda, MD, 20892-0610, USA
| | - Anand Swaroop
- Neurobiology-Neurodegeneration & Repair Laboratory, National Eye Institute, National Institutes of Health, 6 Center Drive, Bethesda, MD, 20892-0610, USA.
| |
Collapse
|
40
|
Gusmao EG, Allhoff M, Zenke M, Costa IG. Analysis of computational footprinting methods for DNase sequencing experiments. Nat Methods 2016; 13:303-9. [PMID: 26901649 DOI: 10.1038/nmeth.3772] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Accepted: 01/27/2016] [Indexed: 12/26/2022]
Abstract
DNase-seq allows nucleotide-level identification of transcription factor binding sites on the basis of a computational search of footprint-like DNase I cleavage patterns on the DNA. Frequently in high-throughput methods, experimental artifacts such as DNase I cleavage bias affect the computational analysis of DNase-seq experiments. Here we performed a comprehensive and systematic study on the performance of computational footprinting methods. We evaluated ten footprinting methods in a panel of DNase-seq experiments for their ability to recover cell-specific transcription factor binding sites. We show that three methods--HINT, DNase2TF and PIQ--consistently outperformed the other evaluated methods and that correcting the DNase-seq signal for experimental artifacts significantly improved the accuracy of computational footprints. We also propose a score that can be used to detect footprints arising from transcription factors with potentially short residence times.
Collapse
Affiliation(s)
- Eduardo G Gusmao
- IZKF Computational Biology Research Group, RWTH Aachen University Medical School, Aachen, Germany
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany
| | - Manuel Allhoff
- IZKF Computational Biology Research Group, RWTH Aachen University Medical School, Aachen, Germany
- Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen University, Aachen, Germany
| | - Martin Zenke
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany
| | - Ivan G Costa
- IZKF Computational Biology Research Group, RWTH Aachen University Medical School, Aachen, Germany
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, Aachen, Germany
- Aachen Institute for Advanced Study in Computational Engineering Science (AICES), RWTH Aachen University, Aachen, Germany
| |
Collapse
|
41
|
Lu L, Wang M, Mao Z, Kang TS, Chen XP, Lu JJ, Leung CH, Ma DL. A novel dinuclear iridium(III) complex as a G-quadruplex-selective probe for the luminescent switch-on detection of transcription factor HIF-1α. Sci Rep 2016; 6:22458. [PMID: 26932240 PMCID: PMC4773817 DOI: 10.1038/srep22458] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2015] [Accepted: 02/15/2016] [Indexed: 12/18/2022] Open
Abstract
A novel dinuclear Ir(III) complex 5 was discovered to be specific to G-quadruplex DNA, and was utilized in a label-free G-quadruplex-based detection platform for transcription factor activity. The principle of this assay was demonstrated by using HIF-1α as a model protein. Moreover, this HIF-1α detection assay exhibited potential use for biological sample analysis.
Collapse
Affiliation(s)
- Lihua Lu
- Department of Chemistry, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Modi Wang
- Department of Chemistry, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Zhifeng Mao
- Department of Chemistry, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Tian-Shu Kang
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macao, China
| | - Xiu-Ping Chen
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macao, China
| | - Jin-Jian Lu
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macao, China
| | - Chung-Hang Leung
- State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macao, China
| | - Dik-Lung Ma
- Department of Chemistry, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| |
Collapse
|
42
|
Vierstra J, Stamatoyannopoulos JA. Genomic footprinting. Nat Methods 2016; 13:213-21. [DOI: 10.1038/nmeth.3768] [Citation(s) in RCA: 76] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Accepted: 01/13/2016] [Indexed: 01/08/2023]
|
43
|
Madrigal P. On Accounting for Sequence-Specific Bias in Genome-Wide Chromatin Accessibility Experiments: Recent Advances and Contradictions. Front Bioeng Biotechnol 2015; 3:144. [PMID: 26442258 PMCID: PMC4585268 DOI: 10.3389/fbioe.2015.00144] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Accepted: 09/07/2015] [Indexed: 11/13/2022] Open
Affiliation(s)
- Pedro Madrigal
- Wellcome Trust Sanger Institute , Cambridge , UK ; Department of Surgery, University of Cambridge , Cambridge , UK
| |
Collapse
|
44
|
Abstract
Recent advances in experimental and computational methodologies are enabling ultra-high resolution genome-wide profiles of protein-DNA binding events. For example, the ChIP-exo protocol precisely characterizes protein-DNA cross-linking patterns by combining chromatin immunoprecipitation (ChIP) with 5' → 3' exonuclease digestion. Similarly, deeply sequenced chromatin accessibility assays (e.g. DNase-seq and ATAC-seq) enable the detection of protected footprints at protein-DNA binding sites. With these techniques and others, we have the potential to characterize the individual nucleotides that interact with transcription factors, nucleosomes, RNA polymerases and other regulatory proteins in a particular cellular context. In this review, we explain the experimental assays and computational analysis methods that enable high-resolution profiling of protein-DNA binding events. We discuss the challenges and opportunities associated with such approaches.
Collapse
Affiliation(s)
- Shaun Mahony
- a Department of Biochemistry & Molecular Biology , Center for Eukaryotic Gene Regulation, The Pennsylvania State University , University Park , PA , USA
| | - B Franklin Pugh
- a Department of Biochemistry & Molecular Biology , Center for Eukaryotic Gene Regulation, The Pennsylvania State University , University Park , PA , USA
| |
Collapse
|
45
|
Wang C, Lv Y, Wang B, Yin C, Lin Y, Pan L. Survey of protein-DNA interactions in Aspergillus oryzae on a genomic scale. Nucleic Acids Res 2015; 43:4429-46. [PMID: 25883143 PMCID: PMC4482085 DOI: 10.1093/nar/gkv334] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 03/31/2015] [Indexed: 01/23/2023] Open
Abstract
The genome-scale delineation of in vivo protein–DNA interactions is key to understanding genome function. Only ∼5% of transcription factors (TFs) in the Aspergillus genus have been identified using traditional methods. Although the Aspergillus oryzae genome contains >600 TFs, knowledge of the in vivo genome-wide TF-binding sites (TFBSs) in aspergilli remains limited because of the lack of high-quality antibodies. We investigated the landscape of in vivo protein–DNA interactions across the A. oryzae genome through coupling the DNase I digestion of intact nuclei with massively parallel sequencing and the analysis of cleavage patterns in protein–DNA interactions at single-nucleotide resolution. The resulting map identified overrepresented de novo TF-binding motifs from genomic footprints, and provided the detailed chromatin remodeling patterns and the distribution of digital footprints near transcription start sites. The TFBSs of 19 known Aspergillus TFs were also identified based on DNase I digestion data surrounding potential binding sites in conjunction with TF binding specificity information. We observed that the cleavage patterns of TFBSs were dependent on the orientation of TF motifs and independent of strand orientation, consistent with the DNA shape features of binding motifs with flanking sequences.
Collapse
Affiliation(s)
- Chao Wang
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, Guangdong, 510006, China
| | - Yangyong Lv
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, Guangdong, 510006, China
| | - Bin Wang
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, Guangdong, 510006, China
| | - Chao Yin
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, Guangdong, 510006, China
| | - Ying Lin
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, Guangdong, 510006, China
| | - Li Pan
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, Guangdong, 510006, China
| |
Collapse
|
46
|
Kähärä J, Lähdesmäki H. BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data. Bioinformatics 2015; 31:2852-9. [DOI: 10.1093/bioinformatics/btv294] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2014] [Accepted: 05/04/2015] [Indexed: 01/09/2023] Open
|