1
|
Morgan D, DeMeo DL, Glass K. Using methylation data to improve transcription factor binding prediction. Epigenetics 2024; 19:2309826. [PMID: 38300850 PMCID: PMC10841018 DOI: 10.1080/15592294.2024.2309826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 01/01/2024] [Indexed: 02/03/2024] Open
Abstract
Modelling the regulatory mechanisms that determine cell fate, response to external perturbation, and disease state depends on measuring many factors, a task made more difficult by the plasticity of the epigenome. Scanning the genome for the sequence patterns defined by Position Weight Matrices (PWM) can be used to estimate transcription factor (TF) binding locations. However, this approach does not incorporate information regarding the epigenetic context necessary for TF binding. CpG methylation is an epigenetic mark influenced by environmental factors that is commonly assayed in human cohort studies. We developed a framework to score inferred TF binding locations using methylation data. We intersected motif locations identified using PWMs with methylation information captured in both whole-genome bisulfite sequencing and Illumina EPIC array data for six cell lines, scored motif locations based on these data, and compared with experimental data characterizing TF binding (ChIP-seq). We found that for most TFs, binding prediction improves using methylation-based scoring compared to standard PWM-scores. We also illustrate that our approach can be generalized to infer TF binding when methylation information is only proximally available, i.e. measured for nearby CpGs that do not directly overlap with a motif location. Overall, our approach provides a framework for inferring context-specific TF binding using methylation data. Importantly, the availability of DNA methylation data in existing patient populations provides an opportunity to use our approach to understand the impact of methylation on gene regulatory processes in the context of human disease.
Collapse
Affiliation(s)
- Daniel Morgan
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Dawn L. DeMeo
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
| | - Kimberly Glass
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
- Department of Biostatistics, Harvard Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
2
|
Multi-Cell-Type Openness-Weighted Association Studies for Trait-Associated Genomic Segments Prioritization. Genes (Basel) 2022; 13:genes13071220. [PMID: 35886003 PMCID: PMC9323627 DOI: 10.3390/genes13071220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 06/30/2022] [Accepted: 07/03/2022] [Indexed: 02/01/2023] Open
Abstract
Openness-weighted association study (OWAS) is a method that leverages the in silico prediction of chromatin accessibility to prioritize genome-wide association studies (GWAS) signals, and can provide novel insights into the roles of non-coding variants in complex diseases. A prerequisite to apply OWAS is to choose a trait-related cell type beforehand. However, for most complex traits, the trait-relevant cell types remain elusive. In addition, many complex traits involve multiple related cell types. To address these issues, we develop OWAS-joint, an efficient framework that aggregates predicted chromatin accessibility across multiple cell types, to prioritize disease-associated genomic segments. In simulation studies, we demonstrate that OWAS-joint achieves a greater statistical power compared to OWAS. Moreover, the heritability explained by OWAS-joint segments is higher than or comparable to OWAS segments. OWAS-joint segments also have high replication rates in independent replication cohorts. Applying the method to six complex human traits, we demonstrate the advantages of OWAS-joint over a single-cell-type OWAS approach. We highlight that OWAS-joint enhances the biological interpretation of disease mechanisms, especially for non-coding regions.
Collapse
|
3
|
Awdeh A, Turcotte M, Perkins TJ. WACS: improving ChIP-seq peak calling by optimally weighting controls. BMC Bioinformatics 2021; 22:69. [PMID: 33588754 PMCID: PMC7885521 DOI: 10.1186/s12859-020-03927-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 12/09/2020] [Indexed: 01/21/2023] Open
Abstract
Background Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. Result We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. Conclusions This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls.
Collapse
Affiliation(s)
- Aseel Awdeh
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada. .,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, K1H8L6, Canada.
| | - Marcel Turcotte
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada
| | - Theodore J Perkins
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada. .,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, K1H8L6, Canada. .,Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, K1H8M5, Canada.
| |
Collapse
|
4
|
Chitpin JG, Awdeh A, Perkins TJ. RECAP reveals the true statistical significance of ChIP-seq peak calls. Bioinformatics 2020; 35:3592-3598. [PMID: 30824903 PMCID: PMC6761936 DOI: 10.1093/bioinformatics/btz150] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 01/18/2019] [Accepted: 02/27/2019] [Indexed: 12/29/2022] Open
Abstract
Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Justin G Chitpin
- Translational and Molecular Medicine Program, University of Ottawa, Ottawa, ON K1H8M5, Canada.,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H8L6, Canada
| | - Aseel Awdeh
- Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H8L6, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N6N5, Canada
| | - Theodore J Perkins
- Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H8L6, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N6N5, Canada.,Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, ON K1H8M5, Canada
| |
Collapse
|
5
|
Schmidt F, Kern F, Schulz MH. Integrative prediction of gene expression with chromatin accessibility and conformation data. Epigenetics Chromatin 2020; 13:4. [PMID: 32029002 PMCID: PMC7003490 DOI: 10.1186/s13072-020-0327-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2019] [Accepted: 01/06/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Enhancers play a fundamental role in orchestrating cell state and development. Although several methods have been developed to identify enhancers, linking them to their target genes is still an open problem. Several theories have been proposed on the functional mechanisms of enhancers, which triggered the development of various methods to infer promoter-enhancer interactions (PEIs). The advancement of high-throughput techniques describing the three-dimensional organization of the chromatin, paved the way to pinpoint long-range PEIs. Here we investigated whether including PEIs in computational models for the prediction of gene expression improves performance and interpretability. RESULTS We have extended our [Formula: see text] framework to include DNA contacts deduced from chromatin conformation capture experiments and compared various methods to determine PEIs using predictive modelling of gene expression from chromatin accessibility data and predicted transcription factor (TF) motif data. We designed a novel machine learning approach that allows the prioritization of TFs binding to distal loop and promoter regions with respect to their importance for gene expression regulation. Our analysis revealed a set of core TFs that are part of enhancer-promoter loops involving YY1 in different cell lines. CONCLUSION We present a novel approach that can be used to prioritize TFs involved in distal and promoter-proximal regulatory events by integrating chromatin accessibility, conformation, and gene expression data. We show that the integration of chromatin conformation data can improve gene expression prediction and aids model interpretability.
Collapse
Affiliation(s)
- Florian Schmidt
- High-throughput Genomics & Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Computational Biology & Applied Algorithmics, Max-Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Center for Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Genome Institute of Singapore, A*STAR, 60 Biopolis Street, Singapore, 138672 Singapore
| | - Fabian Kern
- High-throughput Genomics & Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Center for Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Chair for Clinical Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Marcel H. Schulz
- High-throughput Genomics & Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Computational Biology & Applied Algorithmics, Max-Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Center for Bioinformatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Institute of Cardiovascular Regeneration, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
| |
Collapse
|
6
|
Hiranuma N, Lundberg SM, Lee SI. AIControl: replacing matched control experiments with machine learning improves ChIP-seq peak identification. Nucleic Acids Res 2019; 47:e58. [PMID: 30869146 PMCID: PMC6547432 DOI: 10.1093/nar/gkz156] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 02/15/2019] [Accepted: 02/28/2019] [Indexed: 01/24/2023] Open
Abstract
ChIP-seq is a technique to determine binding locations of transcription factors, which remains a central challenge in molecular biology. Current practice is to use a 'control' dataset to remove background signals from a immunoprecipitation (IP) 'target' dataset. We introduce the AIControl framework, which eliminates the need to obtain a control dataset and instead identifies binding peaks by estimating the distributions of background signals from many publicly available control ChIP-seq datasets. We thereby avoid the cost of running control experiments while simultaneously increasing the accuracy of binding location identification. Specifically, AIControl can (i) estimate background signals at fine resolution, (ii) systematically weigh the most appropriate control datasets in a data-driven way, (iii) capture sources of potential biases that may be missed by one control dataset and (iv) remove the need for costly and time-consuming control experiments. We applied AIControl to 410 IP datasets in the ENCODE ChIP-seq database, using 440 control datasets from 107 cell types to impute background signal. Without using matched control datasets, AIControl identified peaks that were more enriched for putative binding sites than those identified by other popular peak callers that used a matched control dataset. We also demonstrated that our framework identifies binding sites that recover documented protein interactions more accurately.
Collapse
Affiliation(s)
- Naozumi Hiranuma
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195-2350
| | - Scott M Lundberg
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195-2350
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195-2350
| |
Collapse
|
7
|
Schmidt F, Schulz MH. On the problem of confounders in modeling gene expression. Bioinformatics 2019; 35:711-719. [PMID: 30084962 PMCID: PMC6530814 DOI: 10.1093/bioinformatics/bty674] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 06/21/2018] [Accepted: 08/02/2018] [Indexed: 01/01/2023] Open
Abstract
Motivation Modeling of Transcription Factor (TF) binding from both ChIP-seq and chromatin accessibility data has become prevalent in computational biology. Several models have been proposed to generate new hypotheses on transcriptional regulation. However, there is no distinct approach to derive TF binding scores from ChIP-seq and open chromatin experiments. Here, we review biases of various scoring approaches and their effects on the interpretation and reliability of predictive gene expression models. Results We generated predictive models for gene expression using ChIP-seq and DNase1-seq data from DEEP and ENCODE. Via randomization experiments, we identified confounders in TF gene scores derived from both ChIP-seq and DNase1-seq data. We reviewed correction approaches for both data types, which reduced the influence of identified confounders without harm to model performance. Also, our analyses highlighted further quality control measures, in addition to model performance, that may help to assure model reliability and to avoid misinterpretation in future studies. Availability and implementation The software used in this study is available online at https://github.com/SchulzLab/TEPIC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Schmidt
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany.,Graduate School for Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | - Marcel H Schulz
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| |
Collapse
|
8
|
Soleimani VD, Nguyen D, Ramachandran P, Palidwor GA, Porter CJ, Yin H, Perkins TJ, Rudnicki MA. Cis-regulatory determinants of MyoD function. Nucleic Acids Res 2019; 46:7221-7235. [PMID: 30016497 PMCID: PMC6101602 DOI: 10.1093/nar/gky388] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 04/30/2018] [Indexed: 01/06/2023] Open
Abstract
Muscle-specific transcription factor MyoD orchestrates the myogenic gene expression program by binding to short DNA motifs called E-boxes within myogenic cis-regulatory elements (CREs). Genome-wide analyses of MyoD cistrome by chromatin immnunoprecipitation sequencing shows that MyoD-bound CREs contain multiple E-boxes of various sequences. However, how E-box numbers, sequences and their spatial arrangement within CREs collectively regulate the binding affinity and transcriptional activity of MyoD remain largely unknown. Here, by an integrative analysis of MyoD cistrome combined with genome-wide analysis of key regulatory histones and gene expression data we show that the affinity landscape of MyoD is driven by multiple E-boxes, and that the overall binding affinity—and associated nucleosome positioning and epigenetic features of the CREs—crucially depend on the variant sequences and positioning of the E-boxes within the CREs. By comparative genomic analysis of single nucleotide polymorphism (SNPs) across publicly available data from 17 strains of laboratory mice, we show that variant sequences within the MyoD-bound motifs, but not their genome-wide counterparts, are under selection. At last, we show that the quantitative regulatory effect of MyoD binding on the nearby genes can, in part, be predicted by the motif composition of the CREs to which it binds. Taken together, our data suggest that motif numbers, sequences and their spatial arrangement within the myogenic CREs are important determinants of the cis-regulatory code of myogenic CREs.
Collapse
Affiliation(s)
- Vahab D Soleimani
- Department of Human Genetics, McGill University, Montréal, QC H3A 1B1, Canada.,Lady Davis Institute for Medical Research, Jewish General Hospital, Montréal, QC H3T 1E2, Canada
| | - Duy Nguyen
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montréal, QC H3T 1E2, Canada
| | - Parameswaran Ramachandran
- Sprott Centre for Stem Cell Research, Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H 8L6, Canada
| | - Gareth A Palidwor
- Sprott Centre for Stem Cell Research, Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H 8L6, Canada
| | - Christopher J Porter
- Sprott Centre for Stem Cell Research, Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H 8L6, Canada
| | - Hang Yin
- Center for Molecular Medicine, Department of Biochemistry and Molecular Biology, University of Georgia, GA 30602, USA
| | - Theodore J Perkins
- Sprott Centre for Stem Cell Research, Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H 8L6, Canada
| | - Michael A Rudnicki
- Sprott Centre for Stem Cell Research, Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H 8L6, Canada.,Department of Medicine, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| |
Collapse
|
9
|
Martin RC, Vining K, Dombrowski JE. Genome-wide (ChIP-seq) identification of target genes regulated by BdbZIP10 during paraquat-induced oxidative stress. BMC PLANT BIOLOGY 2018; 18:58. [PMID: 29636001 PMCID: PMC5894230 DOI: 10.1186/s12870-018-1275-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 03/29/2018] [Indexed: 05/12/2023]
Abstract
BACKGROUND bZIP transcription factors play a significant role in many aspects of plant growth and development and also play critical regulatory roles during plant responses to various stresses. Overexpression of the Brachypodium bZIP10 (Bradi1g30140) transcription factor conferred enhanced oxidative stress tolerance and increased viability when plants or cells were exposed to the herbicide paraquat. To gain a better understanding of genes involved in bZIP10 conferred oxidative stress tolerance, chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) was performed on BdbZIP10 overexpressing plants in the presence of oxidative stress. RESULTS We identified a transcription factor binding motif, TGDCGACA, different from most known bZIP TF motifs but with strong homology to the Arabidopsis zinc deficiency response element. Analysis of the immunoprecipitated sequences revealed an enrichment of gene ontology groups with metal ion transmembrane transporter, transferase, catalytic and binding activities. Functional categories including kinases and phosphotransferases, cation/ion transmembrane transporters, transferases (phosphorus-containing and glycosyl groups), and some nucleoside/nucleotide binding activities were also enriched. CONCLUSIONS Brachypodium bZIP10 is involved in zinc homeostasis, as it relates to oxidative stress.
Collapse
Affiliation(s)
- Ruth C. Martin
- USDA ARS National Forage Seed and Cereal Research Unit, 3450 SW Campus Way, Corvallis, OR 97330 USA
| | - Kelly Vining
- Department of Horticulture, 4123 Agricultural & Life Sciences, Oregon State University, Corvallis, OR 97330 USA
| | - James E. Dombrowski
- USDA ARS National Forage Seed and Cereal Research Unit, 3450 SW Campus Way, Corvallis, OR 97330 USA
| |
Collapse
|
10
|
Batmanov K, Wang J. Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model. Genes (Basel) 2017; 8:E233. [PMID: 28927002 PMCID: PMC5615366 DOI: 10.3390/genes8090233] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Revised: 09/13/2017] [Accepted: 09/13/2017] [Indexed: 11/30/2022] Open
Abstract
DNA shape readout is an important mechanism of transcription factor target site recognition, in addition to the sequence readout. Several machine learning-based models of transcription factor-DNA interactions, considering DNA shape features, have been developed in recent years. Here, we present a new biophysical model of protein-DNA interactions by integrating the DNA shape properties. It is based on the neighbor dinucleotide dependency model BayesPI2, where new parameters are restricted to a subspace spanned by the dinucleotide form of DNA shape features. This allows a biophysical interpretation of the new parameters as a position-dependent preference towards specific DNA shape features. Using the new model, we explore the variation of DNA shape preferences in several transcription factors across various cancer cell lines and cellular conditions. The results reveal that there are DNA shape variations at FOXA1 (Forkhead Box Protein A1) binding sites in steroid-treated MCF7 cells. The new biophysical model is useful for elucidating the finer details of transcription factor-DNA interaction, as well as for predicting cancer mutation effects in the future.
Collapse
Affiliation(s)
- Kirill Batmanov
- Department of Pathology, Oslo University Hospital-Norwegian Radium Hospital, Montebello, 0310 Oslo,Norway.
| | - Junbai Wang
- Department of Pathology, Oslo University Hospital-Norwegian Radium Hospital, Montebello, 0310 Oslo,Norway.
| |
Collapse
|
11
|
Correcting nucleotide-specific biases in high-throughput sequencing data. BMC Bioinformatics 2017; 18:357. [PMID: 28764645 PMCID: PMC5540620 DOI: 10.1186/s12859-017-1766-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2017] [Accepted: 07/19/2017] [Indexed: 01/07/2023] Open
Abstract
Background High-throughput sequence (HTS) data exhibit position-specific nucleotide biases that obscure the intended signal and reduce the effectiveness of these data for downstream analyses. These biases are particularly evident in HTS assays for identifying regulatory regions in DNA (DNase-seq, ChIP-seq, FAIRE-seq, ATAC-seq). Biases may result from many experiment-specific factors, including selectivity of DNA restriction enzymes and fragmentation method, as well as sequencing technology-specific factors, such as choice of adapters/primers and sample amplification methods. Results We present a novel method to detect and correct position-specific nucleotide biases in HTS short read data. Our method calculates read-specific weights based on aligned reads to correct the over- or underrepresentation of position-specific nucleotide subsequences, both within and adjacent to the aligned read, relative to a baseline calculated in assay-specific enriched regions. Using HTS data from a variety of ChIP-seq, DNase-seq, FAIRE-seq, and ATAC-seq experiments, we show that our weight-adjusted reads reduce the position-specific nucleotide imbalance across reads and improve the utility of these data for downstream analyses, including identification and characterization of open chromatin peaks and transcription-factor binding sites. Conclusions A general-purpose method to characterize and correct position-specific nucleotide sequence biases fills the need to recognize and deal with, in a systematic manner, binding-site preference for the growing number of HTS-based epigenetic assays. As the breadth and impact of these biases are better understood, the availability of a standard toolkit to correct them will be important. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1766-x) contains supplementary material, which is available to authorized users.
Collapse
|
12
|
Dissecting chromatin-mediated gene regulation and epigenetic memory through mathematical modelling. ACTA ACUST UNITED AC 2017. [DOI: 10.1016/j.coisb.2017.02.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|