1
|
Integrative annotation scores of variants for impact on RNA binding protein activities. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae181. [PMID: 38640488 DOI: 10.1093/bioinformatics/btae181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 02/09/2024] [Accepted: 04/17/2024] [Indexed: 04/21/2024]
Abstract
MOTIVATION The ENCODE project generated a large collection of eCLIP-seq RNA binding protein (RBP) profiling data with accompanying RNA-seq transcriptomes of shRNA knockdown of RBPs. These data could have utility in understanding the functional impact of genetic variants, however their potential has not been fully exploited. We implement INCA (Integrative annotation scores of variants for impact on RBP activities) as a multi-step genetic variant scoring approach that leverages the ENCODE RBP data together with ClinVar and integrates multiple computational approaches to aggregate evidence. RESULTS INCA evaluates variant impacts on RBP activities by leveraging genotypic differences in cell lines used for eCLIP-seq. We show that INCA provides critical specificity, beyond generic scoring for RBP binding disruption, for candidate variants and their linkage-disequilibrium partners. As a result, it can, on average, augment scoring of 46.2% of the candidate variants beyond generic scoring for RBP binding disruption and aid in variant prioritization for follow-up analysis. AVAILABILITY AND IMPLEMENTATION INCA is implemented in R and is available at https://github.com/keleslab/INCA.
Collapse
|
2
|
PUF partner interactions at a conserved interface shape the RNA-binding landscape and cell fate in Caenorhabditis elegans. Dev Cell 2024; 59:661-675.e7. [PMID: 38290520 DOI: 10.1016/j.devcel.2024.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 11/10/2023] [Accepted: 01/08/2024] [Indexed: 02/01/2024]
Abstract
Protein-RNA regulatory networks underpin much of biology. C. elegans FBF-2, a PUF-RNA-binding protein, binds over 1,000 RNAs to govern stem cells and differentiation. FBF-2 interacts with multiple protein partners via a key tyrosine, Y479. Here, we investigate the in vivo significance of partnerships using a Y479A mutant. Occupancy of the Y479A mutant protein increases or decreases at specific sites across the transcriptome, varying with RNAs. Germline development also changes in a specific fashion: Y479A abolishes one FBF-2 function-the sperm-to-oocyte cell fate switch. Y479A's effects on the regulation of one mRNA, gld-1, are critical to this fate change, though other network changes are also important. FBF-2 switches from repression to activation of gld-1 RNA, likely by distinct FBF-2 partnerships. The role of RNA-binding protein partnerships in governing RNA regulatory networks will likely extend broadly, as such partnerships pervade RNA controls in virtually all metazoan tissues and species.
Collapse
|
3
|
Whole genome methylation sequencing in blood identifies extensive differential DNA methylation in late-onset dementia due to Alzheimer's disease. Alzheimers Dement 2024; 20:1050-1062. [PMID: 37856321 PMCID: PMC10916976 DOI: 10.1002/alz.13514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Revised: 08/17/2023] [Accepted: 09/25/2023] [Indexed: 10/21/2023]
Abstract
INTRODUCTION DNA microarray-based studies report differentially methylated positions (DMPs) in blood between late-onset dementia due to Alzheimer's disease (AD) and cognitively unimpaired individuals, but interrogate < 4% of the genome. METHODS We used whole genome methylation sequencing (WGMS) to quantify DNA methylation levels at 25,409,826 CpG loci in 281 blood samples from 108 AD and 173 cognitively unimpaired individuals. RESULTS WGMS identified 28,038 DMPs throughout the human methylome, including 2707 differentially methylated genes (e.g., SORCS3, GABA, and PICALM) encoding proteins in biological pathways relevant to AD such as synaptic membrane, cation channel complex, and glutamatergic synapse. One hundred seventy-three differentially methylated blood-specific enhancers interact with the promoters of 95 genes that are differentially expressed in blood from persons with and without AD. DISCUSSION WGMS identifies differentially methylated CpGs in known and newly detected genes and enhancers in blood from persons with and without AD. HIGHLIGHTS Whole genome DNA methylation levels were quantified in blood from persons with and without Alzheimer's disease (AD). Twenty-eight thousand thirty-eight differentially methylated positions (DMPs) were identified. Two thousand seven hundred seven genes comprise DMPs. Forty-eight of 75 independent genetic risk loci for AD have DMPs. One thousand five hundred sixty-eight blood-specific enhancers comprise DMPs, 173 of which interact with the promoters of 95 genes that are differentially expressed in blood from persons with and without AD.
Collapse
|
4
|
MuDCoD: multi-subject community detection in personalized dynamic gene networks from single-cell RNA sequencing. Bioinformatics 2023; 39:btad592. [PMID: 37740957 PMCID: PMC10564618 DOI: 10.1093/bioinformatics/btad592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 08/24/2023] [Accepted: 09/21/2023] [Indexed: 09/25/2023] Open
Abstract
MOTIVATION With the wide availability of single-cell RNA-seq (scRNA-seq) technology, population-scale scRNA-seq datasets across multiple individuals and time points are emerging. While the initial investigations of these datasets tend to focus on standard analysis of clustering and differential expression, leveraging the power of scRNA-seq data at the personalized dynamic gene co-expression network level has the potential to unlock subject and/or time-specific network-level variation, which is critical for understanding phenotypic differences. Community detection from co-expression networks of multiple time points or conditions has been well-studied; however, none of the existing settings included networks from multiple subjects and multiple time points simultaneously. To address this, we develop Multi-subject Dynamic Community Detection (MuDCoD) for multi-subject community detection in personalized dynamic gene networks from scRNA-seq. MuDCoD builds on the spectral clustering framework and promotes information sharing among the networks of the subjects as well as networks at different time points. It clusters genes in the personalized dynamic gene networks and reveals gene communities that are variable or shared not only across time but also among subjects. RESULTS Evaluation and benchmarking of MuDCoD against existing approaches reveal that MuDCoD effectively leverages apparent shared signals among networks of the subjects at individual time points, and performs robustly when there is no or little information sharing among the networks. Applications to population-scale scRNA-seq datasets of human-induced pluripotent stem cells during dopaminergic neuron differentiation and CD4+ T cell activation indicate that MuDCoD enables robust inference for identifying time-varying personalized gene modules. Our results illustrate how personalized dynamic community detection can aid in the exploration of subject-specific biological processes that vary across time. AVAILABILITY AND IMPLEMENTATION MuDCoD is publicly available at https://github.com/bo1929/MuDCoD as a Python package. Implementation includes simulation and real-data experiments together with extensive documentation.
Collapse
|
5
|
Debiased personalized gene coexpression networks for population-scale scRNA-seq data. Genome Res 2023; 33:932-947. [PMID: 37295843 PMCID: PMC10519377 DOI: 10.1101/gr.277363.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 06/07/2023] [Indexed: 06/12/2023]
Abstract
Population-scale single-cell RNA-seq (scRNA-seq) data sets create unique opportunities for quantifying expression variation across individuals at the gene coexpression network level. Estimation of coexpression networks is well established for bulk RNA-seq; however, single-cell measurements pose novel challenges owing to technical limitations and noise levels of this technology. Gene-gene correlation estimates from scRNA-seq tend to be severely biased toward zero for genes with low and sparse expression. Here, we present Dozer to debias gene-gene correlation estimates from scRNA-seq data sets and accurately quantify network-level variation across individuals. Dozer corrects correlation estimates in the general Poisson measurement model and provides a metric to quantify genes measured with high noise. Computational experiments establish that Dozer estimates are robust to mean expression levels of the genes and the sequencing depths of the data sets. Compared with alternatives, Dozer results in fewer false-positive edges in the coexpression networks, yields more accurate estimates of network centrality measures and modules, and improves the faithfulness of networks estimated from separate batches of the data sets. We showcase unique analyses enabled by Dozer in two population-scale scRNA-seq applications. Coexpression network-based centrality analysis of multiple differentiating human induced pluripotent stem cell (iPSC) lines yields biologically coherent gene groups that are associated with iPSC differentiation efficiency. Application with population-scale scRNA-seq of oligodendrocytes from postmortem human tissues of Alzheimer's disease and controls uniquely reveals coexpression modules of innate immune response with distinct coexpression levels between the diagnoses. Dozer represents an important advance in estimating personalized coexpression networks from scRNA-seq data.
Collapse
|
6
|
Dozer: Debiased personalized gene co-expression networks for population-scale scRNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.25.538290. [PMID: 37163070 PMCID: PMC10168282 DOI: 10.1101/2023.04.25.538290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Population-scale single cell RNA-seq (scRNA-seq) datasets create unique opportunities for quantifying expression variation across individuals at the gene co-expression network level. Estimation of co-expression networks is well-established for bulk RNA-seq; however, single-cell measurements pose novel challenges due to technical limitations and noise levels of this technology. Gene-gene correlation estimates from scRNA-seq tend to be severely biased towards zero for genes with low and sparse expression. Here, we present Dozer to debias gene-gene correlation estimates from scRNA-seq datasets and accurately quantify network level variation across individuals. Dozer corrects correlation estimates in the general Poisson measurement model and provides a metric to quantify genes measured with high noise. Computational experiments establish that Dozer estimates are robust to mean expression levels of the genes and the sequencing depths of the datasets. Compared to alternatives, Dozer results in fewer false positive edges in the co-expression networks, yields more accurate estimates of network centrality measures and modules, and improves the faithfulness of networks estimated from separate batches of the datasets. We showcase unique analyses enabled by Dozer in two population-scale scRNA-seq applications. Co-expression network-based centrality analysis of multiple differentiating human induced pluripotent stem cell (iPSC) lines yields biologically coherent gene groups that are associated with iPSC differentiation efficiency. Application with population-scale scRNA-seq of oligodendrocytes from postmortem human tissues of Alzheimer disease and controls uniquely reveals co-expression modules of innate immune response with markedly different co-expression levels between the diagnoses. Dozer represents an important advance in estimating personalized co-expression networks from scRNA-seq data.
Collapse
|
7
|
AdaLiftOver: high-resolution identification of orthologous regulatory elements with Adaptive liftOver. Bioinformatics 2023; 39:btad149. [PMID: 37004197 PMCID: PMC10085516 DOI: 10.1093/bioinformatics/btad149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 03/02/2023] [Accepted: 03/20/2023] [Indexed: 04/03/2023] Open
Abstract
MOTIVATION Elucidating functionally similar orthologous regulatory regions for human and model organism genomes is critical for exploiting model organism research and advancing our understanding of results from genome-wide association studies (GWAS). Sequence conservation is the de facto approach for finding orthologous non-coding regions between human and model organism genomes. However, existing methods for mapping non-coding genomic regions across species are challenged by the multi-mapping, low precision, and low mapping rate issues. RESULTS We develop Adaptive liftOver (AdaLiftOver), a large-scale computational tool for identifying functionally similar orthologous non-coding regions across species. AdaLiftOver builds on the UCSC liftOver framework to extend the query regions and prioritizes the resulting candidate target regions based on the conservation of the epigenomic and the sequence grammar features. Evaluations of AdaLiftOver with multiple case studies, spanning both genomic intervals from epigenome datasets across a wide range of model organisms and GWAS SNPs, yield AdaLiftOver as a versatile method for deriving hard-to-obtain human epigenome datasets as well as reliably identifying orthologous loci for GWAS SNPs. AVAILABILITY AND IMPLEMENTATION The R package and the data for AdaLiftOver is available from https://github.com/keleslab/AdaLiftOver.
Collapse
|
8
|
Joint tensor modeling of single cell 3D genome and epigenetic data with Muscle. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.27.525871. [PMID: 36747701 PMCID: PMC9900892 DOI: 10.1101/2023.01.27.525871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Emerging single cell technologies that simultaneously capture long-range interactions of genomic loci together with their DNA methylation levels are advancing our understanding of three-dimensional genome structure and its interplay with the epigenome at the single cell level. While methods to analyze data from single cell high throughput chromatin conformation capture (scHi-C) experiments are maturing, methods that can jointly analyze multiple single cell modalities with scHi-C data are lacking. Here, we introduce Muscle, a semi-nonnegative joint decomposition of Multiple single cell tensors, to jointly analyze 3D conformation and DNA methylation data at the single cell level. Muscle takes advantage of the inherent tensor structure of the scHi-C data, and integrates this modality with DNA methylation. We developed an alternating least squares algorithm for estimating Muscle parameters and established its optimality properties. Parameters estimated by Muscle directly align with the key components of the downstream analysis of scHi-C data in a cell type specific manner. Evaluations with data-driven experiments and simulations demonstrate the advantages of the joint modeling framework of Muscle over single modality modeling or a baseline multi modality modeling for cell type delineation and elucidating associations between modalities. Muscle is publicly available at https://github.com/keleslab/muscle.
Collapse
|
9
|
Normalization and de-noising of single-cell Hi-C data with BandNorm and scVI-3D. Genome Biol 2022; 23:222. [PMID: 36253828 PMCID: PMC9575231 DOI: 10.1186/s13059-022-02774-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Accepted: 09/19/2022] [Indexed: 11/10/2022] Open
Abstract
Single-cell high-throughput chromatin conformation capture methodologies (scHi-C) enable profiling of long-range genomic interactions. However, data from these technologies are prone to technical noise and biases that hinder downstream analysis. We develop a normalization approach, BandNorm, and a deep generative modeling framework, scVI-3D, to account for scHi-C specific biases. In benchmarking experiments, BandNorm yields leading performances in a time and memory efficient manner for cell-type separation, identification of interacting loci, and recovery of cell-type relationships, while scVI-3D exhibits advantages for rare cell types and under high sparsity scenarios. Application of BandNorm coupled with gene-associating domain analysis reveals scRNA-seq validated sub-cell type identification.
Collapse
|
10
|
scGAD: single-cell gene associating domain scores for exploratory analysis of scHi-C data. Bioinformatics 2022; 38:3642-3644. [PMID: 35652733 PMCID: PMC9272792 DOI: 10.1093/bioinformatics/btac372] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 03/30/2022] [Accepted: 05/26/2022] [Indexed: 11/12/2022] Open
Abstract
SUMMARY Quantitative tools are needed to leverage the unprecedented resolution of single-cell high-throughput chromatin conformation (scHi-C) data and integrate it with other single-cell data modalities. We present single-cell gene associating domain (scGAD) scores as a dimension reduction and exploratory analysis tool for scHi-C data. scGAD enables summarization at the gene unit while accounting for inherent gene-level genomic biases. Low-dimensional projections with scGAD capture clustering of cells based on their 3D structures. Significant chromatin interactions within and between cell types can be identified with scGAD. We further show that scGAD facilitates the integration of scHi-C data with other single-cell data modalities by enabling its projection onto reference low-dimensional embeddings. This multi-modal data integration provides an automated and refined cell-type annotation for scHi-C data. AVAILABILITY AND IMPLEMENTATION scGAD is part of the BandNorm R package at https://sshen82.github.io/BandNorm/articles/scGAD-tutorial.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
11
|
Regulatory Architecture of the RCA Gene Cluster Captures an Intragenic TAD Boundary, CTCF-Mediated Chromatin Looping and a Long-Range Intergenic Enhancer. Front Immunol 2022; 13:901747. [PMID: 35769482 PMCID: PMC9235356 DOI: 10.3389/fimmu.2022.901747] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 05/05/2022] [Indexed: 12/03/2022] Open
Abstract
The Regulators of Complement Activation (RCA) gene cluster comprises several tandemly arranged genes with shared functions within the immune system. RCA members, such as complement receptor 2 (CR2), are well-established susceptibility genes in complex autoimmune diseases. Altered expression of RCA genes has been demonstrated at both the functional and genetic level, but the mechanisms underlying their regulation are not fully characterised. We aimed to investigate the structural organisation of the RCA gene cluster to identify key regulatory elements that influence the expression of CR2 and other genes in this immunomodulatory region. Using 4C, we captured extensive CTCF-mediated chromatin looping across the RCA gene cluster in B cells and showed these were organised into two topologically associated domains (TADs). Interestingly, an inter-TAD boundary was located within the CR1 gene at a well-characterised segmental duplication. Additionally, we mapped numerous gene-gene and gene-enhancer interactions across the region, revealing extensive co-regulation. Importantly, we identified an intergenic enhancer and functionally demonstrated this element upregulates two RCA members (CR2 and CD55) in B cells. We have uncovered novel, long-range mechanisms whereby autoimmune disease susceptibility may be influenced by genetic variants, thus highlighting the important contribution of chromatin topology to gene regulation and complex genetic disease.
Collapse
|
12
|
Gene by environment interaction mouse model reveals a functional role for 5-hydroxymethylcytosine in neurodevelopmental disorders. Genome Res 2022; 32:266-279. [PMID: 34949667 PMCID: PMC8805724 DOI: 10.1101/gr.276137.121] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 12/22/2021] [Indexed: 11/25/2022]
Abstract
Mouse knockouts of Cntnap2 show altered neurodevelopmental behavior, deficits in striatal GABAergic signaling, and a genome-wide disruption of an environmentally sensitive DNA methylation modification (5-hydroxymethylcytosine [5hmC]) in the orthologs of a significant number of genes implicated in human neurodevelopmental disorders. We tested adult Cntnap2 heterozygous mice (Cntnap2 +/-; lacking behavioral or neuropathological abnormalities) subjected to a prenatal stress and found that prenatally stressed Cntnap2 +/- female mice show repetitive behaviors and altered sociability, similar to the homozygote phenotype. Genomic profiling revealed disruptions in hippocampal and striatal 5hmC levels that are correlated to altered transcript levels of genes linked to these phenotypes (e.g., Reln, Dst, Trio, and Epha5). Chromatin immunoprecipitation coupled with high-throughput sequencing and hippocampal nuclear lysate pull-down data indicated that 5hmC abundance alters the binding of the transcription factor CLOCK near the promoters of these genes (e.g., Palld, Gigyf1, and Fry), providing a mechanistic role for 5hmC in gene regulation. Together, these data support gene-by-environment hypotheses for the origins of mental illness and provide a means to identify the elusive factors contributing to complex human diseases.
Collapse
|
13
|
Identification of direct transcriptional targets of NFATC2 that promote β cell proliferation. J Clin Invest 2021; 131:e144833. [PMID: 34491912 PMCID: PMC8553569 DOI: 10.1172/jci144833] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 09/02/2021] [Indexed: 12/13/2022] Open
Abstract
The transcription factor NFATC2 induces β cell proliferation in mouse and human islets. However, the genomic targets that mediate these effects have not been identified. We expressed active forms of Nfatc2 and Nfatc1 in human islets. By integrating changes in gene expression with genomic binding sites for NFATC2, we identified approximately 2200 transcriptional targets of NFATC2. Genes induced by NFATC2 were enriched for transcripts that regulate the cell cycle and for DNA motifs associated with the transcription factor FOXP. Islets from an endocrine-specific Foxp1, Foxp2, and Foxp4 triple-knockout mouse were less responsive to NFATC2-induced β cell proliferation, suggesting the FOXP family works to regulate β cell proliferation in concert with NFATC2. NFATC2 induced β cell proliferation in both mouse and human islets, whereas NFATC1 did so only in human islets. Exploiting this species difference, we identified approximately 250 direct transcriptional targets of NFAT in human islets. This gene set enriches for cell cycle-associated transcripts and includes Nr4a1. Deletion of Nr4a1 reduced the capacity of NFATC2 to induce β cell proliferation, suggesting that much of the effect of NFATC2 occurs through its induction of Nr4a1. Integration of noncoding RNA expression, chromatin accessibility, and NFATC2 binding sites enabled us to identify NFATC2-dependent enhancer loci that mediate β cell proliferation.
Collapse
|
14
|
MLG: multilayer graph clustering for multi-condition scRNA-seq data. Nucleic Acids Res 2021; 49:e127. [PMID: 34581807 PMCID: PMC8682753 DOI: 10.1093/nar/gkab823] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 08/13/2021] [Accepted: 09/21/2021] [Indexed: 11/14/2022] Open
Abstract
Single-cell transcriptome sequencing (scRNA-seq) enabled investigations of cellular heterogeneity at exceedingly higher resolutions. Identification of novel cell types or transient developmental stages across multiple experimental conditions is one of its key applications. Linear and non-linear dimensionality reduction for data integration became a foundational tool in inference from scRNA-seq data. We present multilayer graph clustering (MLG) as an integrative approach for combining multiple dimensionality reduction of multi-condition scRNA-seq data. MLG generates a multilayer shared nearest neighbor cell graph with higher signal-to-noise ratio and outperforms current best practices in terms of clustering accuracy across large-scale benchmarking experiments. Application of MLG to a wide variety of datasets from multiple conditions highlights how MLG boosts signal-to-noise ratio for fine-grained sub-population identification. MLG is widely applicable to settings with single cell data integration via dimension reduction.
Collapse
|
15
|
INFIMA leverages multi-omics model organism data to identify effector genes of human GWAS variants. Genome Biol 2021; 22:241. [PMID: 34425882 PMCID: PMC8381555 DOI: 10.1186/s13059-021-02450-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 08/02/2021] [Indexed: 11/24/2022] Open
Abstract
Genome-wide association studies reveal many non-coding variants associated with complex traits. However, model organism studies largely remain as an untapped resource for unveiling the effector genes of non-coding variants. We develop INFIMA, Integrative Fine-Mapping, to pinpoint causal SNPs for diversity outbred (DO) mice eQTL by integrating founder mice multi-omics data including ATAC-seq, RNA-seq, footprinting, and in silico mutation analysis. We demonstrate INFIMA's superior performance compared to alternatives with human and mouse chromatin conformation capture datasets. We apply INFIMA to identify novel effector genes for GWAS variants associated with diabetes. The results of the application are available at http://www.statlab.wisc.edu/shiny/INFIMA/ .
Collapse
|
16
|
PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments. Genome Res 2020; 30:1655-1666. [PMID: 32958497 PMCID: PMC7605252 DOI: 10.1101/gr.252445.119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2019] [Accepted: 08/27/2020] [Indexed: 11/25/2022]
Abstract
Publicly available RNA-seq data is routinely used for retrospective analysis to elucidate new biology. Novel transcript discovery enabled by joint analysis of large collections of RNA-seq data sets has emerged as one such analysis. Current methods for transcript discovery rely on a '2-Step' approach where the first step encompasses building transcripts from individual data sets, followed by the second step that merges predicted transcripts across data sets. To increase the power of transcript discovery from large collections of RNA-seq data sets, we developed a novel '1-Step' approach named Pooling RNA-seq and Assembling Models (PRAM) that builds transcript models from pooled RNA-seq data sets. We demonstrate in a computational benchmark that 1-Step outperforms 2-Step approaches in predicting overall transcript structures and individual splice junctions, while performing competitively in detecting exonic nucleotides. Applying PRAM to 30 human ENCODE RNA-seq data sets identified unannotated transcripts with epigenetic and RAMPAGE signatures similar to those of recently annotated transcripts. In a case study, we discovered and experimentally validated new transcripts through the application of PRAM to mouse hematopoietic RNA-seq data sets. We uncovered new transcripts that share a differential expression pattern with a neighboring gene Pik3cg implicated in human hematopoietic phenotypes, and we provided evidence for the conservation of this relationship in human. PRAM is implemented as an R/Bioconductor package.
Collapse
|
17
|
FreeHi-C spike-in simulations for benchmarking differential chromatin interaction detection. Methods 2020; 189:3-11. [PMID: 32663510 DOI: 10.1016/j.ymeth.2020.07.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 05/23/2020] [Accepted: 07/03/2020] [Indexed: 11/16/2022] Open
Abstract
High-throughput genome-wide chromatin conformation capture assay (Hi-C) is routinely used to profile long-range genomic interactions and three-dimensional organization of genomes. A key application of Hi-C is the comparative analysis of genomic interactions across different time points, cellular conditions, or multiple stimuli. While operating characteristics of methods for Hi-C data processing such as normalization, pairwise interaction and higher-order organization detection have been relatively well studied, properties of methods for differential chromatin interaction detection are less investigated. We have recently developed FreeHi-C to enable data-driven non-parametric simulations from Hi-C experiments. Here, we extend FreeHi-C with a user/data-driven spike-in module to facilitate comparisons of differential chromatin interaction detection methods where the ground truth differential chromatin interactions are known under a wide variety of settings. We use FreeHi-C to benchmark four differential chromatin interaction detection methods, namely HiCcompare, multiHiCcompare, diffHic, and Selfish, using three comparative analysis settings with different sequencing depths and spike-in proportions. This comparison reveals distinguished performances in terms of the standard metrics such as the false discovery rate control, detection power, significance order, precision-recall curve, and receiver operating characteristic curve as well as overall genomic properties of the types of differential chromatin interactions detectable by each method. Furthermore, it highlights the lack of power for all methods in small replication settings.
Collapse
|
18
|
SURF: integrative analysis of a compendium of RNA-seq and CLIP-seq datasets highlights complex governing of alternative transcriptional regulation by RNA-binding proteins. Genome Biol 2020; 21:139. [PMID: 32532357 PMCID: PMC7291511 DOI: 10.1186/s13059-020-02039-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Accepted: 05/08/2020] [Indexed: 01/10/2023] Open
Abstract
Advances in high-throughput profiling of RNA-binding proteins (RBPs) have resulted inCLIP-seq datasets coupled with transcriptome profiling by RNA-seq. However, analysis methods that integrate both types of data are lacking. We describe SURF, Statistical Utility for RBP Functions, for integrative analysis of large collections of CLIP-seq and RNA-seq data. We demonstrate SURF's ability to accurately detect differential alternative transcriptional regulation events and associate them to local protein-RNA interactions. We apply SURF to ENCODE RBP compendium and carry out downstream analysis with additional reference datasets. The results of this application are browsable at http://www.statlab.wisc.edu/shiny/surf/.
Collapse
|
19
|
FreeHi-C simulates high-fidelity Hi-C data for benchmarking and data augmentation. Nat Methods 2019; 17:37-40. [PMID: 31712779 PMCID: PMC8136837 DOI: 10.1038/s41592-019-0624-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 09/26/2019] [Indexed: 02/06/2023]
Abstract
Ability to simulate high-throughput chromatin conformation (Hi-C) data is foundational for benchmarking Hi-C data analysis methods. Here we present a non-parametric strategy named FreeHi-C to simulate Hi-C data from the interacting genome fragments. Data from FreeHi-C exhibit high fidelity to biological Hi-C data. FreeHi-C boosts the precision and power of differential chromatin interaction detection through data augmentation under preserved false discovery rate control.
Collapse
|
20
|
atSNP Search: a web resource for statistically evaluating influence of human genetic variation on transcription factor binding. Bioinformatics 2019; 35:2657-2659. [PMID: 30534948 PMCID: PMC6662080 DOI: 10.1093/bioinformatics/bty1010] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/13/2018] [Accepted: 12/06/2018] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Understanding the regulatory roles of non-coding genetic variants has become a central goal for interpreting results of genome-wide association studies. The regulatory significance of the variants may be interrogated by assessing their influence on transcription factor binding. We have developed atSNP Search, a comprehensive web database for evaluating motif matches to the human genome with both reference and variant alleles and assessing the overall significance of the variant alterations on the motif matches. Convenient search features, comprehensive search outputs and a useful help menu are key components of atSNP Search. atSNP Search enables convenient interpretation of regulatory variants by statistical significance testing and composite logo plots, which are graphical representations of motif matches with the reference and variant alleles. Existing motif-based regulatory variant discovery tools only consider a limited pool of variants due to storage or other limitations. In contrast, atSNP Search users can test more than 37 billion variant-motif pairs with marginal significance in motif matches or match alteration. Computational evidence from atSNP Search, when combined with experimental validation, may help with the discovery of underlying disease mechanisms. AVAILABILITY AND IMPLEMENTATION atSNP Search is freely available at http://atsnp.biostat.wisc.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
21
|
iFunMed: Integrative functional mediation analysis of GWAS and eQTL studies. Genet Epidemiol 2019; 43:742-760. [PMID: 31328826 DOI: 10.1002/gepi.22217] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Revised: 04/17/2019] [Accepted: 05/07/2019] [Indexed: 11/08/2022]
Abstract
Genome-wide association studies (GWAS) have successfully identified thousands of genetic variants contributing to disease and other phenotypes. However, significant obstacles hamper our ability to elucidate causal variants, identify genes affected by causal variants, and characterize the mechanisms by which genotypes influence phenotypes. The increasing availability of genome-wide functional annotation data is providing unique opportunities to incorporate prior information into the analysis of GWAS to better understand the impact of variants on disease etiology. Although there have been many advances in incorporating prior information into prioritization of trait-associated variants in GWAS, functional annotation data have played a secondary role in the joint analysis of GWAS and molecular (i.e., expression) quantitative trait loci (eQTL) data in assessing evidence for association. To address this, we develop a novel mediation framework, iFunMed, to integrate GWAS and eQTL data with the utilization of publicly available functional annotation data. iFunMed extends the scope of standard mediation analysis by incorporating information from multiple genetic variants at a time and leveraging variant-level summary statistics. Data-driven computational experiments convey how informative annotations improve single-nucleotide polymorphism (SNP) selection performance while emphasizing robustness of iFunMed to noninformative annotations. Application to Framingham Heart Study data indicates that iFunMed is able to boost detection of SNPs with mediation effects that can be attributed to regulatory mechanisms.
Collapse
|
22
|
An empirical Bayes test for allelic-imbalance detection in ChIP-seq. Biostatistics 2018; 19:546-561. [PMID: 29126153 PMCID: PMC6454553 DOI: 10.1093/biostatistics/kxx060] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 10/01/2017] [Indexed: 11/12/2022] Open
Abstract
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) has enabled discovery of genomic regions enriched with biological signals such as transcription factor binding and histone modifications. Allelic-imbalance (ALI) detection is a complementary analysis of ChIP-seq data for associating biological signals with single nucleotide polymorphisms (SNPs). It has been successfully used in elucidating functional roles of non-coding SNPs. Commonly used statistical approaches for ALI detection are often based on binomial testing and mixture models, both of which rely on strong assumptions on the distribution of the unobserved allelic probability, and have significant practical shortcomings. We propose Non-Parametric Binomial (NPBin) test for ALI detection and for modeling Binomial data in general. NPBin models the density of the unobserved allelic probability non-parametrically, and estimates its empirical null distribution via curve fitting. We demonstrate the advantages of NPBin in terms of interpretability of the estimated density and the accuracy in ALI detection using simulations and analysis of several ChIP-seq data sets. We also illustrate the generality of our modeling framework beyond ALI detection by an application to a baseball batting average prediction problem. This article has supplementary material available at Biostatistics online. The code and the sample input data have been also deposited to github https://github.com/QiZhangStat/ALIdetection.
Collapse
|
23
|
atSNPInfrastructure, a case study for searching billions of records while providing significant cost savings over cloud providers. IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, WORKSHOPS AND PHD FORUM : [PROCEEDINGS]. IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, WORKSHOPS AND PHD FORUM 2018; 2018:497-506. [PMID: 30349760 PMCID: PMC6195815 DOI: 10.1109/ipdpsw.2018.00086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We explore the feasibility of a database storage engine housing up to 307 billion genetic Single Nucleotide Polymorphisms (SNP) for online access. We evaluate database storage engines and implement a solution utilizing factors such as dataset size, information gain, cost and hardware constraints. Our solution provides a full feature functional model for scalable storage and query-ability for researchers exploring the SNP's in the human genome. We address the scalability problem by building physical infrastructure and comparing final costs to a major cloud provider.
Collapse
|
24
|
Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. STATISTICS IN BIOSCIENCES 2017; 9:50-72. [PMID: 28781711 PMCID: PMC5542423 DOI: 10.1007/s12561-016-9154-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Revised: 05/09/2016] [Accepted: 06/20/2016] [Indexed: 10/21/2022]
Abstract
Although genome-wide association studies (GWAS) have been successful at finding thousands of disease-associated genetic variants (GVs), identifying causal variants and elucidating the mechanisms by which genotypes influence phenotypes are critical open questions. A key challenge is that a large percentage of disease-associated GVs are potential regulatory variants located in noncoding regions, making them difficult to interpret. Recent research efforts focus on going beyond annotating GVs by integrating functional annotation data with GWAS to prioritize GVs. However, applicability of these approaches is challenged by high dimensionality and heterogeneity of functional annotation data. Furthermore, existing methods often assume global associations of GVs with annotation data. This strong assumption is susceptible to violations for GVs involved in many complex diseases. To address these issues, we develop a general regression framework, named Annotation Regression for GWAS (ARoG). ARoG is based on a finite mixture of linear regressions model where GWAS association measures are viewed as responses and functional annotations as predictors. This mixture framework addresses heterogeneity of effects of GVs by grouping them into clusters and high dimensionality of the functional annotations by enabling annotation selection within each cluster. ARoG further employs permutation testing to evaluate the significance of selected annotations. Computational experiments indicate that ARoG can discover distinct associations between disease risk and functional annotations. Application of ARoG to autism and schizophrenia data from Psychiatric Genomics Consortium led to identification of GVs that significantly affect interactions of several transcription factors with DNA as potential mechanisms contributing to these disorders.
Collapse
|
25
|
A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. J Comput Biol 2016; 24:472-485. [PMID: 27835030 DOI: 10.1089/cmb.2016.0138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each data set (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared toward multisample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq data sets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, Zuo et al. developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq data sets. Although this versatile framework estimates both the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization-based estimation structure hinders its applicability with large number of loci and samples. We address this limitation by developing MAP-based asymptotic derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm that converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparison with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq data sets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.
Collapse
|
26
|
Abstract
In recent years, a large number of genomic and epigenomic studies have been focusing on the integrative analysis of multiple experimental datasets measured over a large number of observational units. The objectives of such studies include not only inferring a hidden state of activity for each unit over individual experiments, but also detecting highly associated clusters of units based on their inferred states. Although there are a number of methods tailored for specific datasets, there is currently no state-of-the-art modeling framework for this general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space mapping and state-space clustering. In state-space mapping, it maps observations onto a finite state-space, representing the activation states of units across conditions. In state-space clustering, MBASIC incorporates a finite mixture model to cluster the units based on their inferred state-space profiles across all conditions. Both the state-space mapping and clustering can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC flexibly adapts to a large number of parametric distributions for the observed data, as well as the heterogeneity in replicate experiments. It allows for imposing structural assumptions on each cluster, and enables model selection using information criterion. In our data-driven simulation studies, MBASIC showed significant accuracy in recovering both the underlying state-space variables and clustering structures. We applied MBASIC to two genome research problems using large numbers of datasets from the ENCODE project. The first application grouped genes based on transcription factor occupancy profiles of their promoter regions in two different cell types. The second application focused on identifying groups of loci that are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing these data with a two-step approach using ENCODE results on transcription factor occupancy data.
Collapse
|
27
|
Sex-specific hippocampal 5-hydroxymethylcytosine is disrupted in response to acute stress. Neurobiol Dis 2016; 96:54-66. [PMID: 27576189 DOI: 10.1016/j.nbd.2016.08.014] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Revised: 08/18/2016] [Accepted: 08/23/2016] [Indexed: 01/18/2023] Open
Abstract
Environmental stress is among the most important contributors to increased susceptibility to develop psychiatric disorders. While it is well known that acute environmental stress alters gene expression, the molecular mechanisms underlying these changes remain largely unknown. 5-hydroxymethylcytosine (5hmC) is a novel environmentally sensitive epigenetic modification that is highly enriched in neurons and is associated with active neuronal transcription. Recently, we reported a genome-wide disruption of hippocampal 5hmC in male mice following acute stress that was correlated to altered transcript levels of genes in known stress related pathways. Since sex-specific endocrine mechanisms respond to environmental stimulus by altering the neuronal epigenome, we examined the genome-wide profile of hippocampal 5hmC in female mice following exposure to acute stress and identified 363 differentially hydroxymethylated regions (DhMRs) linked to known (e.g., Nr3c1 and Ntrk2) and potentially novel genes associated with stress response and psychiatric disorders. Integration of hippocampal expression data from the same female mice found stress-related hydroxymethylation correlated to altered transcript levels. Finally, characterization of stress-induced sex-specific 5hmC profiles in the hippocampus revealed 778 sex-specific acute stress-induced DhMRs some of which were correlated to altered transcript levels that produce sex-specific isoforms in response to stress. Together, the alterations in 5hmC presented here provide a possible molecular mechanism for the adaptive sex-specific response to stress that may augment the design of novel therapeutic agents that will have optimal effectiveness in each sex.
Collapse
|
28
|
Integrative analysis with ChIP-seq advances the limits of transcript quantification from RNA-seq. Genome Res 2016; 26:1124-33. [PMID: 27405803 PMCID: PMC4971760 DOI: 10.1101/gr.199174.115] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 06/13/2016] [Indexed: 11/24/2022]
Abstract
RNA-seq is currently the technology of choice for global measurement of transcript abundances in cells. Despite its successes, isoform-level quantification remains difficult because short RNA-seq reads are often compatible with multiple alternatively spliced isoforms. Existing methods rely heavily on uniquely mapping reads, which are not available for numerous isoforms that lack regions of unique sequence. To improve quantification accuracy in such difficult cases, we developed a novel computational method, prior-enhanced RSEM (pRSEM), which uses a complementary data type in addition to RNA-seq data. We found that ChIP-seq data of RNA polymerase II and histone modifications were particularly informative in this approach. In qRT-PCR validations, pRSEM was shown to be superior than competing methods in estimating relative isoform abundances within or across conditions. Data-driven simulations suggested that pRSEM has a greatly decreased false-positive rate at the expense of a small increase in false-negative rate. In aggregate, our study demonstrates that pRSEM transforms existing capacity to precisely estimate transcript abundances, especially at the isoform level.
Collapse
|
29
|
Systematic evaluation of the impact of ChIP-seq read designs on genome coverage, peak identification, and allele-specific binding detection. BMC Bioinformatics 2016; 17:96. [PMID: 26908256 PMCID: PMC4765064 DOI: 10.1186/s12859-016-0957-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Accepted: 02/19/2016] [Indexed: 01/22/2023] Open
Abstract
Background Chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments revolutionized genome-wide profiling of transcription factors and histone modifications. Although maturing sequencing technologies allow these experiments to be carried out with short (36–50 bps), long (75–100 bps), single-end, or paired-end reads, the impact of these read parameters on the downstream data analysis are not well understood. In this paper, we evaluate the effects of different read parameters on genome sequence alignment, coverage of different classes of genomic features, peak identification, and allele-specific binding detection. Results We generated 101 bps paired-end ChIP-seq data for many transcription factors from human GM12878 and MCF7 cell lines. Systematic evaluations using in silico variations of these data as well as fully simulated data, revealed complex interplay between the sequencing parameters and analysis tools, and indicated clear advantages of paired-end designs in several aspects such as alignment accuracy, peak resolution, and most notably, allele-specific binding detection. Conclusions Our work elucidates the effect of design on the downstream analysis and provides insights to investigators in deciding sequencing parameters in ChIP-seq experiments. We present the first systematic evaluation of the impact of ChIP-seq designs on allele-specific binding detection and highlights the power of pair-end designs in such studies. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0957-1) contains supplementary material, which is available to authorized users.
Collapse
|
30
|
Genome-wide alterations in hippocampal 5-hydroxymethylcytosine links plasticity genes to acute stress. Neurobiol Dis 2015; 86:99-108. [PMID: 26598390 DOI: 10.1016/j.nbd.2015.11.010] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2015] [Accepted: 11/11/2015] [Indexed: 12/15/2022] Open
Abstract
Environmental stress is among the most important contributors to increased susceptibility to develop psychiatric disorders, including anxiety and post-traumatic stress disorder. While even acute stress alters gene expression, the molecular mechanisms underlying these changes remain largely unknown. 5-hydroxymethylcytosine (5hmC) is a novel environmentally sensitive DNA modification that is highly enriched in post-mitotic neurons and is associated with active transcription of neuronal genes. Recently, we found a hippocampal increase of 5hmC in the glucocorticoid receptor gene (Nr3c1) following acute stress, warranting a deeper investigation of stress-related 5hmC levels. Here we used an established chemical labeling and affinity purification method coupled with high-throughput sequencing technology to generate the first genome-wide profile of hippocampal 5hmC following exposure to acute restraint stress and a one-hour recovery. This approach found a genome-wide disruption in 5hmC associated with acute stress response, primarily in genic regions, and identified known and potentially novel stress-related targets that have a significant enrichment for neuronal ontological functions. Integration of these data with hippocampal gene expression data from these same mice found stress-related hydroxymethylation correlated to altered transcript levels and sequence motif predictions indicated that 5hmC may function by mediating transcription factor binding to these transcripts. Together, these data reveal an environmental impact on this newly discovered epigenetic mark in the brain and represent a critical step toward understanding stress-related epigenetic mechanisms that alter gene expression and can lead to the development of psychiatric disorders.
Collapse
|
31
|
Perm-seq: Mapping Protein-DNA Interactions in Segmental Duplication and Highly Repetitive Regions of Genomes with Prior-Enhanced Read Mapping. PLoS Comput Biol 2015; 11:e1004491. [PMID: 26484757 PMCID: PMC4618727 DOI: 10.1371/journal.pcbi.1004491] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2014] [Accepted: 08/06/2015] [Indexed: 11/19/2022] Open
Abstract
Segmental duplications and other highly repetitive regions of genomes contribute significantly to cells' regulatory programs. Advancements in next generation sequencing enabled genome-wide profiling of protein-DNA interactions by chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq). However, interactions in highly repetitive regions of genomes have proven difficult to map since short reads of 50-100 base pairs (bps) from these regions map to multiple locations in reference genomes. Standard analytical methods discard such multi-mapping reads and the few that can accommodate them are prone to large false positive and negative rates. We developed Perm-seq, a prior-enhanced read allocation method for ChIP-seq experiments, that can allocate multi-mapping reads in highly repetitive regions of the genomes with high accuracy. We comprehensively evaluated Perm-seq, and found that our prior-enhanced approach significantly improves multi-read allocation accuracy over approaches that do not utilize additional data types. The statistical formalism underlying our approach facilitates supervising of multi-read allocation with a variety of data sources including histone ChIP-seq. We applied Perm-seq to 64 ENCODE ChIP-seq datasets from GM12878 and K562 cells and identified many novel protein-DNA interactions in segmental duplication regions. Our analysis reveals that although the protein-DNA interactions sites are evolutionarily less conserved in repetitive regions, they share the overall sequence characteristics of the protein-DNA interactions in non-repetitive regions.
Collapse
|
32
|
Genome-wide disruption of 5-hydroxymethylcytosine in a mouse model of autism. Hum Mol Genet 2015; 24:7121-31. [PMID: 26423458 DOI: 10.1093/hmg/ddv411] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 09/28/2015] [Indexed: 01/29/2023] Open
Abstract
The autism spectrum disorders (ASD) comprise a broad group of behaviorally related neurodevelopmental disorders affecting as many as 1 in 68 children. The hallmarks of ASD consist of impaired social and communication interactions, pronounced repetitive behaviors and restricted patterns of interests. Family, twin and epidemiological studies suggest a polygenetic and epistatic susceptibility model involving the interaction of many genes; however, the etiology of ASD is likely to be complex and include both epigenetic and environmental factors. 5-hydroxymethylcytosine (5hmC) is a novel environmentally sensitive DNA modification that is highly enriched in post-mitotic neurons and is associated with active transcription of neuronal genes. Here, we used an established chemical labeling and affinity purification method coupled with high-throughput sequencing technology to generate a genome-wide profile of striatal 5hmC in an autism mouse model (Cntnap2(-/-) mice) and found that at 9 weeks of age the Cntnap2(-/-) mice have a genome-wide disruption in 5hmC, primarily in genic regions and repetitive elements. Annotation of differentially hydroxymethylated regions (DhMRs) to genes revealed a significant overlap with known ASD genes (e.g. Nrxn1 and Reln) that carried an enrichment of neuronal ontological functions, including axonogenesis and neuron projection morphogenesis. Finally, sequence motif predictions identified associations with transcription factors that have a high correlation with important genes in neuronal developmental and functional pathways. Together, our data implicate a role for 5hmC-mediated epigenetic modulation in the pathogenesis of autism and represent a critical step toward understanding the genome-wide molecular consequence of the Cntnap2 mutation, which results in an autism-like phenotype.
Collapse
|
33
|
Fungal Morphology, Iron Homeostasis, and Lipid Metabolism Regulated by a GATA Transcription Factor in Blastomyces dermatitidis. PLoS Pathog 2015; 11:e1004959. [PMID: 26114571 PMCID: PMC4482641 DOI: 10.1371/journal.ppat.1004959] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2014] [Accepted: 05/16/2015] [Indexed: 11/19/2022] Open
Abstract
In response to temperature, Blastomyces dermatitidis converts between yeast and mold forms. Knowledge of the mechanism(s) underlying this response to temperature remains limited. In B. dermatitidis, we identified a GATA transcription factor, SREB, important for the transition to mold. Null mutants (SREBΔ) fail to fully complete the conversion to mold and cannot properly regulate siderophore biosynthesis. To capture the transcriptional response regulated by SREB early in the phase transition (0–48 hours), gene expression microarrays were used to compare SREB∆ to an isogenic wild type isolate. Analysis of the time course microarray data demonstrated SREB functioned as a transcriptional regulator at 37°C and 22°C. Bioinformatic and biochemical analyses indicated SREB was involved in diverse biological processes including iron homeostasis, biosynthesis of triacylglycerol and ergosterol, and lipid droplet formation. Integration of microarray data, bioinformatics, and chromatin immunoprecipitation identified a subset of genes directly bound and regulated by SREB in vivo in yeast (37°C) and during the phase transition to mold (22°C). This included genes involved with siderophore biosynthesis and uptake, iron homeostasis, and genes unrelated to iron assimilation. Functional analysis suggested that lipid droplets were actively metabolized during the phase transition and lipid metabolism may contribute to filamentous growth at 22°C. Chromatin immunoprecipitation, RNA interference, and overexpression analyses suggested that SREB was in a negative regulatory circuit with the bZIP transcription factor encoded by HAPX. Both SREB and HAPX affected morphogenesis at 22°C; however, large changes in transcript abundance by gene deletion for SREB or strong overexpression for HAPX were required to alter the phase transition. Blastomyces dermatitidis belongs to a group of human pathogenic fungi that convert between two forms, mold and yeast, in response to temperature. Growth as yeast (37°C) in tissue facilitates immune evasion, whereas growth as mold (22°C) promotes environmental survival, sexual reproduction, and generation of transmissible spores. Despite the importance of dimorphism, how fungi regulate temperature adaptation is poorly understood. We identified SREB, a transcription factor that regulates disparate processes including dimorphism. SREB null mutants, which lack SREB, fail to fully complete the conversion to mold at 22°C. The goal of our research was to characterize how SREB regulates transcription during the switch to mold. Gene expression microarray along with chromatin binding and biochemical analyses indicated that SREB affected several processes including iron homeostasis, lipid biosynthesis, and lipid droplet formation. In vivo, SREB directly bound and regulated genes involved with iron uptake, lipid biosynthesis, and transcription. Functional analysis suggested that lipid metabolism may influence filamentous growth at 22°C. In addition, SREB interacted with another transcription factor, HAPX.
Collapse
|
34
|
atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 2015; 31:3353-5. [PMID: 26092860 DOI: 10.1093/bioinformatics/btv328] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2015] [Accepted: 05/19/2015] [Indexed: 01/10/2023] Open
Abstract
MOTIVATION Genome-wide association studies revealed that most disease-associated single nucleotide polymorphisms (SNPs) are located in regulatory regions within introns or in regions between genes. Regulatory SNPs (rSNPs) are such SNPs that affect gene regulation by changing transcription factor (TF) binding affinities to genomic sequences. Identifying potential rSNPs is crucial for understanding disease mechanisms. In silico methods that evaluate the impact of SNPs on TF binding affinities are not scalable for large-scale analysis. RESULTS We describe A: ffinity T: esting for regulatory SNP: s (atSNP), a computationally efficient R package for identifying rSNPs in silico. atSNP implements an importance sampling algorithm coupled with a first-order Markov model for the background nucleotide sequences to test the significance of affinity scores and SNP-driven changes in these scores. Application of atSNP with >20 K SNPs indicates that atSNP is the only available tool for such a large-scale task. atSNP provides user-friendly output in the form of both tables and composite logo plots for visualizing SNP-motif interactions. Evaluations of atSNP with known rSNP-TF interactions indicate that SNP is able to prioritize motifs for a given set of SNPs with high accuracy. AVAILABILITY AND IMPLEMENTATION https://github.com/keleslab/atSNP. CONTACT keles@stat.wisc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
35
|
Abstract
As next generation sequencing technologies are becoming more economical, large-scale ChIP-seq studies are enabling the investigation of the roles of transcription factor binding and epigenome on phenotypic variation. Studying such variation requires individual level ChIP-seq experiments. Standard designs for ChIP-seq experiments employ a paired control per ChIP-seq sample. Genomic coverage for control experiments is often sacrificed to increase the resources for ChIP samples. However, the quality of ChIP-enriched regions identifiable from a ChIP-seq experiment depends on the quality and the coverage of the control experiments. Insufficient coverage leads to loss of power in detecting enrichment. We investigate the effect of in silico pooling of control samples within multiple biological replicates, multiple treatment conditions, and multiple cell lines and tissues across multiple datasets with varying levels of genomic coverage. Our computational studies suggest guidelines for performing in silico pooling of control experiments. Using vast amounts of ENCODE data, we show that pairwise correlations between control samples originating from multiple biological replicates, treatments, and cell lines/tissues can be grouped into two classes representing whether or not in silico pooling leads to power gain in detecting enrichment between the ChIP and the control samples. Our findings have important implications for multiplexing samples.
Collapse
|
36
|
Abstract
MOTIVATION In chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and other short-read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighborhood of the alignment locations and ignore the variation in the copy numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads. RESULTS We propose cnvCSEM (CNV-guided ChIP-Seq by expectation-maximization algorithm), a flexible framework that incorporates CNV in multi-read allocation. cnvCSEM eliminates the CNV bias in multi-read allocation by initializing the read allocation algorithm with CNV-aware initial values. Our data-driven simulations illustrate that cnvCSEM leads to higher read coverage with satisfactory accuracy and lower loss in read-depth recovery (estimation). We evaluate the biological relevance of the cnvCSEM-allocated reads and the resultant peaks with the analysis of several ENCODE ChIP-seq datasets. AVAILABILITY AND IMPLEMENTATION Available at http://www.stat.wisc.edu/∼qizhang/ CONTACT : qizhang@stat.wisc.edu or keles@stat.wisc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
37
|
A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 2014; 30:753-60. [PMID: 23665773 PMCID: PMC3957067 DOI: 10.1093/bioinformatics/btt200] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Revised: 04/19/2013] [Accepted: 04/22/2013] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION ChIP-seq technology enables investigators to study genome-wide binding of transcription factors and mapping of epigenomic marks. Although the availability of basic analysis tools for ChIP-seq data is rapidly increasing, there has not been much progress on the related design issues. A challenging question for designing a ChIP-seq experiment is how deeply should the ChIP and the control samples be sequenced? The answer depends on multiple factors some of which can be set by the experimenter based on pilot/preliminary data. The sequencing depth of a ChIP-seq experiment is one of the key factors that determine whether all the underlying targets (e.g. binding locations or epigenomic profiles) can be identified with a targeted power. RESULTS We developed a statistical framework named CSSP (ChIP-seq Statistical Power) for power calculations in ChIP-seq experiments by considering a local Poisson model, which is commonly adopted by many peak callers. Evaluations with simulations and data-driven computational experiments demonstrate that this framework can reliably estimate the power of a ChIP-seq experiment at different sequencing depths based on pilot data. Furthermore, it provides an analytical approach for calculating the required depth for a targeted power while controlling the false discovery rate at a user-specified level. Hence, our results enable researchers to use their own or publicly available data for determining required sequencing depths of their ChIP-seq experiments and potentially make better use of the multiplexing functionality of the sequencers. Evaluation of power for multiple public ChIP-seq datasets indicate that, currently, typical ChIP-seq studies are powered well for detecting large fold changes of ChIP enrichment over the control sample, but they have considerably less power for detecting smaller fold changes. AVAILABILITY Available at www.stat.wisc.edu/~zuo/CSSP. CONTACT keles@stat.wisc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
38
|
Abstract
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is invaluable for identifying genome-wide binding of transcription factors and mapping of epigenomic profiles. We present a statistical protocol for analyzing ChIP-seq data. We describe guidelines for data preprocessing and quality control and provide detailed examples of identifying ChIP-enriched regions using the Bioconductor package "mosaics."
Collapse
|
39
|
dPeak: high resolution identification of transcription factor binding sites from PET and SET ChIP-Seq data. PLoS Comput Biol 2013; 9:e1003246. [PMID: 24146601 PMCID: PMC3798280 DOI: 10.1371/journal.pcbi.1003246] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2013] [Accepted: 08/14/2013] [Indexed: 11/21/2022] Open
Abstract
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. Because the compact genomes of prokaryotes harbor many binding sites separated by only few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Applications in prokaryotic genomes are further hampered by the fact that well studied data analysis methods for ChIP-Seq do not result in a resolution required for deciphering the locations of nearby binding events. We generated single-end tag (SET) and paired-end tag (PET) ChIP-Seq data for factor in Escherichia coli (E. coli). Direct comparison of these datasets revealed that although PET assay enables higher resolution identification of binding events, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak as a high resolution binding site identification (deconvolution) algorithm. dPeak implements a probabilistic model that accurately describes ChIP-Seq data generation process for both the SET and PET assays. For SET data, dPeak outperforms or performs comparably to the state-of-the-art high-resolution ChIP-Seq peak deconvolution algorithms such as PICS, GPS, and GEM. When coupled with PET data, dPeak significantly outperforms SET-based analysis with any of the current state-of-the-art methods. Experimental validations of a subset of dPeak predictions from PET ChIP-Seq data indicate that dPeak can estimate locations of binding events with as high as to resolution. Applications of dPeak to ChIP-Seq data in E. coli under aerobic and anaerobic conditions reveal closely located promoters that are differentially occupied and further illustrate the importance of high resolution analysis of ChIP-Seq data. Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is widely used for studying in vivo protein-DNA interactions genome-wide. Current state-of-the-art ChIP-Seq protocols utilize single-end tag (SET) assay which only sequences ends of DNA fragments in the library. Although paired-end tag (PET) sequencing is routinely used in other applications of next generation sequencing, it has not been much adapted to ChIP-Seq. We illustrate both experimentally and computationally that PET sequencing significantly improves the resolution of ChIP-Seq experiments and enables ChIP-Seq applications in compact genomes like Escherichia coli (E. coli). To enable efficient identification using PET ChIP-Seq data, we develop dPeak as a high resolution binding site identification algorithm. dPeak implements probabilistic models for both SET and PET data and facilitates efficient analysis of both data types. Applications of dPeak to deeply sequenced E. coli PET and SET ChIP-Seq data establish significantly better resolution of PET compared to SET sequencing.
Collapse
|
40
|
Genome-scale analysis of escherichia coli FNR reveals complex features of transcription factor binding. PLoS Genet 2013; 9:e1003565. [PMID: 23818864 PMCID: PMC3688515 DOI: 10.1371/journal.pgen.1003565] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2013] [Accepted: 04/29/2013] [Indexed: 01/05/2023] Open
Abstract
FNR is a well-studied global regulator of anaerobiosis, which is widely conserved across bacteria. Despite the importance of FNR and anaerobiosis in microbial lifestyles, the factors that influence its function on a genome-wide scale are poorly understood. Here, we report a functional genomic analysis of FNR action. We find that FNR occupancy at many target sites is strongly influenced by nucleoid-associated proteins (NAPs) that restrict access to many FNR binding sites. At a genome-wide level, only a subset of predicted FNR binding sites were bound under anaerobic fermentative conditions and many appeared to be masked by the NAPs H-NS, IHF and Fis. Similar assays in cells lacking H-NS and its paralog StpA showed increased FNR occupancy at sites bound by H-NS in WT strains, indicating that large regions of the genome are not readily accessible for FNR binding. Genome accessibility may also explain our finding that genome-wide FNR occupancy did not correlate with the match to consensus at binding sites, suggesting that significant variation in ChIP signal was attributable to cross-linking or immunoprecipitation efficiency rather than differences in binding affinities for FNR sites. Correlation of FNR ChIP-seq peaks with transcriptomic data showed that less than half of the FNR-regulated operons could be attributed to direct FNR binding. Conversely, FNR bound some promoters without regulating expression presumably requiring changes in activity of condition-specific transcription factors. Such combinatorial regulation may allow Escherichia coli to respond rapidly to environmental changes and confer an ecological advantage in the anaerobic but nutrient-fluctuating environment of the mammalian gut. Regulation of gene expression by transcription factors (TFs) is key to adaptation to environmental changes. Our comprehensive, genome-scale analysis of a prototypical global TF, the anaerobic regulator FNR from Escherichia coli, leads to several novel and unanticipated insights into the influences on FNR binding genome-wide and the complex structure of bacterial regulons. We found that binding of NAPs restricts FNR binding at a subset of sites, suggesting that the bacterial genome is not freely accessible for FNR binding. Our finding that less than half of the predicted FNR binding sites were occupied in vivo further challenges the utility of using bioinformatic searches alone to predict regulon structure, reinforcing the need for experimental determination of TF binding. By correlating the occupancy data with transcriptomic data, we confirm that FNR serves as a global signal of anaerobiosis but expression of some operons in the FNR regulon require other regulators sensitive to alternative environmental stimuli. Thus, FNR binding and regulation appear to depend on both the nucleoprotein structure of the chromosome and on combinatorial binding of FNR with other regulators. Both of these phenomena are typical of TF binding in eukaryotes; our results establish that they are also features of bacterial TF binding.
Collapse
|
41
|
jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biol 2013; 14:R38. [PMID: 23844871 PMCID: PMC4053760 DOI: 10.1186/gb-2013-14-4-r38] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2012] [Revised: 04/12/2013] [Accepted: 04/29/2013] [Indexed: 12/20/2022] Open
Abstract
The ChIP-seq technique enables genome-wide mapping of in vivo protein-DNA interactions and chromatin states. Current analytical approaches for ChIP-seq analysis are largely geared towards single-sample investigations, and have limited applicability in comparative settings that aim to identify combinatorial patterns of enrichment across multiple datasets. We describe a novel probabilistic method, jMOSAiCS, for jointly analyzing multiple ChIP-seq datasets. We demonstrate its usefulness with a wide range of data-driven computational experiments and with a case study of histone modifications on GATA1-occupied segments during erythroid differentiation. jMOSAiCS is open source software and can be downloaded from Bioconductor 1.
Collapse
|
42
|
Abstract
Background ChIP-seq has become an important tool for identifying genome-wide protein-DNA interactions, including transcription factor binding and histone modifications. In ChIP-seq experiments, ChIP samples are usually coupled with their matching control samples. Proper normalization between the ChIP and control samples is an essential aspect of ChIP-seq data analysis. Results We have developed a novel method for estimating the normalization factor between the ChIP and the control samples. Our method, named as NCIS (Normalization of ChIP-seq) can accommodate both low and high sequencing depth datasets. We compare statistical properties of NCIS against existing methods in a set of diverse simulation settings, where NCIS enjoys the best estimation precision. In addition, we illustrate the impact of the normalization factor in FDR control and show that NCIS leads to more power among methods that control FDR at nominal levels. Conclusion Our results indicate that the proper normalization between the ChIP and control samples is an important step in ChIP-seq analysis in terms of power and error rate control. Our proposed method shows excellent statistical properties and is useful in the full range of ChIP-seq applications, especially with deeply sequenced data.
Collapse
|
43
|
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-Seq) has revolutionalized experiments for genome-wide profiling of DNA-binding proteins, histone modifications, and nucleosome occupancy. As the cost of sequencing is decreasing, many researchers are switching from microarray-based technologies (ChIP-chip) to ChIP-Seq for genome-wide study of transcriptional regulation. Despite its increasing and well-deserved popularity, there is little work that investigates and accounts for sources of biases in the ChIP-Seq technology. These biases typically arise from both the standard pre-processing protocol and the underlying DNA sequence of the generated data. We study data from a naked DNA sequencing experiment, which sequences non-cross-linked DNA after deproteinizing and shearing, to understand factors affecting background distribution of data generated in a ChIP-Seq experiment. We introduce a background model that accounts for apparent sources of biases such as mappability and GC content and develop a flexible mixture model named MOSAiCS for detecting peaks in both one- and two-sample analyses of ChIP-Seq data. We illustrate that our model fits observed ChIP-Seq data well and further demonstrate advantages of MOSAiCS over commonly used tools for ChIP-Seq data analysis with several case studies.
Collapse
|
44
|
Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data. PLoS Comput Biol 2011; 7:e1002111. [PMID: 21779159 PMCID: PMC3136429 DOI: 10.1371/journal.pcbi.1002111] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2011] [Accepted: 05/18/2011] [Indexed: 11/19/2022] Open
Abstract
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Collapse
|
45
|
Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Series B Stat Methodol 2010; 72:3-25. [PMID: 20107611 PMCID: PMC2810828 DOI: 10.1111/j.1467-9868.2009.00723.x] [Citation(s) in RCA: 542] [Impact Index Per Article: 38.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.
Collapse
|
46
|
Abstract
Microarray platforms are used increasingly to make comparative inferences through genome-wide surveys of gene expression. Although recent studies focus on describing the evidence for natural selection using estimates of the within- and between-taxa mutational variances, these methods do not explicitly or flexibly account for predicted nonindependence due to phylogenetic associations between measurements. In the interest of parsing the effects of selection: we introduce a mixture model for the comparative analysis of variation in gene expression across multiple taxa. This class of models isolates the phylogenetic signal from the nonphylogenetic and the heritable signal from the nonheritable while measuring the proper amount of correction. As a result, the mixture model resolves outstanding differences between existing models, relates different ways to estimate the across taxa variance, and induces a likelihood ratio test for selection. We investigate by simulation and application the feasibility and utility of estimation of the required parameters and the power of the proposed test. We illustrate analysis under this mixture model with a gene duplication family data set.
Collapse
|
47
|
CSI-Tree: a regression tree approach for modeling binding properties of DNA-binding molecules based on cognate site identification (CSI) data. Nucleic Acids Res 2008; 36:3171-84. [PMID: 18411210 PMCID: PMC2425502 DOI: 10.1093/nar/gkn057] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The identification and characterization of binding sites of DNA-binding molecules, including transcription factors (TFs), is a critical problem at the interface of chemistry, biology and molecular medicine. The Cognate Site Identification (CSI) array is a high-throughput microarray platform for measuring comprehensive recognition profiles of DNA-binding molecules. This technique produces datasets that are useful not only for identifying binding sites of previously uncharacterized TFs but also for elucidating dependencies, both local and nonlocal, between the nucleotides at different positions of the recognition sites. We have developed a regression tree technique, CSI-Tree, for exploring the spectrum of binding sites of DNA-binding molecules. Our approach constructs regression trees utilizing the CSI data of unaligned sequences. The resulting model partitions the binding spectrum into homogeneous regions of position specific nucleotide effects. Each homogeneous partition is then summarized by a position weight matrix (PWM). Hence, the final outcome is a binding intensity rank-ordered collection of PWMs each of which spans a different region in the binding spectrum. Nodes of the regression tree depict the critical position/nucleotide combinations. We analyze the CSI data of the eukaryotic TF Nkx-2.5 and two engineered small molecule DNA ligands and obtain unique insights into their binding properties. The CSI tree for Nkx-2.5 reveals an interaction between two positions of the binding profile and elucidates how different nucleotide combinations at these two positions lead to different binding affinities. The CSI trees for the engineered DNA ligands exhibit a common preference for the dinucleotide AA in the first two positions, which is consistent with preference for a narrow and relatively flat minor groove. We carry out a reanalysis of these data with a mixture of PWMs approach. This approach is an advancement over the simple PWM model and accommodates position dependencies based on only sequence data. Our analysis indicates that the dependencies revealed by the CSI-Tree are challenging to discover without the actual binding intensities. Moreover, such a mixture model is highly sensitive to the number and length of the sequences analyzed. In contrast, CSI-Tree provides interpretable and concise summaries of the complete recognition profiles of DNA-binding molecules by utilizing binding affinities.
Collapse
|
48
|
Mixture models with multiple levels, with application to the analysis of multifactor gene expression data. Biostatistics 2008; 9:540-54. [PMID: 18256042 DOI: 10.1093/biostatistics/kxm051] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both "within" and "between" cluster profiles. We explore various flexible "within-cluster" parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse "between-cluster" representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.
Collapse
|
49
|
CMARRT: a tool for the analysis of ChIP-chip data from tiling arrays by incorporating the correlation structure. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2008:515-526. [PMID: 18229712 PMCID: PMC2862456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Whole genome tiling arrays at a user specified resolution are becoming a versatile tool in genomics. Chromatin immunoprecipitation on microarrays (ChIP-chip) is a powerful application of these arrays. Although there is an increasing number of methods for analyzing ChIP-chip data, perhaps the most simple and commonly used one, due to its computational efficiency, is testing with a moving average statistic. Current moving average methods assume exchangeability of the measurements within an array. They are not tailored to deal with the issues due to array designs such as overlapping probes that result in correlated measurements. We investigate the correlation structure of data from such arrays and propose an extension of the moving average testing via a robust and rapid method called CMARRT. We illustrate the pitfalls of ignoring the correlation structure in simulations and a case study. Our approach is implemented as an R package called CMARRT and can be used with any tiling array platform.
Collapse
|
50
|
Abstract
Chromatin immunoprecipitation followed by DNA microarray analysis (ChIP-chip methodology) is an efficient way of mapping genome-wide protein-DNA interactions. Data from tiling arrays encompass DNA-protein interaction measurements on thousands or millions of short oligonucleotides (probes) tiling a whole chromosome or genome. We propose a new model-based method for analyzing ChIP-chip data. The proposed model is motivated by the widely used two-component multinomial mixture model of de novo motif finding. It utilizes a hierarchical gamma mixture model of binding intensities while incorporating inherent spatial structure of the data. In this model, genomic regions belong to either one of the following two general groups: regions with a local protein-DNA interaction (peak) and regions lacking this interaction. Individual probes within a genomic region are allowed to have different localization rates accommodating different binding affinities. A novel feature of this model is the incorporation of a distribution for the peak size derived from the experimental design and parameters. This leads to the relaxation of the fixed peak size assumption that is commonly employed when computing a test statistic for these types of spatial data. Simulation studies and a real data application demonstrate good operating characteristics of the method including high sensitivity with small sample sizes when compared to available alternative methods.
Collapse
|