1
|
Zuo C, Chen K, Keleş S. A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. J Comput Biol 2016; 24:472-485. [PMID: 27835030 DOI: 10.1089/cmb.2016.0138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each data set (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared toward multisample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq data sets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, Zuo et al. developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq data sets. Although this versatile framework estimates both the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization-based estimation structure hinders its applicability with large number of loci and samples. We address this limitation by developing MAP-based asymptotic derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm that converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparison with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq data sets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.
Collapse
Affiliation(s)
- Chandler Zuo
- Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison , Madison, Wisconsin
| | - Kailei Chen
- Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison , Madison, Wisconsin
| | - Sündüz Keleş
- Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison , Madison, Wisconsin
| |
Collapse
|
2
|
Zuo C, Chen K, Hewitt KJ, Bresnick EH, Keleş S. A Hierarchical Framework for State-Space Matrix Inference and Clustering. Ann Appl Stat 2016; 10:1348-1372. [PMID: 29910842 DOI: 10.1214/16-aoas938] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
In recent years, a large number of genomic and epigenomic studies have been focusing on the integrative analysis of multiple experimental datasets measured over a large number of observational units. The objectives of such studies include not only inferring a hidden state of activity for each unit over individual experiments, but also detecting highly associated clusters of units based on their inferred states. Although there are a number of methods tailored for specific datasets, there is currently no state-of-the-art modeling framework for this general class of problems. In this paper, we develop the MBASIC (Matrix Based Analysis for State-space Inference and Clustering) framework. MBASIC consists of two parts: state-space mapping and state-space clustering. In state-space mapping, it maps observations onto a finite state-space, representing the activation states of units across conditions. In state-space clustering, MBASIC incorporates a finite mixture model to cluster the units based on their inferred state-space profiles across all conditions. Both the state-space mapping and clustering can be simultaneously estimated through an Expectation-Maximization algorithm. MBASIC flexibly adapts to a large number of parametric distributions for the observed data, as well as the heterogeneity in replicate experiments. It allows for imposing structural assumptions on each cluster, and enables model selection using information criterion. In our data-driven simulation studies, MBASIC showed significant accuracy in recovering both the underlying state-space variables and clustering structures. We applied MBASIC to two genome research problems using large numbers of datasets from the ENCODE project. The first application grouped genes based on transcription factor occupancy profiles of their promoter regions in two different cell types. The second application focused on identifying groups of loci that are similar to a GATA2 binding site that is functional at its endogenous locus by utilizing transcription factor occupancy data and illustrated applicability of MBASIC in a wide variety of problems. In both studies, MBASIC showed higher levels of raw data fidelity than analyzing these data with a two-step approach using ENCODE results on transcription factor occupancy data.
Collapse
Affiliation(s)
- Chandler Zuo
- Department of Statistics, University of Wisconsin, Madison, WI, U.S.A.,Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, U.S.A
| | - Kailei Chen
- Department of Statistics, University of Wisconsin, Madison, WI, U.S.A.,Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, U.S.A
| | - Kyle J Hewitt
- Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI, U.S.A
| | - Emery H Bresnick
- Department of Cell and Regenerative Biology, University of Wisconsin, Madison, WI, U.S.A
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin, Madison, WI, U.S.A.,Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, U.S.A
| |
Collapse
|