1
|
Lin Y, Wu TY, Chen X, Wan S, Chao B, Xin J, Yang JYH, Wong WH, Wang YXR. Data integration and inference of gene regulation using single-cell temporal multimodal data with scTIE. Genome Res 2024; 34:119-133. [PMID: 38190633 PMCID: PMC10903952 DOI: 10.1101/gr.277960.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 12/13/2023] [Indexed: 01/10/2024]
Abstract
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space by using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal data sets, we show scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome data set we generated from differentiating mouse embryonic stem cells over time, we show scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.
Collapse
Affiliation(s)
- Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR 999077, China
| | - Tung-Yu Wu
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA
| | - Xi Chen
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA
| | - Sheng Wan
- Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
| | - Brian Chao
- Department of Electrical Engineering, Stanford University, Stanford, California 94305-9505, USA
| | - Jingxue Xin
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA
| | - Jean Y H Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR 999077, China
| | - Wing H Wong
- Department of Statistics, Stanford University, Stanford, California 94305-4020, USA;
- Department of Biomedical Data Science, Stanford University, Stanford, California 94305-5464, USA
- Bio-X Program, Stanford University, Stanford, California 94305, USA
| | - Y X Rachel Wang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia;
| |
Collapse
|
2
|
Fan X, Wang YXR, Sarkar P, Yue Y. A Unified Framework for Tuning Hyperparameters in Clustering Problems. Stat Sin 2024. [DOI: 10.5705/ss.202021.0427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
3
|
Lin Y, Wu TY, Chen X, Wan S, Chao B, Xin J, Yang JY, Wong WH, Wang YXR. scTIE: data integration and inference of gene regulation using single-cell temporal multimodal data. bioRxiv 2023:2023.05.18.541381. [PMID: 37292801 PMCID: PMC10245711 DOI: 10.1101/2023.05.18.541381] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Single-cell technologies offer unprecedented opportunities to dissect gene regulatory mechanisms in context-specific ways. Although there are computational methods for extracting gene regulatory relationships from scRNA-seq and scATAC-seq data, the data integration problem, essential for accurate cell type identification, has been mostly treated as a standalone challenge. Here we present scTIE, a unified method that integrates temporal multimodal data and infers regulatory relationships predictive of cellular state changes. scTIE uses an autoencoder to embed cells from all time points into a common space using iterative optimal transport, followed by extracting interpretable information to predict cell trajectories. Using a variety of synthetic and real temporal multimodal datasets, we demonstrate scTIE achieves effective data integration while preserving more biological signals than existing methods, particularly in the presence of batch effects and noise. Furthermore, on the exemplar multiome dataset we generated from differentiating mouse embryonic stem cells over time, we demonstrate scTIE captures regulatory elements highly predictive of cell transition probabilities, providing new potentials to understand the regulatory landscape driving developmental processes.
Collapse
Affiliation(s)
- Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, NSW, Australia
- Charles Perkins Centre, The University of Sydney, NSW, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Tung-Yu Wu
- Department of Statistics, Stanford University, CA, USA
| | - Xi Chen
- Department of Statistics, Stanford University, CA, USA
| | - Sheng Wan
- Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Brian Chao
- Department of Electrical Engineering, Stanford University, CA, USA
| | - Jingxue Xin
- Department of Statistics, Stanford University, CA, USA
| | - Jean Y.H. Yang
- School of Mathematics and Statistics, The University of Sydney, NSW, Australia
- Charles Perkins Centre, The University of Sydney, NSW, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Wing H. Wong
- Department of Statistics, Stanford University, CA, USA
- Department of Biomedical Data Science, Stanford University, CA, USA
- Bio-X Program, Stanford University, CA, USA
| | - Y. X. Rachel Wang
- School of Mathematics and Statistics, The University of Sydney, NSW, Australia
| |
Collapse
|
4
|
Lin Y, Wu TY, Wan S, Yang JYH, Wong WH, Wang YXR. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat Biotechnol 2022; 40:703-710. [DOI: 10.1038/s41587-021-01161-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/16/2021] [Indexed: 12/11/2022]
|
5
|
Abstract
The rise of network data in many different domains has offered researchers new insight into the problem of modeling complex systems and propelled the development of numerous innovative statistical methodologies and computational tools. In this paper, we primarily focus on two types of biological networks, gene networks and brain networks, where statistical network modeling has found both fruitful and challenging applications. Unlike other network examples such as social networks where network edges can be directly observed, both gene and brain networks require careful estimation of edges using covariates as a first step. We provide a discussion on existing statistical and computational methods for edge esitimation and subsequent statistical inference problems in these two types of biological networks.
Collapse
Affiliation(s)
- Y X Rachel Wang
- School of Mathematics and Statistics, University of Sydney, Australia
| | - Lexin Li
- Department of Biostatistics and Epidemiology, School of Public Health, University of California, Berkeley
| | | | - Haiyan Huang
- Department of Statistics, University of California, Berkeley
| |
Collapse
|
6
|
Wu TY, Rachel Wang YX, Wong WH. Mini-Batch Metropolis–Hastings With Reversible SGLD Proposal. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1782222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Tung-Yu Wu
- Institute for Computational & Mathematical Engineering, Stanford University, Stanford, CA
| | - Y. X. Rachel Wang
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia
| | - Wing H. Wong
- Department of Statistics, Stanford University, Stanford, CA
| |
Collapse
|
7
|
Ursu O, Boley N, Taranova M, Wang YXR, Yardimci GG, Stafford Noble W, Kundaje A. GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics 2019; 34:2701-2707. [PMID: 29554289 DOI: 10.1093/bioinformatics/bty164] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 03/15/2018] [Indexed: 02/04/2023] Open
Abstract
Motivation The three-dimensional organization of chromatin plays a critical role in gene regulation and disease. High-throughput chromosome conformation capture experiments such as Hi-C are used to obtain genome-wide maps of three-dimensional chromatin contacts. However, robust estimation of data quality and systematic comparison of these contact maps is challenging due to the multi-scale, hierarchical structure of chromatin contacts and the resulting properties of experimental noise in the data. Measuring concordance of contact maps is important for assessing reproducibility of replicate experiments and for modeling variation between different cellular contexts. Results We introduce a concordance measure called DIfferences between Smoothed COntact maps (GenomeDISCO) for assessing the similarity of a pair of contact maps obtained from chromosome conformation capture experiments. The key idea is to smooth contact maps using random walks on the contact map graph, before estimating concordance. We use simulated datasets to benchmark GenomeDISCO's sensitivity to different types of noise that affect chromatin contact maps. When applied to a large collection of Hi-C datasets, GenomeDISCO accurately distinguishes biological replicates from samples obtained from different cell types. GenomeDISCO also generalizes to other chromosome conformation capture assays, such as HiChIP. Availability and implementation Software implementing GenomeDISCO is available at https://github.com/kundajelab/genomedisco. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Oana Ursu
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Nathan Boley
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Maryna Taranova
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Y X Rachel Wang
- Department of Statistics, Stanford University, Stanford, CA, USA
| | | | - William Stafford Noble
- Department of Genome Sciences, University of Washington, WA, USA.,Department of Computer Science and Engineering, University of Washington, WA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.,Department of Computer Science, Stanford University, Stanford, CA, USA
| |
Collapse
|
8
|
Abstract
Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, that is, the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this nonexchangeability. in addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.
Collapse
|
9
|
Wang YXR, Liu K, Theusch E, Rotter JI, Medina MW, Waterman MS, Huang H, Stegle O. Generalized correlation measure using count statistics for gene expression data with ordered samples. Bioinformatics 2018; 34:617-624. [PMID: 29040382 DOI: 10.1093/bioinformatics/btx641] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 10/11/2017] [Indexed: 12/22/2022] Open
Abstract
Motivation Capturing association patterns in gene expression levels under different conditions or time points is important for inferring gene regulatory interactions. In practice, temporal changes in gene expression may result in complex association patterns that require more sophisticated detection methods than simple correlation measures. For instance, the effect of regulation may lead to time-lagged associations and interactions local to a subset of samples. Furthermore, expression profiles of interest may not be aligned or directly comparable (e.g. gene expression profiles from two species). Results We propose a count statistic for measuring association between pairs of gene expression profiles consisting of ordered samples (e.g. time-course), where correlation may only exist locally in subsequences separated by a position shift. The statistic is simple and fast to compute, and we illustrate its use in two applications. In a cross-species comparison of developmental gene expression levels, we show our method not only measures association of gene expressions between the two species, but also provides alignment between different developmental stages. In the second application, we applied our statistic to expression profiles from two distinct phenotypic conditions, where the samples in each profile are ordered by the associated phenotypic values. The detected associations can be useful in building correspondence between gene association networks under different phenotypes. On the theoretical side, we provide asymptotic distributions of the statistic for different regions of the parameter space and test its power on simulated data. Availability and implementation The code used to perform the analysis is available as part of the Supplementary Material. Contact msw@usc.edu or hhuang@stat.berkeley.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Y X Rachel Wang
- School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia
| | - Ke Liu
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | - Elizabeth Theusch
- Children's Hospital Oakland Research Institute, Oakland, CA 94609, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Departments of Pediatrics and Medicine, LABioMed at Harbor-UCLA Medical Center, Torrance, CA 90502, USA
| | - Marisa W Medina
- Children's Hospital Oakland Research Institute, Oakland, CA 94609, USA
| | - Michael S Waterman
- Molecular and Computational Biology, University of Southern California, CA 90089, USA
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | | |
Collapse
|
10
|
|
11
|
Wang YXR, Jiang K, Feldman LJ, Bickel PJ, Huang H. Inferring gene–gene interactions and functional modules using sparse canonical correlation analysis. Ann Appl Stat 2015. [DOI: 10.1214/14-aoas792] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Bhaskar A, Wang YXR, Song YS. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data. Genome Res 2015; 25:268-79. [PMID: 25564017 PMCID: PMC4315300 DOI: 10.1101/gr.178756.114] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.
Collapse
Affiliation(s)
- Anand Bhaskar
- Simons Institute for the Theory of Computing, Berkeley, California 94720, USA; Computer Science Division, University of California, Berkeley, California 94720, USA
| | - Y X Rachel Wang
- Department of Statistics, University of California, Berkeley, California 94720, USA
| | - Yun S Song
- Simons Institute for the Theory of Computing, Berkeley, California 94720, USA; Computer Science Division, University of California, Berkeley, California 94720, USA; Department of Statistics, University of California, Berkeley, California 94720, USA; Department of Integrative Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
13
|
Wang YXR, Huang H. Review on statistical methods for gene network reconstruction using expression data. J Theor Biol 2014; 362:53-61. [PMID: 24726980 DOI: 10.1016/j.jtbi.2014.03.040] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2014] [Revised: 03/29/2014] [Accepted: 03/31/2014] [Indexed: 12/16/2022]
Abstract
Network modeling has proven to be a fundamental tool in analyzing the inner workings of a cell. It has revolutionized our understanding of biological processes and made significant contributions to the discovery of disease biomarkers. Much effort has been devoted to reconstruct various types of biochemical networks using functional genomic datasets generated by high-throughput technologies. This paper discusses statistical methods used to reconstruct gene regulatory networks using gene expression data. In particular, we highlight progress made and challenges yet to be met in the problems involved in estimating gene interactions, inferring causality and modeling temporal changes of regulation behaviors. As rapid advances in technologies have made available diverse, large-scale genomic data, we also survey methods of incorporating all these additional data to achieve better, more accurate inference of gene networks.
Collapse
Affiliation(s)
- Y X Rachel Wang
- Department of Statistics, University of California, Berkeley, CA 94720, USA.
| | - Haiyan Huang
- Department of Statistics, University of California, Berkeley, CA 94720, USA.
| |
Collapse
|
14
|
Steinrücken M, Wang YXR, Song YS. An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection. Theor Popul Biol 2012; 83:1-14. [PMID: 23127866 DOI: 10.1016/j.tpb.2012.10.006] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2012] [Revised: 10/14/2012] [Accepted: 10/15/2012] [Indexed: 10/27/2022]
Abstract
Characterizing time-evolution of allele frequencies in a population is a fundamental problem in population genetics. In the Wright-Fisher diffusion, such dynamics is captured by the transition density function, which satisfies well-known partial differential equations. For a multi-allelic model with general diploid selection, various theoretical results exist on representations of the transition density, but finding an explicit formula has remained a difficult problem. In this paper, a technique recently developed for a diallelic model is extended to find an explicit transition density for an arbitrary number of alleles, under a general diploid selection model with recurrent parent-independent mutation. Specifically, the method finds the eigenvalues and eigenfunctions of the generator associated with the multi-allelic diffusion, thus yielding an accurate spectral representation of the transition density. Furthermore, this approach allows for efficient, accurate computation of various other quantities of interest, including the normalizing constant of the stationary distribution and the rate of convergence to this distribution.
Collapse
|