1
|
Pessoa P, Schweiger M, Pressé S. Avoiding matrix exponentials for large transition rate matrices. J Chem Phys 2024; 160:094109. [PMID: 38436441 PMCID: PMC10919955 DOI: 10.1063/5.0190527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/07/2024] [Indexed: 03/05/2024] Open
Abstract
Exact methods for the exponentiation of matrices of dimension N can be computationally expensive in terms of execution time (N3) and memory requirements (N2), not to mention numerical precision issues. A matrix often exponentiated in the natural sciences is the rate matrix. Here, we explore five methods to exponentiate rate matrices, some of which apply more broadly to other matrix types. Three of the methods leverage a mathematical analogy between computing matrix elements of a matrix exponential process and computing transition probabilities of a dynamical process (technically a Markov jump process, MJP, typically simulated using Gillespie). In doing so, we identify a novel MJP-based method relying on restricting the number of "trajectory" jumps that incurs improved computational scaling. We then discuss this method's downstream implications on mixing properties of Monte Carlo posterior samplers. We also benchmark two other methods of matrix exponentiation valid for any matrix (beyond rate matrices and, more generally, positive definite matrices) related to solving differential equations: Runge-Kutta integrators and Krylov subspace methods. Under conditions where both the largest matrix element and the number of non-vanishing elements scale linearly with N-reasonable conditions for rate matrices often exponentiated-computational time scaling with the most competitive methods (Krylov and one of the MJP-based methods) reduces to N2 with total memory requirements of N.
Collapse
Affiliation(s)
| | | | - Steve Pressé
- Author to whom correspondence should be addressed:
| |
Collapse
|
2
|
Gorin G, Vastola JJ, Pachter L. Studying stochastic systems biology of the cell with single-cell genomics data. Cell Syst 2023; 14:822-843.e22. [PMID: 37751736 PMCID: PMC10725240 DOI: 10.1016/j.cels.2023.08.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 08/16/2023] [Accepted: 08/25/2023] [Indexed: 09/28/2023]
Abstract
Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.
Collapse
Affiliation(s)
- Gennady Gorin
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - John J Vastola
- Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA; Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA.
| |
Collapse
|
3
|
Kilic Z, Schweiger M, Moyer C, Pressé S. Monte Carlo samplers for efficient network inference. PLoS Comput Biol 2023; 19:e1011256. [PMID: 37463156 PMCID: PMC10353823 DOI: 10.1371/journal.pcbi.1011256] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 06/09/2023] [Indexed: 07/20/2023] Open
Abstract
Accessing information on an underlying network driving a biological process often involves interrupting the process and collecting snapshot data. When snapshot data are stochastic, the data's structure necessitates a probabilistic description to infer underlying reaction networks. As an example, we may imagine wanting to learn gene state networks from the type of data collected in single molecule RNA fluorescence in situ hybridization (RNA-FISH). In the networks we consider, nodes represent network states, and edges represent biochemical reaction rates linking states. Simultaneously estimating the number of nodes and constituent parameters from snapshot data remains a challenging task in part on account of data uncertainty and timescale separations between kinetic parameters mediating the network. While parametric Bayesian methods learn parameters given a network structure (with known node numbers) with rigorously propagated measurement uncertainty, learning the number of nodes and parameters with potentially large timescale separations remain open questions. Here, we propose a Bayesian nonparametric framework and describe a hybrid Bayesian Markov Chain Monte Carlo (MCMC) sampler directly addressing these challenges. In particular, in our hybrid method, Hamiltonian Monte Carlo (HMC) leverages local posterior geometries in inference to explore the parameter space; Adaptive Metropolis Hastings (AMH) learns correlations between plausible parameter sets to efficiently propose probable models; and Parallel Tempering takes into account multiple models simultaneously with tempered information content to augment sampling efficiency. We apply our method to synthetic data mimicking single molecule RNA-FISH, a popular snapshot method in probing transcriptional networks to illustrate the identified challenges inherent to learning dynamical models from these snapshots and how our method addresses them.
Collapse
Affiliation(s)
- Zeliha Kilic
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Max Schweiger
- Center for Biological Physics, ASU, Tempe, Arizona, United States of America
- Department of Physics ASU, Tempe, Arizona, United States of America
| | - Camille Moyer
- Center for Biological Physics, ASU, Tempe, Arizona, United States of America
- School of Mathematics and Statistical Sciences, ASU, Tempe, Arizona, United States of America
| | - Steve Pressé
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
- Center for Biological Physics, ASU, Tempe, Arizona, United States of America
- School of Molecular Sciences, ASU, Tempe, Arizona, United States of America
| |
Collapse
|
4
|
Gorin G, Vastola JJ, Pachter L. Studying stochastic systems biology of the cell with single-cell genomics data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.17.541250. [PMID: 37292934 PMCID: PMC10245677 DOI: 10.1101/2023.05.17.541250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.
Collapse
Affiliation(s)
- Gennady Gorin
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, 91125
| | - John J. Vastola
- Department of Neurobiology, Harvard Medical School, Boston, MA, 02115
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125
| |
Collapse
|
5
|
Kilic Z, Schweiger M, Moyer C, Shepherd D, Pressé S. Gene expression model inference from snapshot RNA data using Bayesian non-parametrics. NATURE COMPUTATIONAL SCIENCE 2023; 3:174-183. [PMID: 38125199 PMCID: PMC10732567 DOI: 10.1038/s43588-022-00392-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 12/15/2022] [Indexed: 12/23/2023]
Abstract
Gene expression models, which are key towards understanding cellular regulatory response, underlie observations of single-cell transcriptional dynamics. Although RNA expression data encode information on gene expression models, existing computational frameworks do not perform simultaneous Bayesian inference of gene expression models and parameters from such data. Rather, gene expression models-composed of gene states, their connectivities and associated parameters-are currently deduced by pre-specifying gene state numbers and connectivity before learning associated rate parameters. Here we propose a method to learn full distributions over gene states, state connectivities and associated rate parameters, simultaneously and self-consistently from single-molecule RNA counts. We propagate noise from fluctuating RNA counts over models by treating models themselves as random variables. We achieve this within a Bayesian non-parametric paradigm. We demonstrate our method on the Escherichia coli lacZ pathway and the Saccharomyces cerevisiae STL1 pathway, and verify its robustness on synthetic data.
Collapse
Affiliation(s)
- Zeliha Kilic
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA
- These authors contributed equally: Zeliha Kilic, Max Schweiger
| | - Max Schweiger
- Center for Biological Physics, ASU, Tempe, AZ, USA
- Department of Physics, ASU, Tempe, AZ, USA
- These authors contributed equally: Zeliha Kilic, Max Schweiger
| | - Camille Moyer
- Center for Biological Physics, ASU, Tempe, AZ, USA
- School of Mathematics and Statistical Sciences, ASU, Tempe, AZ, USA
| | - Douglas Shepherd
- Center for Biological Physics, ASU, Tempe, AZ, USA
- Department of Physics, ASU, Tempe, AZ, USA
| | - Steve Pressé
- Center for Biological Physics, ASU, Tempe, AZ, USA
- Department of Physics, ASU, Tempe, AZ, USA
- School of Molecular Sciences, ASU, Tempe, AZ, USA
| |
Collapse
|
6
|
Coulier A, Singh P, Sturrock M, Hellander A. Systematic comparison of modeling fidelity levels and parameter inference settings applied to negative feedback gene regulation. PLoS Comput Biol 2022; 18:e1010683. [PMID: 36520957 PMCID: PMC9799300 DOI: 10.1371/journal.pcbi.1010683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 12/29/2022] [Accepted: 10/25/2022] [Indexed: 12/23/2022] Open
Abstract
Quantitative stochastic models of gene regulatory networks are important tools for studying cellular regulation. Such models can be formulated at many different levels of fidelity. A practical challenge is to determine what model fidelity to use in order to get accurate and representative results. The choice is important, because models of successively higher fidelity come at a rapidly increasing computational cost. In some situations, the level of detail is clearly motivated by the question under study. In many situations however, many model options could qualitatively agree with available data, depending on the amount of data and the nature of the observations. Here, an important distinction is whether we are interested in inferring the true (but unknown) physical parameters of the model or if it is sufficient to be able to capture and explain available data. The situation becomes complicated from a computational perspective because inference needs to be approximate. Most often it is based on likelihood-free Approximate Bayesian Computation (ABC) and here determining which summary statistics to use, as well as how much data is needed to reach the desired level of accuracy, are difficult tasks. Ultimately, all of these aspects-the model fidelity, the available data, and the numerical choices for inference-interplay in a complex manner. In this paper we develop a computational pipeline designed to systematically evaluate inference accuracy for a wide range of true known parameters. We then use it to explore inference settings for negative feedback gene regulation. In particular, we compare a detailed spatial stochastic model, a coarse-grained compartment-based multiscale model, and the standard well-mixed model, across several data-scenarios and for multiple numerical options for parameter inference. Practically speaking, this pipeline can be used as a preliminary step to guide modelers prior to gathering experimental data. By training Gaussian processes to approximate the distance function values, we are able to substantially reduce the computational cost of running the pipeline.
Collapse
Affiliation(s)
- Adrien Coulier
- Department of Information Technology, Uppsala University, Uppsala, Sweden
| | - Prashant Singh
- Science for Life Laboratory, Department of Information Technology, Uppsala University, Uppsala, Sweden
| | - Marc Sturrock
- Department of Physiology, Royal College of Surgeons in Ireland, Dublin, Ireland
| | - Andreas Hellander
- Department of Information Technology, Uppsala University, Uppsala, Sweden
- * E-mail:
| |
Collapse
|
7
|
Catanach TA, Vo HD, Munsky B. BAYESIAN INFERENCE OF STOCHASTIC REACTION NETWORKS USING MULTIFIDELITY SEQUENTIAL TEMPERED MARKOV CHAIN MONTE CARLO. INTERNATIONAL JOURNAL FOR UNCERTAINTY QUANTIFICATION 2020; 10:515-542. [PMID: 34007522 PMCID: PMC8127724 DOI: 10.1615/int.j.uncertaintyquantification.2020033241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Stochastic reaction network models are often used to explain and predict the dynamics of gene regulation in single cells. These models usually involve several parameters, such as the kinetic rates of chemical reactions, that are not directly measurable and must be inferred from experimental data. Bayesian inference provides a rigorous probabilistic framework for identifying these parameters by finding a posterior parameter distribution that captures their uncertainty. Traditional computational methods for solving inference problems such as Markov Chain Monte Carlo methods based on classical Metropolis-Hastings algorithm involve numerous serial evaluations of the likelihood function, which in turn requires expensive forward solutions of the chemical master equation (CME). We propose an alternate approach based on a multifidelity extension of the Sequential Tempered Markov Chain Monte Carlo (ST-MCMC) sampler. This algorithm is built upon Sequential Monte Carlo and solves the Bayesian inference problem by decomposing it into a sequence of efficiently solved subproblems that gradually increase both model fidelity and the influence of the observed data. We reformulate the finite state projection (FSP) algorithm, a well-known method for solving the CME, to produce a hierarchy of surrogate master equations to be used in this multifidelity scheme. To determine the appropriate fidelity, we introduce a novel information-theoretic criteria that seeks to extract the most information about the ultimate Bayesian posterior from each model in the hierarchy without inducing significant bias. This novel sampling scheme is tested with high performance computing resources using biologically relevant problems.
Collapse
Affiliation(s)
- Thomas A. Catanach
- Sandia National Laboratories, Livermore, CA, 94550
- Address all correspondence to: Thomas A. Catanach,
| | - Huy D. Vo
- Dept. of Chemical and Biological Engr., Colorado State University, Fort Collins, CO, 80521
| | - Brian Munsky
- Dept. of Chemical and Biological Engr., Colorado State University, Fort Collins, CO, 80521
| |
Collapse
|