1
|
Hobolth A, Rivas-González I, Bladt M, Futschik A. Phase-type distributions in mathematical population genetics: An emerging framework. Theor Popul Biol 2024; 157:14-32. [PMID: 38460602 DOI: 10.1016/j.tpb.2024.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 02/29/2024] [Accepted: 03/04/2024] [Indexed: 03/11/2024]
Abstract
A phase-type distribution is the time to absorption in a continuous- or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the 'phases' in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.
Collapse
Affiliation(s)
- Asger Hobolth
- Department of Mathematics, Aarhus University, Denmark.
| | | | - Mogens Bladt
- Department of Mathematical Sciences, University of Copenhagen, Denmark.
| | - Andreas Futschik
- Institute of Applied Statistics, Johannes Kepler University, Austria.
| |
Collapse
|
2
|
Mikula LC, Vogl C. The expected sample allele frequencies from populations of changing size via orthogonal polynomials. Theor Popul Biol 2024; 157:55-85. [PMID: 38552964 DOI: 10.1016/j.tpb.2024.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 03/24/2024] [Accepted: 03/26/2024] [Indexed: 04/11/2024]
Abstract
In this article, discrete and stochastic changes in (effective) population size are incorporated into the spectral representation of a biallelic diffusion process for drift and small mutation rates. A forward algorithm inspired by Hidden-Markov-Model (HMM) literature is used to compute exact sample allele frequency spectra for three demographic scenarios: single changes in (effective) population size, boom-bust dynamics, and stochastic fluctuations in (effective) population size. An approach for fully agnostic demographic inference from these sample allele spectra is explored, and sufficient statistics for stepwise changes in population size are found. Further, convergence behaviours of the polymorphic sample spectra for population size changes on different time scales are examined and discussed within the context of inference of the effective population size. Joint visual assessment of the sample spectra and the temporal coefficients of the spectral decomposition of the forward diffusion process is found to be important in determining departure from equilibrium. Stochastic changes in (effective) population size are shown to shape sample spectra particularly strongly.
Collapse
Affiliation(s)
- Lynette Caitlin Mikula
- Centre for Biological Diversity, School of Biology, University of St. Andrews, St, Andrews KY16 9TH, UK.
| | - Claus Vogl
- Department of Biomedical Sciences and Pathobiology, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria; Vienna Graduate School of Population Genetics, Vetmeduni Vienna, Veterinärplatz 1, A-1210 Wien, Austria.
| |
Collapse
|
3
|
Rivas-González I, Schierup MH, Wakeley J, Hobolth A. TRAILS: Tree reconstruction of ancestry using incomplete lineage sorting. PLoS Genet 2024; 20:e1010836. [PMID: 38330138 PMCID: PMC10880969 DOI: 10.1371/journal.pgen.1010836] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 02/21/2024] [Accepted: 01/22/2024] [Indexed: 02/10/2024] Open
Abstract
Genome-wide genealogies of multiple species carry detailed information about demographic and selection processes on individual branches of the phylogeny. Here, we introduce TRAILS, a hidden Markov model that accurately infers time-resolved population genetics parameters, such as ancestral effective population sizes and speciation times, for ancestral branches using a multi-species alignment of three species and an outgroup. TRAILS leverages the information contained in incomplete lineage sorting fragments by modelling genealogies along the genome as rooted three-leaved trees, each with a topology and two coalescent events happening in discretized time intervals within the phylogeny. Posterior decoding of the hidden Markov model can be used to infer the ancestral recombination graph for the alignment and details on demographic changes within a branch. Since TRAILS performs posterior decoding at the base-pair level, genome-wide scans based on the posterior probabilities can be devised to detect deviations from neutrality. Using TRAILS on a human-chimp-gorilla-orangutan alignment, we recover speciation parameters and extract information about the topology and coalescent times at high resolution.
Collapse
Affiliation(s)
| | - Mikkel H. Schierup
- Bioinformatics Research Center (BiRC), Aarhus University, Aarhus, Denmark
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Massachusetts, United States of America
| | - Asger Hobolth
- Department of Mathematics, Aarhus University, Aarhus, Denmark
| |
Collapse
|
4
|
Miró Pina V, Joly É, Siri-Jégousse A. Estimating the Lambda measure in multiple-merger coalescents. Theor Popul Biol 2023; 154:94-101. [PMID: 37742787 DOI: 10.1016/j.tpb.2023.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/13/2023] [Accepted: 09/15/2023] [Indexed: 09/26/2023]
Abstract
Multiple-merger coalescents, also known as Λ-coalescents, have been used to describe the genealogy of populations that have a skewed offspring distribution or that undergo strong selection. Inferring the characteristic measure Λ, which describes the rates of the multiple-merger events, is key to understand these processes. So far, most inference methods only work for some particular families of Λ-coalescents that are described by only one parameter, but not for more general models. This article is devoted to the construction of a non-parametric estimator of the density of Λ that is based on the observation at a single time of the so-called Site Frequency Spectrum (SFS), which describes the allelic frequencies in a present population sample. First, we produce estimates of the multiple-merger rates by solving a linear system, whose coefficients are obtained by appropriately subsampling the SFS. Then, we use a technique that aggregates the information extracted from the previous step through a kernel type of re-construction to give a non-parametric estimation of the measure Λ. We give a consistency result of this estimator under mild conditions on the behavior of Λ around 0. We also show some numerical examples of how our method performs.
Collapse
Affiliation(s)
- Verónica Miró Pina
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, CDMX, Mexico
| | - Émilien Joly
- Centro de Investigación en Matemáticas, AC (CIMAT), Guanajuato, Mexico
| | - Arno Siri-Jégousse
- Instituto de Investigaciones en Matemáticas Aplicadas y Sistemas, Universidad Nacional Autónoma de México, CDMX, Mexico.
| |
Collapse
|
5
|
De Bin R, Stikbakke VG. A boosting first-hitting-time model for survival analysis in high-dimensional settings. LIFETIME DATA ANALYSIS 2023; 29:420-440. [PMID: 35476164 PMCID: PMC10006065 DOI: 10.1007/s10985-022-09553-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/25/2022] [Indexed: 06/13/2023]
Abstract
In this paper we propose a boosting algorithm to extend the applicability of a first hitting time model to high-dimensional frameworks. Based on an underlying stochastic process, first hitting time models do not require the proportional hazards assumption, hardly verifiable in the high-dimensional context, and represent a valid parametric alternative to the Cox model for modelling time-to-event responses. First hitting time models also offer a natural way to integrate low-dimensional clinical and high-dimensional molecular information in a prediction model, that avoids complicated weighting schemes typical of current methods. The performance of our novel boosting algorithm is illustrated in three real data examples.
Collapse
Affiliation(s)
- Riccardo De Bin
- Department of Mathematics, University of Oslo, Moltke Moes vei 35, 0851 Oslo, Norway
| | | |
Collapse
|
6
|
Bisschop G. Graph-based algorithms for Laplace transformed coalescence time distributions. PLoS Comput Biol 2022; 18:e1010532. [PMID: 36108047 PMCID: PMC9514611 DOI: 10.1371/journal.pcbi.1010532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 09/27/2022] [Accepted: 09/01/2022] [Indexed: 11/25/2022] Open
Abstract
Extracting information on the selective and demographic past of populations that is contained in samples of genome sequences requires a description of the distribution of the underlying genealogies. Using the Laplace transform, this distribution can be generated with a simple recursive procedure, regardless of model complexity. Assuming an infinite-sites mutation model, the probability of observing specific configurations of linked variants within small haplotype blocks can be recovered from the Laplace transform of the joint distribution of branch lengths. However, the repeated differentiation required to compute these probabilities has proven to be a serious computational bottleneck in earlier implementations. Here, I show that the state space diagram can be turned into a computational graph, allowing efficient evaluation of the Laplace transform by means of a graph traversal algorithm. This general algorithm can, for example, be applied to tabulate the likelihoods of mutational configurations in non-recombining blocks. This work provides a crucial speed up for existing composite likelihood approaches that rely on the joint distribution of branch lengths to fit isolation with migration models and estimate the parameters of selective sweeps. The associated software is available as an open-source Python library, agemo. For simple models of idealised populations, the process that generates the observed sequences can be mathematically described. For a given number of samples, we can enumerate all possible genealogies. We can even incorporate the impact of past events like population size reductions on the observed sequence variation. However, the number of possible genealogies will become very large, very fast. So, to extract information from the observed mutations, we need mathematical tools and efficient algorithms to use the information contained within the large collection of possible genealogies. The Laplace transform is one such mathematical tool that allows us to recursively generate the branch length distribution of all genealogies. Here I show how the transform can be represented as a graph. Using this nonlinear data structure, I define a general procedure to efficiently evaluate the associated mathematical expressions. And I further show how this can be used to speed up existing composite likelihood approaches to fit demographic models and estimate sweep parameters. The associated software, agemo, has a well-documented Python API and has been designed with extensibility in mind, making it an ideal back-end for many other inference approaches in population genetics.
Collapse
Affiliation(s)
- Gertjan Bisschop
- University of Edinburgh, Institute of Evolution and Ecology, Edinburgh, United Kingdom
- * E-mail:
| |
Collapse
|
7
|
Legried B, Terhorst J. Rates of convergence in the two-island and isolation-with-migration models. Theor Popul Biol 2022; 147:16-27. [PMID: 36007782 DOI: 10.1016/j.tpb.2022.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Revised: 08/10/2022] [Accepted: 08/11/2022] [Indexed: 11/25/2022]
Abstract
A number of powerful demographic inference methods have been developed in recent years, with the goal of fitting rich evolutionary models to genetic data obtained from many populations. In this paper we investigate the statistical performance of these methods in the specific case where there is continuous migration between populations. Compared with earlier work, migration significantly complicates the theoretical analysis and requires new techniques. We employ the theories of phase-type distributions and concentration of measure in order to study the two-island and isolation-with-migration models, resulting in both upper and lower bounds on rates of convergence for parametric estimators in migration models. For the upper bounds, we consider inferring rates of coalescent and migration on the basis of directly observing pairwise coalescent times, and, more realistically, when (conditionally) Poisson-distributed mutations dropped on latent trees are observed. We complement these upper bounds with information-theoretic lower bounds which establish a limit, in terms of sample size, below which inference is effectively impossible.
Collapse
Affiliation(s)
- Brandon Legried
- Department of Statistics, University of Michigan, United States of America
| | - Jonathan Terhorst
- Department of Statistics, University of Michigan, United States of America.
| |
Collapse
|
8
|
The shape of a seed bank tree. J Appl Probab 2022. [DOI: 10.1017/jpr.2021.79] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Abstract
We derive the asymptotic behavior of the total, active, and inactive branch lengths of the seed bank coalescent when the initial sample size grows to infinity. These random variables have important applications for populations evolving under some seed bank effects, such as plants and bacteria, and for some cases of structured populations like metapopulations. The proof relies on the analysis of the tree at a stopping time corresponding to the first time a deactivated lineage is reactivated. We also give conditional sampling formulas for the random partition, and we study the system at the time of the first reactivation of a lineage. All these results provide a good picture of the different regimes and behaviors of the block-counting process of the seed bank coalescent.
Collapse
|
9
|
Biddanda A, Steinrücken M, Novembre J. Properties of Two-Locus Genealogies and Linkage Disequilibrium in Temporally Structured Samples. Genetics 2022; 221:6549526. [PMID: 35294015 PMCID: PMC9245597 DOI: 10.1093/genetics/iyac038] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 02/06/2022] [Indexed: 11/13/2022] Open
Abstract
Archaeogenetics has been revolutionary, revealing insights into demographic history and recent positive selection. However, most studies to date have ignored the non-random association of genetic variants at different loci (i.e., linkage disequilibrium, LD). This may be in part because basic properties of LD in samples from different times are still not well understood. Here, we derive several results for summary statistics of haplotypic variation under a model with time-stratified sampling: 1) The correlation between the number of pairwise differences observed between time-staggered samples (πΔt) in models with and without strict population continuity; 2) The product of the LD coefficient, D, between ancient and modern samples, which is a measure of haplotypic similarity between modern and ancient samples; and 3) The expected switch rate in the Li and Stephens haplotype copying model. The latter has implications for genotype imputation and phasing in ancient samples with modern reference panels. Overall, these results provide a characterization of how haplotype patterns are affected by sample age, recombination rates, and population sizes. We expect these results will help guide the interpretation and analysis of haplotype data from ancient and modern samples.
Collapse
Affiliation(s)
- Arjun Biddanda
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Matthias Steinrücken
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.,Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA
| | - John Novembre
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.,Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
10
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschumar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2022; 220:iyab229. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 116] [Impact Index Per Article: 58.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence “Controlling Microbes to Fight Infections”, Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany
| | | | - Jared G Galloway
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA
- Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Warren W Kretzschumar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, State College, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | | | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Peter L Ralph
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
11
|
Multivariate phase-type theory for the site frequency spectrum. J Math Biol 2021; 83:63. [PMID: 34783900 DOI: 10.1007/s00285-021-01689-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 08/09/2021] [Accepted: 10/13/2021] [Indexed: 10/19/2022]
Abstract
Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package PhaseTypeR, and R code for the reproduction of our results is available as an accompanying vignette.
Collapse
|
12
|
Hurtado PJ, Richards C. Building mean field ODE models using the generalized linear chain trick & Markov chain theory. JOURNAL OF BIOLOGICAL DYNAMICS 2021; 15:S248-S272. [PMID: 33847236 DOI: 10.1080/17513758.2021.1912418] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Accepted: 03/17/2021] [Indexed: 06/12/2023]
Abstract
The well known linear chain trick (LCT) allows modellers to derive mean field ODEs that assume gamma (Erlang) distributed passage times, by transitioning individuals sequentially through a chain of sub-states. The time spent in these sub-states is the sum of k exponentially distributed random variables, and is thus gamma distributed. The generalized linear chain trick (GLCT) extends this technique to the broader phase-type family of distributions, which includes exponential, Erlang, hypoexponential, and Coxian distributions. Phase-type distributions are the family of matrix exponential distributions on [0,∞) that represent the absorption time distributions for finite-state, continuous time Markov chains (CTMCs). Here we review CTMCs and phase-type distributions, then illustrate how to use the GLCT to efficiently build ODE models from underlying stochastic model assumptions. We introduce two novel model families by using the GLCT to generalize the Rosenzweig-MacArthur predator-prey model, and the SEIR model. We illustrate the kinds of complexity that can be captured by such models through multiple examples. We also show the benefits of using a GLCT-based model formulation to speed up the computation of numerical solutions to such models. These results highlight the intuitive nature, and utility, of using the GLCT to derive ODE models from first principles.
Collapse
|
13
|
Zeng K, Charlesworth B, Hobolth A. Studying models of balancing selection using phase-type theory. Genetics 2021; 218:6237896. [PMID: 33871627 DOI: 10.1093/genetics/iyab055] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 03/25/2021] [Indexed: 11/15/2022] Open
Abstract
Balancing selection (BLS) is the evolutionary force that maintains high levels of genetic variability in many important genes. To further our understanding of its evolutionary significance, we analyze models with BLS acting on a biallelic locus: an equilibrium model with long-term BLS, a model with long-term BLS and recent changes in population size, and a model of recent BLS. Using phase-type theory, a mathematical tool for analyzing continuous time Markov chains with an absorbing state, we examine how BLS affects polymorphism patterns in linked neutral regions, as summarized by nucleotide diversity, the expected number of segregating sites, the site frequency spectrum, and the level of linkage disequilibrium (LD). Long-term BLS affects polymorphism patterns in a relatively small genomic neighborhood, and such selection targets are easier to detect when the equilibrium frequencies of the selected variants are close to 50%, or when there has been a population size reduction. For a new mutation subject to BLS, its initial increase in frequency in the population causes linked neutral regions to have reduced diversity, an excess of both high and low frequency derived variants, and elevated LD with the selected locus. These patterns are similar to those produced by selective sweeps, but the effects of recent BLS are weaker. Nonetheless, compared to selective sweeps, nonequilibrium polymorphism and LD patterns persist for a much longer period under recent BLS, which may increase the chance of detecting such selection targets. An R package for analyzing these models, among others (e.g., isolation with migration), is available.
Collapse
Affiliation(s)
- Kai Zeng
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK
| | - Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Asger Hobolth
- Department of Mathematics, Aarhus University, Aarhus DK-8000, Denmark
| |
Collapse
|
14
|
Freund F, Siri-Jégousse A. The impact of genetic diversity statistics on model selection between coalescents. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
15
|
Blath J, Buzzoni E, Koskela J, Wilke Berenguer M. Statistical tools for seed bank detection. Theor Popul Biol 2020; 132:1-15. [PMID: 31945384 DOI: 10.1016/j.tpb.2020.01.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 12/24/2019] [Accepted: 01/02/2020] [Indexed: 10/25/2022]
Abstract
We derive statistical tools to analyze the patterns of genetic variability produced by models related to seed banks; in particular the Kingman coalescent, its time-changed counterpart describing so-called weak seed banks, the strong seed bank coalescent, and the two-island structured coalescent. As (strong) seed banks stratify a population, we expect them to produce a signal comparable to population structure. We present tractable formulas for Wright's FST and the expected site frequency spectrum for these models, and show that they can distinguish between some models for certain ranges of parameters. We then use pseudo-marginal MCMC to show that the full likelihood can reliably distinguish between all models in the presence of parameter uncertainty under moderate stratification, and point out statistical pitfalls arising from stratification that is either too strong or too weak. We further show that it is possible to infer parameters, and in particular determine whether mutation is taking place in the (strong) seed bank.
Collapse
Affiliation(s)
- Jochen Blath
- Institut für Mathematik, Technische Universität Berlin, Straße des 17. Juni 136, 10623 Berlin, Germany.
| | - Eugenio Buzzoni
- Institut für Mathematik, Technische Universität Berlin, Straße des 17. Juni 136, 10623 Berlin, Germany.
| | - Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK.
| | - Maite Wilke Berenguer
- Fakultät für Mathematik, Ruhr-Universität Bochum, Universitätsstraße 150, 44801 Bochum, Germany.
| |
Collapse
|