1
|
Stammnitz MR, Gori K, Murchison EP. No evidence that a transmissible cancer has shifted from emergence to endemism in Tasmanian devils. ROYAL SOCIETY OPEN SCIENCE 2024; 11:231875. [PMID: 38633353 PMCID: PMC11022658 DOI: 10.1098/rsos.231875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 03/01/2024] [Accepted: 03/04/2024] [Indexed: 04/19/2024]
Abstract
Tasmanian devils are endangered by a transmissible cancer known as Tasmanian devil facial tumour 1 (DFT1). A 2020 study by Patton et al. (Science 370, eabb9772 (doi:10.1126/science.abb9772)) used genome data from DFT1 tumours to produce a dated phylogenetic tree for this transmissible cancer lineage, and thence, using phylodynamics models, to estimate its epidemiological parameters and predict its future trajectory. It concluded that the effective reproduction number for DFT1 had declined to a value of one, and that the disease had shifted from emergence to endemism. We show that the study is based on erroneous mutation calls and flawed methodology, and that its conclusions cannot be substantiated.
Collapse
Affiliation(s)
- Maximilian R. Stammnitz
- Transmissible Cancer Group, Department of Veterinary Medicine, University of Cambridge, Cambridge, UK
| | - Kevin Gori
- Transmissible Cancer Group, Department of Veterinary Medicine, University of Cambridge, Cambridge, UK
| | - Elizabeth P. Murchison
- Transmissible Cancer Group, Department of Veterinary Medicine, University of Cambridge, Cambridge, UK
| |
Collapse
|
2
|
Parag KV, Obolski U. Risk averse reproduction numbers improve resurgence detection. PLoS Comput Biol 2023; 19:e1011332. [PMID: 37471464 PMCID: PMC10393178 DOI: 10.1371/journal.pcbi.1011332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 07/06/2023] [Indexed: 07/22/2023] Open
Abstract
The effective reproduction number R is a prominent statistic for inferring the transmissibility of infectious diseases and effectiveness of interventions. R purportedly provides an easy-to-interpret threshold for deducing whether an epidemic will grow (R>1) or decline (R<1). We posit that this interpretation can be misleading and statistically overconfident when applied to infections accumulated from groups featuring heterogeneous dynamics. These groups may be delineated by geography, infectiousness or sociodemographic factors. In these settings, R implicitly weights the dynamics of the groups by their number of circulating infections. We find that this weighting can cause delayed detection of outbreak resurgence and premature signalling of epidemic control because it underrepresents the risks from highly transmissible groups. Applying E-optimal experimental design theory, we develop a weighting algorithm to minimise these issues, yielding the risk averse reproduction number E. Using simulations, analytic approaches and real-world COVID-19 data stratified at the city and district level, we show that E meaningfully summarises transmission dynamics across groups, balancing bias from the averaging underlying R with variance from directly using local group estimates. An E>1generates timely resurgence signals (upweighting risky groups), while an E<1ensures local outbreaks are under control. We propose E as an alternative to R for informing policy and assessing transmissibility at large scales (e.g., state-wide or nationally), where R is commonly computed but well-mixed or homogeneity assumptions break down.
Collapse
Affiliation(s)
- Kris V Parag
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, United Kingdom
| | - Uri Obolski
- School of Public Health, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- Porter School of the Environment and Earth Sciences, Faculty of Exact Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
3
|
Cappello L, Kim J, Palacios JA. adaPop: Bayesian inference of dependent population dynamics in coalescent models. PLoS Comput Biol 2023; 19:e1010897. [PMID: 36940209 PMCID: PMC10063170 DOI: 10.1371/journal.pcbi.1010897] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 03/30/2023] [Accepted: 01/25/2023] [Indexed: 03/21/2023] Open
Abstract
The coalescent is a powerful statistical framework that allows us to infer past population dynamics leveraging the ancestral relationships reconstructed from sampled molecular sequence data. In many biomedical applications, such as in the study of infectious diseases, cell development, and tumorgenesis, several distinct populations share evolutionary history and therefore become dependent. The inference of such dependence is a highly important, yet a challenging problem. With advances in sequencing technologies, we are well positioned to exploit the wealth of high-resolution biological data for tackling this problem. Here, we present adaPop, a probabilistic model to estimate past population dynamics of dependent populations and to quantify their degree of dependence. An essential feature of our approach is the ability to track the time-varying association between the populations while making minimal assumptions on their functional shapes via Markov random field priors. We provide nonparametric estimators, extensions of our base model that integrate multiple data sources, and fast scalable inference algorithms. We test our method using simulated data under various dependent population histories and demonstrate the utility of our model in shedding light on evolutionary histories of different variants of SARS-CoV-2.
Collapse
Affiliation(s)
- Lorenzo Cappello
- Departments of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Julia A. Palacios
- Departments of Statistics and Biomedical Data Science, Stanford University, Stanford, California, United States of America
- * E-mail:
| |
Collapse
|
4
|
Upadhya G, Steinrücken M. Robust inference of population size histories from genomic sequencing data. PLoS Comput Biol 2022; 18:e1010419. [PMID: 36112715 PMCID: PMC9518926 DOI: 10.1371/journal.pcbi.1010419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2021] [Revised: 09/28/2022] [Accepted: 07/21/2022] [Indexed: 02/08/2023] Open
Abstract
Unraveling the complex demographic histories of natural populations is a central problem in population genetics. Understanding past demographic events is of general anthropological interest, but is also an important step in establishing accurate null models when identifying adaptive or disease-associated genetic variation. An important class of tools for inferring past population size changes from genomic sequence data are Coalescent Hidden Markov Models (CHMMs). These models make efficient use of the linkage information in population genomic datasets by using the local genealogies relating sampled individuals as latent states that evolve along the chromosome in an HMM framework. Extending these models to large sample sizes is challenging, since the number of possible latent states increases rapidly. Here, we present our method CHIMP (CHMM History-Inference Maximum-Likelihood Procedure), a novel CHMM method for inferring the size history of a population. It can be applied to large samples (hundreds of haplotypes) and only requires unphased genomes as input. The two implementations of CHIMP that we present here use either the height of the genealogical tree (TMRCA) or the total branch length, respectively, as the latent variable at each position in the genome. The requisite transition and emission probabilities are obtained by numerically solving certain systems of differential equations derived from the ancestral process with recombination. The parameters of the population size history are subsequently inferred using an Expectation-Maximization algorithm. In addition, we implement a composite likelihood scheme to allow the method to scale to large sample sizes. We demonstrate the efficiency and accuracy of our method in a variety of benchmark tests using simulated data and present comparisons to other state-of-the-art methods. Specifically, our implementation using TMRCA as the latent variable shows comparable performance and provides accurate estimates of effective population sizes in intermediate and ancient times. Our method is agnostic to the phasing of the data, which makes it a promising alternative in scenarios where high quality data is not available, and has potential applications for pseudo-haploid data.
Collapse
Affiliation(s)
- Gautam Upadhya
- Department of Physics, University of Chicago, Chicago, Illinois, United States of America
| | - Matthias Steinrücken
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, United States of America
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
5
|
Parag KV, Donnelly CA, Zarebski AE. Quantifying the information in noisy epidemic curves. NATURE COMPUTATIONAL SCIENCE 2022; 2:584-594. [PMID: 38177483 DOI: 10.1038/s43588-022-00313-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 08/08/2022] [Indexed: 01/06/2024]
Abstract
Reliably estimating the dynamics of transmissible diseases from noisy surveillance data is an enduring problem in modern epidemiology. Key parameters are often inferred from incident time series, with the aim of informing policy-makers on the growth rate of outbreaks or testing hypotheses about the effectiveness of public health interventions. However, the reliability of these inferences depends critically on reporting errors and latencies innate to the time series. Here, we develop an analytical framework to quantify the uncertainty induced by under-reporting and delays in reporting infections, as well as a metric for ranking surveillance data informativeness. We apply this metric to two primary data sources for inferring the instantaneous reproduction number: epidemic case and death curves. We find that the assumption of death curves as more reliable, commonly made for acute infectious diseases such as COVID-19 and influenza, is not obvious and possibly untrue in many settings. Our framework clarifies and quantifies how actionable information about pathogen transmissibility is lost due to surveillance limitations.
Collapse
Affiliation(s)
- Kris V Parag
- NIHR Health Protection Research Unit in Behavioural Science and Evaluation, University of Bristol, Bristol, UK.
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, UK.
| | - Christl A Donnelly
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, UK
- Department of Statistics, University of Oxford, Oxford, UK
| | | |
Collapse
|
6
|
Bouckaert RR. An Efficient Coalescent Epoch Model for Bayesian Phylogenetic Inference. Syst Biol 2022; 71:1549-1560. [PMID: 35212733 PMCID: PMC9773037 DOI: 10.1093/sysbio/syac015] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 01/24/2022] [Accepted: 02/22/2022] [Indexed: 12/25/2022] Open
Abstract
We present a two-headed approach called Bayesian Integrated Coalescent Epoch PlotS (BICEPS) for efficient inference of coalescent epoch models. Firstly, we integrate out population size parameters, and secondly, we introduce a set of more powerful Markov chain Monte Carlo (MCMC) proposals for flexing and stretching trees. Even though population sizes are integrated out and not explicitly sampled through MCMC, we are still able to generate samples from the population size posteriors. This allows demographic reconstruction through time and estimating the timing and magnitude of population bottlenecks and full population histories. Altogether, BICEPS can be considered a more muscular version of the popular Bayesian skyline model. We demonstrate its power and correctness by a well-calibrated simulation study. Furthermore, we demonstrate with an application to SARS-CoV-2 genomic data that some analyses that have trouble converging with the traditional Bayesian skyline prior and standard MCMC proposals can do well with the BICEPS approach. BICEPS is available as open-source package for BEAST 2 under GPL license and has a user-friendly graphical user interface.[Bayesian phylogenetics; BEAST 2; BICEPS; coalescent model.].
Collapse
Affiliation(s)
- Remco R Bouckaert
- Correspondence to be sent to: University of Auckland, Thomas
Building, Room 407 3 Symonds St Auckland 1010 New Zealand E-mail:
| |
Collapse
|
7
|
Cappello L, Palacios JA. Adaptive Preferential Sampling in Phylodynamics With an Application to SARS-CoV-2. J Comput Graph Stat 2021; 31:541-552. [PMID: 36035966 PMCID: PMC9409340 DOI: 10.1080/10618600.2021.1987256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Longitudinal molecular data of rapidly evolving viruses and pathogens provide information about disease spread and complement traditional surveillance approaches based on case count data. The coalescent is used to model the genealogy that represents the sample ancestral relationships. The basic assumption is that coalescent events occur at a rate inversely proportional to the effective population size Ne(t), a time-varying measure of genetic diversity. When the sampling process (collection of samples over time) depends on Ne(t), the coalescent and the sampling processes can be jointly modeled to improve estimation of Ne(t). Failing to do so can lead to bias due to model misspecification. However, the way that the sampling process depends on the effective population size may vary over time. We introduce an approach where the sampling process is modeled as an inhomogeneous Poisson process with rate equal to the product of Ne(t) and a time-varying coefficient, making minimal assumptions on their functional shapes via Markov random field priors. We provide efficient algorithms for inference, show the model performance vis-a-vis alternative methods in a simulation study, and apply our model to SARS-CoV-2 sequences from Los Angeles and Santa Clara counties. The methodology is implemented and available in the R package adapref. Supplementary files for this article are available online.
Collapse
Affiliation(s)
| | - Julia A. Palacios
- Department of Statistics, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford Medicine, Stanford, CA
| |
Collapse
|
8
|
Louca S, McLaughlin A, MacPherson A, Joy JB, Pennell MW. Fundamental Identifiability Limits in Molecular Epidemiology. Mol Biol Evol 2021; 38:4010-4024. [PMID: 34009339 PMCID: PMC8382926 DOI: 10.1093/molbev/msab149] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Viral phylogenies provide crucial information on the spread of infectious diseases, and many studies fit mathematical models to phylogenetic data to estimate epidemiological parameters such as the effective reproduction ratio (Re) over time. Such phylodynamic inferences often complement or even substitute for conventional surveillance data, particularly when sampling is poor or delayed. It remains generally unknown, however, how robust phylodynamic epidemiological inferences are, especially when there is uncertainty regarding pathogen prevalence and sampling intensity. Here, we use recently developed mathematical techniques to fully characterize the information that can possibly be extracted from serially collected viral phylogenetic data, in the context of the commonly used birth-death-sampling model. We show that for any candidate epidemiological scenario, there exists a myriad of alternative, markedly different, and yet plausible "congruent" scenarios that cannot be distinguished using phylogenetic data alone, no matter how large the data set. In the absence of strong constraints or rate priors across the entire study period, neither maximum-likelihood fitting nor Bayesian inference can reliably reconstruct the true epidemiological dynamics from phylogenetic data alone; rather, estimators can only converge to the "congruence class" of the true dynamics. We propose concrete and feasible strategies for making more robust epidemiological inferences from viral phylogenetic data.
Collapse
Affiliation(s)
- Stilianos Louca
- Department of Biology, University of Oregon, Eugene, OR, USA
- Institute of Ecology and Evolution, University of Oregon, Eugene, OR, USA
| | - Angela McLaughlin
- British Columbia Centre for Excellence in HIV/AIDS, Vancouver, BC, Canada
- Bioinformatics, University of British Columbia, Vancouver, BC, Canada
| | - Ailene MacPherson
- Biodiversity Research Centre, University of British Columbia, Vancouver, BC, Canada
- Department of Zoology, University of British Columbia, Vancouver, BC, Canada
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, ON, Canada
| | - Jeffrey B Joy
- British Columbia Centre for Excellence in HIV/AIDS, Vancouver, BC, Canada
- Bioinformatics, University of British Columbia, Vancouver, BC, Canada
- Department of Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Matthew W Pennell
- Biodiversity Research Centre, University of British Columbia, Vancouver, BC, Canada
- Department of Zoology, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
9
|
Parag KV, Pybus OG, Wu CH. Are Skyline Plot-Based Demographic Estimates Overly Dependent on Smoothing Prior Assumptions? Syst Biol 2021; 71:121-138. [PMID: 33989428 PMCID: PMC8677568 DOI: 10.1093/sysbio/syab037] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 05/07/2021] [Accepted: 05/08/2021] [Indexed: 11/13/2022] Open
Abstract
In Bayesian phylogenetics, the coalescent process provides an informative framework for inferring changes in the effective size of a population from a phylogeny (or tree) of sequences sampled from that population. Popular coalescent inference approaches such as the Bayesian Skyline Plot, Skyride, and Skygrid all model these population size changes with a discontinuous, piecewise-constant function but then apply a smoothing prior to ensure that their posterior population size estimates transition gradually with time. These prior distributions implicitly encode extra population size information that is not available from the observed coalescent data or tree. Here, we present a novel statistic, $\Omega$, to quantify and disaggregate the relative contributions of the coalescent data and prior assumptions to the resulting posterior estimate precision. Our statistic also measures the additional mutual information introduced by such priors. Using $\Omega$ we show that, because it is surprisingly easy to overparametrize piecewise-constant population models, common smoothing priors can lead to overconfident and potentially misleading inference, even under robust experimental designs. We propose $\Omega$ as a useful tool for detecting when effective population size estimates are overly reliant on prior assumptions and for improving quantification of the uncertainty in those estimates.[Coalescent processes; effective population size; information theory; phylodynamics; prior assumptions; skyline plots.].
Collapse
Affiliation(s)
- Kris V Parag
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London W2 1PG, UK,Department of Zoology, University of Oxford, Oxford OX1 3SY, UK,Correspondence to be sent to: MRC Centre for Global Infectious Disease Analysis, Imperial College London, London W2 1PG, UK; e-mail:
| | - Oliver G Pybus
- Department of Zoology, University of Oxford, Oxford OX1 3SY, UK
| | - Chieh-Hsi Wu
- Mathematical Sciences, University of Southampton, Highfield, Southampton SO17 1BJ, UK
| |
Collapse
|
10
|
Parag KV, Donnelly CA. Adaptive Estimation for Epidemic Renewal and Phylogenetic Skyline Models. Syst Biol 2020; 69:1163-1179. [PMID: 32333789 PMCID: PMC7584150 DOI: 10.1093/sysbio/syaa035] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 04/14/2020] [Accepted: 04/16/2020] [Indexed: 11/12/2022] Open
Abstract
Estimating temporal changes in a target population from phylogenetic or count data is an important problem in ecology and epidemiology. Reliable estimates can provide key insights into the climatic and biological drivers influencing the diversity or structure of that population and evidence hypotheses concerning its future growth or decline. In infectious disease applications, the individuals infected across an epidemic form the target population. The renewal model estimates the effective reproduction number, R, of the epidemic from counts of observed incident cases. The skyline model infers the effective population size, N, underlying a phylogeny of sequences sampled from that epidemic. Practically, R measures ongoing epidemic growth while N informs on historical caseload. While both models solve distinct problems, the reliability of their estimates depends on p-dimensional piecewise-constant functions. If p is misspecified, the model might underfit significant changes or overfit noise and promote a spurious understanding of the epidemic, which might misguide intervention policies or misinform forecasts. Surprisingly, no transparent yet principled approach for optimizing p exists. Usually, p is heuristically set, or obscurely controlled via complex algorithms. We present a computable and interpretable p-selection method based on the minimum description length (MDL) formalism of information theory. Unlike many standard model selection techniques, MDL accounts for the additional statistical complexity induced by how parameters interact. As a result, our method optimizes p so that R and N estimates properly and meaningfully adapt to available data. It also outperforms comparable Akaike and Bayesian information criteria on several classification problems, given minimal knowledge of the parameter space, and exposes statistical similarities among renewal, skyline, and other models in biology. Rigorous and interpretable model selection is necessary if trustworthy and justifiable conclusions are to be drawn from piecewise models. [Coalescent processes; epidemiology; information theory; model selection; phylodynamics; renewal models; skyline plots].
Collapse
Affiliation(s)
- Kris V Parag
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, W2 1PG, UK
| | - Christl A Donnelly
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, W2 1PG, UK
- Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK
| |
Collapse
|
11
|
Parag KV, du Plessis L, Pybus OG. Jointly Inferring the Dynamics of Population Size and Sampling Intensity from Molecular Sequences. Mol Biol Evol 2020; 37:2414-2429. [PMID: 32003829 PMCID: PMC7403618 DOI: 10.1093/molbev/msaa016] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Estimating past population dynamics from molecular sequences that have been sampled longitudinally through time is an important problem in infectious disease epidemiology, molecular ecology, and macroevolution. Popular solutions, such as the skyline and skygrid methods, infer past effective population sizes from the coalescent event times of phylogenies reconstructed from sampled sequences but assume that sequence sampling times are uninformative about population size changes. Recent work has started to question this assumption by exploring how sampling time information can aid coalescent inference. Here, we develop, investigate, and implement a new skyline method, termed the epoch sampling skyline plot (ESP), to jointly estimate the dynamics of population size and sampling rate through time. The ESP is inspired by real-world data collection practices and comprises a flexible model in which the sequence sampling rate is proportional to the population size within an epoch but can change discontinuously between epochs. We show that the ESP is accurate under several realistic sampling protocols and we prove analytically that it can at least double the best precision achievable by standard approaches. We generalize the ESP to incorporate phylogenetic uncertainty in a new Bayesian package (BESP) in BEAST2. We re-examine two well-studied empirical data sets from virus epidemiology and molecular evolution and find that the BESP improves upon previous coalescent estimators and generates new, biologically useful insights into the sampling protocols underpinning these data sets. Sequence sampling times provide a rich source of information for coalescent inference that will become increasingly important as sequence collection intensifies and becomes more formalized.
Collapse
Affiliation(s)
- Kris V Parag
- Department of Zoology, University of Oxford, Oxford, United Kingdom
- Department of Infectious Disease Epidemiology, MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, United Kingdom
| | - Louis du Plessis
- Department of Zoology, University of Oxford, Oxford, United Kingdom
| | - Oliver G Pybus
- Department of Zoology, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
12
|
Huang J, Flouri T, Yang Z. A Simulation Study to Examine the Information Content in Phylogenomic Data Sets under the Multispecies Coalescent Model. Mol Biol Evol 2020; 37:3211-3224. [DOI: 10.1093/molbev/msaa166] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
AbstractWe use computer simulation to examine the information content in multilocus data sets for inference under the multispecies coalescent model. Inference problems considered include estimation of evolutionary parameters (such as species divergence times, population sizes, and cross-species introgression probabilities), species tree estimation, and species delimitation based on Bayesian comparison of delimitation models. We found that the number of loci is the most influential factor for almost all inference problems examined. Although the number of sequences per species does not appear to be important to species tree estimation, it is very influential to species delimitation. Increasing the number of sites and the per-site mutation rate both increase the mutation rate for the whole locus and these have the same effect on estimation of parameters, but the sequence length has a greater effect than the per-site mutation rate for species tree estimation. We discuss the computational costs when the data size increases and provide guidelines concerning the subsampling of genomic data to enable the application of full-likelihood methods of inference.
Collapse
Affiliation(s)
- Jun Huang
- Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
- Department of Mathematics, Beijing Jiaotong University, Beijing, P.R. China
| | - Tomáš Flouri
- Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | - Ziheng Yang
- Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| |
Collapse
|
13
|
Parag KV, Donnelly CA. Using information theory to optimise epidemic models for real-time prediction and estimation. PLoS Comput Biol 2020; 16:e1007990. [PMID: 32609732 PMCID: PMC7360089 DOI: 10.1371/journal.pcbi.1007990] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 07/14/2020] [Accepted: 05/27/2020] [Indexed: 01/31/2023] Open
Abstract
The effective reproduction number, Rt, is a key time-varying prognostic for the growth rate of any infectious disease epidemic. Significant changes in Rt can forewarn about new transmissions within a population or predict the efficacy of interventions. Inferring Rt reliably and in real-time from observed time-series of infected (demographic) data is an important problem in population dynamics. The renewal or branching process model is a popular solution that has been applied to Ebola and Zika virus disease outbreaks, among others, and is currently being used to investigate the ongoing COVID-19 pandemic. This model estimates Rt using a heuristically chosen piecewise function. While this facilitates real-time detection of statistically significant Rt changes, inference is highly sensitive to the function choice. Improperly chosen piecewise models might ignore meaningful changes or over-interpret noise-induced ones, yet produce visually reasonable estimates. No principled piecewise selection scheme exists. We develop a practical yet rigorous scheme using the accumulated prediction error (APE) metric from information theory, which deems the model capable of describing the observed data using the fewest bits as most justified. We derive exact posterior prediction distributions for infected population size and integrate these within an APE framework to obtain an exact and reliable method for identifying the piecewise function best supported by available epidemic data. We find that this choice optimises short-term prediction accuracy and can rapidly detect salient fluctuations in Rt, and hence the infected population growth rate, in real-time over the course of an unfolding epidemic. Moreover, we emphasise the need for formal selection by exposing how common heuristic choices, which seem sensible, can be misleading. Our APE-based method is easily computed and broadly applicable to statistically similar models found in phylogenetics and macroevolution, for example. Our results explore the relationships among estimate precision, forecast reliability and model complexity. Understanding how the population of infected individuals (which may be humans, animals or plants) fluctuates in size over the course of an epidemic is an important problem in epidemiology and ecology. The effective reproduction number, R, provides an intuitive and useful way of describing these fluctuations by characterising the growth rate of the infected population. An R > 1 signifies a burgeoning epidemic whereas R < 1 indicates a declining one. Public health agencies often use R to inform or corroborate vaccination and quarantine policies. However, popular approaches to inferring R from epidemic data make heuristic choices, which may lead to visually reasonable estimates that are deceptive or unreliable. By adapting mathematical tools from information theory, we develop a general and principled scheme for estimating R in a data-justified way. Our method exposes the pitfalls of heuristic estimates and provides an easily computable correction that also maximises our ability to predict upcoming population fluctuations. Our work is widely applicable to similar inference problems found in evolution and genetics, demonstrably useful for reliably analysing emerging epidemics in real time and highlights how abstract mathematical concepts can inspire novel and practical biological solutions, showcasing the importance of multidisciplinary research.
Collapse
Affiliation(s)
- Kris V. Parag
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, W2 1PG, United Kingdom
- * E-mail:
| | - Christl A. Donnelly
- MRC Centre for Global Infectious Disease Analysis, Imperial College London, London, W2 1PG, United Kingdom
- Department of Statistics, University of Oxford, Oxford, OX1 3LB, United Kingdom
| |
Collapse
|
14
|
Sellinger TPP, Abu Awad D, Moest M, Tellier A. Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data. PLoS Genet 2020; 16:e1008698. [PMID: 32251472 PMCID: PMC7173940 DOI: 10.1371/journal.pgen.1008698] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 04/21/2020] [Accepted: 02/24/2020] [Indexed: 02/04/2023] Open
Abstract
Several methods based on the Sequential Markovian coalescence (SMC) have been developed that make use of genome sequence data to uncover population demographic history, which is of interest in its own right and is a key requirement to generate a null model for selection tests. While these methods can be applied to all possible kind of species, the underlying assumptions are sexual reproduction in each generation and non-overlapping generations. However, in many plants, invertebrates, fungi and other taxa, those assumptions are often violated due to different ecological and life history traits, such as self-fertilization or long term dormant structures (seed or egg-banking). We develop a novel SMC-based method to infer 1) the rates/parameters of dormancy and of self-fertilization, and 2) the populations' past demographic history. Using simulated data sets, we demonstrate the accuracy of our method for a wide range of demographic scenarios and for sequence lengths from one to 30 Mb using four sampled genomes. Finally, we apply our method to a Swedish and a German population of Arabidopsis thaliana demonstrating a selfing rate of ca. 0.87 and the absence of any detectable seed-bank. In contrast, we show that the water flea Daphnia pulex exhibits a long lived egg-bank of three to 18 generations. In conclusion, we here present a novel method to infer accurate demographies and life-history traits for species with selfing and/or seed/egg-banks. Finally, we provide recommendations for the use of SMC-based methods for non-model organisms, highlighting the importance of the per site and the effective ratios of recombination over mutation.
Collapse
Affiliation(s)
| | - Diala Abu Awad
- Department of Population Genetics, Technische Universitaet Muenchen, Freising, Germany
| | - Markus Moest
- Department of Ecology, University of Innsbruck, Innsbruck, Austria
| | - Aurélien Tellier
- Department of Population Genetics, Technische Universitaet Muenchen, Freising, Germany
| |
Collapse
|
15
|
Wang X, Maher KH, Zhang N, Que P, Zheng C, Liu S, Wang B, Huang Q, Chen D, Yang X, Zhang Z, Székely T, Urrutia AO, Liu Y. Demographic Histories and Genome-Wide Patterns of Divergence in Incipient Species of Shorebirds. Front Genet 2019; 10:919. [PMID: 31781152 PMCID: PMC6857203 DOI: 10.3389/fgene.2019.00919] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2019] [Accepted: 08/30/2019] [Indexed: 12/30/2022] Open
Abstract
Understanding how incipient species are maintained with gene flow is a fundamental question in evolutionary biology. Whole genome sequencing of multiple individuals holds great potential to illustrate patterns of genomic differentiation as well as the associated evolutionary histories. Kentish (Charadrius alexandrinus) and the white-faced (C. dealbatus) plovers, which differ in their phenotype, ecology and behavior, are two incipient species and parapatrically distributed in East Asia. Previous studies show evidence of genetic diversification with gene flow between the two plovers. Under this scenario, it is of great importance to explore the patterns of divergence at the genomic level and to determine whether specific regions are involved in reproductive isolation and local adaptation. Here we present the first population genomic analysis of the two incipient species based on the de novo Kentish plover reference genome and resequenced populations. We show that the two plover lineages are distinct in both nuclear and mitochondrial genomes. Using model-based coalescence analysis, we found that population sizes of Kentish plover increased whereas white-faced plovers declined during the Last Glaciation Period. Moreover, the two plovers diverged allopatrically, with gene flow occurring after secondary contact. This has resulted in low levels of genome-wide differentiation, although we found evidence of a few highly differentiated genomic regions in both the autosomes and the Z-chromosome. This study illustrates that incipient shorebird species with gene flow after secondary contact can exhibit discrete divergence at specific genomic regions and provides basis to further exploration on the genetic basis of relevant phenotypic traits.
Collapse
Affiliation(s)
- Xuejing Wang
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Kathryn H. Maher
- Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, United Kingdom
| | - Nan Zhang
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Pinjia Que
- Ministry of Education Key Laboratory for Biodiversity and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Chenqing Zheng
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- Department of Bioinformatics, Shenzhen Realomics Biological Technology Ltd, Shenzhen, China
| | - Simin Liu
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Biao Wang
- School of Biosciences, University of Melbourne, Parkville, VIC, Australia
| | - Qin Huang
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - De Chen
- Ministry of Education Key Laboratory for Biodiversity and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Xu Yang
- Department of Bioinformatics, Shenzhen Realomics Biological Technology Ltd, Shenzhen, China
| | - Zhengwang Zhang
- Ministry of Education Key Laboratory for Biodiversity and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Tamás Székely
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
- Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
- Ministry of Education Key Laboratory for Biodiversity and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Araxi O. Urrutia
- Milner Centre for Evolution, Department of Biology and Biochemistry, University of Bath, Bath, United Kingdom
- Instituto de Ecología, Universidad Nacional Autónoma de México, Ciudad de México, Mexico
| | - Yang Liu
- State Key Laboratory of Biocontrol, Department of Ecology, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|