1
|
Vaughn AH, Nielsen R. Fast and Accurate Estimation of Selection Coefficients and Allele Histories from Ancient and Modern DNA. Mol Biol Evol 2024; 41:msae156. [PMID: 39078618 PMCID: PMC11321360 DOI: 10.1093/molbev/msae156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 07/02/2024] [Accepted: 07/10/2024] [Indexed: 07/31/2024] Open
Abstract
We here present CLUES2, a full-likelihood method to infer natural selection from sequence data that is an extension of the method CLUES. We make several substantial improvements to the CLUES method that greatly increases both its applicability and its speed. We add the ability to use ancestral recombination graphs on ancient data as emissions to the underlying hidden Markov model, which enables CLUES2 to use both temporal and linkage information to make estimates of selection coefficients. We also fully implement the ability to estimate distinct selection coefficients in different epochs, which allows for the analysis of changes in selective pressures through time, as well as selection with dominance. In addition, we greatly increase the computational efficiency of CLUES2 over CLUES using several approximations to the forward-backward algorithms and develop a new way to reconstruct historic allele frequencies by integrating over the uncertainty in the estimation of the selection coefficients. We illustrate the accuracy of CLUES2 through extensive simulations and validate the importance sampling framework for integrating over the uncertainty in the inference of gene trees. We also show that CLUES2 is well-calibrated by showing that under the null hypothesis, the distribution of log-likelihood ratios follows a χ2 distribution with the appropriate degrees of freedom. We run CLUES2 on a set of recently published ancient human data from Western Eurasia and test for evidence of changing selection coefficients through time. We find significant evidence of changing selective pressures in several genes correlated with the introduction of agriculture to Europe and the ensuing dietary and demographic shifts of that time. In particular, our analysis supports previous hypotheses of strong selection on lactase persistence during periods of ancient famines and attenuated selection in more modern periods.
Collapse
Affiliation(s)
- Andrew H Vaughn
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - Rasmus Nielsen
- Departments of Integrative Biology and Statistics, University of California, Berkeley, CA 94720, USA
- Center for GeoGenetics, University of Copenhagen, Copenhagen DK-1350, Denmark
| |
Collapse
|
2
|
Anderson NW, Kirk L, Schraiber JG, Ragsdale AP. A Path Integral Approach for Allele Frequency Dynamics Under Polygenic Selection. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.14.599114. [PMID: 38915613 PMCID: PMC11195211 DOI: 10.1101/2024.06.14.599114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Many phenotypic traits have a polygenic genetic basis, making it challenging to learn their genetic architectures and predict individual phenotypes. One promising avenue to resolve the genetic basis of complex traits is through evolve-and-resequence experiments, in which laboratory populations are exposed to some selective pressure and trait-contributing loci are identified by extreme frequency changes over the course of the experiment. However, small laboratory populations will experience substantial random genetic drift, and it is difficult to determine whether selection played a roll in a given allele frequency change. Predicting how much allele frequencies change under drift and selection had remained an open problem well into the 21st century, even those contributing to simple, monogenic traits. Recently, there have been efforts to apply the path integral, a method borrowed from physics, to solve this problem. So far, this approach has been limited to genic selection, and is therefore inadequate to capture the complexity of quantitative, highly polygenic traits that are commonly studied. Here we extend one of these path integral methods, the perturbation approximation, to selection scenarios that are of interest to quantitative genetics. In particular, we derive analytic expressions for the transition probability (i.e., the probability that an allele will change in frequency from x , to y in time t ) of an allele contributing to a trait subject to stabilizing selection, as well as that of an allele contributing to a trait rapidly adapting to a new phenotypic optimum. We use these expressions to characterize the use of allele frequency change to test for selection, as well as explore optimal design choices for evolve-and-resequence experiments to uncover the genetic architecture of polygenic traits under selection.
Collapse
Affiliation(s)
- Nathan W. Anderson
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Lloyd Kirk
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Joshua G. Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
| | - Aaron P. Ragsdale
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| |
Collapse
|
3
|
Yu Q, Ascensao JA, Okada T, Boyd O, Volz E, Hallatschek O. Lineage frequency time series reveal elevated levels of genetic drift in SARS-CoV-2 transmission in England. PLoS Pathog 2024; 20:e1012090. [PMID: 38620033 PMCID: PMC11045146 DOI: 10.1371/journal.ppat.1012090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 04/25/2024] [Accepted: 03/03/2024] [Indexed: 04/17/2024] Open
Abstract
Genetic drift in infectious disease transmission results from randomness of transmission and host recovery or death. The strength of genetic drift for SARS-CoV-2 transmission is expected to be high due to high levels of superspreading, and this is expected to substantially impact disease epidemiology and evolution. However, we don't yet have an understanding of how genetic drift changes over time or across locations. Furthermore, noise that results from data collection can potentially confound estimates of genetic drift. To address this challenge, we develop and validate a method to jointly infer genetic drift and measurement noise from time-series lineage frequency data. Our method is highly scalable to increasingly large genomic datasets, which overcomes a limitation in commonly used phylogenetic methods. We apply this method to over 490,000 SARS-CoV-2 genomic sequences from England collected between March 2020 and December 2021 by the COVID-19 Genomics UK (COG-UK) consortium and separately infer the strength of genetic drift for pre-B.1.177, B.1.177, Alpha, and Delta. We find that even after correcting for measurement noise, the strength of genetic drift is consistently, throughout time, higher than that expected from the observed number of COVID-19 positive individuals in England by 1 to 3 orders of magnitude, which cannot be explained by literature values of superspreading. Our estimates of genetic drift suggest low and time-varying establishment probabilities for new mutations, inform the parametrization of SARS-CoV-2 evolutionary models, and motivate future studies of the potential mechanisms for increased stochasticity in this system.
Collapse
Affiliation(s)
- QinQin Yu
- Department of Physics, University of California, Berkeley, California, United States of America
| | - Joao A. Ascensao
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Takashi Okada
- Department of Physics, University of California, Berkeley, California, United States of America
- Department of Integrative Biology, University of California, Berkeley, California, United States of America
- Institute for Life and Medical Sciences, Kyoto University, Kyoto, Japan
- RIKEN iTHEMS, Wako, Saitama, Japan
| | | | - Olivia Boyd
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - Erik Volz
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - Oskar Hallatschek
- Department of Physics, University of California, Berkeley, California, United States of America
- Department of Integrative Biology, University of California, Berkeley, California, United States of America
- Peter Debye Institute for Soft Matter Physics, Leipzig University, Leipzig, Germany
| |
Collapse
|
4
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the discrete-time Wright-Fisher model to biobank-scale datasets. Genetics 2023; 225:iyad168. [PMID: 37724741 PMCID: PMC10627256 DOI: 10.1093/genetics/iyad168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/01/2023] [Accepted: 09/08/2023] [Indexed: 09/21/2023] Open
Abstract
The discrete-time Wright-Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large samples or in the presence of strong selection. Existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here, we present a scalable algorithm that approximates the DTWF model with provably bounded error. Our approach relies on two key observations about the DTWF model. The first is that transition probabilities under the model are approximately sparse. The second is that transition distributions for similar starting allele frequencies are extremely close as distributions. Together, these observations enable approximate matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the tens of millions, paving the way for rigorous biobank-scale inference. Finally, we use our results to estimate the impact of larger samples on estimating selection coefficients for loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
- Jeffrey P Spence
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Tony Zeng
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
5
|
Whitehouse LS, Schrider DR. Timesweeper: accurately identifying selective sweeps using population genomic time series. Genetics 2023; 224:iyad084. [PMID: 37157914 PMCID: PMC10324941 DOI: 10.1093/genetics/iyad084] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 07/25/2022] [Accepted: 04/25/2023] [Indexed: 05/10/2023] Open
Abstract
Despite decades of research, identifying selective sweeps, the genomic footprints of positive selection, remains a core problem in population genetics. Of the myriad methods that have been developed to tackle this task, few are designed to leverage the potential of genomic time-series data. This is because in most population genetic studies of natural populations, only a single period of time can be sampled. Recent advancements in sequencing technology, including improvements in extracting and sequencing ancient DNA, have made repeated samplings of a population possible, allowing for more direct analysis of recent evolutionary dynamics. Serial sampling of organisms with shorter generation times has also become more feasible due to improvements in the cost and throughput of sequencing. With these advances in mind, here we present Timesweeper, a fast and accurate convolutional neural network-based tool for identifying selective sweeps in data consisting of multiple genomic samplings of a population over time. Timesweeper analyzes population genomic time-series data by first simulating training data under a demographic model appropriate for the data of interest, training a one-dimensional convolutional neural network on said simulations, and inferring which polymorphisms in this serialized data set were the direct target of a completed or ongoing selective sweep. We show that Timesweeper is accurate under multiple simulated demographic and sampling scenarios, identifies selected variants with high resolution, and estimates selection coefficients more accurately than existing methods. In sum, we show that more accurate inferences about natural selection are possible when genomic time-series data are available; such data will continue to proliferate in coming years due to both the sequencing of ancient samples and repeated samplings of extant populations with faster generation times, as well as experimentally evolved populations where time-series data are often generated. Methodological advances such as Timesweeper thus have the potential to help resolve the controversy over the role of positive selection in the genome. We provide Timesweeper as a Python package for use by the community.
Collapse
Affiliation(s)
- Logan S Whitehouse
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27514, USA
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27514, USA
| |
Collapse
|
6
|
Barata C, Borges R, Kosiol C. Bait-ER: A Bayesian method to detect targets of selection in Evolve-and-Resequence experiments. J Evol Biol 2023; 36:29-44. [PMID: 36544394 PMCID: PMC10108205 DOI: 10.1111/jeb.14134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 11/09/2022] [Accepted: 11/11/2022] [Indexed: 12/24/2022]
Abstract
For over a decade, experimental evolution has been combined with high-throughput sequencing techniques. In so-called Evolve-and-Resequence (E&R) experiments, populations are kept in the laboratory under controlled experimental conditions where their genomes are sampled and allele frequencies monitored. However, identifying signatures of adaptation in E&R datasets is far from trivial, and it is still necessary to develop more efficient and statistically sound methods for detecting selection in genome-wide data. Here, we present Bait-ER - a fully Bayesian approach based on the Moran model of allele evolution to estimate selection coefficients from E&R experiments. The model has overlapping generations, a feature that describes several experimental designs found in the literature. We tested our method under several different demographic and experimental conditions to assess its accuracy and precision, and it performs well in most scenarios. Nevertheless, some care must be taken when analysing trajectories where drift largely dominates and starting frequencies are low. We compare our method with other available software and report that ours has generally high accuracy even for trajectories whose complexity goes beyond a classical sweep model. Furthermore, our approach avoids the computational burden of simulating an empirical null distribution, outperforming available software in terms of computational time and facilitating its use on genome-wide data. We implemented and released our method in a new open-source software package that can be accessed at https://doi.org/10.5281/zenodo.7351736.
Collapse
Affiliation(s)
- Carolina Barata
- Centre for Biological Diversity, University of St Andrews, St Andrews, UK
| | - Rui Borges
- Institute of Population Genetics, Wien, Austria
| | - Carolin Kosiol
- Centre for Biological Diversity, University of St Andrews, St Andrews, UK.,Institute of Population Genetics, Wien, Austria
| |
Collapse
|
7
|
Sohail MS, Louie RHY, Hong Z, Barton JP, McKay MR. Inferring Epistasis from Genetic Time-series Data. Mol Biol Evol 2022; 39:6710201. [PMID: 36130322 PMCID: PMC9558069 DOI: 10.1093/molbev/msac199] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Epistasis refers to fitness or functional effects of mutations that depend on the sequence background in which these mutations arise. Epistasis is prevalent in nature, including populations of viruses, bacteria, and cancers, and can contribute to the evolution of drug resistance and immune escape. However, it is difficult to directly estimate epistatic effects from sampled observations of a population. At present, there are very few methods that can disentangle the effects of selection (including epistasis), mutation, recombination, genetic drift, and genetic linkage in evolving populations. Here we develop a method to infer epistasis, along with the fitness effects of individual mutations, from observed evolutionary histories. Simulations show that we can accurately infer pairwise epistatic interactions provided that there is sufficient genetic diversity in the data. Our method also allows us to identify which fitness parameters can be reliably inferred from a particular data set and which ones are unidentifiable. Our approach therefore allows for the inference of more complex models of selection from time-series genetic data, while also quantifying uncertainty in the inferred parameters.
Collapse
Affiliation(s)
- Muhammad Saqib Sohail
- Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong SAR, People’s Republic of China
| | - Raymond H Y Louie
- The Kirby Institute, University of New South Wales, Sydney, New South Wales, Australia
| | - Zhenchen Hong
- Department of Physics and Astronomy, University of California, Riverside, CA, USA
| | | | | |
Collapse
|
8
|
Friedlander E, Steinrücken M. A numerical framework for genetic hitchhiking in populations of variable size. Genetics 2022; 220:6526396. [PMID: 35143667 PMCID: PMC8893261 DOI: 10.1093/genetics/iyac012] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 12/27/2021] [Indexed: 11/13/2022] Open
Abstract
Natural selection on beneficial or deleterious alleles results in an increase or decrease, respectively, of their frequency within the population. Due to chromosomal linkage, the dynamics of the selected site affect the genetic variation at nearby neutral loci in a process commonly referred to as genetic hitchhiking. Changes in population size, however, can yield patterns in genomic data that mimic the effects of selection. Accurately modeling these dynamics is thus crucial to understanding how selection and past population size changes impact observed patterns of genetic variation. Here, we model the evolution of haplotype frequencies with the Wright-Fisher diffusion to study the impact of selection on linked neutral variation. Explicit solutions are not known for the dynamics of this diffusion when selection and recombination act simultaneously. Thus, we present a method for numerically evaluating the Wright-Fisher diffusion dynamics of 2 linked loci separated by a certain recombination distance when selection is acting. We can account for arbitrary population size histories explicitly using this approach. A key step in the method is to express the moments of the associated transition density, or sampling probabilities, as solutions to ordinary differential equations. Numerically solving these differential equations relies on a novel accurate and numerically efficient technique to estimate higher order moments from lower order moments. We demonstrate how this numerical framework can be used to quantify the reduction and recovery of genetic diversity around a selected locus over time and elucidate distortions in the site-frequency-spectra of neutral variation linked to loci under selection in various demographic settings. The method can be readily extended to more general modes of selection and applied in likelihood frameworks to detect loci under selection and infer the strength of the selective pressure.
Collapse
Affiliation(s)
- Eric Friedlander
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA,Department of Mathematics, Saint Norbert College, Green Bay, WI 54115, USA
| | - Matthias Steinrücken
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA,Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA,Corresponding author: Department of Ecology & Evolution, The University of Chicago, 1101 E. 57th Street, Chicago, IL 60637, USA.
| |
Collapse
|
9
|
Mathieson I, Terhorst J. Direct detection of natural selection in Bronze Age Britain. Genome Res 2022; 32:2057-2067. [PMID: 36316157 PMCID: PMC9808619 DOI: 10.1101/gr.276862.122] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 08/29/2022] [Indexed: 11/04/2022]
Abstract
We developed a novel method for efficiently estimating time-varying selection coefficients from genome-wide ancient DNA data. In simulations, our method accurately recovers selective trajectories and is robust to misspecification of population size. We applied it to a large data set of ancient and present-day human genomes from Britain and identified seven loci with genome-wide significant evidence of selection in the past 4500 yr. Almost all of them can be related to increased vitamin D or calcium levels, suggesting strong selective pressure on these or related phenotypes. However, the strength of selection on individual loci varied substantially over time, suggesting that cultural or environmental factors moderated the genetic response. Of 28 complex anthropometric and metabolic traits, skin pigmentation was the only one with significant evidence of polygenic selection, further underscoring the importance of phenotypes related to vitamin D. Our approach illustrates the power of ancient DNA to characterize selection in human populations and illuminates the recent evolutionary history of Britain.
Collapse
Affiliation(s)
- Iain Mathieson
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Jonathan Terhorst
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
10
|
Exact simulation of coupled Wright–Fisher diffusions. ADV APPL PROBAB 2021. [DOI: 10.1017/apr.2021.9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
AbstractIn this paper an exact rejection algorithm for simulating paths of the coupled Wright–Fisher diffusion is introduced. The coupled Wright–Fisher diffusion is a family of multivariate Wright–Fisher diffusions that have drifts depending on each other through a coupling term and that find applications in the study of networks of interacting genes. The proposed rejection algorithm uses independent neutral Wright–Fisher diffusions as candidate proposals, which are only needed at a finite number of points. Once a candidate is accepted, the remainder of the path can be recovered by sampling from neutral multivariate Wright–Fisher bridges, for which an exact sampling strategy is also provided. Finally, the algorithm’s complexity is derived and its performance demonstrated in a simulation study.
Collapse
|
11
|
Lyu W, Dai X, Beaumont M, Yu F, He Z. Inferring the timing and strength of natural selection and gene migration in the evolution of chicken from ancient DNA data. Mol Ecol Resour 2021; 22:1362-1379. [PMID: 34783162 DOI: 10.1111/1755-0998.13553] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Revised: 09/10/2021] [Accepted: 09/28/2021] [Indexed: 11/29/2022]
Abstract
With the rapid growth of the number of sequenced ancient genomes, there has been increasing interest in using this new information to study past and present adaptation. Such an additional temporal component has the promise of providing improved power for the estimation of natural selection. Over the last decade, statistical approaches for detection and quantification of natural selection from ancient DNA (aDNA) data have been developed. However, most of the existing methods do not allow us to estimate the timing of natural selection along with its strength, which is key to understanding the evolution and persistence of organismal diversity. Additionally, most methods ignore the fact that natural populations are almost always structured, which can result in overestimation of the effect of natural selection. To address these issues, we introduce a novel Bayesian framework for the inference of natural selection and gene migration from aDNA data with Markov chain Monte Carlo techniques, co-estimating both timing and strength of natural selection and gene migration. Such an advance enables us to infer drivers of natural selection and gene migration by correlating genetic evolution with potential causes such as the changes in the ecological context in which an organism has evolved. The performance of our procedure is evaluated through extensive simulations, with its utility shown with an application to ancient chicken samples.
Collapse
Affiliation(s)
- Wenyang Lyu
- School of Mathematics, University of Bristol, Bristol, BS8 1UG, United Kingdom
| | - Xiaoyang Dai
- School of Biological Sciences, University of Bristol, Bristol, BS8 1TQ, United Kingdom.,The Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, E1 2AT, United Kingdom
| | - Mark Beaumont
- School of Biological Sciences, University of Bristol, Bristol, BS8 1TQ, United Kingdom
| | - Feng Yu
- School of Mathematics, University of Bristol, Bristol, BS8 1UG, United Kingdom
| | - Zhangyi He
- MRC Toxicology Unit, University of Cambridge, Cambridge, CB2 1QR, United Kingdom.,Cancer Research UK Beatson Institute, Glasgow, G61 1BD, United Kingdom
| |
Collapse
|
12
|
Croze M, Kim Y. Inference of population genetic parameters from an irregular time series of seasonal influenza virus sequences. Genetics 2021; 217:6066165. [PMID: 33724414 DOI: 10.1093/genetics/iyaa039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Accepted: 12/17/2020] [Indexed: 11/12/2022] Open
Abstract
Basic summary statistics that quantify the population genetic structure of influenza virus are important for understanding and inferring the evolutionary and epidemiological processes. However, the sampling dates of global virus sequences in the last several decades are scattered nonuniformly throughout the calendar. Such temporal structure of samples and the small effective size of viral population hampers the use of conventional methods to calculate summary statistics. Here, we define statistics that overcome this problem by correcting for the sampling-time difference in quantifying a pairwise sequence difference. A simple linear regression method jointly estimates the mutation rate and the level of sequence polymorphism, thus providing an estimate of the effective population size. It also leads to the definition of Wright's FST for arbitrary time-series data. Furthermore, as an alternative to Tajima's D statistic or the site-frequency spectrum, a mismatch distribution corrected for sampling-time differences can be obtained and compared between actual and simulated data. Application of these methods to seasonal influenza A/H3N2 viruses sampled between 1980 and 2017 and sequences simulated under the model of recurrent positive selection with metapopulation dynamics allowed us to estimate the synonymous mutation rate and find parameter values for selection and demographic structure that fit the observation. We found that the mutation rates of HA and PB1 segments before 2007 were particularly high and that including recurrent positive selection in our model was essential for the genealogical structure of the HA segment. Methods developed here can be generally applied to population genetic inferences using serially sampled genetic data.
Collapse
Affiliation(s)
- Myriam Croze
- Division of EcoScience, Ewha Womans University, Seoul 03760, Korea
| | - Yuseob Kim
- Division of EcoScience, Ewha Womans University, Seoul 03760, Korea.,Department of Life Science, Ewha Womans University, Seoul 03760, Korea
| |
Collapse
|
13
|
Roodgar M, Good BH, Garud NR, Martis S, Avula M, Zhou W, Lancaster SM, Lee H, Babveyh A, Nesamoney S, Pollard KS, Snyder MP. Longitudinal linked-read sequencing reveals ecological and evolutionary responses of a human gut microbiome during antibiotic treatment. Genome Res 2021; 31:1433-1446. [PMID: 34301627 PMCID: PMC8327913 DOI: 10.1101/gr.265058.120] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 06/25/2021] [Indexed: 01/01/2023]
Abstract
Gut microbial communities can respond to antibiotic perturbations by rapidly altering their taxonomic and functional composition. However, little is known about the strain-level processes that drive this collective response. Here, we characterize the gut microbiome of a single individual at high temporal and genetic resolution through a period of health, disease, antibiotic treatment, and recovery. We used deep, linked-read metagenomic sequencing to track the longitudinal trajectories of thousands of single nucleotide variants within 36 species, which allowed us to contrast these genetic dynamics with the ecological fluctuations at the species level. We found that antibiotics can drive rapid shifts in the genetic composition of individual species, often involving incomplete genome-wide sweeps of pre-existing variants. These genetic changes were frequently observed in species without obvious changes in species abundance, emphasizing the importance of monitoring diversity below the species level. We also found that many sweeping variants quickly reverted to their baseline levels once antibiotic treatment had concluded, demonstrating that the ecological resilience of the microbiota can sometimes extend all the way down to the genetic level. Our results provide new insights into the population genetic forces that shape individual microbiomes on therapeutically relevant timescales, with potential implications for personalized health and disease.
Collapse
Affiliation(s)
- Morteza Roodgar
- Department of Genetics, Stanford University, Stanford, California 94305, USA.,Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Benjamin H Good
- Department of Applied Physics, Stanford University, Stanford, California 94305, USA
| | - Nandita R Garud
- Department of Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, California 90095, USA
| | - Stephen Martis
- Department of Physics, University of California, Berkeley, California 94720, USA
| | - Mohan Avula
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Wenyu Zhou
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Samuel M Lancaster
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Hayan Lee
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Afshin Babveyh
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Sophia Nesamoney
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | - Katherine S Pollard
- Gladstone Institutes, San Francisco, California 94158, USA.,Department of Epidemiology and Biostatistics, University of California, San Francisco, California 94158, USA.,Chan Zuckerberg Biohub, San Francisco, California 94158, USA
| | - Michael P Snyder
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
14
|
He Z, Dai X, Beaumont M, Yu F. Detecting and Quantifying Natural Selection at Two Linked Loci from Time Series Data of Allele Frequencies with Forward-in-Time Simulations. Genetics 2020; 216:521-541. [PMID: 32826299 PMCID: PMC7536848 DOI: 10.1534/genetics.120.303463] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 08/15/2020] [Indexed: 12/16/2022] Open
Abstract
Recent advances in DNA sequencing techniques have made it possible to monitor genomes in great detail over time. This improvement provides an opportunity for us to study natural selection based on time serial samples of genomes while accounting for genetic recombination effect and local linkage information. Such time series genomic data allow for more accurate estimation of population genetic parameters and hypothesis testing on the recent action of natural selection. In this work, we develop a novel Bayesian statistical framework for inferring natural selection at a pair of linked loci by capitalising on the temporal aspect of DNA data with the additional flexibility of modeling the sampled chromosomes that contain unknown alleles. Our approach is built on a hidden Markov model where the underlying process is a two-locus Wright-Fisher diffusion with selection, which enables us to explicitly model genetic recombination and local linkage. The posterior probability distribution for selection coefficients is computed by applying the particle marginal Metropolis-Hastings algorithm, which allows us to efficiently calculate the likelihood. We evaluate the performance of our Bayesian inference procedure through extensive simulations, showing that our approach can deliver accurate estimates of selection coefficients, and the addition of genetic recombination and local linkage brings about significant improvement in the inference of natural selection. We also illustrate the utility of our method on real data with an application to ancient DNA data associated with white spotting patterns in horses.
Collapse
Affiliation(s)
- Zhangyi He
- School of Mathematics, University of Bristol, BS8 1UG, United Kingdom
| | - Xiaoyang Dai
- School of Biological Sciences, University of Bristol, BS8 1TQ, United Kingdom
| | - Mark Beaumont
- School of Biological Sciences, University of Bristol, BS8 1TQ, United Kingdom
| | - Feng Yu
- School of Mathematics, University of Bristol, BS8 1UG, United Kingdom
| |
Collapse
|
15
|
He Z, Dai X, Beaumont M, Yu F. Estimation of Natural Selection and Allele Age from Time Series Allele Frequency Data Using a Novel Likelihood-Based Approach. Genetics 2020; 216:463-480. [PMID: 32769100 PMCID: PMC7536852 DOI: 10.1534/genetics.120.303400] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 07/29/2020] [Indexed: 11/18/2022] Open
Abstract
Temporally spaced genetic data allow for more accurate inference of population genetic parameters and hypothesis testing on the recent action of natural selection. In this work, we develop a novel likelihood-based method for jointly estimating selection coefficient and allele age from time series data of allele frequencies. Our approach is based on a hidden Markov model where the underlying process is a Wright-Fisher diffusion conditioned to survive until the time of the most recent sample. This formulation circumvents the assumption required in existing methods that the allele is created by mutation at a certain low frequency. We calculate the likelihood by numerically solving the resulting Kolmogorov backward equation backward in time while reweighting the solution with the emission probabilities of the observation at each sampling time point. This procedure reduces the two-dimensional numerical search for the maximum of the likelihood surface, for both the selection coefficient and the allele age, to a one-dimensional search over the selection coefficient only. We illustrate through extensive simulations that our method can produce accurate estimates of the selection coefficient and the allele age under both constant and nonconstant demographic histories. We apply our approach to reanalyze ancient DNA data associated with horse base coat colors. We find that ignoring demographic histories or grouping raw samples can significantly bias the inference results.
Collapse
Affiliation(s)
- Zhangyi He
- Department of Statistics, University of Oxford, OX1 3LB, United Kingdom
| | - Xiaoyang Dai
- School of Biological Sciences, University of Bristol, BS8 1TQ, United Kingdom
| | - Mark Beaumont
- School of Biological Sciences, University of Bristol, BS8 1TQ, United Kingdom
| | - Feng Yu
- School of Mathematics, University of Bristol, BS8 1UG, United Kingdom
| |
Collapse
|
16
|
Stoltz M, Baeumer B, Bouckaert R, Fox C, Hiscott G, Bryant D. Bayesian Inference of Species Trees using Diffusion Models. Syst Biol 2020; 70:145-161. [PMID: 33005955 DOI: 10.1093/sysbio/syaa051] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Revised: 06/19/2020] [Accepted: 06/23/2020] [Indexed: 11/13/2022] Open
Abstract
We describe a new and computationally efficient Bayesian methodology for inferring species trees and demographics from unlinked binary markers. Likelihood calculations are carried out using diffusion models of allele frequency dynamics combined with novel numerical algorithms. The diffusion approach allows for analysis of data sets containing hundreds or thousands of individuals. The method, which we call Snapper, has been implemented as part of the BEAST2 package. We conducted simulation experiments to assess numerical error, computational requirements, and accuracy recovering known model parameters. A reanalysis of soybean SNP data demonstrates that the models implemented in Snapp and Snapper can be difficult to distinguish in practice, a characteristic which we tested with further simulations. We demonstrate the scale of analysis possible using a SNP data set sampled from 399 fresh water turtles in 41 populations. [Bayesian inference; diffusion models; multi-species coalescent; SNP data; species trees; spectral methods.].
Collapse
Affiliation(s)
- Marnus Stoltz
- Department of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand
| | - Boris Baeumer
- Department of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand
| | - Remco Bouckaert
- Centre for Computational Evolution, University of Auckland, Auckland 1142, New Zealand
| | - Colin Fox
- Department of Physics, University of Otago, Dunedin 9054, New Zealand
| | - Gordon Hiscott
- Department of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand
| | - David Bryant
- Department of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand
| |
Collapse
|
17
|
Dehasque M, Ávila‐Arcos MC, Díez‐del‐Molino D, Fumagalli M, Guschanski K, Lorenzen ED, Malaspinas A, Marques‐Bonet T, Martin MD, Murray GGR, Papadopulos AST, Therkildsen NO, Wegmann D, Dalén L, Foote AD. Inference of natural selection from ancient DNA. Evol Lett 2020; 4:94-108. [PMID: 32313686 PMCID: PMC7156104 DOI: 10.1002/evl3.165] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 01/13/2020] [Accepted: 02/02/2020] [Indexed: 01/01/2023] Open
Abstract
Evolutionary processes, including selection, can be indirectly inferred based on patterns of genomic variation among contemporary populations or species. However, this often requires unrealistic assumptions of ancestral demography and selective regimes. Sequencing ancient DNA from temporally spaced samples can inform about past selection processes, as time series data allow direct quantification of population parameters collected before, during, and after genetic changes driven by selection. In this Comment and Opinion, we advocate for the inclusion of temporal sampling and the generation of paleogenomic datasets in evolutionary biology, and highlight some of the recent advances that have yet to be broadly applied by evolutionary biologists. In doing so, we consider the expected signatures of balancing, purifying, and positive selection in time series data, and detail how this can advance our understanding of the chronology and tempo of genomic change driven by selection. However, we also recognize the limitations of such data, which can suffer from postmortem damage, fragmentation, low coverage, and typically low sample size. We therefore highlight the many assumptions and considerations associated with analyzing paleogenomic data and the assumptions associated with analytical methods.
Collapse
Affiliation(s)
- Marianne Dehasque
- Centre for Palaeogenetics10691StockholmSweden
- Department of Bioinformatics and GeneticsSwedish Museum of Natural History10405StockholmSweden
- Department of ZoologyStockholm University10691StockholmSweden
| | - María C. Ávila‐Arcos
- International Laboratory for Human Genome Research (LIIGH)UNAM JuriquillaQueretaro76230Mexico
| | - David Díez‐del‐Molino
- Centre for Palaeogenetics10691StockholmSweden
- Department of ZoologyStockholm University10691StockholmSweden
| | - Matteo Fumagalli
- Department of Life Sciences, Silwood Park CampusImperial College LondonAscotSL5 7PYUnited Kingdom
| | - Katerina Guschanski
- Animal Ecology, Department of Ecology and Genetics, Science for Life LaboratoryUppsala University75236UppsalaSweden
| | | | - Anna‐Sapfo Malaspinas
- Department of Computational BiologyUniversity of Lausanne1015LausanneSwitzerland
- SIB Swiss Institute of Bioinformatics1015LausanneSwitzerland
| | - Tomas Marques‐Bonet
- Institut de Biologia Evolutiva(CSIC‐Universitat Pompeu Fabra), Parc de Recerca Biomèdica de BarcelonaBarcelonaSpain
- National Centre for Genomic Analysis—Centre for Genomic RegulationBarcelona Institute of Science and Technology08028BarcelonaSpain
- Institucio Catalana de Recerca i Estudis Avançats08010BarcelonaSpain
- Institut Català de Paleontologia Miquel CrusafontUniversitat Autònoma de BarcelonaCerdanyola del VallèsSpain
| | - Michael D. Martin
- Department of Natural History, NTNU University MuseumNorwegian University of Science and Technology (NTNU)TrondheimNorway
| | - Gemma G. R. Murray
- Department of Veterinary MedicineUniversity of CambridgeCambridgeCB2 1TNUnited Kingdom
| | - Alexander S. T. Papadopulos
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological SciencesBangor UniversityBangorLL57 2UWUnited Kingdom
| | | | - Daniel Wegmann
- Department of BiologyUniversité de Fribourg1700FribourgSwitzerland
- Swiss Institute of BioinformaticsFribourgSwitzerland
| | - Love Dalén
- Centre for Palaeogenetics10691StockholmSweden
- Department of Bioinformatics and GeneticsSwedish Museum of Natural History10405StockholmSweden
| | - Andrew D. Foote
- Molecular Ecology and Fisheries Genetics Laboratory, School of Biological SciencesBangor UniversityBangorLL57 2UWUnited Kingdom
| |
Collapse
|
18
|
Spitzer K, Pelizzola M, Futschik A. Modifying the Chi-square and the CMH test for population genetic inference: Adapting to overdispersion. Ann Appl Stat 2020. [DOI: 10.1214/19-aoas1301] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
19
|
Inference of Selection from Genetic Time Series Using Various Parametric Approximations to the Wright-Fisher Model. G3-GENES GENOMES GENETICS 2019; 9:4073-4086. [PMID: 31597676 PMCID: PMC6893182 DOI: 10.1534/g3.119.400778] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Detecting genomic regions under selection is an important objective of population genetics. Typical analyses for this goal are based on exploiting genetic diversity patterns in present time data but rapid advances in DNA sequencing have increased the availability of time series genomic data. A common approach to analyze such data is to model the temporal evolution of an allele frequency as a Markov chain. Based on this principle, several methods have been proposed to infer selection intensity. One of their differences lies in how they model the transition probabilities of the Markov chain. Using the Wright-Fisher model is a natural choice but its computational cost is prohibitive for large population sizes so approximations to this model based on parametric distributions have been proposed. Here, we compared the performance of some of these approximations with respect to their power to detect selection and their estimation of the selection coefficient. We developped a new generic Hidden Markov Model likelihood calculator and applied it on genetic time series simulated under various evolutionary scenarios. The Beta with spikes approximation, which combines discrete fixation probabilities with a continuous Beta distribution, was found to perform consistently better than the others. This distribution provides an almost perfect fit to the Wright-Fisher model in terms of selection inference, for a computational cost that does not increase with population size. We further evaluated this model for population sizes not accessible to the Wright-Fisher model and illustrated its performance on a dataset of two divergently selected chicken populations.
Collapse
|
20
|
Maximum Likelihood Estimation of Fitness Components in Experimental Evolution. Genetics 2019; 211:1005-1017. [PMID: 30679262 PMCID: PMC6404243 DOI: 10.1534/genetics.118.301893] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Accepted: 01/15/2019] [Indexed: 12/30/2022] Open
Abstract
Estimating fitness differences between allelic variants is a central goal of experimental evolution. Current methods for inferring such differences from allele frequency time series typically assume that the effects of selection can be described by a fixed selection coefficient. However, fitness is an aggregate of several components including mating success, fecundity, and viability. Distinguishing between these components could be critical in many scenarios. Here, we develop a flexible maximum likelihood framework that can disentangle different components of fitness from genotype frequency data, and estimate them individually in males and females. As a proof-of-principle, we apply our method to experimentally evolved cage populations of Drosophila melanogaster, in which we tracked the relative frequencies of a loss-of-function and wild-type allele of yellow This X-linked gene produces a recessive yellow phenotype when disrupted and is involved in male courtship ability. We find that the fitness costs of the yellow phenotype take the form of substantially reduced mating preference of wild-type females for yellow males, together with a modest reduction in the viability of yellow males and females. Our framework should be generally applicable to situations where it is important to quantify fitness components of specific genetic variants, including quantitative characterization of the population dynamics of CRISPR gene drives.
Collapse
|
21
|
Inferring Demography and Selection in Organisms Characterized by Skewed Offspring Distributions. Genetics 2019; 211:1019-1028. [PMID: 30651284 DOI: 10.1534/genetics.118.301684] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 01/15/2019] [Indexed: 01/01/2023] Open
Abstract
The recent increase in time-series population genomic data from experimental, natural, and ancient populations has been accompanied by a promising growth in methodologies for inferring demographic and selective parameters from such data. However, these methods have largely presumed that the populations of interest are well-described by the Kingman coalescent. In reality, many groups of organisms, including viruses, marine organisms, and some plants, protists, and fungi, typified by high variance in progeny number, may be best characterized by multiple-merger coalescent models. Estimation of population genetic parameters under Wright-Fisher assumptions for these organisms may thus be prone to serious mis-inference. We propose a novel method for the joint inference of demography and selection under the Ψ-coalescent model, termed Multiple-Merger Coalescent Approximate Bayesian Computation, or MMC-ABC. We first demonstrate mis-inference under the Kingman, and then exhibit the superior performance of MMC-ABC under conditions of skewed offspring distributions. In order to highlight the utility of this approach, we reanalyzed previously published drug-selection lines of influenza A virus. We jointly inferred the extent of progeny-skew inherent to viral replication and identified putative drug-resistance mutations.
Collapse
|
22
|
Zinger T, Gelbart M, Miller D, Pennings PS, Stern A. Inferring population genetics parameters of evolving viruses using time-series data. Virus Evol 2019; 5:vez011. [PMID: 31191979 PMCID: PMC6555871 DOI: 10.1093/ve/vez011] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
With the advent of deep sequencing techniques, it is now possible to track the evolution of viruses with ever-increasing detail. Here, we present Flexible Inference from Time-Series (FITS)-a computational tool that allows inference of one of three parameters: the fitness of a specific mutation, the mutation rate or the population size from genomic time-series sequencing data. FITS was designed first and foremost for analysis of either short-term Evolve & Resequence (E&R) experiments or rapidly recombining populations of viruses. We thoroughly explore the performance of FITS on simulated data and highlight its ability to infer the fitness/mutation rate/population size. We further show that FITS can infer meaningful information even when the input parameters are inexact. In particular, FITS is able to successfully categorize a mutation as advantageous or deleterious. We next apply FITS to empirical data from an E&R experiment on poliovirus where parameters were determined experimentally and demonstrate high accuracy in inference.
Collapse
Affiliation(s)
- Tal Zinger
- Department of Molecular Microbiology and Biotechnology, School of Molecular Cell Biology and Biotechnology, Haim Levanon Str., Tel-Aviv University, Tel-Aviv, Israel
| | - Maoz Gelbart
- Department of Molecular Microbiology and Biotechnology, School of Molecular Cell Biology and Biotechnology, Haim Levanon Str., Tel-Aviv University, Tel-Aviv, Israel
| | - Danielle Miller
- Department of Molecular Microbiology and Biotechnology, School of Molecular Cell Biology and Biotechnology, Haim Levanon Str., Tel-Aviv University, Tel-Aviv, Israel
| | - Pleuni S Pennings
- Department of Biology, San Francisco State University, 1600 Holloway Ave, San Francisco, CA, USA
| | - Adi Stern
- Department of Molecular Microbiology and Biotechnology, School of Molecular Cell Biology and Biotechnology, Haim Levanon Str., Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
23
|
Abstract
Allele frequency time series data constitute a powerful resource for unraveling mechanisms of adaptation, because the temporal dimension captures important information about evolutionary forces. In particular, Evolve and Resequence (E&R), the whole-genome sequencing of replicated experimentally evolving populations, is becoming increasingly popular. Based on computer simulations several studies proposed experimental parameters to optimize the identification of the selection targets. No such recommendations are available for the underlying parameters selection strength and dominance. Here, we introduce a highly accurate method to estimate selection parameters from replicated time series data, which is fast enough to be applied on a genome scale. Using this new method, we evaluate how experimental parameters can be optimized to obtain the most reliable estimates for selection parameters. We show that the effective population size (Ne) and the number of replicates have the largest impact. Because the number of time points and sequencing coverage had only a minor effect, we suggest that time series analysis is feasible without major increase in sequencing costs. We anticipate that time series analysis will become routine in E&R studies.
Collapse
Affiliation(s)
- Thomas Taus
- Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria.,Vienna Graduate School of Population Genetics, Vienna, Austria
| | - Andreas Futschik
- Department of Applied Statistics, Johannes Kepler Universität Linz, Linz, Austria
| | | |
Collapse
|
24
|
Inference from the stationary distribution of allele frequencies in a family of Wright-Fisher models with two levels of genetic variability. Theor Popul Biol 2018; 122:78-87. [PMID: 29574050 DOI: 10.1016/j.tpb.2018.03.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The distribution of allele frequencies obtained from diffusion approximations to Wright-Fisher models is useful in developing intuition about the population level effects of evolutionary processes. The statistical properties of the stationary distributions of K-allele models have been extensively studied under neutrality or under selection. Here, we introduce a new family of Wright-Fisher models in which there are two hierarchical levels of genetic variability. The genotypes composed of alleles differing from each other at the selected level have fitness differences with respect to each other and evolve under selection. The genotypes composed of alleles differing from each other only at the neutral level have the same fitness and evolve under neutrality. We show that with an appropriate scaling of the mutation parameter with respect to the number of alleles at each level, the frequencies of alleles at the selected and the neutral level are conditionally independent of each other, conditional on knowing the number of alleles at all levels. This conditional independence allows us to simulate from the joint stationary distribution of the allele frequencies. We use these simulated frequencies to perform inference on parameters of the model with two levels of genetic variability using Approximate Bayesian Computation.
Collapse
|
25
|
Tataru P, Simonsen M, Bataillon T, Hobolth A. Statistical Inference in the Wright-Fisher Model Using Allele Frequency Data. Syst Biol 2018; 66:e30-e46. [PMID: 28173553 PMCID: PMC5837693 DOI: 10.1093/sysbio/syw056] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 05/31/2016] [Accepted: 06/06/2016] [Indexed: 11/14/2022] Open
Abstract
The Wright–Fisher model provides an elegant mathematical framework for understanding allele frequency data. In particular, the model can be used to infer the demographic history of species and identify loci under selection. A crucial quantity for inference under the Wright–Fisher model is the distribution of allele frequencies (DAF). Despite the apparent simplicity of the model, the calculation of the DAF is challenging. We review and discuss strategies for approximating the DAF, and how these are used in methods that perform inference from allele frequency data. Various evolutionary forces can be incorporated in the Wright–Fisher model, and we consider these in turn. We begin our review with the basic bi-allelic Wright–Fisher model where random genetic drift is the only evolutionary force. We then consider mutation, migration, and selection. In particular, we compare diffusion-based and moment-based methods in terms of accuracy, computational efficiency, and analytical tractability. We conclude with a brief overview of the multi-allelic process with a general mutation model.
Collapse
Affiliation(s)
- Paula Tataru
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Maria Simonsen
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Thomas Bataillon
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Asger Hobolth
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| |
Collapse
|
26
|
Inference in population genetics using forward and backward, discrete and continuous time processes. J Theor Biol 2018; 439:166-180. [DOI: 10.1016/j.jtbi.2017.12.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Revised: 11/23/2017] [Accepted: 12/08/2017] [Indexed: 01/01/2023]
|
27
|
R Nené N, Mustonen V, J R Illingworth C. Evaluating genetic drift in time-series evolutionary analysis. J Theor Biol 2018; 437:51-57. [PMID: 28958783 PMCID: PMC5703635 DOI: 10.1016/j.jtbi.2017.09.021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Revised: 06/20/2017] [Accepted: 09/18/2017] [Indexed: 11/15/2022]
Abstract
The Wright-Fisher model is the most popular population model for describing the behaviour of evolutionary systems with a finite population size. Approximations have commonly been used but the model itself has rarely been tested against time-resolved genomic data. Here, we evaluate the extent to which it can be inferred as the correct model under a likelihood framework. Given genome-wide data from an evolutionary experiment, we validate the Wright-Fisher drift model as the better option for describing evolutionary trajectories in a finite population. This was found by evaluating its performance against a Gaussian model of allele frequency propagation. However, we note a range of circumstances under which standard Wright-Fisher drift cannot be correctly identified.
Collapse
Affiliation(s)
- Nuno R Nené
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - Ville Mustonen
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK; Department of Biosciences, Department of Computer Science, Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland
| | | |
Collapse
|
28
|
Rousseau E, Moury B, Mailleret L, Senoussi R, Palloix A, Simon V, Valière S, Grognard F, Fabre F. Estimating virus effective population size and selection without neutral markers. PLoS Pathog 2017; 13:e1006702. [PMID: 29155894 PMCID: PMC5720836 DOI: 10.1371/journal.ppat.1006702] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Revised: 12/07/2017] [Accepted: 10/19/2017] [Indexed: 12/04/2022] Open
Abstract
By combining high-throughput sequencing (HTS) with experimental evolution, we can observe the within-host dynamics of pathogen variants of biomedical or ecological interest. We studied the evolutionary dynamics of five variants of Potato virus Y (PVY) in 15 doubled-haploid lines of pepper. All plants were inoculated with the same mixture of virus variants and variant frequencies were determined by HTS in eight plants of each pepper line at each of six sampling dates. We developed a method for estimating the intensities of selection and genetic drift in a multi-allelic Wright-Fisher model, applicable whether these forces are strong or weak, and in the absence of neutral markers. This method requires variant frequency determination at several time points, in independent hosts. The parameters are the selection coefficients for each PVY variant and four effective population sizes Ne at different time-points of the experiment. Numerical simulations of asexual haploid Wright-Fisher populations subjected to contrasting genetic drift (Ne ∈ [10, 2000]) and selection (|s| ∈ [0, 0.15]) regimes were used to validate the method proposed. The experiment in closely related pepper host genotypes revealed that viruses experienced a considerable diversity of selection and genetic drift regimes. The resulting variant dynamics were accurately described by Wright-Fisher models. The fitness ranks of the variants were almost identical between host genotypes. By contrast, the dynamics of Ne were highly variable, although a bottleneck was often identified during the systemic movement of the virus. We demonstrated that, for a fixed initial PVY population, virus effective population size is a heritable trait in plants. These findings pave the way for the breeding of plant varieties exposing viruses to stronger genetic drift, thereby slowing virus adaptation. A growing number of experimental evolution studies are using an “evolve-and-resequence” approach to observe the within-host dynamics of pathogen variants of biomedical or ecological interest. The resulting data are particularly appropriate for studying the effects of evolutionary forces, such as selection and genetic drift, on the emergence of new pathogen variants. However, it remains challenging to unravel the effects of selection and genetic drift in the absence of neutral markers, a situation frequently encountered for microbes, such as viruses, due to their small constrained genomes. Using such an approach on a plant virus, we observed that the same set of virus variants displayed highly diverse dynamics in closely related plant genotypes. We developed and validated a method that does not require neutral markers, for estimating selection coefficients and effective population sizes from these experimental evolution data. We found that the viruses experienced considerable diversity in genetic drift regimes, depending on host genotype. Importantly, genetic drift experienced by virus populations was shown to be a heritable plant trait. These findings pave the way for the breeding of plant varieties exposing viruses to strong genetic drift, thereby slowing virus adaptation.
Collapse
Affiliation(s)
- Elsa Rousseau
- Université Côte d’Azur, Inria, INRA, CNRS, UPMC Univ Paris 06, Biocore team, Sophia Antipolis, France
- Université Côte d’Azur, INRA, CNRS, ISA, Sophia Antipolis, France
- Pathologie Végétale, INRA, 84140 Montfavet, France
- * E-mail: (ER); (FF)
| | - Benoît Moury
- Pathologie Végétale, INRA, 84140 Montfavet, France
| | - Ludovic Mailleret
- Université Côte d’Azur, Inria, INRA, CNRS, UPMC Univ Paris 06, Biocore team, Sophia Antipolis, France
- Université Côte d’Azur, INRA, CNRS, ISA, Sophia Antipolis, France
| | | | | | - Vincent Simon
- Pathologie Végétale, INRA, 84140 Montfavet, France
- UMR BFP, INRA, Villenave d’Ornon, France
| | - Sophie Valière
- GeT-PlaGe, INRA, Genotoul, Castanet-tolosan, France
- UAR DEPT GA, INRA, Castanet-Tolosan, France
| | - Frédéric Grognard
- Université Côte d’Azur, Inria, INRA, CNRS, UPMC Univ Paris 06, Biocore team, Sophia Antipolis, France
| | - Frédéric Fabre
- UMR SAVE, INRA, Villenave d’Ornon, France
- * E-mail: (ER); (FF)
| |
Collapse
|
29
|
Villanueva‐Cañas JL, Rech GE, Cara MAR, González J. Beyond
SNP
s: how to detect selection on transposable element insertions. Methods Ecol Evol 2017. [DOI: 10.1111/2041-210x.12781] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Affiliation(s)
| | - Gabriel E. Rech
- Institute of Evolutionary Biology (CSIC‐Universitat Pompeu Fabra) Barcelona Spain
| | - Maria Angeles Rodriguez Cara
- Ecoanthropology and Ethnobiology Laboratory, UMR 7206, CNRS/MNHN/Universite Paris 7 Museum National d'HistoireNaturelle F‐75116 Paris France
| | - Josefa González
- Institute of Evolutionary Biology (CSIC‐Universitat Pompeu Fabra) Barcelona Spain
| |
Collapse
|
30
|
Clear: Composition of Likelihoods for Evolve and Resequence Experiments. Genetics 2017; 206:1011-1023. [PMID: 28396506 PMCID: PMC5499160 DOI: 10.1534/genetics.116.197566] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2016] [Accepted: 03/31/2017] [Indexed: 01/26/2023] Open
Abstract
The advent of next generation sequencing technologies has made whole-genome and whole-population sampling possible, even for eukaryotes with large genomes. With this development, experimental evolution studies can be designed to observe molecular evolution "in action" via evolve-and-resequence (E&R) experiments. Among other applications, E&R studies can be used to locate the genes and variants responsible for genetic adaptation. Most existing literature on time-series data analysis often assumes large population size, accurate allele frequency estimates, or wide time spans. These assumptions do not hold in many E&R studies. In this article, we propose a method-composition of likelihoods for evolve-and-resequence experiments (Clear)-to identify signatures of selection in small population E&R experiments. Clear takes whole-genome sequences of pools of individuals as input, and properly addresses heterogeneous ascertainment bias resulting from uneven coverage. Clear also provides unbiased estimates of model parameters, including population size, selection strength, and dominance, while being computationally efficient. Extensive simulations show that Clear achieves higher power in detecting and localizing selection over a wide range of parameters, and is robust to variation of coverage. We applied the Clear statistic to multiple E&R experiments, including data from a study of adaptation of Drosophila melanogaster to alternating temperatures and a study of outcrossing yeast populations, and identified multiple regions under selection with genome-wide significance.
Collapse
|
31
|
Jewett EM, Steinrücken M, Song YS. The Effects of Population Size Histories on Estimates of Selection Coefficients from Time-Series Genetic Data. Mol Biol Evol 2016; 33:3002-3027. [PMID: 27550904 PMCID: PMC5062326 DOI: 10.1093/molbev/msw173] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Many approaches have been developed for inferring selection coefficients from time series data while accounting for genetic drift. These approaches have been motivated by the intuition that properly accounting for the population size history can significantly improve estimates of selective strengths. However, the improvement in inference accuracy that can be attained by modeling drift has not been characterized. Here, by comparing maximum likelihood estimates of selection coefficients that account for the true population size history with estimates that ignore drift by assuming allele frequencies evolve deterministically in a population of infinite size, we address the following questions: how much can modeling the population size history improve estimates of selection coefficients? How much can mis-inferred population sizes hurt inferences of selection coefficients? We conduct our analysis under the discrete Wright–Fisher model by deriving the exact probability of an allele frequency trajectory in a population of time-varying size and we replicate our results under the diffusion model. For both models, we find that ignoring drift leads to estimates of selection coefficients that are nearly as accurate as estimates that account for the true population history, even when population sizes are small and drift is high. This result is of interest because inference methods that ignore drift are widely used in evolutionary studies and can be many orders of magnitude faster than methods that account for population sizes.
Collapse
Affiliation(s)
- Ethan M Jewett
- Department of EECS, University of California, Berkeley, CA Department of Statistics, University of California, Berkeley, CA
| | - Matthias Steinrücken
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA
| | - Yun S Song
- Department of EECS, University of California, Berkeley, CA Department of Statistics, University of California, Berkeley, CA Department of Integrative Biology, University of California, Berkeley, CA Department of Biology, University of Pennsylvania, Philadelphia, PA Department of Mathematics, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
32
|
Ferrer-Admetlla A, Leuenberger C, Jensen JD, Wegmann D. An Approximate Markov Model for the Wright-Fisher Diffusion and Its Application to Time Series Data. Genetics 2016; 203:831-46. [PMID: 27038112 PMCID: PMC4896197 DOI: 10.1534/genetics.115.184598] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2015] [Accepted: 03/22/2016] [Indexed: 11/18/2022] Open
Abstract
The joint and accurate inference of selection and demography from genetic data is considered a particularly challenging question in population genetics, since both process may lead to very similar patterns of genetic diversity. However, additional information for disentangling these effects may be obtained by observing changes in allele frequencies over multiple time points. Such data are common in experimental evolution studies, as well as in the comparison of ancient and contemporary samples. Leveraging this information, however, has been computationally challenging, particularly when considering multilocus data sets. To overcome these issues, we introduce a novel, discrete approximation for diffusion processes, termed mean transition time approximation, which preserves the long-term behavior of the underlying continuous diffusion process. We then derive this approximation for the particular case of inferring selection and demography from time series data under the classic Wright-Fisher model and demonstrate that our approximation is well suited to describe allele trajectories through time, even when only a few states are used. We then develop a Bayesian inference approach to jointly infer the population size and locus-specific selection coefficients with high accuracy and further extend this model to also infer the rates of sequencing errors and mutations. We finally apply our approach to recent experimental data on the evolution of drug resistance in influenza virus, identifying likely targets of selection and finding evidence for much larger viral population sizes than previously reported.
Collapse
Affiliation(s)
- Anna Ferrer-Admetlla
- Department of Biology, University, of Fribourg, 1700 Fribourg Switzerland Department of Life Science, Ecole Polytechnique Federal de Lausanne, 1015 Switzerland Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
| | | | - Jeffrey D Jensen
- Department of Life Science, Ecole Polytechnique Federal de Lausanne, 1015 Switzerland Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
| | - Daniel Wegmann
- Department of Biology, University, of Fribourg, 1700 Fribourg Switzerland Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
| |
Collapse
|
33
|
Bayesian Inference of Natural Selection from Allele Frequency Time Series. Genetics 2016; 203:493-511. [PMID: 27010022 DOI: 10.1534/genetics.116.187278] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Accepted: 03/11/2016] [Indexed: 12/21/2022] Open
Abstract
The advent of accessible ancient DNA technology now allows the direct ascertainment of allele frequencies in ancestral populations, thereby enabling the use of allele frequency time series to detect and estimate natural selection. Such direct observations of allele frequency dynamics are expected to be more powerful than inferences made using patterns of linked neutral variation obtained from modern individuals. We developed a Bayesian method to make use of allele frequency time series data and infer the parameters of general diploid selection, along with allele age, in nonequilibrium populations. We introduce a novel path augmentation approach, in which we use Markov chain Monte Carlo to integrate over the space of allele frequency trajectories consistent with the observed data. Using simulations, we show that this approach has good power to estimate selection coefficients and allele age. Moreover, when applying our approach to data on horse coat color, we find that ignoring a relevant demographic history can significantly bias the results of inference. Our approach is made available in a C++ software package.
Collapse
|
34
|
Computation of the Likelihood of Joint Site Frequency Spectra Using Orthogonal Polynomials. COMPUTATION 2016. [DOI: 10.3390/computation4010006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
35
|
Ormond L, Foll M, Ewing GB, Pfeifer SP, Jensen JD. Inferring the age of a fixed beneficial allele. Mol Ecol 2016; 25:157-69. [PMID: 26576754 DOI: 10.1111/mec.13478] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Revised: 10/14/2015] [Accepted: 11/09/2015] [Indexed: 12/28/2022]
Abstract
Estimating the age and strength of beneficial alleles is central to understanding how adaptation proceeds in response to changing environmental conditions. Several haplotype-based estimators exist for inferring the age of segregating beneficial mutations. Here, we develop an approximate Bayesian-based approach that rather estimates these parameters for fixed beneficial mutations in single populations. We integrate a range of existing diversity, site frequency spectrum, haplotype- and linkage disequilibrium-based summary statistics. We show that for strong selective sweeps on de novo mutations the method can estimate allele age and selection strength even in nonequilibrium demographic scenarios. We extend our approach to models of selection on standing variation, and co-infer the frequency at which selection began to act upon the mutation. Finally, we apply our method to estimate the age and selection strength of a previously identified mutation underpinning cryptic colour adaptation in a wild deer mouse population, and compare our findings with previously published estimates as well as with geological data pertaining to the presumed shift in selective pressure.
Collapse
Affiliation(s)
- Louise Ormond
- School of Life Sciences, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Matthieu Foll
- School of Life Sciences, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
- International Agency for Research on Cancer (IARC), Lyon, France
| | - Gregory B Ewing
- School of Life Sciences, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Susanne P Pfeifer
- School of Life Sciences, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Jeffrey D Jensen
- School of Life Sciences, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| |
Collapse
|
36
|
Malaspinas AS. Methods to characterize selective sweeps using time serial samples: an ancient DNA perspective. Mol Ecol 2015; 25:24-41. [DOI: 10.1111/mec.13492] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 11/08/2015] [Accepted: 11/10/2015] [Indexed: 01/20/2023]
Affiliation(s)
- Anna-Sapfo Malaspinas
- Institute of Ecology and Evolution; University of Bern; Baltzerstrasse 6 CH-3012 Bern Switzerland
- Centre for GeoGenetics; Natural History Museum of Denmark; University of Copenhagen; Øster Voldgade 5-7 1350 Copenhagen Denmark
| |
Collapse
|
37
|
|
38
|
Steinrücken M, Jewett EM, Song YS. SpectralTDF: transition densities of diffusion processes with time-varying selection parameters, mutation rates and effective population sizes. Bioinformatics 2015; 32:795-7. [PMID: 26556388 DOI: 10.1093/bioinformatics/btv627] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2015] [Accepted: 10/22/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In the Wright-Fisher diffusion, the transition density function describes the time evolution of the population-wide frequency of an allele. This function has several practical applications in population genetics and computing it for biologically realistic scenarios with selection and demography is an important problem. RESULTS We develop an efficient method for finding a spectral representation of the transition density function for a general model where the effective population size, selection coefficients and mutation parameters vary over time in a piecewise constant manner. AVAILABILITY AND IMPLEMENTATION The method, called SpectralTDF, is available at https://sourceforge.net/projects/spectraltdf/ CONTACT yss@berkeley.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matthias Steinrücken
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA 01003, USA
| | | | - Yun S Song
- Department of Statistics, Department of EECS, Department of Integrative Biology, University of California, Berkeley, CA 94720, USA, Department of Mathematics and Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
39
|
Inference Under a Wright-Fisher Model Using an Accurate Beta Approximation. Genetics 2015; 201:1133-41. [PMID: 26311474 DOI: 10.1534/genetics.115.179606] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 08/22/2015] [Indexed: 01/08/2023] Open
Abstract
The large amount and high quality of genomic data available today enable, in principle, accurate inference of evolutionary histories of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytical form. Existing approximations build on the computationally intensive diffusion limit or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (0 and 1). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here we introduce the beta with spikes, an extension of the beta approximation that explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations with comparable performance to an existing state-of-the-art method.
Collapse
|
40
|
Transition Densities and Sample Frequency Spectra of Diffusion Processes with Selection and Variable Population Size. Genetics 2015; 200:601-17. [PMID: 25873633 DOI: 10.1534/genetics.115.175265] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Accepted: 04/09/2015] [Indexed: 11/18/2022] Open
Abstract
Advances in empirical population genetics have made apparent the need for models that simultaneously account for selection and demography. To address this need, we here study the Wright-Fisher diffusion under selection and variable effective population size. In the case of genic selection and piecewise-constant effective population sizes, we obtain the transition density by extending a recently developed method for computing an accurate spectral representation for a constant population size. Utilizing this extension, we show how to compute the sample frequency spectrum in the presence of genic selection and an arbitrary number of instantaneous changes in the effective population size. We also develop an alternate, efficient algorithm for computing the sample frequency spectrum using a moment-based approach. We apply these methods to answer the following questions: If neutrality is incorrectly assumed when there is selection, what effects does it have on demographic parameter estimation? Can the impact of negative selection be observed in populations that undergo strong exponential growth?
Collapse
|
41
|
Terhorst J, Schlötterer C, Song YS. Multi-locus analysis of genomic time series data from experimental evolution. PLoS Genet 2015; 11:e1005069. [PMID: 25849855 PMCID: PMC4388667 DOI: 10.1371/journal.pgen.1005069] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 02/11/2015] [Indexed: 11/19/2022] Open
Abstract
Genomic time series data generated by evolve-and-resequence (E&R) experiments offer a powerful window into the mechanisms that drive evolution. However, standard population genetic inference procedures do not account for sampling serially over time, and new methods are needed to make full use of modern experimental evolution data. To address this problem, we develop a Gaussian process approximation to the multi-locus Wright-Fisher process with selection over a time course of tens of generations. The mean and covariance structure of the Gaussian process are obtained by computing the corresponding moments in discrete-time Wright-Fisher models conditioned on the presence of a linked selected site. This enables our method to account for the effects of linkage and selection, both along the genome and across sampled time points, in an approximate but principled manner. We first use simulated data to demonstrate the power of our method to correctly detect, locate and estimate the fitness of a selected allele from among several linked sites. We study how this power changes for different values of selection strength, initial haplotypic diversity, population size, sampling frequency, experimental duration, number of replicates, and sequencing coverage depth. In addition to providing quantitative estimates of selection parameters from experimental evolution data, our model can be used by practitioners to design E&R experiments with requisite power. We also explore how our likelihood-based approach can be used to infer other model parameters, including effective population size and recombination rate. Then, we apply our method to analyze genome-wide data from a real E&R experiment designed to study the adaptation of D. melanogaster to a new laboratory environment with alternating cold and hot temperatures. A growing number of experimental biologists are generating “evolve-and-resequence” (E&R) data in which the genomes of an experimental population are repeatedly sequenced over time. The resulting time series data provide important new insights into the dynamics of evolution. This type of analysis has only recently been made possible by next-generation sequencing, and new statistical procedures are required to analyze this novel data source. We present such a procedure here, and apply it to both simulated and real E&R data.
Collapse
Affiliation(s)
- Jonathan Terhorst
- Department of Statistics, University of California, Berkeley, Berkeley, California, United States of America
| | | | - Yun S. Song
- Department of Statistics, University of California, Berkeley, Berkeley, California, United States of America
- Computer Science Division, University of California, Berkeley, Berkeley, California, United States of America
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
42
|
Steinrücken M, Bhaskar A, Song YS. A NOVEL SPECTRAL METHOD FOR INFERRING GENERAL DIPLOID SELECTION FROM TIME SERIES GENETIC DATA. Ann Appl Stat 2014; 8:2203-2222. [PMID: 25598858 PMCID: PMC4295721 DOI: 10.1214/14-aoas764] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
The increased availability of time series genetic variation data from experimental evolution studies and ancient DNA samples has created new opportunities to identify genomic regions under selective pressure and to estimate their associated fitness parameters. However, it is a challenging problem to compute the likelihood of non-neutral models for the population allele frequency dynamics, given the observed temporal DNA data. Here, we develop a novel spectral algorithm to analytically and efficiently integrate over all possible frequency trajectories between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the population allele frequency space when numerically approximating requisite integrals. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and also apply it to analyze ancient DNA data from genetic loci associated with coat coloration in horses. In contrast to previous studies, our exploration of the full fitness parameter space reveals that a heterozygote-advantage form of balancing selection may have been acting on these loci.
Collapse
|