1
|
Dutheil JY, Hobolth A. Ancestral Population Genomics. Methods Mol Biol 2019; 1910:555-589. [PMID: 31278677 DOI: 10.1007/978-1-4939-9074-0_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Borrowing both from population genetics and phylogenetics, the field of population genomics emerged as full genomes of several closely related species were available. Providing we can properly model sequence evolution within populations undergoing speciation events, this resource enables us to estimate key population genetics parameters such as ancestral population sizes and split times. Furthermore we can enhance our understanding of the recombination process and investigate various selective forces. With the advent of resequencing technologies, genome-wide patterns of diversity in extant populations have now come to complement this picture, offering an increasing power to study more recent genetic history.We discuss the basic models of genomes in populations, including speciation models for closely related species. A major point in our discussion is that only a few complete genomes contain much information about the whole population. The reason being that recombination unlinks genomic regions, and therefore a few genomes contain many segments with distinct histories. The challenge of population genomics is to decode this mosaic of histories in order to infer scenarios of demography and selection. We survey modeling strategies for understanding genetic variation in ancestral populations and species. The underlying models build on the coalescent with recombination process and introduce further assumptions to scale the analyses to genomic data sets.
Collapse
Affiliation(s)
- Julien Y Dutheil
- Department of Evolutionary Genetics, Max Planck Institute of Evolutionary Biology, Plön, Germany.
| | - Asger Hobolth
- Bioinformatics Research Center (BiRC), Aarhus University, Aarhus, Denmark
| |
Collapse
|
2
|
Abstract
With the advent of sequencing techniques population genomics took a major shift. The structure of data sets has evolved from a sample of a few loci in the genome, sequenced in dozens of individuals, to collections of complete genomes, virtually comprising all available loci. Initially sequenced in a few individuals, such genomic data sets are now reaching and even exceeding the size of traditional data sets in the number of haplotypes sequenced. Because all loci in a genome are not independent, this evolution of data sets is mirrored by a methodological change. The evolutionary processes that generate the observed sequences are now modeled spatially along genomes whereas it was previously described temporally (either in a forward or backward manner). Although the spatial process of sequence evolution is complex, approximations to the model feature Markovian properties, permitting efficient inference. In this chapter, we introduce these recent developments that enable the modeling of the evolutionary history of a sample of several individual genomes. Such models assume the occurrence of meiotic recombination, and therefore, to date, they are dedicated to the analysis of eukaryotic species.
Collapse
|
3
|
Miroshnikov A, Steinrücken M. Computing the joint distribution of the total tree length across loci in populations with variable size. Theor Popul Biol 2017; 118:1-19. [PMID: 28943126 PMCID: PMC5705476 DOI: 10.1016/j.tpb.2017.09.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Revised: 09/08/2017] [Accepted: 09/13/2017] [Indexed: 11/26/2022]
Abstract
In recent years, a number of methods have been developed to infer complex demographic histories, especially historical population size changes, from genomic sequence data. Coalescent Hidden Markov Models have proven to be particularly useful for this type of inference. Due to the Markovian structure of these models, an essential building block is the joint distribution of local genealogical trees, or statistics of these genealogies, at two neighboring loci in populations of variable size. Here, we present a novel method to compute the marginal and the joint distribution of the total length of the genealogical trees at two loci separated by at most one recombination event for samples of arbitrary size. To our knowledge, no method to compute these distributions has been presented in the literature to date. We show that they can be obtained from the solution of certain hyperbolic systems of partial differential equations. We present a numerical algorithm, based on the method of characteristics, that can be used to efficiently and accurately solve these systems and compute the marginal and the joint distributions. We demonstrate its utility to study the properties of the joint distribution. Our flexible method can be straightforwardly extended to handle an arbitrary fixed number of recombination events, to include the distributions of other statistics of the genealogies as well, and can also be applied in structured populations.
Collapse
Affiliation(s)
- Alexey Miroshnikov
- University of California, Los Angeles, Department of Mathematics, United States; University of Massachusetts Amherst, Department of Biostatistics and Epidemiology, United States
| | - Matthias Steinrücken
- University of Chicago, Department of Ecology and Evolution, United States; University of Massachusetts Amherst, Department of Biostatistics and Epidemiology, United States.
| |
Collapse
|
4
|
A non-zero variance of Tajima's estimator for two sequences even for infinitely many unlinked loci. Theor Popul Biol 2017; 122:22-29. [PMID: 28341209 DOI: 10.1016/j.tpb.2017.03.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2016] [Revised: 02/12/2017] [Accepted: 03/03/2017] [Indexed: 10/19/2022]
Abstract
The population-scaled mutation rate, θ, is informative on the effective population size and is thus widely used in population genetics. We show that for two sequences and n unlinked loci, the variance of Tajima's estimator (θˆ), which is the average number of pairwise differences, does not vanish even as n→∞. The non-zero variance of θˆ results from a (weak) correlation between coalescence times even at unlinked loci, which, in turn, is due to the underlying fixed pedigree shared by gene genealogies at all loci. We derive the correlation coefficient under a diploid, discrete-time, Wright-Fisher model, and we also derive a simple, closed-form lower bound. We also obtain empirical estimates of the correlation of coalescence times under demographic models inspired by large-scale human genealogies. While the effect we describe is small (Varθˆ∕θ2≈ONe-1), it is important to recognize this feature of statistical population genetics, which runs counter to commonly held notions about unlinked loci.
Collapse
|
5
|
Montemuiño C, Espinosa A, Moure JC, Vera G, Hernández P, Ramos-Onsins S. Approaching Long Genomic Regions and Large Recombination Rates with msParSm as an Alternative to MaCS. Evol Bioinform Online 2016; 12:223-228. [PMID: 27721650 PMCID: PMC5047705 DOI: 10.4137/ebo.s40268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Revised: 07/19/2016] [Accepted: 07/21/2016] [Indexed: 11/05/2022] Open
Abstract
The msParSm application is an evolution of msPar, the parallel version of the coalescent simulation program ms, which removes the limitation for simulating long stretches of DNA sequences with large recombination rates, without compromising the accuracy of the standard coalescence. This work introduces msParSm, describes its significant performance improvements over msPar and its shared memory parallelization details, and shows how it can get better, if not similar, execution times than MaCS. Two case studies with different mutation rates were analyzed, one approximating the human average and the other approximating the Drosophila melanogaster average. Source code is available at https://github.com/cmontemuino/msparsm.
Collapse
Affiliation(s)
- Carlos Montemuiño
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Antonio Espinosa
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Juan C Moure
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Gonzalo Vera
- Centre for Research in Agricultural Genomics (CRAG) Consortium CSIC-IRTA-UAB-UB Edifici CRAG, Campus UAB, Bellaterra, Spain
| | - Porfidio Hernández
- Computer Architecture and Operating Systems Department (CAOS), Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Sebastián Ramos-Onsins
- Centre for Research in Agricultural Genomics (CRAG) Consortium CSIC-IRTA-UAB-UB Edifici CRAG, Campus UAB, Bellaterra, Spain
| |
Collapse
|
6
|
Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput Biol 2016; 12:e1004842. [PMID: 27145223 PMCID: PMC4856371 DOI: 10.1371/journal.pcbi.1004842] [Citation(s) in RCA: 340] [Impact Index Per Article: 42.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 03/02/2016] [Indexed: 01/23/2023] Open
Abstract
A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods. Our understanding of the distribution of genetic variation in natural populations has been driven by mathematical models of the underlying biological and demographic processes. A key strength of such coalescent models is that they enable efficient simulation of data we might see under a variety of evolutionary scenarios. However, current methods are not well suited to simulating genome-scale data sets on hundreds of thousands of samples, which is essential if we are to understand the data generated by population-scale sequencing projects. Similarly, processing the results of large simulations also presents researchers with a major challenge, as it can take many days just to read the data files. In this paper we solve these problems by introducing a new way to represent information about the ancestral process. This new representation leads to huge gains in simulation speed and storage efficiency so that large simulations complete in minutes and the output files can be processed in seconds.
Collapse
Affiliation(s)
- Jerome Kelleher
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- * E-mail:
| | | | - Gilean McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
7
|
The SMC' is a highly accurate approximation to the ancestral recombination graph. Genetics 2015; 200:343-55. [PMID: 25786855 DOI: 10.1534/genetics.114.173898] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 03/12/2015] [Indexed: 11/18/2022] Open
Abstract
Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the first time the similarity between the SMC' and the ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC' is the same as it is marginally under the ARG, which demonstrates that the SMC' is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC' they are approximately asymptotically unbiased.
Collapse
|
8
|
Staab PR, Zhu S, Metzler D, Lunter G. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. ACTA ACUST UNITED AC 2015; 31:1680-2. [PMID: 25596205 PMCID: PMC4426833 DOI: 10.1093/bioinformatics/btu861] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2014] [Accepted: 12/23/2014] [Indexed: 11/13/2022]
Abstract
Motivation: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations. Results: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure. Availability and implementation: The open source implementation scrm is freely available at https://scrm.github.io under the conditions of the GPLv3 license. Contact:staab@bio.lmu.de or gerton.lunter@well.ox.ac.uk. Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paul R Staab
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Sha Zhu
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Dirk Metzler
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Gerton Lunter
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| |
Collapse
|
9
|
Yang T, Deng HW, Niu T. Critical assessment of coalescent simulators in modeling recombination hotspots in genomic sequences. BMC Bioinformatics 2014; 15:3. [PMID: 24387001 PMCID: PMC3890628 DOI: 10.1186/1471-2105-15-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2013] [Accepted: 12/30/2013] [Indexed: 12/04/2022] Open
Abstract
Background Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging. Results We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data. Conclusions While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.
Collapse
Affiliation(s)
| | | | - Tianhua Niu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, Tulane University School of Public Health and Tropical Medicine, 1440 Canal Street, Suite 2001, New Orleans, LA 70112, USA.
| |
Collapse
|
10
|
Li H, Wiehe T. Coalescent tree imbalance and a simple test for selective sweeps based on microsatellite variation. PLoS Comput Biol 2013; 9:e1003060. [PMID: 23696722 PMCID: PMC3656098 DOI: 10.1371/journal.pcbi.1003060] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2012] [Accepted: 03/28/2013] [Indexed: 12/04/2022] Open
Abstract
Selective sweeps are at the core of adaptive evolution. We study how the shape of coalescent trees is affected by recent selective sweeps. To do so we define a coarse-grained measure of tree topology. This measure has appealing analytical properties, its distribution is derived from a uniform, and it is easy to estimate from experimental data. We show how it can be cast into a test for recent selective sweeps using microsatellite markers and present an application to an experimental data set from Plasmodium falciparum. It is one of the major interests in population genetics to contrast the properties and consequences of neutral and non-neutral modes of evolution. As is well-known, positive Darwinian selection and genetic hitchhiking drastically change the profile of genetic diversity compared to neutral expectations. The present-day observable genetic diversity in a sample of DNA sequences depends on events in their evolutionary history, and in particular on the shape of the underlying genealogical tree. In this paper we study how the shape of coalescent trees is affected by the presence of positively selected mutations. We define a measure of tree topology and study its properties under scenarios of neutrality and positive selection. We show that this measure can reliably be estimated from experimental data, and define an easy-to-compute statistical test of the neutral evolution hypothesis. We apply this test to data from a population of the malaria parasite Plasmodium falciparum and confirm the signature of recent positive selection in the vicinity of a drug resistance locus.
Collapse
Affiliation(s)
- Haipeng Li
- Department of Computational Genomics, CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Institut für Genetik, Universität zu Köln, Köln, Germany
| | - Thomas Wiehe
- Institut für Genetik, Universität zu Köln, Köln, Germany
- * E-mail:
| |
Collapse
|
11
|
A sequential coalescent algorithm for chromosomal inversions. Heredity (Edinb) 2013; 111:200-9. [PMID: 23632894 DOI: 10.1038/hdy.2013.38] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 02/04/2013] [Accepted: 03/25/2013] [Indexed: 01/06/2023] Open
Abstract
Chromosomal inversions are common in natural populations and are believed to be involved in many important evolutionary phenomena, including speciation, the evolution of sex chromosomes and local adaptation. While recent advances in sequencing and genotyping methods are leading to rapidly increasing amounts of genome-wide sequence data that reveal interesting patterns of genetic variation within inverted regions, efficient simulation methods to study these patterns are largely missing. In this work, we extend the sequential Markovian coalescent, an approximation to the coalescent with recombination, to include the effects of polymorphic inversions on patterns of recombination. Results show that our algorithm is fast, memory-efficient and accurate, making it feasible to simulate large inversions in large populations for the first time. The SMC algorithm enables studies of patterns of genetic variation (for example, linkage disequilibria) and tests of hypotheses (using simulation-based approaches) that were previously intractable.
Collapse
|
12
|
Abstract
To model deviations from selectively neutral genetic variation caused by different forms of selection, it is necessary to first understand patterns of neutral variation. Best understood is neutral genetic variation at a single locus. But, as is well known, additional insights can be gained by investigating multiple loci. The resulting patterns reflect the degree of association (linkage) between loci and provide information about the underlying multilocus gene genealogies. The statistical properties of two-locus gene genealogies have been intensively studied for populations of constant size, as well as for simple demographic histories such as exponential population growth and single bottlenecks. By contrast, the combined effect of recombination and sustained demographic fluctuations is poorly understood. Addressing this issue, we study a two-locus Wright-Fisher model of a population subject to recurrent bottlenecks. We derive coalescent approximations for the covariance of the times to the most recent common ancestor at two loci in samples of two chromosomes. This covariance reflects the degree of association and thus linkage disequilibrium between these loci. We find, first, that an effective population-size approximation describes the numerically observed association between two loci provided that recombination occurs either much faster or much more slowly than the population-size fluctuations. Second, when recombination occurs frequently between but rarely within bottlenecks, we observe that the association of gene histories becomes independent of physical distance over a certain range of distances. Third, we show that in this case, a commonly used measure of linkage disequilibrium, σ(2)(d) (closely related to r(2)), fails to capture the long-range association between two loci. The reason is that constituent terms, each reflecting the long-range association, cancel. Fourth, we analyze a limiting case in which the long-range association can be described in terms of a Xi coalescent allowing for simultaneous multiple mergers of ancestral lines.
Collapse
|
13
|
Excoffier L, Foll M. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 2011; 27:1332-4. [DOI: 10.1093/bioinformatics/btr124] [Citation(s) in RCA: 343] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|