1
|
Padilla-Iglesias C, Derkx I. Hunter-gatherer genetics research: Importance and avenues. EVOLUTIONARY HUMAN SCIENCES 2024; 6:e15. [PMID: 38516374 PMCID: PMC10955370 DOI: 10.1017/ehs.2024.7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 01/17/2024] [Accepted: 02/02/2024] [Indexed: 03/23/2024] Open
Abstract
Major developments in the field of genetics in the past few decades have revolutionised notions of what it means to be human. Although currently only a few populations around the world practise a hunting and gathering lifestyle, this mode of subsistence has characterised members of our species since its very origins and allowed us to migrate across the planet. Therefore, the geographical distribution of hunter-gatherer populations, dependence on local ecosystems and connections to past populations and neighbouring groups have provided unique insights into our evolutionary origins. However, given the vulnerable status of hunter-gatherers worldwide, the development of the field of anthropological genetics requires that we reevaluate how we conduct research with these communities. Here, we review how the inclusion of hunter-gatherer populations in genetics studies has advanced our understanding of human origins, ancient population migrations and interactions as well as phenotypic adaptations and adaptability to different environments, and the important scientific and medical applications of these advancements. At the same time, we highlight the necessity to address yet unresolved questions and identify areas in which the field may benefit from improvements.
Collapse
Affiliation(s)
| | - Inez Derkx
- Department of Evolutionary Anthropology, University of Zurich, Zurich, Switzerland
| |
Collapse
|
2
|
Fournier R, Tsangalidou Z, Reich D, Palamara PF. Haplotype-based inference of recent effective population size in modern and ancient DNA samples. Nat Commun 2023; 14:7945. [PMID: 38040695 PMCID: PMC10692198 DOI: 10.1038/s41467-023-43522-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 11/10/2023] [Indexed: 12/03/2023] Open
Abstract
Individuals sharing recent ancestors are likely to co-inherit large identical-by-descent (IBD) genomic regions. The distribution of these IBD segments in a population may be used to reconstruct past demographic events such as effective population size variation, but accurate IBD detection is difficult in ancient DNA data and in underrepresented populations with limited reference data. In this work, we introduce an accurate method for inferring effective population size variation during the past ~2000 years in both modern and ancient DNA data, called HapNe. HapNe infers recent population size fluctuations using either IBD sharing (HapNe-IBD) or linkage disequilibrium (HapNe-LD), which does not require phasing and can be computed in low coverage data, including data sets with heterogeneous sampling times. HapNe shows improved accuracy in a range of simulated demographic scenarios compared to currently available methods for IBD-based and LD-based inference of recent effective population size, while requiring fewer computational resources. We apply HapNe to several modern populations from the 1,000 Genomes Project, the UK Biobank, the Allen Ancient DNA Resource, and recently published samples from Iron Age Britain, detecting multiple instances of recent effective population size variation across these groups.
Collapse
Affiliation(s)
| | | | - David Reich
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Human Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Harvard Medical School, Boston, MA, USA
| | - Pier Francesco Palamara
- Department of Statistics, University of Oxford, Oxford, UK.
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK.
| |
Collapse
|
3
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the discrete-time Wright-Fisher model to biobank-scale datasets. Genetics 2023; 225:iyad168. [PMID: 37724741 PMCID: PMC10627256 DOI: 10.1093/genetics/iyad168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/01/2023] [Accepted: 09/08/2023] [Indexed: 09/21/2023] Open
Abstract
The discrete-time Wright-Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large samples or in the presence of strong selection. Existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here, we present a scalable algorithm that approximates the DTWF model with provably bounded error. Our approach relies on two key observations about the DTWF model. The first is that transition probabilities under the model are approximately sparse. The second is that transition distributions for similar starting allele frequencies are extremely close as distributions. Together, these observations enable approximate matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the tens of millions, paving the way for rigorous biobank-scale inference. Finally, we use our results to estimate the impact of larger samples on estimating selection coefficients for loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
- Jeffrey P Spence
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Tony Zeng
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
4
|
Noskova E, Borovitskiy V. Bayesian optimization for demographic inference. G3 (BETHESDA, MD.) 2023; 13:jkad080. [PMID: 37070782 PMCID: PMC10320152 DOI: 10.1093/g3journal/jkad080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/12/2022] [Revised: 11/12/2022] [Accepted: 03/18/2023] [Indexed: 04/19/2023]
Abstract
Inference of demographic histories of species and populations is one of the central problems in population genetics. It is usually stated as an optimization problem: find a model's parameters that maximize a certain log-likelihood. This log-likelihood is often expensive to evaluate in terms of time and hardware resources, critically more so for larger population counts. Although genetic algorithm-based solution has proven efficient for demographic inference in the past, it struggles to deal with log-likelihoods in the setting of more than three populations. Different tools are therefore needed to handle such scenarios. We introduce a new optimization pipeline for demographic inference with time consuming log-likelihood evaluations. It is based on Bayesian optimization, a prominent technique for optimizing expensive black box functions. Comparing to the existing widely used genetic algorithm solution, we demonstrate new pipeline's superiority in the limited time budget setting with four and five populations, when using the log-likelihoods provided by the moments tool.
Collapse
Affiliation(s)
- Ekaterina Noskova
- Computer Technologies Laboratory, ITMO University, St. Petersburg 197101, Russia
| | | |
Collapse
|
5
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the Discrete-time Wright Fisher model to biobank-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.19.541517. [PMID: 37293115 PMCID: PMC10245735 DOI: 10.1101/2023.05.19.541517] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing like-lihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
| | - Tony Zeng
- Department of Genetics, Stanford University
| | | | - Jonathan K. Pritchard
- Department of Genetics, Stanford University
- Department of Biology, Stanford University
| |
Collapse
|
6
|
Terbot JW, Johri P, Liphardt SW, Soni V, Pfeifer SP, Cooper BS, Good JM, Jensen JD. Developing an appropriate evolutionary baseline model for the study of SARS-CoV-2 patient samples. PLoS Pathog 2023; 19:e1011265. [PMID: 37018331 PMCID: PMC10075409 DOI: 10.1371/journal.ppat.1011265] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/06/2023] Open
Abstract
Over the past 3 years, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has spread through human populations in several waves, resulting in a global health crisis. In response, genomic surveillance efforts have proliferated in the hopes of tracking and anticipating the evolution of this virus, resulting in millions of patient isolates now being available in public databases. Yet, while there is a tremendous focus on identifying newly emerging adaptive viral variants, this quantification is far from trivial. Specifically, multiple co-occurring and interacting evolutionary processes are constantly in operation and must be jointly considered and modeled in order to perform accurate inference. We here outline critical individual components of such an evolutionary baseline model-mutation rates, recombination rates, the distribution of fitness effects, infection dynamics, and compartmentalization-and describe the current state of knowledge pertaining to the related parameters of each in SARS-CoV-2. We close with a series of recommendations for future clinical sampling, model construction, and statistical analysis.
Collapse
Affiliation(s)
- John W Terbot
- University of Montana, Division of Biological Sciences, Missoula, Montana, United States of America
- Arizona State University, School of Life Sciences, Center for Evolution & Medicine, Tempe, Arizona, United States of America
| | - Parul Johri
- Arizona State University, School of Life Sciences, Center for Evolution & Medicine, Tempe, Arizona, United States of America
| | - Schuyler W Liphardt
- University of Montana, Division of Biological Sciences, Missoula, Montana, United States of America
| | - Vivak Soni
- Arizona State University, School of Life Sciences, Center for Evolution & Medicine, Tempe, Arizona, United States of America
| | - Susanne P Pfeifer
- Arizona State University, School of Life Sciences, Center for Evolution & Medicine, Tempe, Arizona, United States of America
| | - Brandon S Cooper
- University of Montana, Division of Biological Sciences, Missoula, Montana, United States of America
| | - Jeffrey M Good
- University of Montana, Division of Biological Sciences, Missoula, Montana, United States of America
| | - Jeffrey D Jensen
- Arizona State University, School of Life Sciences, Center for Evolution & Medicine, Tempe, Arizona, United States of America
| |
Collapse
|
7
|
Yan Z, Ogilvie HA, Nakhleh L. Comparing inference under the multispecies coalescent with and without recombination. Mol Phylogenet Evol 2023; 181:107724. [PMID: 36720421 DOI: 10.1016/j.ympev.2023.107724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 01/17/2023] [Accepted: 01/24/2023] [Indexed: 01/29/2023]
Abstract
Accurate inference of population parameters plays a pivotal role in unravelling evolutionary histories. While recombination has been universally accepted as a fundamental process in the evolution of sexually reproducing organisms, it remains challenging to model it exactly. Thus, existing coalescent-based approaches make different assumptions or approximations to facilitate phylogenetic inference, which can potentially bring about biases in estimates of evolutionary parameters when recombination is present. In this article, we evaluate the performance of population parameter estimation using three methods-StarBEAST2, SNAPP, and diCal2-that represent three different types of inference. We performed whole-genome simulations in which recombination rates, mutation rates, and levels of incomplete lineage sorting were varied. We show that StarBEAST2 using short or medium-sized loci is robust to realistic rates of recombination, which is in agreement with previous studies. SNAPP, as expected, is generally unaffected by recombination events. Most surprisingly, diCal2, a method that is designed to explicitly account for recombination, performs considerably worse than other methods under comparison.
Collapse
Affiliation(s)
- Zhi Yan
- Department of Computer Science, Rice University, 6100 Main Street, Houston 77005, TX, USA.
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, 6100 Main Street, Houston 77005, TX, USA.
| | - Luay Nakhleh
- Department of Computer Science, Rice University, 6100 Main Street, Houston 77005, TX, USA.
| |
Collapse
|
8
|
Shipilina D, Pal A, Stankowski S, Chan YF, Barton NH. On the origin and structure of haplotype blocks. Mol Ecol 2023; 32:1441-1457. [PMID: 36433653 PMCID: PMC10946714 DOI: 10.1111/mec.16793] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Revised: 11/16/2022] [Accepted: 11/18/2022] [Indexed: 11/27/2022]
Abstract
The term "haplotype block" is commonly used in the developing field of haplotype-based inference methods. We argue that the term should be defined based on the structure of the Ancestral Recombination Graph (ARG), which contains complete information on the ancestry of a sample. We use simulated examples to demonstrate key features of the relationship between haplotype blocks and ancestral structure, emphasizing the stochasticity of the processes that generate them. Even the simplest cases of neutrality or of a "hard" selective sweep produce a rich structure, often missed by commonly used statistics. We highlight a number of novel methods for inferring haplotype structure, based on the full ARG, or on a sequence of trees, and illustrate how they can be used to define haplotype blocks using an empirical data set. While the advent of new, computationally efficient methods makes it possible to apply these concepts broadly, they (and additional new methods) could benefit from adding features to explore haplotype blocks, as we define them. Understanding and applying the concept of the haplotype block will be essential to fully exploit long and linked-read sequencing technologies.
Collapse
Affiliation(s)
- Daria Shipilina
- Evolutionary Biology Program, Department of Ecology and Genetics (IEG), Uppsala University, Uppsala, Sweden
- Institute of Science and Technology Austria, Klosterneuburg, Austria
- Swedish Collegium for Advanced Study, Uppsala, Sweden
| | - Arka Pal
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| | - Sean Stankowski
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| | | | - Nicholas H Barton
- Institute of Science and Technology Austria, Klosterneuburg, Austria
| |
Collapse
|
9
|
Noskova E, Abramov N, Iliutkin S, Sidorin A, Dobrynin P, Ulyantsev VI. GADMA2: more efficient and flexible demographic inference from genetic data. Gigascience 2022; 12:giad059. [PMID: 37609916 PMCID: PMC10445054 DOI: 10.1093/gigascience/giad059] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 01/31/2023] [Accepted: 07/05/2023] [Indexed: 08/24/2023] Open
Abstract
BACKGROUND Inference of complex demographic histories is a source of information about events that happened in the past of studied populations. Existing methods for demographic inference typically require input from the researcher in the form of a parameterized model. With an increased variety of methods and tools, each with its own interface, the model specification becomes tedious and error-prone. Moreover, optimization algorithms used to find model parameters sometimes turn out to be inefficient, for instance, by being not properly tuned or highly dependent on a user-provided initialization. The open-source software GADMA addresses these problems, providing automatic demographic inference. It proposes a common interface for several likelihood engines and provides global parameters optimization based on a genetic algorithm. RESULTS Here, we introduce the new GADMA2 software and provide a detailed description of the added and expanded features. It has a renovated core code base, new likelihood engines, an updated optimization algorithm, and a flexible setup for automatic model construction. We provide a full overview of GADMA2 enhancements, compare the performance of supported likelihood engines on simulated data, and demonstrate an example of GADMA2 usage on 2 empirical datasets. CONCLUSIONS We demonstrate the better performance of a genetic algorithm in GADMA2 by comparing it to the initial version and other existing optimization approaches. Our experiments on simulated data indicate that GADMA2's likelihood engines are able to provide accurate estimations of demographic parameters even for misspecified models. We improve model parameters for 2 empirical datasets of inbred species.
Collapse
Affiliation(s)
- Ekaterina Noskova
- Computer Technologies Laboratory, ITMO University, St. Petersburg 197101, Russia
| | | | - Stanislav Iliutkin
- Computer Technologies Laboratory, ITMO University, St. Petersburg 197101, Russia
| | - Anton Sidorin
- Laboratory of Biochemical Genetics, St. Petersburg State University, St. Petersburg 199034, Russia
| | - Pavel Dobrynin
- Computer Technologies Laboratory, ITMO University, St. Petersburg 197101, Russia
- Human Genetics Laboratory, Vavilov Institute of General Genetics RAS, Moscow 119991, Russia
| | - Vladimir I Ulyantsev
- Computer Technologies Laboratory, ITMO University, St. Petersburg 197101, Russia
| |
Collapse
|
10
|
Wang Y, Zhao Z, Miao X, Wang Y, Qian X, Chen L, Wang C, Li S. eSMC: a statistical model to infer admixture events from individual genomics data. BMC Genomics 2022; 23:827. [PMCID: PMC9748406 DOI: 10.1186/s12864-022-09033-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 11/21/2022] [Indexed: 12/15/2022] Open
Abstract
Abstract
Background
Inferring historical population admixture events yield essential insights in understanding a species demographic history. Methods are available to infer admixture events in demographic history with extant genetic data from multiple sources. Due to the deficiency in ancient population genetic data, there lacks a method for admixture inference from a single source. Pairwise Sequentially Markovian Coalescent (PSMC) estimates the historical effective population size from lineage genomes of a single individual, based on the distribution of the most recent common ancestor between the diploid’s alleles. However, PSMC does not infer the admixture event.
Results
Here, we proposed eSMC, an extended PSMC model for admixture inference from a single source. We evaluated our model’s performance on both in silico data and real data. We simulated population admixture events at an admixture time range from 5 kya to 100 kya (5 years/generation) with population admix ratio at 1:1, 2:1, 3:1, and 4:1, respectively. The root means the square error is $$\pm 7.61$$
±
7.61
kya for all experiments. Then we implemented our method to infer the historical admixture events in human, donkey and goat populations. The estimated admixture time for both Han and Tibetan individuals range from 60 kya to 80 kya (25 years/generation), while the estimated admixture time for the domesticated donkeys and the goats ranged from 40 kya to 60 kya (8 years/generation) and 40 kya to 100 kya (6 years/generation), respectively. The estimated admixture times were concordance to the time that domestication occurred in human history.
Conclusion
Our eSMC effectively infers the time of the most recent admixture event in history from a single individual’s genomics data. The source code of eSMC is hosted at https://github.com/zachary-zzc/eSMC.
Collapse
|
11
|
Population genomics reveals moderate genetic differentiation between populations of endangered Forest Musk Deer located in Shaanxi and Sichuan. BMC Genomics 2022; 23:668. [PMID: 36138352 PMCID: PMC9503231 DOI: 10.1186/s12864-022-08896-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 09/12/2022] [Indexed: 11/30/2022] Open
Abstract
Background Many endangered species exist in small, genetically depauperate, or inbred populations, hence promoting genetic differentiation and reducing long-term population viability. Forest Musk Deer (Moschus berezovskii) has been subject to illegal hunting for hundreds of years due to the medical and commercial values of musk, resulting in a significant decline in population size. However, it is still unclear to what extent the genetic exchange and inbreeding levels are between geographically isolated populations. By using whole-genome data, we reconstructed the demographic history, evaluated genetic diversity, and characterized the population genetic structure of Forest Musk Deer from one wild population in Sichuan Province and two captive populations from two ex-situ centers in Shaanxi Province. Results SNP calling by GATK resulted in a total of 44,008,662 SNPs. Principal component analysis (PCA), phylogenetic tree (NJ tree), ancestral component analysis (ADMIXTURE) and the ABBA-BABA test separated Sichuan and Shaanxi Forest Musk Deer as two genetic clusters, but no obvious genetic differentiation was observed between the two captive populations. The average pairwise FST value between the populations in Sichuan and Shaanxi ranged from 0.05–0.07, suggesting a low to moderate genetic differentiation. The mean heterozygous SNPs rate was 0.14% (0.11%—0.15%) for Forest Musk Deer at the genomic scale, and varied significantly among three populations (Chi-square = 1.22, p < 0.05, Kruskal–Wallis Test), with the Sichuan population having the lowest (0.11%). The nucleotide diversity of three populations varied significantly (p < 0.05, Kruskal–Wallis Test), with the Sichuan population having the lowest genetic θπ (1.69 × 10–3). Conclusions Genetic diversity of Forest Musk Deer was moderate at the genomic scale compared with other endangered species. Genetic differentiation between populations in Sichuan and Shaanxi may not only result from historical biogeographical factors but also be associated with contemporary human disturbances. Our findings provide scientific aid for the conservation and management of Forest Musk Deer. They can extend the proposed measures at the genomic level to apply to other musk deer species worldwide. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08896-9.
Collapse
|
12
|
Upadhya G, Steinrücken M. Robust inference of population size histories from genomic sequencing data. PLoS Comput Biol 2022; 18:e1010419. [PMID: 36112715 PMCID: PMC9518926 DOI: 10.1371/journal.pcbi.1010419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2021] [Revised: 09/28/2022] [Accepted: 07/21/2022] [Indexed: 02/08/2023] Open
Abstract
Unraveling the complex demographic histories of natural populations is a central problem in population genetics. Understanding past demographic events is of general anthropological interest, but is also an important step in establishing accurate null models when identifying adaptive or disease-associated genetic variation. An important class of tools for inferring past population size changes from genomic sequence data are Coalescent Hidden Markov Models (CHMMs). These models make efficient use of the linkage information in population genomic datasets by using the local genealogies relating sampled individuals as latent states that evolve along the chromosome in an HMM framework. Extending these models to large sample sizes is challenging, since the number of possible latent states increases rapidly. Here, we present our method CHIMP (CHMM History-Inference Maximum-Likelihood Procedure), a novel CHMM method for inferring the size history of a population. It can be applied to large samples (hundreds of haplotypes) and only requires unphased genomes as input. The two implementations of CHIMP that we present here use either the height of the genealogical tree (TMRCA) or the total branch length, respectively, as the latent variable at each position in the genome. The requisite transition and emission probabilities are obtained by numerically solving certain systems of differential equations derived from the ancestral process with recombination. The parameters of the population size history are subsequently inferred using an Expectation-Maximization algorithm. In addition, we implement a composite likelihood scheme to allow the method to scale to large sample sizes. We demonstrate the efficiency and accuracy of our method in a variety of benchmark tests using simulated data and present comparisons to other state-of-the-art methods. Specifically, our implementation using TMRCA as the latent variable shows comparable performance and provides accurate estimates of effective population sizes in intermediate and ancient times. Our method is agnostic to the phasing of the data, which makes it a promising alternative in scenarios where high quality data is not available, and has potential applications for pseudo-haploid data.
Collapse
Affiliation(s)
- Gautam Upadhya
- Department of Physics, University of Chicago, Chicago, Illinois, United States of America
| | - Matthias Steinrücken
- Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, United States of America
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
13
|
Abstract
The polar bear (Ursus maritimus) has become a symbol of the threat to biodiversity from climate change. Understanding polar bear evolutionary history may provide insights into apex carnivore responses and prospects during periods of extreme environmental perturbations. In recent years, genomic studies have examined bear speciation and population history, including evidence for ancient admixture between polar bears and brown bears (Ursus arctos). Here, we extend our earlier studies of a 130,000- to 115,000-y-old polar bear from the Svalbard Archipelago using a 10× coverage genome sequence and 10 new genomes of polar and brown bears from contemporary zones of overlap in northern Alaska. We demonstrate a dramatic decline in effective population size for this ancient polar bear’s lineage, followed by a modest increase just before its demise. A slightly higher genetic diversity in the ancient polar bear suggests a severe genetic erosion over a prolonged bottleneck in modern polar bears. Statistical fitting of data to alternative admixture graph scenarios favors at least one ancient introgression event from brown bears into the ancestor of polar bears, possibly dating back over 150,000 y. Gene flow was likely bidirectional, but allelic transfer from brown into polar bear is the strongest detected signal, which contrasts with other published work. These findings may have implications for our understanding of climate change impacts: Polar bears, a specialist Arctic lineage, may not only have undergone severe genetic bottlenecks but also been the recipient of generalist, boreal genetic variants from brown bears during critical phases of Northern Hemisphere glacial oscillations.
Collapse
|
14
|
Johri P, Aquadro CF, Beaumont M, Charlesworth B, Excoffier L, Eyre-Walker A, Keightley PD, Lynch M, McVean G, Payseur BA, Pfeifer SP, Stephan W, Jensen JD. Recommendations for improving statistical inference in population genomics. PLoS Biol 2022; 20:e3001669. [PMID: 35639797 PMCID: PMC9154105 DOI: 10.1371/journal.pbio.3001669] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The field of population genomics has grown rapidly in response to the recent advent of affordable, large-scale sequencing technologies. As opposed to the situation during the majority of the 20th century, in which the development of theoretical and statistical population genetic insights outpaced the generation of data to which they could be applied, genomic data are now being produced at a far greater rate than they can be meaningfully analyzed and interpreted. With this wealth of data has come a tendency to focus on fitting specific (and often rather idiosyncratic) models to data, at the expense of a careful exploration of the range of possible underlying evolutionary processes. For example, the approach of directly investigating models of adaptive evolution in each newly sequenced population or species often neglects the fact that a thorough characterization of ubiquitous nonadaptive processes is a prerequisite for accurate inference. We here describe the perils of these tendencies, present our consensus views on current best practices in population genomic data analysis, and highlight areas of statistical inference and theory that are in need of further attention. Thereby, we argue for the importance of defining a biologically relevant baseline model tuned to the details of each new analysis, of skepticism and scrutiny in interpreting model fitting results, and of carefully defining addressable hypotheses and underlying uncertainties.
Collapse
Affiliation(s)
- Parul Johri
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Charles F. Aquadro
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, United States of America
| | - Mark Beaumont
- School of Biological Sciences, University of Bristol, Bristol, United Kingdom
| | - Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Laurent Excoffier
- Institute of Ecology and Evolution, University of Berne, Berne, Switzerland
| | - Adam Eyre-Walker
- School of Life Sciences, University of Sussex, Brighton, United Kingdom
| | - Peter D. Keightley
- Institute of Ecology and Evolution, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Michael Lynch
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Gil McVean
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom
| | - Bret A. Payseur
- Laboratory of Genetics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
| | - Susanne P. Pfeifer
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | | | - Jeffrey D. Jensen
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
- * E-mail:
| |
Collapse
|
15
|
Y. C. Brandt D, Wei X, Deng Y, Vaughn AH, Nielsen R. Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics 2022; 221:iyac044. [PMID: 35333304 PMCID: PMC9071567 DOI: 10.1093/genetics/iyac044] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 03/08/2022] [Indexed: 11/12/2022] Open
Abstract
The ancestral recombination graph is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress toward scalably estimating whole-genome genealogies. In addition to inferring the ancestral recombination graph, some of these methods can also provide ancestral recombination graphs sampled from a defined posterior distribution. Obtaining good samples of ancestral recombination graphs is crucial for quantifying statistical uncertainty and for estimating population genetic parameters such as effective population size, mutation rate, and allele age. Here, we use standard neutral coalescent simulations to benchmark the estimates of pairwise coalescence times from 3 popular ancestral recombination graph inference programs: ARGweaver, Relate, and tsinfer+tsdate. We compare (1) the true coalescence times to the inferred times at each locus; (2) the distribution of coalescence times across all loci to the expected exponential distribution; (3) whether the sampled coalescence times have the properties expected of a valid posterior distribution. We find that inferred coalescence times at each locus are most accurate in ARGweaver, and often more accurate in Relate than in tsinfer+tsdate. However, all 3 methods tend to overestimate small coalescence times and underestimate large ones. Lastly, the posterior distribution of ARGweaver is closer to the expected posterior distribution than Relate's, but this higher accuracy comes at a substantial trade-off in scalability. The best choice of method will depend on the number and length of input sequences and on the goal of downstream analyses, and we provide guidelines for the best practices.
Collapse
Affiliation(s)
- Débora Y. C. Brandt
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Xinzhu Wei
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
| | - Yun Deng
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - Andrew H Vaughn
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - Rasmus Nielsen
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
- Department of Statistics, University of California Berkeley, Berkeley, CA 94720, USA
- GLOBE Institute, University of Copenhagen, Copenhagen K 1350, Denmark
| |
Collapse
|
16
|
Chen C, Parejo M, Momeni J, Langa J, Nielsen RO, Shi W, Vingborg R, Kryger P, Bouga M, Estonba A, Meixner M. Population Structure and Diversity in European Honey Bees (Apis mellifera L.)—An Empirical Comparison of Pool and Individual Whole-Genome Sequencing. Genes (Basel) 2022; 13:genes13020182. [PMID: 35205227 PMCID: PMC8872436 DOI: 10.3390/genes13020182] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 12/27/2021] [Accepted: 12/30/2021] [Indexed: 01/27/2023] Open
Abstract
Background: Whole-genome sequencing has become routine for population genetic studies. Sequencing of individuals provides maximal data but is rather expensive and fewer samples can be studied. In contrast, sequencing a pool of samples (pool-seq) can provide sufficient data, while presenting less of an economic challenge. Few studies have compared the two approaches to infer population genetic structure and diversity in real datasets. Here, we apply individual sequencing (ind-seq) and pool-seq to the study of Western honey bees (Apis mellifera). Methods: We collected honey bee workers that belonged to 14 populations, including 13 subspecies, totaling 1347 colonies, who were individually (139 individuals) and pool-sequenced (14 pools). We compared allele frequencies, genetic diversity estimates, and population structure as inferred by the two approaches. Results: Pool-seq and ind-seq revealed near identical population structure and genetic diversities, albeit at different costs. While pool-seq provides genome-wide polymorphism data at considerably lower costs, ind-seq can provide additional information, including the identification of population substructures, hybridization, or individual outliers. Conclusions: If costs are not the limiting factor, we recommend using ind-seq, as population genetic structure can be inferred similarly well, with the advantage gained from individual genetic information. Not least, it also significantly reduces the effort required for the collection of numerous samples and their further processing in the laboratory.
Collapse
Affiliation(s)
- Chao Chen
- Institute of Apicultural Research, Chinese Academy of Agricultural Sciences, Beijing 100093, China;
- Key Laboratory of Pollinating Insect Biology, Ministry of Agriculture and Rural Affairs, Beijing 100093, China
- Correspondence: (C.C.); (M.P.)
| | - Melanie Parejo
- Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain; (J.L.); (A.E.)
- Swiss Bee Research Center, Agroscope, 3003 Bern, Switzerland
- Correspondence: (C.C.); (M.P.)
| | - Jamal Momeni
- Eurofins Genomics, 8200 Aarhus, Denmark; (J.M.); (R.O.N.); (R.V.)
| | - Jorge Langa
- Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain; (J.L.); (A.E.)
| | | | - Wei Shi
- Institute of Apicultural Research, Chinese Academy of Agricultural Sciences, Beijing 100093, China;
- Key Laboratory of Pollinating Insect Biology, Ministry of Agriculture and Rural Affairs, Beijing 100093, China
| | | | - Rikke Vingborg
- Eurofins Genomics, 8200 Aarhus, Denmark; (J.M.); (R.O.N.); (R.V.)
| | - Per Kryger
- Department of Agroecology, Aarhus University, 4200 Slagelse, Denmark;
| | - Maria Bouga
- Lab of Agricultural Zoology and Entomology, Agricultural University of Athens, 11855 Athens, Greece;
| | - Andone Estonba
- Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), 48940 Leioa, Spain; (J.L.); (A.E.)
| | | |
Collapse
|
17
|
Liu X, Ogilvie HA, Nakhleh L. Variational inference using approximate likelihood under the coalescent with recombination. Genome Res 2021; 31:2107-2119. [PMID: 34426513 PMCID: PMC8559707 DOI: 10.1101/gr.273631.120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 08/17/2021] [Indexed: 11/30/2022]
Abstract
Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, is coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human–chimp–gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies, and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation, it is flexible enough to enable future implementations of various population models.
Collapse
Affiliation(s)
- Xinhao Liu
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
18
|
Nadachowska-Brzyska K, Dutoit L, Smeds L, Kardos M, Gustafsson L, Ellegren H. Genomic inference of contemporary effective population size in a large island population of collared flycatchers (Ficedula albicollis). Mol Ecol 2021; 30:3965-3973. [PMID: 34145933 DOI: 10.1111/mec.16025] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 05/28/2021] [Accepted: 06/01/2021] [Indexed: 12/30/2022]
Abstract
Due to its central importance to many aspects of evolutionary biology and population genetics, the long-term effective population size (Ne ) has been estimated for numerous species and populations. However, estimating contemporary Ne is difficult and in practice this parameter is often unknown. In principle, contemporary Ne can be estimated using either analyses of temporal changes in allele frequencies, or the extent of linkage disequilibrium (LD) between unlinked markers. We applied these approaches to estimate contemporary Ne of a relatively recently founded island population of collared flycatchers (Ficedula albicollis). We sequenced the genomes of 85 birds sampled in 1993 and 2015, and applied several temporal methods to estimate Ne at a few thousand (4000-7000). The approach based on LD provided higher estimates of Ne (20,000-32,000) and was associated with high variance, often resulting in infinite Ne . We conclude that whole-genome sequencing data offers new possibilities to estimate high (>1000) contemporary Ne , but also note that such estimates remain challenging, in particular for LD-based methods for contemporary Ne estimation.
Collapse
Affiliation(s)
- Krystyna Nadachowska-Brzyska
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.,Institute of Environmental Sciences, Jagiellonian University, Kraków, Poland
| | - Ludovic Dutoit
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden.,Department of Zoology, University of Otago, Dunedin, New Zealand
| | - Linnéa Smeds
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | - Martin Kardos
- Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, WA, USA
| | - Lars Gustafsson
- Department of Animal Ecology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | - Hans Ellegren
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| |
Collapse
|
19
|
Excofffier L, Marchi N, Marques DA, Matthey-Doret R, Gouy A, Sousa VC. fastsimcoal2: demographic inference under complex evolutionary scenarios. Bioinformatics 2021; 37:4882-4885. [PMID: 34164653 PMCID: PMC8665742 DOI: 10.1093/bioinformatics/btab468] [Citation(s) in RCA: 97] [Impact Index Per Article: 32.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 06/11/2021] [Accepted: 06/22/2021] [Indexed: 01/25/2023] Open
Abstract
Motivation fastsimcoal2 extends fastsimcoal, a continuous time coalescent-based genetic simulation program, by enabling the estimation of demographic parameters under very complex scenarios from the site frequency spectrum under a maximum-likelihood framework. Results Other improvements include multi-threading, handling of population inbreeding, extended input file syntax facilitating the description of complex demographic scenarios, and more efficient simulations of sparsely structured populations and of large chromosomes. Availability and implementation fastsimcoal2 is freely available on http://cmpg.unibe.ch/software/fastsimcoal2/. It includes console versions for Linux, Windows and MacOS, additional scripts for the analysis and visualization of simulated and estimated scenarios, as well as a detailed documentation and ready-to-use examples.
Collapse
Affiliation(s)
- Laurent Excofffier
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland.,Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Nina Marchi
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland.,Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - David Alexander Marques
- Life Science Division, Natural History Museum Basel, 4051 Basel, Switzerland.,Aquatic Ecology and Evolution, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland.,Department of Fish Ecology and Evolution, EAWAG swiss Federal institute of Aquatic Science and Technology, Center for Ecology, Evolution and Biogeochemistry, 6047 Kastanienbaum, Switzerland
| | - Remi Matthey-Doret
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland.,Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Alexandre Gouy
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland.,Gouy Data Consulting, 1026 Denges, Switzerland
| | - Vitor C Sousa
- Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, 3012 Bern, Switzerland.,cE3c - Centre for Ecology, Evolution and Environmental Changes, Faculdade de Ciências da Universidade de Lisboa, University of Lisbon, Campo Grande, 1749-016, Lisbon, Portugal
| |
Collapse
|
20
|
Sellinger TPP, Abu-Awad D, Tellier A. Limits and convergence properties of the sequentially Markovian coalescent. Mol Ecol Resour 2021; 21:2231-2248. [PMID: 33978324 DOI: 10.1111/1755-0998.13416] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 04/19/2021] [Accepted: 04/29/2021] [Indexed: 02/07/2023]
Abstract
Several methods based on the sequentially Markovian coalescent (SMC) make use of full genome sequence data from samples to infer population demographic history including past changes in population size, admixture, migration events and population structure. More recently, the original theoretical framework has been extended to allow the simultaneous estimation of population size changes along with other life history traits such as selfing or seed banking. The latter developments enhance the applicability of SMC methods to nonmodel species. Although convergence proofs have been given using simulated data in a few specific cases, an in-depth investigation of the limitations of SMC methods is lacking. In order to explore such limits, we first develop a tool inferring the best case convergence of SMC methods assuming the true underlying coalescent genealogies are known. This tool can be used to quantify the amount and type of information that can be confidently retrieved from given data sets prior to the analysis of the real data. Second, we assess the inference accuracy when the assumptions of SMC approaches are violated due to departures from the model, namely the presence of transposable elements, variable recombination and mutation rates along the sequence, and SNP calling errors. Third, we deliver a new interpretation of SMC methods by highlighting the importance of the transition matrix, which we argue can be used as a set of summary statistics in other statistical inference methods, uncoupling the SMC from hidden Markov models (HMMs). We finally offer recommendations to better apply SMC methods and build adequate data sets under budget constraints.
Collapse
Affiliation(s)
| | - Diala Abu-Awad
- Department of Life Science Systems, Technical University of Munich, Munchen, Germany
| | - Aurélien Tellier
- Department of Life Science Systems, Technical University of Munich, Munchen, Germany
| |
Collapse
|
21
|
Arredondo A, Mourato B, Nguyen K, Boitard S, Rodríguez W, Noûs C, Mazet O, Chikhi L. Inferring number of populations and changes in connectivity under the n-island model. Heredity (Edinb) 2021; 126:896-912. [PMID: 33846579 PMCID: PMC8178352 DOI: 10.1038/s41437-021-00426-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 03/11/2021] [Accepted: 03/12/2021] [Indexed: 12/11/2022] Open
Abstract
Inferring the demographic history of species is one of the greatest challenges in populations genetics. This history is often represented as a history of size changes, ignoring population structure. Alternatively, when structure is assumed, it is defined a priori as a population tree and not inferred. Here we propose a framework based on the IICR (Inverse Instantaneous Coalescence Rate). The IICR can be estimated for a single diploid individual using the PSMC method of Li and Durbin (2011). For an isolated panmictic population, the IICR matches the population size history, and this is how the PSMC outputs are generally interpreted. However, it is increasingly acknowledged that the IICR is a function of the demographic model and sampling scheme with limited connection to population size changes. Our method fits observed IICR curves of diploid individuals with IICR curves obtained under piecewise stationary symmetrical island models. In our models we assume a fixed number of time periods during which gene flow is constant, but gene flow is allowed to change between time periods. We infer the number of islands, their sizes, the periods at which connectivity changes and the corresponding rates of connectivity. Validation with simulated data showed that the method can accurately recover most of the scenario parameters. Our application to a set of five human PSMCs yielded demographic histories that are in agreement with previous studies using similar methods and with recent research suggesting ancient human structure. They are in contrast with the view of human evolution consisting of one ancestral population branching into three large continental and panmictic populations with varying degrees of connectivity and no population structure within each continent.
Collapse
Affiliation(s)
- Armando Arredondo
- Université de Toulouse, Institut National des Sciences Appliquées, Institut de Mathématiques de Toulouse, Toulouse, France. .,Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse, Toulouse, France.
| | - Beatriz Mourato
- Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse, Toulouse, France.,Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | - Khoa Nguyen
- Université de Toulouse, Institut National des Sciences Appliquées, Institut de Mathématiques de Toulouse, Toulouse, France
| | - Simon Boitard
- CBGP, Université de Montpellier, CIRAD, INRAE, Institut Agro, IRD, Montpellier, France
| | - Willy Rodríguez
- Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse, Toulouse, France.,ENAC - Ecole Nationale de l'Aviation Civile, Université de Toulouse, Toulouse, France
| | | | - Olivier Mazet
- Université de Toulouse, Institut National des Sciences Appliquées, Institut de Mathématiques de Toulouse, Toulouse, France.,Institut de Mathématiques de Toulouse; UMR5219. Université de Toulouse, Toulouse, France
| | - Lounès Chikhi
- Instituto Gulbenkian de Ciência, Oeiras, Portugal. .,Laboratoire Évolution & Diversité Biologique (EDB UMR 5174), CNRS, IRD, UPS, Université de Toulouse Midi-Pyrénées, Toulouse, France.
| |
Collapse
|
22
|
Henderson D, Zhu S(J, Cole CB, Lunter G. Demographic inference from multiple whole genomes using a particle filter for continuous Markov jump processes. PLoS One 2021; 16:e0247647. [PMID: 33651801 PMCID: PMC7924771 DOI: 10.1371/journal.pone.0247647] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 02/10/2021] [Indexed: 12/12/2022] Open
Abstract
Demographic events shape a population's genetic diversity, a process described by the coalescent-with-recombination model that relates demography and genetics by an unobserved sequence of genealogies along the genome. As the space of genealogies over genomes is large and complex, inference under this model is challenging. Formulating the coalescent-with-recombination model as a continuous-time and -space Markov jump process, we develop a particle filter for such processes, and use waypoints that under appropriate conditions allow the problem to be reduced to the discrete-time case. To improve inference, we generalise the Auxiliary Particle Filter for discrete-time models, and use Variational Bayes to model the uncertainty in parameter estimates for rare events, avoiding biases seen with Expectation Maximization. Using real and simulated genomes, we show that past population sizes can be accurately inferred over a larger range of epochs than was previously possible, opening the possibility of jointly analyzing multiple genomes under complex demographic models. Code is available at https://github.com/luntergroup/smcsmc.
Collapse
Affiliation(s)
| | - Sha (Joe) Zhu
- Wellcome Centre for Human Genetics, Oxford, United Kingdom
- Big Data Institute, Oxford, United Kingdom
| | - Christopher B. Cole
- MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford, United Kingdom
| | - Gerton Lunter
- MRC Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, Headington, Oxford, United Kingdom
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| |
Collapse
|
23
|
Johri P, Riall K, Becher H, Excoffier L, Charlesworth B, Jensen JD. The Impact of Purifying and Background Selection on the Inference of Population History: Problems and Prospects. Mol Biol Evol 2021; 38:2986-3003. [PMID: 33591322 PMCID: PMC8233493 DOI: 10.1093/molbev/msab050] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Current procedures for inferring population history generally assume complete neutrality—that is, they neglect both direct selection and the effects of selection on linked sites. We here examine how the presence of direct purifying selection and background selection may bias demographic inference by evaluating two commonly-used methods (MSMC and fastsimcoal2), specifically studying how the underlying shape of the distribution of fitness effects and the fraction of directly selected sites interact with demographic parameter estimation. The results show that, even after masking functional genomic regions, background selection may cause the mis-inference of population growth under models of both constant population size and decline. This effect is amplified as the strength of purifying selection and the density of directly selected sites increases, as indicated by the distortion of the site frequency spectrum and levels of nucleotide diversity at linked neutral sites. We also show how simulated changes in background selection effects caused by population size changes can be predicted analytically. We propose a potential method for correcting for the mis-inference of population growth caused by selection. By treating the distribution of fitness effect as a nuisance parameter and averaging across all potential realizations, we demonstrate that even directly selected sites can be used to infer demographic histories with reasonable accuracy.
Collapse
Affiliation(s)
- Parul Johri
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Kellen Riall
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Hannes Becher
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Laurent Excoffier
- Institute of Ecology and Evolution, University of Berne, Berne, Switzerland.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Jeffrey D Jensen
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| |
Collapse
|
24
|
Johri P, Riall K, Becher H, Excoffier L, Charlesworth B, Jensen JD. The impact of purifying and background selection on the inference of population history: problems and prospects. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021. [PMID: 33501439 PMCID: PMC7836109 DOI: 10.1101/2020.04.28.066365] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Current procedures for inferring population history generally assume complete neutrality - that is, they neglect both direct selection and the effects of selection on linked sites. We here examine how the presence of direct purifying selection and background selection may bias demographic inference by evaluating two commonly-used methods (MSMC and fastsimcoal2), specifically studying how the underlying shape of the distribution of fitness effects (DFE) and the fraction of directly selected sites interact with demographic parameter estimation. The results show that, even after masking functional genomic regions, background selection may cause the mis-inference of population growth under models of both constant population size and decline. This effect is amplified as the strength of purifying selection and the density of directly selected sites increases, as indicated by the distortion of the site frequency spectrum and levels of nucleotide diversity at linked neutral sites. We also show how simulated changes in background selection effects caused by population size changes can be predicted analytically. We propose a potential method for correcting for the mis-inference of population growth caused by selection. By treating the DFE as a nuisance parameter and averaging across all potential realizations, we demonstrate that even directly selected sites can be used to infer demographic histories with reasonable accuracy.
Collapse
Affiliation(s)
- Parul Johri
- School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA
| | - Kellen Riall
- School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA
| | - Hannes Becher
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, EH9 3FL, United Kingdom
| | - Laurent Excoffier
- Institute of Ecology and Evolution, University of Berne, Berne 3012, Switzerland.,Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Brian Charlesworth
- Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, EH9 3FL, United Kingdom
| | - Jeffrey D Jensen
- School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA
| |
Collapse
|
25
|
Fraïsse C, Popovic I, Mazoyer C, Spataro B, Delmotte S, Romiguier J, Loire É, Simon A, Galtier N, Duret L, Bierne N, Vekemans X, Roux C. DILS: Demographic inferences with linked selection by using ABC. Mol Ecol Resour 2021; 21:2629-2644. [PMID: 33448666 DOI: 10.1111/1755-0998.13323] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 12/09/2020] [Accepted: 12/21/2020] [Indexed: 01/21/2023]
Abstract
We present DILS, a deployable statistical analysis platform for conducting demographic inferences with linked selection from population genomic data using an Approximate Bayesian Computation framework. DILS takes as input single-population or two-population data sets (multilocus fasta sequences) and performs three types of analyses in a hierarchical manner, identifying: (a) the best demographic model to study the importance of gene flow and population size change on the genetic patterns of polymorphism and divergence, (b) the best genomic model to determine whether the effective size Ne and migration rate N, m are heterogeneously distributed along the genome (implying linked selection) and (c) loci in genomic regions most associated with barriers to gene flow. Also available via a Web interface, an objective of DILS is to facilitate collaborative research in speciation genomics. Here, we show the performance and limitations of DILS by using simulations and finally apply the method to published data on a divergence continuum composed by 28 pairs of Mytilus mussel populations/species.
Collapse
Affiliation(s)
- Christelle Fraïsse
- Institute of Science and Technology Austria, Klosterneuœburg, Austria.,Univ. Lille, CNRS, UMR 8198 - Evo-Eco-Paleo, Lille, France
| | - Iva Popovic
- School of Biological Sciences, University of Queensland, St Lucia, Qld, Australia
| | | | - Bruno Spataro
- Laboratoire de Biologie et Biométrie Évolutive CNRS UMR 5558, Université Claude Bernard, Lyon, France
| | - Stéphane Delmotte
- Laboratoire de Biologie et Biométrie Évolutive CNRS UMR 5558, Université Claude Bernard, Lyon, France
| | | | - Étienne Loire
- Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD), UMR, ASTRE, Montpellier, France
| | - Alexis Simon
- ISEM, Univ Montpellier, CNRS, EPHE, IRD, Montpellier, France
| | - Nicolas Galtier
- ISEM, Univ Montpellier, CNRS, EPHE, IRD, Montpellier, France
| | - Laurent Duret
- Laboratoire de Biologie et Biométrie Évolutive CNRS UMR 5558, Université Claude Bernard, Lyon, France
| | - Nicolas Bierne
- ISEM, Univ Montpellier, CNRS, EPHE, IRD, Montpellier, France
| | | | - Camille Roux
- Univ. Lille, CNRS, UMR 8198 - Evo-Eco-Paleo, Lille, France
| |
Collapse
|
26
|
Marchi N, Excoffier L. Gene flow as a simple cause for an excess of high-frequency-derived alleles. Evol Appl 2020; 13:2254-2263. [PMID: 33005222 PMCID: PMC7513730 DOI: 10.1111/eva.12998] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 04/30/2020] [Accepted: 05/04/2020] [Indexed: 01/19/2023] Open
Abstract
Most human populations exhibit an excess of high-frequency variants, leading to a U-shaped site-frequency spectrum (uSFS). This pattern has been generally interpreted as a signature of ongoing episodes of positive selection, or as evidence for a mis-assignment of ancestral/derived allelic states, but uSFS has also been observed in populations receiving gene flow from a ghost population, in structured populations, or after range expansions. In order to better explain the prevalence of high-frequency variants in humans and other populations, we describe here which patterns of gene flow and population demography can lead to uSFS by using extensive coalescent simulations. We find that uSFS can often be observed in a population if gene flow brings a few ancestral alleles from a well-differentiated population. Gene flow can either consist in single pulses of admixture or continuous immigration, but different demographic conditions are necessary to observe uSFS in these two scenarios. Indeed, an extremely low and recent gene flow is required in the case of single admixture events, while with continuous immigration, uSFS occurs only if gene flow started recently at a high rate or if it lasted for a long time at a low rate. Overall, we find that a neutral uSFS occurs under more restrictive conditions in populations having received single pulses of gene flow than in populations exposed to continuous gene flow. We also show that the uSFS observed in human populations from the 1000 Genomes Project can easily be explained by gene flow from surrounding populations without requiring past episodes of positive selection. These results imply that uSFS should be common in non-isolated populations, such as most wild or domesticated plants and animals.
Collapse
Affiliation(s)
- Nina Marchi
- CMPGInstitute of Ecology and EvolutionUniversity of BerneBerneSwitzerland
- Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Laurent Excoffier
- CMPGInstitute of Ecology and EvolutionUniversity of BerneBerneSwitzerland
- Swiss Institute of BioinformaticsLausanneSwitzerland
| |
Collapse
|
27
|
Sanchez T, Cury J, Charpiat G, Jay F. Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation. Mol Ecol Resour 2020; 21:2645-2660. [DOI: 10.1111/1755-0998.13224] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 06/19/2020] [Accepted: 07/02/2020] [Indexed: 12/28/2022]
Affiliation(s)
- Théophile Sanchez
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| | - Jean Cury
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| | - Guillaume Charpiat
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| | - Flora Jay
- Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris‐Saclay Orsay France
| |
Collapse
|
28
|
Sankararaman S. Methods for detecting introgressed archaic sequences. Curr Opin Genet Dev 2020; 62:85-90. [PMID: 32717667 PMCID: PMC7484293 DOI: 10.1016/j.gde.2020.05.026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 05/12/2020] [Accepted: 05/22/2020] [Indexed: 11/16/2022]
Abstract
Analysis of genome sequences from archaic and modern humans have revealed multiple episodes of admixture between highly-diverged population groups. Statistical methods that attempt to localize DNA segments introduced by these events offer a powerful tool to investigate recent human evolution. We review recent advances in methods for detecting introgressed sequences.
Collapse
Affiliation(s)
- Sriram Sankararaman
- Department of Computer Science, University of California, Los Angeles, CA 90095, United States; Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, United States; Department of Computational Medicine, University of California, Los Angeles, CA 90095, United States.
| |
Collapse
|
29
|
Leitwein M, Duranton M, Rougemont Q, Gagnaire PA, Bernatchez L. Using Haplotype Information for Conservation Genomics. Trends Ecol Evol 2020; 35:245-258. [DOI: 10.1016/j.tree.2019.10.012] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Revised: 10/18/2019] [Accepted: 10/28/2019] [Indexed: 12/19/2022]
|
30
|
Salter-Townshend M, Myers S. Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups. Genetics 2019; 212:869-889. [PMID: 31123038 PMCID: PMC6614886 DOI: 10.1534/genetics.119.302139] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 05/18/2019] [Indexed: 12/31/2022] Open
Abstract
We present an algorithm for inferring ancestry segments and characterizing admixture events, which involve an arbitrary number of genetically differentiated groups coming together. This allows inference of the demographic history of the species, properties of admixing groups, identification of signatures of natural selection, and may aid disease gene mapping. The algorithm employs nested hidden Markov models to obtain local ancestry estimation along the genome for each admixed individual. In a range of simulations, the accuracy of these estimates equals or exceeds leading existing methods. Moreover, and unlike these approaches, we do not require any prior knowledge of the relationship between subgroups of donor reference haplotypes and the unseen mixing ancestral populations. Our approach infers these in terms of conditional "copying probabilities." In application to the Human Genome Diversity Project, we corroborate many previously inferred admixture events (e.g., an ancient admixture event in the Kalash). We further identify novel events such as complex four-way admixture in San-Khomani individuals, and show that Eastern European populations possess [Formula: see text] ancestry from a group resembling modern-day central Asians. We also identify evidence of recent natural selection favoring sub-Saharan ancestry at the human leukocyte antigen (HLA) region, across North African individuals. We make available an R and C++ software library, which we term MOSAIC (which stands for MOSAIC Organizes Segments of Ancestry In Chromosomes).
Collapse
Affiliation(s)
| | - Simon Myers
- Dept. of Statistics, University of Oxford and Wellcome Trust Centre for Human Genetics, Oxford, UK
| |
Collapse
|