1
|
Fan WTL, Wakeley J. Latent mutations in the ancestries of alleles under selection. Theor Popul Biol 2024:S0040-5809(24)00041-8. [PMID: 38697365 DOI: 10.1016/j.tpb.2024.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 04/23/2024] [Accepted: 04/29/2024] [Indexed: 05/05/2024]
Abstract
We consider a single genetic locus with two alleles A1 and A2 in a large haploid population. The locus is subject to selection and two-way, or recurrent, mutation. Assuming the allele frequencies follow a Wright-Fisher diffusion and have reached stationarity, we describe the asymptotic behaviors of the conditional gene genealogy and the latent mutations of a sample with known allele counts, when the count n1 of allele A1 is fixed, and when either or both the sample size n and the selection strength |α| tend to infinity. Our study extends previous work under neutrality to the case of non-neutral rare alleles, asserting that when selection is not too strong relative to the sample size, even if it is strongly positive or strongly negative in the usual sense (α→-∞ or α→+∞), the number of latent mutations of the n1 copies of allele A1 follows the same distribution as the number of alleles in the Ewens sampling formula. On the other hand, very strong positive selection relative to the sample size leads to neutral gene genealogies with a single ancient latent mutation. We also demonstrate robustness of our asymptotic results against changing population sizes, when one of |α| or n is large.
Collapse
Affiliation(s)
- Wai-Tong Louis Fan
- Department of Mathematics, Indiana University, 831 East 3rd St, Bloomington, 47405, IN, USA; Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Ave, Cambridge, 02138, MA, USA.
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Ave, Cambridge, 02138, MA, USA.
| |
Collapse
|
2
|
Schraiber JG, Edge MD, Pennell M. Unifying approaches from statistical genetics and phylogenetics for mapping phenotypes in structured populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.10.579721. [PMID: 38496530 PMCID: PMC10942266 DOI: 10.1101/2024.02.10.579721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these two fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we derive a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., Genome-Wide Association Studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur using analytical theory and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate this by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study of this, we re-examine an analysis testing for co-evolution of expression levels between genes across a fungal phylogeny, and show that including covariance matrix eigenvectors as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.
Collapse
|
3
|
Zurita AMI, Kyriazis CC, Lohmueller KE. The impact of non-neutral synonymous mutations when inferring selection on non-synonymous mutations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.07.579314. [PMID: 38370782 PMCID: PMC10871344 DOI: 10.1101/2024.02.07.579314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
The distribution of fitness effects (DFE) describes the proportions of new mutations that have different effects on reproductive fitness. Accurate measurements of the DFE are important because the DFE is a fundamental parameter in evolutionary genetics and has implications for our understanding of other phenomena like complex disease or inbreeding depression. Current computational methods to infer the DFE for nonsynonymous mutations from natural variation first estimate demographic parameters from synonymous variants to control for the effects of demography and background selection. Then, conditional on these parameters, the DFE is then inferred for nonsynonymous mutations. This approach relies on the assumption that synonymous variants are neutrally evolving. However, some evidence points toward synonymous mutations having measurable effects on fitness. To test whether selection on synonymous mutations affects inference of the DFE of nonsynonymous mutations, we simulated several possible models of selection on synonymous mutations using SLiM and attempted to recover the DFE of nonsynonymous mutations using Fit∂a∂i, a common method for DFE inference. Our results show that the presence of selection on synonymous variants leads to incorrect inferences of recent population growth. Furthermore, under certain parameter combinations, inferences of the DFE can have an inflated proportion of highly deleterious nonsynonymous mutations. However, this bias can be eliminated if the correct demographic parameters are used for DFE inference instead of the biased ones inferred from synonymous variants. Our work demonstrates how unmodeled selection on synonymous mutations may affect downstream inferences of the DFE.
Collapse
Affiliation(s)
- Aina Martinez I Zurita
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, USA
| | - Christopher C Kyriazis
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, USA
| | - Kirk E Lohmueller
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, USA
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, USA
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, USA
| |
Collapse
|
4
|
Mah JC, Lohmueller KE, Garud N. Inference of the demographic histories and selective effects of human gut commensal microbiota over the course of human history. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.09.566454. [PMID: 38014007 PMCID: PMC10680615 DOI: 10.1101/2023.11.09.566454] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Despite the importance of gut commensal microbiota to human health, there is little knowledge about their evolutionary histories, including their population demographic histories and their distributions of fitness effects (DFE) of new mutations. Here, we infer the demographic histories and DFEs of 27 of the most highly prevalent and abundant commensal gut microbial species in North Americans over timescales exceeding human generations using a collection of lineages inferred from a panel of healthy hosts. We find overall reductions in genetic variation among commensal gut microbes sampled from a Western population relative to an African rural population. Additionally, some species in North American microbiomes display contractions in population size and others expansions, potentially occurring at several key historical moments in human history. DFEs across species vary from highly to mildly deleterious, with accessory genes experiencing more drift compared to core genes. Within genera, DFEs tend to be more congruent, reflective of underlying phylogenetic relationships. Taken together, these findings suggest that human commensal gut microbes have distinct evolutionary histories, possibly reflecting the unique roles of individual members of the microbiome.
Collapse
|
5
|
Antinucci M, Comas D, Calafell F. Population history modulates the fitness effects of Copy Number Variation in the Roma. Hum Genet 2023; 142:1327-1343. [PMID: 37311904 PMCID: PMC10449987 DOI: 10.1007/s00439-023-02579-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Accepted: 06/02/2023] [Indexed: 06/15/2023]
Abstract
We provide the first whole genome Copy Number Variant (CNV) study addressing Roma, along with reference populations from South Asia, the Middle East and Europe. Using CNV calling software for short-read sequence data, we identified 3171 deletions and 489 duplications. Taking into account the known population history of the Roma, as inferred from whole genome nucleotide variation, we could discern how this history has shaped CNV variation. As expected, patterns of deletion variation, but not duplication, in the Roma followed those obtained from single nucleotide polymorphisms (SNPs). Reduced effective population size resulting in slightly relaxed natural selection may explain our observation of an increase in intronic (but not exonic) deletions within Loss of Function (LoF)-intolerant genes. Over-representation analysis for LoF-intolerant gene sets hosting intronic deletions highlights a substantial accumulation of shared biological processes in Roma, intriguingly related to signaling, nervous system and development features, which may be related to the known profile of private disease in the population. Finally, we show the link between deletions and known trait-related SNPs reported in the genome-wide association study (GWAS) catalog, which exhibited even frequency distributions among the studied populations. This suggests that, in general human populations, the strong association between deletions and SNPs associated to biomedical conditions and traits could be widespread across continental populations, reflecting a common background of potentially disease/trait-related CNVs.
Collapse
Affiliation(s)
- Marco Antinucci
- Institute of Evolutionary Biology (UPF-CSIC), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
| | - David Comas
- Institute of Evolutionary Biology (UPF-CSIC), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
| | - Francesc Calafell
- Institute of Evolutionary Biology (UPF-CSIC), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
6
|
Wakeley J, Fan WT(L, Koch E, Sunyaev S. Recurrent mutation in the ancestry of a rare variant. Genetics 2023; 224:iyad049. [PMID: 36967220 PMCID: PMC10324944 DOI: 10.1093/genetics/iyad049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 01/30/2023] [Accepted: 03/08/2023] [Indexed: 03/28/2023] Open
Abstract
Recurrent mutation produces multiple copies of the same allele which may be co-segregating in a population. Yet, most analyses of allele-frequency or site-frequency spectra assume that all observed copies of an allele trace back to a single mutation. We develop a sampling theory for the number of latent mutations in the ancestry of a rare variant, specifically a variant observed in relatively small count in a large sample. Our results follow from the statistical independence of low-count mutations, which we show to hold for the standard neutral coalescent or diffusion model of population genetics as well as for more general coalescent trees. For populations of constant size, these counts are distributed like the number of alleles in the Ewens sampling formula. We develop a Poisson sampling model for populations of varying size and illustrate it using new results for site-frequency spectra in an exponentially growing population. We apply our model to a large data set of human SNPs and use it to explain dramatic differences in site-frequency spectra across the range of mutation rates in the human genome.
Collapse
Affiliation(s)
- John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Wai-Tong (Louis) Fan
- Department of Mathematics, Indiana University, Bloomington, IN 47405, USA
- Center of Mathematical Sciences and Applications, Harvard University, Cambridge, MA 02138, USA
| | - Evan Koch
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Shamil Sunyaev
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
7
|
Townsend C, Ferraro JV, Habecker H, Flinn MV. Human cooperation and evolutionary transitions in individuality. Philos Trans R Soc Lond B Biol Sci 2023; 378:20210414. [PMID: 36688393 PMCID: PMC9869453 DOI: 10.1098/rstb.2021.0414] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
A major evolutionary transition in individuality involves the formation of a cooperative group and the transformation of that group into an evolutionary entity. Human cooperation shares principles with those of multicellular organisms that have undergone transitions in individuality: division of labour, communication, and fitness interdependence. After the split from the last common ancestor of hominoids, early hominins adapted to an increasingly terrestrial niche for several million years. We posit that new challenges in this niche set in motion a positive feedback loop in selection pressure for cooperation that ratcheted coevolutionary changes in sociality, communication, brains, cognition, kin relations and technology, eventually resulting in egalitarian societies with suppressed competition and rapid cumulative culture. The increasing pace of information innovation and transmission became a key aspect of the evolutionary niche that enabled humans to become formidable cooperators with explosive population growth, the ability to cooperate and compete in groups of millions, and emergent social norms, e.g. private property. Despite considerable fitness interdependence, the rise of private property, in concert with population explosion and socioeconomic inequality, subverts potential transition of human groups into evolutionary entities due to resurgence of latent competition and conflict. This article is part of the theme issue 'Human socio-cultural evolution in light of evolutionary transitions'.
Collapse
Affiliation(s)
- Cathryn Townsend
- Department of Anthropology, Baylor University, Waco, TX 76798-7334, USA
| | - Joseph V. Ferraro
- Department of Anthropology, Baylor University, Waco, TX 76798-7334, USA
| | - Heather Habecker
- Department of Psychology and Neuroscience, Baylor University, Waco, TX 76798-7334, USA
| | - Mark V. Flinn
- Department of Anthropology, Baylor University, Waco, TX 76798-7334, USA
| |
Collapse
|
8
|
Wehbi SS, Zu Dohna H. A comparative analysis of L1 retrotransposition activities in human genomes suggests an ongoing increase in L1 number despite an evolutionary trend towards lower activity. Mob DNA 2021; 12:26. [PMID: 34782009 PMCID: PMC8594186 DOI: 10.1186/s13100-021-00255-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 10/26/2021] [Indexed: 11/18/2022] Open
Abstract
Background LINE-1 (Long Interspersed Nuclear Elements, L1) retrotransposons are the only autonomously active transposable elements in the human genome. The evolution of L1 retrotransposition rates and its implications for L1 dynamics are poorly understood. Retrotransposition rates are commonly measured in cell culture-based assays, but it is unclear how well these measurements provide insight into L1 population dynamics. This study applied comparative methods to estimate parameters for the evolution of retrotransposition rates, and infer L1 dynamics from these estimates. Results Our results show that the rates at which new L1s emerge in the human population correlate positively to cell-culture based retrotransposition activities, that there is an evolutionary trend towards lower retrotransposition activity, and that this evolutionary trend is not sufficient to counter-balance the increase in active L1s resulting from continuing retrotransposition. Conclusions Together, these findings support a model of the population-level L1 retrotransposition dynamics that is consistent with prior expectations and indicate the remaining gaps in the understanding of L1 dynamics in human genomes.
Collapse
Affiliation(s)
- Sawsan Sami Wehbi
- Department of Biology, American University of Beirut, Beirut, Lebanon
| | - Heinrich Zu Dohna
- Department of Biology, American University of Beirut, Beirut, Lebanon.
| |
Collapse
|
9
|
Helmstetter AJ, Cable S, Rakotonasolo F, Rabarijaona R, Rakotoarinivo M, Eiserhardt WL, Baker WJ, Papadopulos AST. The demographic history of Madagascan micro-endemics: have rare species always been rare? Proc Biol Sci 2021; 288:20210957. [PMID: 34547905 PMCID: PMC8456134 DOI: 10.1098/rspb.2021.0957] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 08/25/2021] [Indexed: 01/25/2023] Open
Abstract
Extinction has increased as human activities impact ecosystems, yet relatively few species have conservation assessments. Novel approaches are needed to highlight threatened species that are currently data-deficient. Many Madagascan plant species have extremely narrow ranges, but this may not have always been the case-it is unclear how the island's diverse flora evolved. To assess this, we generated restriction-site associated DNA sequence data for 10 Madagascan plant species, estimated effective population size (Ne) for each species and compared this to census (Nc) sizes. In each case, Ne was an order of magnitude larger than Nc-signifying rapid, recent population decline. We then estimated species' demographic history, tracking changes in Ne over time. We show that it is possible to predict extinction risk, particularly in the most threatened species. Furthermore, simulations showed that our approach has the power to detect population decline during the Anthropocene. Our analyses reveal that Madagascar's micro-endemics were not always rare, having experienced a rapid decline in their recent history. This casts further uncertainty over the processes that generated Madagascar's exceptional biodiversity. Our approach targets data-deficient species in need of conservation assessment, particularly in regions where human modification of the environment has been rapid.
Collapse
Affiliation(s)
- Andrew J. Helmstetter
- Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK
- Institut de Recherche pour le Développement (IRD), UMR-DIADE, 911 Avenue Agropolis, BP 64501, Montpellier 34394, France
| | - Stuart Cable
- Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK
- Kew Madagascar Conservation Centre, Lot II J 131 B Ambodivoanjo, Ivandry, Antananarivo 101, Madagascar
| | - Franck Rakotonasolo
- Kew Madagascar Conservation Centre, Lot II J 131 B Ambodivoanjo, Ivandry, Antananarivo 101, Madagascar
| | - Romer Rabarijaona
- Kew Madagascar Conservation Centre, Lot II J 131 B Ambodivoanjo, Ivandry, Antananarivo 101, Madagascar
| | - Mijoro Rakotoarinivo
- Mention Biologie et Ecologie Végétales, Faculté des Sciences, Université d'Antananarivo, Antananarivo BP 906101, Madagascar
| | - Wolf L. Eiserhardt
- Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK
- Department of Biology, Aarhus University, Aarhus, Denmark
| | | | - Alexander S. T. Papadopulos
- Royal Botanic Gardens, Kew, Richmond, Surrey TW9 3AE, UK
- Molecular Ecology and Evolution Bangor, Environment Centre Wales, School of Natural Sciences, Bangor University, Bangor LL57 2UW, UK
| |
Collapse
|
10
|
Yazar M, Özbek P. In Silico Tools and Approaches for the Prediction of Functional and Structural Effects of Single-Nucleotide Polymorphisms on Proteins: An Expert Review. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2020; 25:23-37. [PMID: 33058752 DOI: 10.1089/omi.2020.0141] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Single-nucleotide polymorphisms (SNPs) are single-base variants that contribute to human biological variation and pathogenesis of many human diseases. Among all SNP types, nonsynonymous single-nucleotide polymorphisms (nsSNPs) can alter many structural, biochemical, and functional features of a protein such as folding characteristics, charge distribution, stability, dynamics, and interactions with other proteins/nucleotides. These modifications in the protein structure can lead nsSNPs to be closely associated with many multifactorial diseases such as cancer, diabetes, and neurodegenerative diseases. Predicting structural and functional effects of nsSNPs with experimental approaches can be time-consuming and costly; hence, computational prediction tools and algorithms are being widely and increasingly utilized in biology and medical research. This expert review examines the in silico tools and algorithms for the prediction of functional or structural effects of SNP variants, in addition to the description of the phenotypic effects of nsSNPs on protein structure, association between pathogenicity of variants, and functional or structural features of disease-associated variants. Finally, case studies investigating the functional and structural effects of nsSNPs on selected protein structures are highlighted. We conclude that creating a consistent workflow with a combination of in silico approaches or tools should be considered to increase the performance, accuracy, and precision of the biological and clinical predictions made in silico.
Collapse
Affiliation(s)
- Metin Yazar
- Department of Bioengineering, Marmara University, Göztepe, İstanbul, Turkey.,Department of Genetics and Bioengineering, Istanbul Okan University, Tuzla, Istanbul, Turkey
| | - Pemra Özbek
- Department of Bioengineering, Marmara University, Göztepe, İstanbul, Turkey
| |
Collapse
|
11
|
Valencia-Montoya WA, Elfekih S, North HL, Meier JI, Warren IA, Tay WT, Gordon KHJ, Specht A, Paula-Moraes SV, Rane R, Walsh TK, Jiggins CD. Adaptive Introgression across Semipermeable Species Boundaries between Local Helicoverpa zea and Invasive Helicoverpa armigera Moths. Mol Biol Evol 2020; 37:2568-2583. [PMID: 32348505 PMCID: PMC7475041 DOI: 10.1093/molbev/msaa108] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Hybridization between invasive and native species has raised global concern, given the dramatic increase in species range shifts and pest outbreaks due to anthropogenic dispersal. Nevertheless, secondary contact between sister lineages of local and invasive species provides a natural laboratory to understand the factors that determine introgression and the maintenance or loss of species barriers. Here, we characterize the early evolutionary outcomes following secondary contact between invasive Helicoverpa armigera and native H. zea in Brazil. We carried out whole-genome resequencing of Helicoverpa moths from Brazil in two temporal samples: during the outbreak of H. armigera in 2013 and 2017. There is evidence for a burst of hybridization and widespread introgression from local H. zea into invasive H. armigera coinciding with H. armigera expansion in 2013. However, in H. armigera, the admixture proportion and the length of introgressed blocks were significantly reduced between 2013 and 2017, suggesting selection against admixture. In contrast to the genome-wide pattern, there was striking evidence for adaptive introgression of a single region from the invasive H. armigera into local H. zea, including an insecticide resistance allele that increased in frequency over time. In summary, despite extensive gene flow after secondary contact, the species boundaries are largely maintained except for the single introgressed region containing the insecticide-resistant locus. We document the worst-case scenario for an invasive species, in which there are now two pest species instead of one, and the native species has acquired resistance to pyrethroid insecticides through introgression.
Collapse
Affiliation(s)
- Wendy A Valencia-Montoya
- Department of Zoology, University of Cambridge, Cambridge, United Kingdom
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA
| | - Samia Elfekih
- CSIRO Health and Biosecurity, Australian Animal Health Laboratory, Geelong, VIC, Australia
- Bio21 Institute, University of Melbourne, Parkville, VIC, Australia
| | - Henry L North
- Department of Zoology, University of Cambridge, Cambridge, United Kingdom
| | - Joana I Meier
- Department of Zoology, University of Cambridge, Cambridge, United Kingdom
| | - Ian A Warren
- Department of Zoology, University of Cambridge, Cambridge, United Kingdom
| | - Wee Tek Tay
- CSIRO Land and Water, Black Mountain Laboratories, Canberra, ACT, Australia
| | - Karl H J Gordon
- CSIRO Land and Water, Black Mountain Laboratories, Canberra, ACT, Australia
| | | | | | - Rahul Rane
- CSIRO Health and Biosecurity, Australian Animal Health Laboratory, Geelong, VIC, Australia
- Bio21 Institute, University of Melbourne, Parkville, VIC, Australia
| | - Tom K Walsh
- CSIRO Land and Water, Black Mountain Laboratories, Canberra, ACT, Australia
| | - Chris D Jiggins
- Department of Zoology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
12
|
The exhaustive genomic scan approach, with an application to rare-variant association analysis. Eur J Hum Genet 2020; 28:1283-1291. [PMID: 32415273 PMCID: PMC7608423 DOI: 10.1038/s41431-020-0639-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Revised: 02/28/2020] [Accepted: 04/07/2020] [Indexed: 12/12/2022] Open
Abstract
Region-based genome-wide scans are usually performed by use of a priori chosen analysis regions. Such an approach will likely miss the region comprising the strongest signal and, thus, may result in increased type II error rates and decreased power. Here, we propose a genomic exhaustive scan approach that analyzes all possible subsequences and does not rely on a prior definition of the analysis regions. As a prime instance, we present a computationally ultraefficient implementation using the rare-variant collapsing test for phenotypic association, the genomic exhaustive collapsing scan (GECS). Our implementation allows for the identification of regions comprising the strongest signals in large, genome-wide rare-variant association studies while controlling the family-wise error rate via permutation. Application of GECS to two genomic data sets revealed several novel significantly associated regions for age-related macular degeneration and for schizophrenia. Our approach also offers a high potential to improve genome-wide scans for selection, methylation, and other analyses.
Collapse
|
13
|
Chen H. A Computational Approach for Modeling the Allele Frequency Spectrum of Populations with Arbitrarily Varying Size. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 17:635-644. [PMID: 32173599 PMCID: PMC7212486 DOI: 10.1016/j.gpb.2019.06.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 06/04/2019] [Accepted: 08/02/2019] [Indexed: 11/25/2022]
Abstract
The allele frequency spectrum (AFS), or site frequency spectrum, is commonly used to summarize the genomic polymorphism pattern of a sample, which is informative for inferring population history and detecting natural selection. In 2013, Chen and Chen developed a method for analytically deriving the AFS for populations with temporally varying size through the coalescence time-scaling function. However, their approach is only applicable to population history scenarios in which the analytical form of the time-scaling function is tractable. In this paper, we propose a computational approach to extend the method to populations with arbitrary complex varying size by numerically approximating the time-scaling function. We demonstrate the performance of the approach by constructing the AFS for two population history scenarios: the logistic growth model and the Gompertz growth model, for which the AFS are unavailable with existing approaches. Software for implementing the algorithm can be downloaded at http://chenlab.big.ac.cn/software/.
Collapse
Affiliation(s)
- Hua Chen
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China; CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
14
|
Jay F, Boitard S, Austerlitz F. An ABC Method for Whole-Genome Sequence Data: Inferring Paleolithic and Neolithic Human Expansions. Mol Biol Evol 2020; 36:1565-1579. [PMID: 30785202 DOI: 10.1093/molbev/msz038] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Species generally undergo a complex demographic history consisting, in particular, of multiple changes in population size. Genome-wide sequencing data are potentially highly informative for reconstructing this demographic history. A crucial point is to extract the relevant information from these very large data sets. Here, we design an approach for inferring past demographic events from a moderate number of fully sequenced genomes. Our new approach uses Approximate Bayesian Computation, a simulation-based statistical framework that allows 1) identifying the best demographic scenario among several competing scenarios and 2) estimating the best-fitting parameters under the chosen scenario. Approximate Bayesian Computation relies on the computation of summary statistics. Using a cross-validation approach, we show that statistics such as the lengths of haplotypes shared between individuals, or the decay of linkage disequilibrium with distance, can be combined with classical statistics (e.g., heterozygosity and Tajima's D) to accurately infer complex demographic scenarios including bottlenecks and expansion periods. We also demonstrate the importance of simultaneously estimating the genotyping error rate. Applying our method on genome-wide human-sequence databases, we finally show that a model consisting in a bottleneck followed by a Paleolithic and a Neolithic expansion is the most relevant for Eurasian populations.
Collapse
Affiliation(s)
- Flora Jay
- Laboratoire EcoAnthropologie et Ethnobiologie, CNRS/MNHN/Université Paris Diderot, Paris, France.,Laboratoire de Recherche en Informatique, CNRS/Université Paris-Sud/Université Paris-Saclay, Orsay, France
| | - Simon Boitard
- GenPhySE, Université de Toulouse, INRA, INPT, INP-ENVT, Castanet Tolosan, France
| | - Frédéric Austerlitz
- Laboratoire EcoAnthropologie et Ethnobiologie, CNRS/MNHN/Université Paris Diderot, Paris, France
| |
Collapse
|
15
|
Kamm J, Terhorst J, Durbin R, Song YS. Efficiently inferring the demographic history of many populations with allele count data. J Am Stat Assoc 2019; 115:1472-1487. [PMID: 33012903 PMCID: PMC7531012 DOI: 10.1080/01621459.2019.1635482] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Revised: 04/14/2019] [Accepted: 06/08/2019] [Indexed: 01/06/2023]
Abstract
The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than p reviously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed "basal Eurasian" admixture event in human history. We implement and release our method in a new open-source software package momi2.
Collapse
Affiliation(s)
- Jack Kamm
- Wellcome Sanger Institute, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
- Chan Zuckerberg Biohub, San Francisco, USA
| | | | - Richard Durbin
- Wellcome Sanger Institute, Hinxton, Cambridge, UK
- Department of Genetics, University of Cambridge, Cambridge, UK
| | - Yun S. Song
- Computer Science Division, University of California, Berkeley, USA
- Department of Statistics, University of California, Berkeley, USA
- Chan Zuckerberg Biohub, San Francisco, USA
| |
Collapse
|
16
|
Flagel L, Brandvain Y, Schrider DR. The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference. Mol Biol Evol 2019; 36:220-238. [PMID: 30517664 PMCID: PMC6367976 DOI: 10.1093/molbev/msy224] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.
Collapse
Affiliation(s)
- Lex Flagel
- Monsanto Company, Chesterfield, MO
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Yaniv Brandvain
- Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN
| | - Daniel R Schrider
- Department of Genetics, University of North Carolina, Chapel Hill, NC
| |
Collapse
|
17
|
Tournebize R, Poncet V, Jakobsson M, Vigouroux Y, Manel S. McSwan: A joint site frequency spectrum method to detect and date selective sweeps across multiple population genomes. Mol Ecol Resour 2018; 19:283-295. [PMID: 30358170 DOI: 10.1111/1755-0998.12957] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2018] [Revised: 10/17/2018] [Accepted: 10/18/2018] [Indexed: 01/01/2023]
Abstract
Inferring the mode and tempo of natural selection helps further our understanding of adaptation to past environmental changes. Here, we introduce McSwan, a method to detect and date past and recent natural selection events in the case of a hard sweep. The method is based on the comparison of site frequency spectra obtained under various demographic models that include selection. McSwan demonstrated high power (high sensitivity and specificity) in capturing hard selective sweep events without requiring haplotype phasing. It performed slightly better than SweeD when the recent effective population size was low and the genomic region was small. We then applied our method to a European (CEU) and an African (LWK) human re-sequencing data set. Most hard sweeps were detected in the CEU population (96%). Moreover, hard sweeps in the African population were estimated to have occurred further back in time (mode: 43,625 years BP) compared to those of Europeans (mode: 24,850 years BP). Most of the estimated ages of hard sweeps in Europeans were associated with the Last Glacial Maximum and were enriched in immunity-associated genes.
Collapse
Affiliation(s)
- Rémi Tournebize
- IRD, University of Montpellier, UMR DIADE BP 64501, Montpellier Cedex 5, France
| | - Valérie Poncet
- IRD, University of Montpellier, UMR DIADE BP 64501, Montpellier Cedex 5, France
| | - Mattias Jakobsson
- Department of Organismal Biology and SciLifeLab, Uppsala University, Uppsala, Sweden.,Centre for Anthropological Research, Department of Anthropology and Development Studies, University of Johannesburg, Auckland Park, South Africa
| | - Yves Vigouroux
- IRD, University of Montpellier, UMR DIADE BP 64501, Montpellier Cedex 5, France
| | - Stéphanie Manel
- EPHE, PSL Research University, CNRS, University of Montpellier, Montpellier SupAgro, IRD, INRA, UMR:5175 CEFE, Montpellier, France
| |
Collapse
|
18
|
Beichman AC, Huerta-Sanchez E, Lohmueller KE. Using Genomic Data to Infer Historic Population Dynamics of Nonmodel Organisms. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2018. [DOI: 10.1146/annurev-ecolsys-110617-062431] [Citation(s) in RCA: 89] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Genome sequence data are now being routinely obtained from many nonmodel organisms. These data contain a wealth of information about the demographic history of the populations from which they originate. Many sophisticated statistical inference procedures have been developed to infer the demographic history of populations from this type of genomic data. In this review, we discuss the different statistical methods available for inference of demography, providing an overview of the underlying theory and logic behind each approach. We also discuss the types of data required and the pros and cons of each method. We then discuss how these methods have been applied to a variety of nonmodel organisms. We conclude by presenting some recommendations for researchers looking to use genomic data to infer demographic history.
Collapse
Affiliation(s)
- Annabel C. Beichman
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California 90095, USA
| | - Emilia Huerta-Sanchez
- Department of Molecular and Cell Biology, University of California, Merced, California 95343, USA
- Current affiliation: Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island 02912, USA
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California 90095, USA
- Interdepartmental Program in Bioinformatics and Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
19
|
Ragsdale AP, Moreau C, Gravel S. Genomic inference using diffusion models and the allele frequency spectrum. Curr Opin Genet Dev 2018; 53:140-147. [PMID: 30366252 DOI: 10.1016/j.gde.2018.10.001] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2018] [Revised: 09/14/2018] [Accepted: 10/07/2018] [Indexed: 01/25/2023]
Abstract
Evolutionary, biological, and demographic processes together shape observed variation in populations. Understanding how these processes influence variation allows us to infer past demography and the nature of selection in populations. Forward in time models such as the diffusion approximation provide a powerful tool for performing inference based on the distribution of allele frequencies. Here, we discuss recent computational developments and their application to reconstructing human demographic history. Using whole-genome sequence data for 797 French Canadian individuals, we assess the neutrality of synonymous variants and show that selection can bias inferred demography, mutation rates, and distributions of fitness effects. We argue that the simple evolutionary models investigated by Kimura and Ohta still provide important insight into modern genetic research.
Collapse
Affiliation(s)
- Aaron P Ragsdale
- Department of Human Genetics, McGill University, Montreal, QC, Canada
| | - Claudia Moreau
- Département des Sciences Fondamentales, Université du Québec à Chicoutimi, Chicoutimi, QC, Canada
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montreal, QC, Canada.
| |
Collapse
|
20
|
Reppell M, Zöllner S. An efficient algorithm for generating the internal branches of a Kingman coalescent. Theor Popul Biol 2018; 122:57-66. [PMID: 28709926 PMCID: PMC5764821 DOI: 10.1016/j.tpb.2017.05.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Revised: 05/19/2017] [Accepted: 05/26/2017] [Indexed: 01/16/2023]
Abstract
Coalescent simulations are a widely used approach for simulating sample genealogies, but can become computationally burdensome in large samples. Methods exist to analytically calculate a sample's expected frequency spectrum without simulating full genealogies. However, statistics that rely on the distribution of the length of internal coalescent branches, such as the probability that two mutations of equal size arose on the same genealogical branch, have previously required full coalescent simulations to estimate. Here, we present a sampling method capable of efficiently generating limited portions of sample genealogies using a series of analytic equations that give probabilities for the number, start, and end of internal branches conditional on the number of final samples they subtend. These equations are independent of the coalescent waiting times and need only be calculated a single time, lending themselves to efficient computation. We compare our method with full coalescent simulations to show the resulting distribution of branch lengths and summary statistics are equivalent, but that for many conditions our method is at least 10 times faster.
Collapse
Affiliation(s)
- M Reppell
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| | - S Zöllner
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA; Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
21
|
Schrider DR, Ayroles J, Matute DR, Kern AD. Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genet 2018; 14:e1007341. [PMID: 29684059 PMCID: PMC5933812 DOI: 10.1371/journal.pgen.1007341] [Citation(s) in RCA: 69] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 05/03/2018] [Accepted: 03/28/2018] [Indexed: 12/30/2022] Open
Abstract
Hybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia. Understanding the extent to which species or diverged populations hybridize in nature is crucially important if we are to understand the speciation process. Accordingly numerous research groups have developed methodology for finding the genetic evidence of such introgression. In this report we develop a supervised machine learning approach for uncovering loci which have introgressed across species boundaries. We show that our method, FILET, has greater accuracy and power than competing methods in discovering introgression, and in addition can detect the directionality associated with the gene flow between species. Using whole genome sequences from Drosophila simulans and Drosophila sechellia we show that FILET discovers quite extensive introgression between these species that has occurred mostly from D. simulans to D. sechellia. Our work highlights the complex process of speciation even within a well-studied system and points to the growing importance of supervised machine learning in population genetics.
Collapse
Affiliation(s)
- Daniel R. Schrider
- Department of Genetics, Rutgers University, Piscataway, New Jersey, United States of America
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey, United States of America
- * E-mail:
| | - Julien Ayroles
- Ecology and Evolutionary Biology Department, Princeton University, Princeton, New Jersey, United States of America
- Lewis Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Daniel R. Matute
- Biology Department, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Andrew D. Kern
- Department of Genetics, Rutgers University, Piscataway, New Jersey, United States of America
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey, United States of America
| |
Collapse
|
22
|
Browning SR, Browning BL, Zhou Y, Tucci S, Akey JM. Analysis of Human Sequence Data Reveals Two Pulses of Archaic Denisovan Admixture. Cell 2018; 173:53-61.e9. [PMID: 29551270 PMCID: PMC5866234 DOI: 10.1016/j.cell.2018.02.031] [Citation(s) in RCA: 166] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Revised: 11/21/2017] [Accepted: 02/12/2018] [Indexed: 01/27/2023]
Abstract
Anatomically modern humans interbred with Neanderthals and with a related archaic population known as Denisovans. Genomes of several Neanderthals and one Denisovan have been sequenced, and these reference genomes have been used to detect introgressed genetic material in present-day human genomes. Segments of introgression also can be detected without use of reference genomes, and doing so can be advantageous for finding introgressed segments that are less closely related to the sequenced archaic genomes. We apply a new reference-free method for detecting archaic introgression to 5,639 whole-genome sequences from Eurasia and Oceania. We find Denisovan ancestry in populations from East and South Asia and Papuans. Denisovan ancestry comprises two components with differing similarity to the sequenced Altai Denisovan individual. This indicates that at least two distinct instances of Denisovan admixture into modern humans occurred, involving Denisovan populations that had different levels of relatedness to the sequenced Altai Denisovan. VIDEO ABSTRACT.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| | - Brian L Browning
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Ying Zhou
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Serena Tucci
- Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08544, USA
| | - Joshua M Akey
- Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08544, USA; The Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
23
|
Population genomic analysis of elongated skulls reveals extensive female-biased immigration in Early Medieval Bavaria. Proc Natl Acad Sci U S A 2018. [PMID: 29531040 PMCID: PMC5879695 DOI: 10.1073/pnas.1719880115] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Many modern European states trace their roots back to a period known as the Migration Period that spans from Late Antiquity to the early Middle Ages. We have conducted the first population-level analysis of people from this era, generating genomic data from 41 graves from archaeological sites in present-day Bavaria in southern Germany mostly dating to around 500 AD. While they are predominantly of northern/central European ancestry, we also find significant evidence for a nonlocal genetic provenance that is highly enriched among resident Early Medieval women, demonstrating artificial skull deformation. We infer that the most likely origin of the majority of these women was southeastern Europe, resolving a debate that has lasted for more than half a century. Modern European genetic structure demonstrates strong correlations with geography, while genetic analysis of prehistoric humans has indicated at least two major waves of immigration from outside the continent during periods of cultural change. However, population-level genome data that could shed light on the demographic processes occurring during the intervening periods have been absent. Therefore, we generated genomic data from 41 individuals dating mostly to the late 5th/early 6th century AD from present-day Bavaria in southern Germany, including 11 whole genomes (mean depth 5.56×). In addition we developed a capture array to sequence neutral regions spanning a total of 5 Mb and 486 functional polymorphic sites to high depth (mean 72×) in all individuals. Our data indicate that while men generally had ancestry that closely resembles modern northern and central Europeans, women exhibit a very high genetic heterogeneity; this includes signals of genetic ancestry ranging from western Europe to East Asia. Particularly striking are women with artificial skull deformations; the analysis of their collective genetic ancestry suggests an origin in southeastern Europe. In addition, functional variants indicate that they also differed in visible characteristics. This example of female-biased migration indicates that complex demographic processes during the Early Medieval period may have contributed in an unexpected way to shape the modern European genetic landscape. Examination of the panel of functional loci also revealed that many alleles associated with recent positive selection were already at modern-like frequencies in European populations ∼1,500 years ago.
Collapse
|
24
|
Baharian S, Gravel S. On the decidability of population size histories from finite allele frequency spectra. Theor Popul Biol 2018; 120:42-51. [PMID: 29305873 DOI: 10.1016/j.tpb.2017.12.008] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Revised: 12/15/2017] [Accepted: 12/20/2017] [Indexed: 10/18/2022]
Abstract
Understanding the historical events that shaped current genomic diversity has applications in historical, biological, and medical research. However, the amount of historical information that can be inferred from genetic data is finite, which leads to an identifiability problem. For example, different historical processes can lead to identical distribution of allele frequencies. This identifiability issue casts a shadow of uncertainty over the results of any study which uses the frequency spectrum to infer past demography. It has been argued that imposing mild 'reasonableness' constraints on demographic histories can enable unique reconstruction, at least in an idealized setting where the length of the genome is nearly infinite. Here, we discuss this problem for finite sample size and genome length. Using the diffusion approximation, we obtain bounds on likelihood differences between similar demographic histories, and use them to construct pairs of very different reasonable histories that produce almost-identical frequency distributions. The finite-genome problem therefore remains poorly determined even among reasonable histories. Where fits to few-parameter models produce narrow parameter confidence intervals, large uncertainties lurk hidden by model assumption.
Collapse
Affiliation(s)
- Soheil Baharian
- Department of Human Genetics, McGill University, Montreal, QC, Canada; McGill University and Genome Quebec Innovation Centre, Montreal, QC, Canada
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montreal, QC, Canada; McGill University and Genome Quebec Innovation Centre, Montreal, QC, Canada.
| |
Collapse
|
25
|
Comparison of Single Genome and Allele Frequency Data Reveals Discordant Demographic Histories. G3-GENES GENOMES GENETICS 2017; 7:3605-3620. [PMID: 28893846 PMCID: PMC5677151 DOI: 10.1534/g3.117.300259] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Inference of demographic history from genetic data is a primary goal of population genetics of model and nonmodel organisms. Whole genome-based approaches such as the pairwise/multiple sequentially Markovian coalescent methods use genomic data from one to four individuals to infer the demographic history of an entire population, while site frequency spectrum (SFS)-based methods use the distribution of allele frequencies in a sample to reconstruct the same historical events. Although both methods are extensively used in empirical studies and perform well on data simulated under simple models, there have been only limited comparisons of them in more complex and realistic settings. Here we use published demographic models based on data from three human populations (Yoruba, descendants of northwest-Europeans, and Han Chinese) as an empirical test case to study the behavior of both inference procedures. We find that several of the demographic histories inferred by the whole genome-based methods do not predict the genome-wide distribution of heterozygosity, nor do they predict the empirical SFS. However, using simulated data, we also find that the whole genome methods can reconstruct the complex demographic models inferred by SFS-based methods, suggesting that the discordant patterns of genetic variation are not attributable to a lack of statistical power, but may reflect unmodeled complexities in the underlying demography. More generally, our findings indicate that demographic inference from a small number of genomes, routine in genomic studies of nonmodel organisms, should be interpreted cautiously, as these models cannot recapitulate other summaries of the data.
Collapse
|
26
|
Amorim CEG, Gao Z, Baker Z, Diesel JF, Simons YB, Haque IS, Pickrell J, Przeworski M. The population genetics of human disease: The case of recessive, lethal mutations. PLoS Genet 2017; 13:e1006915. [PMID: 28957316 PMCID: PMC5619689 DOI: 10.1371/journal.pgen.1006915] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2016] [Accepted: 07/09/2017] [Indexed: 01/08/2023] Open
Abstract
Do the frequencies of disease mutations in human populations reflect a simple balance between mutation and purifying selection? What other factors shape the prevalence of disease mutations? To begin to answer these questions, we focused on one of the simplest cases: recessive mutations that alone cause lethal diseases or complete sterility. To this end, we generated a hand-curated set of 417 Mendelian mutations in 32 genes reported to cause a recessive, lethal Mendelian disease. We then considered analytic models of mutation-selection balance in infinite and finite populations of constant sizes and simulations of purifying selection in a more realistic demographic setting, and tested how well these models fit allele frequencies estimated from 33,370 individuals of European ancestry. In doing so, we distinguished between CpG transitions, which occur at a substantially elevated rate, and three other mutation types. Intriguingly, the observed frequency for CpG transitions is slightly higher than expectation but close, whereas the frequencies observed for the three other mutation types are an order of magnitude higher than expected, with a bigger deviation from expectation seen for less mutable types. This discrepancy is even larger when subtle fitness effects in heterozygotes or lethal compound heterozygotes are taken into account. In principle, higher than expected frequencies of disease mutations could be due to widespread errors in reporting causal variants, compensation by other mutations, or balancing selection. It is unclear why these factors would have a greater impact on disease mutations that occur at lower rates, however. We argue instead that the unexpectedly high frequency of disease mutations and the relationship to the mutation rate likely reflect an ascertainment bias: of all the mutations that cause recessive lethal diseases, those that by chance have reached higher frequencies are more likely to have been identified and thus to have been included in this study. Beyond the specific application, this study highlights the parameters likely to be important in shaping the frequencies of Mendelian disease alleles. What determines the frequencies of disease mutations in human populations? To begin to answer this question, we focus on one of the simplest cases: mutations that cause completely recessive, lethal Mendelian diseases. We first review theory about what to expect from mutation and selection in a population of finite size and generate predictions based on simulations using a plausible demographic scenario of recent human evolution. For a highly mutable type of mutation, transitions at CpG sites, we find that the predictions are close to the observed frequencies of recessive lethal disease mutations. For less mutable types, however, predictions substantially under-estimate the observed frequency. We discuss possible explanations for the discrepancy and point to a complication that, to our knowledge, is not widely appreciated: that there exists ascertainment bias in disease mutation discovery. Specifically, we suggest that alleles that have been identified to date are likely the ones that by chance have reached higher frequencies and are thus more likely to have been mapped. More generally, our study highlights the factors that influence the frequencies of Mendelian disease alleles.
Collapse
Affiliation(s)
- Carlos Eduardo G. Amorim
- Department of Biological Sciences, Columbia University, New York, NY, United States of America
- CAPES Foundation, Ministry of Education of Brazil, Brasília, DF, Brazil
- * E-mail:
| | - Ziyue Gao
- Howard Hughes Medical Institution, Stanford University, Stanford, CA, United States of America
| | - Zachary Baker
- Department of Systems Biology, Columbia University, New York, NY, United States of America
| | | | - Yuval B. Simons
- Department of Biological Sciences, Columbia University, New York, NY, United States of America
| | - Imran S. Haque
- Counsyl, 180 Kimball Way, South San Francisco, CA, United States of America
| | - Joseph Pickrell
- Department of Biological Sciences, Columbia University, New York, NY, United States of America
- New York Genome Center, New York, NY, United States of America
| | - Molly Przeworski
- Department of Biological Sciences, Columbia University, New York, NY, United States of America
- Department of Systems Biology, Columbia University, New York, NY, United States of America
| |
Collapse
|
27
|
Mostafavi H, Berisa T, Day FR, Perry JRB, Przeworski M, Pickrell JK. Identifying genetic variants that affect viability in large cohorts. PLoS Biol 2017; 15:e2002458. [PMID: 28873088 PMCID: PMC5584811 DOI: 10.1371/journal.pbio.2002458] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 08/03/2017] [Indexed: 12/20/2022] Open
Abstract
A number of open questions in human evolutionary genetics would become tractable if we were able to directly measure evolutionary fitness. As a step towards this goal, we developed a method to examine whether individual genetic variants, or sets of genetic variants, currently influence viability. The approach consists in testing whether the frequency of an allele varies across ages, accounting for variation in ancestry. We applied it to the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort and to the parents of participants in the UK Biobank. Across the genome, we found only a few common variants with large effects on age-specific mortality: tagging the APOE ε4 allele and near CHRNA3. These results suggest that when large, even late-onset effects are kept at low frequency by purifying selection. Testing viability effects of sets of genetic variants that jointly influence 1 of 42 traits, we detected a number of strong signals. In participants of the UK Biobank of British ancestry, we found that variants that delay puberty timing are associated with a longer parental life span (P~6.2 × 10−6 for fathers and P~2.0 × 10−3 for mothers), consistent with epidemiological studies. Similarly, variants associated with later age at first birth are associated with a longer maternal life span (P~1.4 × 10−3). Signals are also observed for variants influencing cholesterol levels, risk of coronary artery disease (CAD), body mass index, as well as risk of asthma. These signals exhibit consistent effects in the GERA cohort and among participants of the UK Biobank of non-British ancestry. We also found marked differences between males and females, most notably at the CHRNA3 locus, and variants associated with risk of CAD and cholesterol levels. Beyond our findings, the analysis serves as a proof of principle for how upcoming biomedical data sets can be used to learn about selection effects in contemporary humans. Our global understanding of adaptation in humans is limited to indirect statistical inferences from patterns of genetic variation, which are sensitive to past selection pressures. We introduced a method that allowed us to directly observe ongoing selection in humans by identifying genetic variants that affect survival to a given age (i.e., viability selection). We applied our approach to the GERA cohort and parents of the UK Biobank participants. We found viability effects of variants near the APOE and CHRNA3 genes, which are associated with the risk of Alzheimer disease and smoking behavior, respectively. We also tested for the joint effect of sets of genetic variants that influence quantitative traits. We uncovered an association between longer life span and genetic variants that delay puberty timing and age at first birth. We also detected detrimental effects of higher genetically predicted cholesterol levels, body mass index, risk of coronary artery disease (CAD), and risk of asthma on survival. Some of the observed effects differ between males and females, most notably those at the CHRNA3 gene and variants associated with risk of CAD and cholesterol levels. Beyond this application, our analysis shows how large biomedical data sets can be used to study natural selection in humans.
Collapse
Affiliation(s)
- Hakhamanesh Mostafavi
- Department of Biological Sciences, Columbia University, New York, New York, United States of America
- * E-mail: (HM); (MP); (JKP)
| | - Tomaz Berisa
- New York Genome Center, New York, New York, United States of America
| | - Felix R. Day
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge, Cambridge, United Kingdom
| | - John R. B. Perry
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge, Cambridge, United Kingdom
| | - Molly Przeworski
- Department of Biological Sciences, Columbia University, New York, New York, United States of America
- Department of Systems Biology, Columbia University, New York, New York, United States of America
- * E-mail: (HM); (MP); (JKP)
| | - Joseph K. Pickrell
- Department of Biological Sciences, Columbia University, New York, New York, United States of America
- New York Genome Center, New York, New York, United States of America
- * E-mail: (HM); (MP); (JKP)
| |
Collapse
|
28
|
Abstract
The degree to which adaptation in recent human evolution shapes genetic variation remains controversial. This is in part due to the limited evidence in humans for classic "hard selective sweeps", wherein a novel beneficial mutation rapidly sweeps through a population to fixation. However, positive selection may often proceed via "soft sweeps" acting on mutations already present within a population. Here, we examine recent positive selection across six human populations using a powerful machine learning approach that is sensitive to both hard and soft sweeps. We found evidence that soft sweeps are widespread and account for the vast majority of recent human adaptation. Surprisingly, our results also suggest that linked positive selection affects patterns of variation across much of the genome, and may increase the frequencies of deleterious mutations. Our results also reveal insights into the role of sexual selection, cancer risk, and central nervous system development in recent human evolution.
Collapse
Affiliation(s)
- Daniel R. Schrider
- Department of Genetics, Rutgers University, Piscataway, NJ
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ
| | - Andrew D. Kern
- Department of Genetics, Rutgers University, Piscataway, NJ
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ
| |
Collapse
|
29
|
Dietary adaptation of FADS genes in Europe varied across time and geography. Nat Ecol Evol 2017; 1:167. [PMID: 29094686 PMCID: PMC5672832 DOI: 10.1038/s41559-017-0167] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2016] [Accepted: 04/18/2017] [Indexed: 11/08/2022]
Abstract
Fatty acid desaturase (FADS) genes encode rate-limiting enzymes for the biosynthesis of omega-6 and omega-3 long chain polyunsaturated fatty acids (LCPUFAs). This biosynthesis is essential for individuals subsisting on LCPUFAs-poor diets (e.g. plant-based). Positive selection on FADS genes has been reported in multiple populations, but its presence and pattern in Europeans remain elusive. Here, using ancient and modern DNA, we demonstrate that positive selection acted on the same FADS variants both before and after the advent of farming in Europe, but on opposite (i.e. alternative) alleles. Selection in recent farmers also varied geographically, with the strongest signal in Southern Europe. These varying selection patterns concur with anthropological evidence of varying diets, and with the association of farming-adaptive alleles with higher FADS1 expression and thus enhanced LCPUFAs biosynthesis. Genome-wide association studies reveal that farming-adaptive alleles not only increase LCPUFAs, but also affect other lipid levels and protect against several inflammatory diseases.
Collapse
|
30
|
Accuracy of Demographic Inferences from the Site Frequency Spectrum: The Case of the Yoruba Population. Genetics 2017; 206:439-449. [PMID: 28341655 DOI: 10.1534/genetics.116.192708] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 03/23/2017] [Indexed: 01/23/2023] Open
Abstract
Some methods for demographic inference based on the observed genetic diversity of current populations rely on the use of summary statistics such as the Site Frequency Spectrum (SFS). Demographic models can be either model-constrained with numerous parameters, such as growth rates, timing of demographic events, and migration rates, or model-flexible, with an unbounded collection of piecewise constant sizes. It is still debated whether demographic histories can be accurately inferred based on the SFS. Here, we illustrate this theoretical issue on an example of demographic inference for an African population. The SFS of the Yoruba population (data from the 1000 Genomes Project) is fit to a simple model of population growth described with a single parameter (e.g., founding time). We infer a time to the most recent common ancestor of 1.7 million years (MY) for this population. However, we show that the Yoruba SFS is not informative enough to discriminate between several different models of growth. We also show that for such simple demographies, the fit of one-parameter models outperforms the stairway plot, a recently developed model-flexible method. The use of this method on simulated data suggests that it is biased by the noise intrinsically present in the data.
Collapse
|
31
|
Inference of the Distribution of Selection Coefficients for New Nonsynonymous Mutations Using Large Samples. Genetics 2017; 206:345-361. [PMID: 28249985 PMCID: PMC5419480 DOI: 10.1534/genetics.116.197145] [Citation(s) in RCA: 113] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Accepted: 02/14/2017] [Indexed: 12/23/2022] Open
Abstract
The distribution of fitness effects (DFE) has considerable importance in population genetics. To date, estimates of the DFE come from studies using a small number of individuals. Thus, estimates of the proportion of moderately to strongly deleterious new mutations may be unreliable because such variants are unlikely to be segregating in the data. Additionally, the true functional form of the DFE is unknown, and estimates of the DFE differ significantly between studies. Here we present a flexible and computationally tractable method, called Fit∂a∂i, to estimate the DFE of new mutations using the site frequency spectrum from a large number of individuals. We apply our approach to the frequency spectrum of 1300 Europeans from the Exome Sequencing Project ESP6400 data set, 1298 Danes from the LuCamp data set, and 432 Europeans from the 1000 Genomes Project to estimate the DFE of deleterious nonsynonymous mutations. We infer significantly fewer (0.38-0.84 fold) strongly deleterious mutations with selection coefficient |s| > 0.01 and more (1.24-1.43 fold) weakly deleterious mutations with selection coefficient |s| < 0.001 compared to previous estimates. Furthermore, a DFE that is a mixture distribution of a point mass at neutrality plus a gamma distribution fits better than a gamma distribution in two of the three data sets. Our results suggest that nearly neutral forces play a larger role in human evolution than previously thought.
Collapse
|
32
|
Kamm JA, Terhorst J, Song YS. Efficient computation of the joint sample frequency spectra for multiple populations. J Comput Graph Stat 2017; 26:182-194. [PMID: 28239248 DOI: 10.1080/10618600.2016.1159212] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.
Collapse
Affiliation(s)
- John A Kamm
- Department of Statistics, University of California, Berkeley
| | | | - Yun S Song
- Departments of EECS, Statistics, and Integrative Biology, University of California, Berkeley
| |
Collapse
|
33
|
Bagley RK, Sousa VC, Niemiller ML, Linnen CR. History, geography and host use shape genomewide patterns of genetic variation in the redheaded pine sawfly (
Neodiprion lecontei
). Mol Ecol 2017; 26:1022-1044. [DOI: 10.1111/mec.13972] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Revised: 11/10/2016] [Accepted: 12/01/2016] [Indexed: 01/03/2023]
Affiliation(s)
- Robin K. Bagley
- Department of Biology University of Kentucky Lexington KY 40506 USA
| | - Vitor C. Sousa
- cE3c ‐ Centre for Ecology, Evolution and Environmental Changes Faculdade de Ciências Universidade de Lisboa 1749‐016 Lisboa Portugal
| | - Matthew L. Niemiller
- Illinois Natural History Survey Prairie Research Institute University of Illinois Urbana‐Champaign Champaign IL 61820 USA
| | | |
Collapse
|
34
|
A Model of Compound Heterozygous, Loss-of-Function Alleles Is Broadly Consistent with Observations from Complex-Disease GWAS Datasets. PLoS Genet 2017; 13:e1006573. [PMID: 28103232 PMCID: PMC5289629 DOI: 10.1371/journal.pgen.1006573] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 02/02/2017] [Accepted: 01/05/2017] [Indexed: 12/17/2022] Open
Abstract
The genetic component of complex disease risk in humans remains largely unexplained. A corollary is that the allelic spectrum of genetic variants contributing to complex disease risk is unknown. Theoretical models that relate population genetic processes to the maintenance of genetic variation for quantitative traits may suggest profitable avenues for future experimental design. Here we use forward simulation to model a genomic region evolving under a balance between recurrent deleterious mutation and Gaussian stabilizing selection. We consider multiple genetic and demographic models, and several different methods for identifying genomic regions harboring variants associated with complex disease risk. We demonstrate that the model of gene action, relating genotype to phenotype, has a qualitative effect on several relevant aspects of the population genetic architecture of a complex trait. In particular, the genetic model impacts genetic variance component partitioning across the allele frequency spectrum and the power of statistical tests. Models with partial recessivity closely match the minor allele frequency distribution of significant hits from empirical genome-wide association studies without requiring homozygous effect sizes to be small. We highlight a particular gene-based model of incomplete recessivity that is appealing from first principles. Under that model, deleterious mutations in a genomic region partially fail to complement one another. This model of gene-based recessivity predicts the empirically observed inconsistency between twin and SNP based estimated of dominance heritability. Furthermore, this model predicts considerable levels of unexplained variance associated with intralocus epistasis. Our results suggest a need for improved statistical tools for region based genetic association and heritability estimation. Gene action determines how mutations affect phenotype. When placed in an evolutionary context, the details of the genotype-to-phenotype model can impact the maintenance of genetic variation for complex traits. Likewise, non-equilibrium demographic history may affect patterns of genetic variation. Here, we explore the impact of genetic model and population growth on distribution of genetic variance across the allele frequency spectrum underlying risk for a complex disease. Using forward-in-time population genetic simulations, we show that the genetic model has important impacts on the composition of variation for complex disease risk in a population. We explicitly simulate genome-wide association studies (GWAS) and perform heritability estimation on population samples. A particular model of gene-based partial recessivity, based on allelic non-complementation, aligns well with empirical results. This model is congruent with the dominance variance estimates from both SNPs and twins, and the minor allele frequency distribution of GWAS hits.
Collapse
|
35
|
Harpak A, Bhaskar A, Pritchard JK. Mutation Rate Variation is a Primary Determinant of the Distribution of Allele Frequencies in Humans. PLoS Genet 2016; 12:e1006489. [PMID: 27977673 PMCID: PMC5157949 DOI: 10.1371/journal.pgen.1006489] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2016] [Accepted: 11/16/2016] [Indexed: 01/06/2023] Open
Abstract
The site frequency spectrum (SFS) has long been used to study demographic history and natural selection. Here, we extend this summary by examining the SFS conditional on the alleles found at the same site in other species. We refer to this extension as the "phylogenetically-conditioned SFS" or cSFS. Using recent large-sample data from the Exome Aggregation Consortium (ExAC), combined with primate genome sequences, we find that human variants that occurred independently in closely related primate lineages are at higher frequencies in humans than variants with parallel substitutions in more distant primates. We show that this effect is largely due to sites with elevated mutation rates causing significant departures from the widely-used infinite sites mutation model. Our analysis also suggests substantial variation in mutation rates even among mutations involving the same nucleotide changes. In summary, we show that variable mutation rates are key determinants of the SFS in humans.
Collapse
Affiliation(s)
- Arbel Harpak
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Anand Bhaskar
- Department of Genetics, Stanford University, Stanford, California, United States of America
- Howard Hughes Medical Institute, Stanford University, Stanford, California, United States of America
| | - Jonathan K. Pritchard
- Department of Biology, Stanford University, Stanford, California, United States of America
- Department of Genetics, Stanford University, Stanford, California, United States of America
- Howard Hughes Medical Institute, Stanford University, Stanford, California, United States of America
| |
Collapse
|
36
|
Gao F, Keinan A. Explosive genetic evidence for explosive human population growth. Curr Opin Genet Dev 2016; 41:130-139. [PMID: 27710906 PMCID: PMC5161661 DOI: 10.1016/j.gde.2016.09.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 08/26/2016] [Accepted: 09/11/2016] [Indexed: 11/19/2022]
Abstract
The advent of next-generation sequencing technology has allowed the collection of vast amounts of genetic variation data. A recurring discovery from studying larger and larger samples of individuals had been the extreme, previously unexpected, excess of very rare genetic variants, which has been shown to be mostly due to the recent explosive growth of human populations. Here, we review recent literature that inferred recent changes in population size in different human populations and with different methodologies, with many pointing to recent explosive growth, especially in European populations for which more data has been available. We also review the state-of-the-art methods and software for the inference of historical population size changes that lead to these discoveries. Finally, we discuss the implications of recent population growth on personalized genomics, on purifying selection in the non-equilibrium state it entails and, as a consequence, on the genetic architecture underlying complex disease and the performance of mapping methods in discovering rare variants that contribute to complex disease risk.
Collapse
Affiliation(s)
- Feng Gao
- Department of Biological Statistics and Computational Biology, Ithaca, NY 14850, United States
| | - Alon Keinan
- Department of Biological Statistics and Computational Biology, Ithaca, NY 14850, United States.
| |
Collapse
|
37
|
Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ, Golan D, Yengo L, Rocheleau G, Froguel P, McCarthy MI, Pritchard JK. Detection of human adaptation during the past 2000 years. Science 2016; 354:760-764. [PMID: 27738015 PMCID: PMC5182071 DOI: 10.1126/science.aag0776] [Citation(s) in RCA: 234] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Accepted: 10/03/2016] [Indexed: 12/22/2022]
Abstract
Detection of recent natural selection is a challenging problem in population genetics. Here we introduce the singleton density score (SDS), a method to infer very recent changes in allele frequencies from contemporary genome sequences. Applied to data from the UK10K Project, SDS reflects allele frequency changes in the ancestors of modern Britons during the past ~2000 to 3000 years. We see strong signals of selection at lactase and the major histocompatibility complex, and in favor of blond hair and blue eyes. For polygenic adaptation, we find that recent selection for increased height has driven allele frequency shifts across most of the genome. Moreover, we identify shifts associated with other complex traits, suggesting that polygenic adaptation has played a pervasive role in shaping genotypic and phenotypic variation in modern humans.
Collapse
Affiliation(s)
- Yair Field
- Department of Genetics, Stanford University, Stanford, CA 94305, USA.
- Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
| | - Evan A Boyle
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Natalie Telis
- Program in Biomedical Informatics, Stanford University, Stanford, CA 94305, USA
| | - Ziyue Gao
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
| | - Kyle J Gaulton
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Wellcome Trust Center for Human Genetics, and Oxford Center for Diabetes Endocrinology and Metabolism, University of Oxford, Oxford, UK
| | - David Golan
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Loic Yengo
- Univ. Lille, CNRS, Institut Pasteur de Lille, UMR 8199-EGID, F-59000 Lille, France
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Ghislain Rocheleau
- Univ. Lille, CNRS, Institut Pasteur de Lille, UMR 8199-EGID, F-59000 Lille, France
| | - Philippe Froguel
- Univ. Lille, CNRS, Institut Pasteur de Lille, UMR 8199-EGID, F-59000 Lille, France
- Imperial College, Department of Genomics of Common Disease, London Hammersmith Hospital, London, UK
| | - Mark I McCarthy
- Wellcome Trust Center for Human Genetics, and Oxford Center for Diabetes Endocrinology and Metabolism, University of Oxford, Oxford, UK
| | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA.
- Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
- Department of Biology, Stanford University, Stanford, CA, USA
| |
Collapse
|
38
|
Schrider DR, Shanku AG, Kern AD. Effects of Linked Selective Sweeps on Demographic Inference and Model Selection. Genetics 2016; 204:1207-1223. [PMID: 27605051 PMCID: PMC5105852 DOI: 10.1534/genetics.116.190223] [Citation(s) in RCA: 90] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Accepted: 09/02/2016] [Indexed: 01/06/2023] Open
Abstract
The availability of large-scale population genomic sequence data has resulted in an explosion in efforts to infer the demographic histories of natural populations across a broad range of organisms. As demographic events alter coalescent genealogies, they leave detectable signatures in patterns of genetic variation within and between populations. Accordingly, a variety of approaches have been designed to leverage population genetic data to uncover the footprints of demographic change in the genome. The vast majority of these methods make the simplifying assumption that the measures of genetic variation used as their input are unaffected by natural selection. However, natural selection can dramatically skew patterns of variation not only at selected sites, but at linked, neutral loci as well. Here we assess the impact of recent positive selection on demographic inference by characterizing the performance of three popular methods through extensive simulation of data sets with varying numbers of linked selective sweeps. In particular, we examined three different demographic models relevant to a number of species, finding that positive selection can bias parameter estimates of each of these models-often severely. We find that selection can lead to incorrect inferences of population size changes when none have occurred. Moreover, we show that linked selection can lead to incorrect demographic model selection, when multiple demographic scenarios are compared. We argue that natural populations may experience the amount of recent positive selection required to skew inferences. These results suggest that demographic studies conducted in many species to date may have exaggerated the extent and frequency of population size changes.
Collapse
Affiliation(s)
- Daniel R Schrider
- Department of Genetics, Rutgers University, Piscataway, New Jersey 08854
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey 08554
| | - Alexander G Shanku
- Department of Genetics, Rutgers University, Piscataway, New Jersey 08854
- Institute for Quantitative Biomedicine, Rutgers University, Piscataway, New Jersey 08554
| | - Andrew D Kern
- Department of Genetics, Rutgers University, Piscataway, New Jersey 08854
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, New Jersey 08554
| |
Collapse
|
39
|
Mathias RA, Taub MA, Gignoux CR, Fu W, Musharoff S, O'Connor TD, Vergara C, Torgerson DG, Pino-Yanes M, Shringarpure SS, Huang L, Rafaels N, Boorgula MP, Johnston HR, Ortega VE, Levin AM, Song W, Torres R, Padhukasahasram B, Eng C, Mejia-Mejia DA, Ferguson T, Qin ZS, Scott AF, Yazdanbakhsh M, Wilson JG, Marrugo J, Lange LA, Kumar R, Avila PC, Williams LK, Watson H, Ware LB, Olopade C, Olopade O, Oliveira R, Ober C, Nicolae DL, Meyers D, Mayorga A, Knight-Madden J, Hartert T, Hansel NN, Foreman MG, Ford JG, Faruque MU, Dunston GM, Caraballo L, Burchard EG, Bleecker E, Araujo MI, Herrera-Paz EF, Gietzen K, Grus WE, Bamshad M, Bustamante CD, Kenny EE, Hernandez RD, Beaty TH, Ruczinski I, Akey J, Barnes KC. A continuum of admixture in the Western Hemisphere revealed by the African Diaspora genome. Nat Commun 2016; 7:12522. [PMID: 27725671 PMCID: PMC5062574 DOI: 10.1038/ncomms12522] [Citation(s) in RCA: 102] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 07/12/2016] [Indexed: 01/20/2023] Open
Abstract
The African Diaspora in the Western Hemisphere represents one of the largest forced migrations in history and had a profound impact on genetic diversity in modern populations. To date, the fine-scale population structure of descendants of the African Diaspora remains largely uncharacterized. Here we present genetic variation from deeply sequenced genomes of 642 individuals from North and South American, Caribbean and West African populations, substantially increasing the lexicon of human genomic variation and suggesting much variation remains to be discovered in African-admixed populations in the Americas. We summarize genetic variation in these populations, quantifying the postcolonial sex-biased European gene flow across multiple regions. Moreover, we refine estimates on the burden of deleterious variants carried across populations and how this varies with African ancestry. Our data are an important resource for empowering disease mapping studies in African-admixed individuals and will facilitate gene discovery for diseases disproportionately affecting individuals of African ancestry.
Collapse
Affiliation(s)
- Rasika Ann Mathias
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, Maryland 21205, USA
| | - Margaret A. Taub
- Department of Biostatistics, Bloomberg School of Public Health, JHU, Baltimore, Maryland 21205, USA
| | - Christopher R. Gignoux
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Wenqing Fu
- Department of Genomic Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Shaila Musharoff
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Timothy D. O'Connor
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
- Program in Personalized and Genomic Medicine, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
- Department of Medicine, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Candelaria Vergara
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
| | - Dara G. Torgerson
- Department of Medicine, University of California, San Francisco, San Francisco, California 94143, USA
| | - Maria Pino-Yanes
- Department of Medicine, University of California, San Francisco, San Francisco, California 94143, USA
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid 28029, Spain
| | - Suyash S. Shringarpure
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Lili Huang
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
| | - Nicholas Rafaels
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
| | | | - Henry Richard Johnston
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, USA
| | - Victor E. Ortega
- Center for Human Genomics and Personalized Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina 27157, USA
| | - Albert M. Levin
- Department of Public Health Sciences, Henry Ford Health System, Detroit, Michigan 48202, USA
| | - Wei Song
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
- Program in Personalized and Genomic Medicine, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
- Department of Medicine, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA
| | - Raul Torres
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, California 94158, USA
| | - Badri Padhukasahasram
- Center for Health Policy and Health Services Research, Henry Ford Health System, Detroit, Michigan 48202, USA
| | - Celeste Eng
- Department of Medicine, University of California, San Francisco, San Francisco, California 94143, USA
| | - Delmy-Aracely Mejia-Mejia
- Centro de Neumologia y Alergias, San Pedro Sula 21102, Honduras
- Faculty of Medicine, Centro Medico de la Familia, San Pedro Sula 21102, Honduras
| | - Trevor Ferguson
- Tropical Medicine Research Institute, The University of the West Indies, St. Michael BB11115, Barbados
| | - Zhaohui S. Qin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, USA
| | - Alan F. Scott
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
| | - Maria Yazdanbakhsh
- Department of Parasitology, Leiden University Medical Center, Leiden 2333ZA, The Netherlands
| | - James G. Wilson
- Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, Mississippi 39216, USA
| | - Javier Marrugo
- Instituto de Investigaciones Immunologicas, Universidad de Cartagena, Cartagena 130000, Colombia
| | - Leslie A. Lange
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599, USA
| | - Rajesh Kumar
- Department of Pediatrics, Northwestern University, Chicago, Illinois 60637, USA
- The Ann & Robert H. Lurie Children's Hospital of Chicago, Chicago, Illinois 60637, USA
| | - Pedro C. Avila
- Department of Medicine, Northwestern University, Chicago, Illinois 60637, USA
| | - L. Keoki Williams
- Center for Health Policy and Health Services Research, Henry Ford Health System, Detroit, Michigan 48202, USA
- Department of Internal Medicine, Henry Ford Health System, Detroit, Michigan 48202, USA
| | - Harold Watson
- Faculty of Medical Sciences Cave Hill Campus, The University of the West Indies, Bridgetown BB11000, Barbados
- Queen Elizabeth Hospital, The University of the West Indies, St. Michael BB11115, Barbados
| | - Lorraine B. Ware
- Department of Medicine, Vanderbilt University, Nashville, Tennessee 37232, USA
- Department of Pathology, Microbiology and Immunology, Vanderbilt University, Nashville, Tennessee 37232, USA
| | - Christopher Olopade
- Department of Medicine and Center for Global Health, University of Chicago, Chicago, Illinois 60637, USA
| | | | - Ricardo Oliveira
- Laboratório de Patologia Experimental, Centro de Pesquisas Gonçalo Moniz, Salvador 40296-710, Brazil
| | - Carole Ober
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Dan L. Nicolae
- Department of Medicine, University of Chicago, Chicago, Illinois 60637, USA
- Department of Statistics, University of Chicago, Chicago, Illinois 60637, USA
| | - Deborah Meyers
- Center for Human Genomics and Personalized Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina 27157, USA
| | - Alvaro Mayorga
- Centro de Neumologia y Alergias, San Pedro Sula 21102, Honduras
| | - Jennifer Knight-Madden
- Tropical Medicine Research Institute, The University of the West Indies, St. Michael BB11115, Barbados
| | - Tina Hartert
- Department of Medicine, Vanderbilt University, Nashville, Tennessee 37232, USA
| | - Nadia N. Hansel
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
| | - Marilyn G. Foreman
- Pulmonary and Critical Care Medicine, Morehouse School of Medicine, Atlanta, Georgia 30310, USA
| | - Jean G. Ford
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, Maryland 21205, USA
- Department of Medicine, The Brooklyn Hospital Center, Brooklyn, New York 11201, USA
| | - Mezbah U. Faruque
- National Human Genome Center, Howard University College of Medicine, Washington DC 20059, USA
| | - Georgia M. Dunston
- National Human Genome Center, Howard University College of Medicine, Washington DC 20059, USA
- Department of Microbiology, Howard University College of Medicine, Washington DC 20059, USA
| | - Luis Caraballo
- Institute for Immunological Research, Universidad de Cartagena, Cartagena 130000, Colombia
| | - Esteban G. Burchard
- Department of Medicine, University of California, San Francisco, San Francisco, California 94143, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California 94158, USA
| | - Eugene Bleecker
- Center for Human Genomics and Personalized Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina 27157, USA
| | - Maria Ilma Araujo
- Immunology Service, Universidade Federal da Bahia, Salvador 401110170, Brazil
| | - Edwin Francisco Herrera-Paz
- Centro de Neumologia y Alergias, San Pedro Sula 21102, Honduras
- Faculty of Medicine, Centro Medico de la Familia, San Pedro Sula 21102, Honduras
- Facultad de Medicina, Universidad Catolica de Honduras, San Pedro Sula 21102, Honduras
| | | | | | - Michael Bamshad
- Department of Pediatrics, University of Washington, Seattle, Washington 98195, USA
| | - Carlos D. Bustamante
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | - Eimear E. Kenny
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
- Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA
| | - Ryan D. Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, California 94158, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, California 94143, USA
- California Institute for Quantitative Biosciences, University of California, San Francisco, California 94143, USA
| | - Terri H. Beaty
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, Maryland 21205, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Bloomberg School of Public Health, JHU, Baltimore, Maryland 21205, USA
| | - Joshua Akey
- Department of Genomic Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Kathleen C. Barnes
- Department of Medicine, Johns Hopkins University, Baltimore, Maryland 21224, USA
- Department of Epidemiology, Bloomberg School of Public Health, JHU, Baltimore, Maryland 21205, USA
| |
Collapse
|
40
|
Auer PL, Reiner AP, Wang G, Kang HM, Abecasis GR, Altshuler D, Bamshad MJ, Nickerson DA, Tracy RP, Rich SS, Leal SM, Leal SM. Guidelines for Large-Scale Sequence-Based Complex Trait Association Studies: Lessons Learned from the NHLBI Exome Sequencing Project. Am J Hum Genet 2016; 99:791-801. [PMID: 27666372 DOI: 10.1016/j.ajhg.2016.08.012] [Citation(s) in RCA: 69] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Accepted: 08/08/2016] [Indexed: 12/11/2022] Open
Abstract
Massively parallel whole-genome sequencing (WGS) data have ushered in a new era in human genetics. These data are now being used to understand the role of rare variants in complex traits and to advance the goals of precision medicine. The technological and computing advances that have enabled us to generate WGS data on thousands of individuals have also outpaced our ability to perform analyses in scientifically and statistically rigorous and thoughtful ways. The past several years have witnessed the application of whole-exome sequencing (WES) to complex traits and diseases. From our analysis of NHLBI Exome Sequencing Project (ESP) data, not only have a number of important disease and complex trait association findings emerged, but our collective experience offers some valuable lessons for WGS initiatives. These include caveats associated with generating automated pipelines for quality control and analysis of rare variants; the importance of studying minority populations; sample size requirements and efficient study designs for identifying rare-variant associations; and the significance of incidental findings in population-based genetic research. With the ESP as an example, we offer guidance and a framework on how to conduct a large-scale association study in the era of WGS.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Suzanne M Leal
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
| |
Collapse
|
41
|
Xue C, Chen H, Yu F. Base-Biased Evolution of Disease-Associated Mutations in the Human Genome. Hum Mutat 2016; 37:1209-1214. [PMID: 27507420 DOI: 10.1002/humu.23065] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2016] [Revised: 08/02/2016] [Accepted: 08/07/2016] [Indexed: 11/08/2022]
Abstract
Understanding the evolution of disease-associated mutations is fundamental to analyze pathogenetics of diseases. Mutation, recombination (by GC-biased gene conversion, gBGC), and selection have been known to shape the evolution of disease-associated mutations, but how these evolutionary forces work together is still an open question. In this study, we analyzed several human large-scale datasets (1000 Genomes, ESP6500, ExAC and ClinVar), and found that base-biased mutagenesis generates more GC→AT than AT→GC mutations, while gBGC promotes the fixation of AT→GC mutations to balance the impact of base-biased mutation on genome. Due to this effect of gBGC, purifying selection removes more deleterious AT→GC mutations than GC→AT from population, but many high-frequency (fixed and nearly fixed) deleterious AT→GC mutations are remained possibly due to high genetic load. As a special subset, disease-associated mutations follow this evolutionary rule, in which disease-associated GC→AT mutations are more enriched in rare mutations compared with AT→GC, while disease-associated AT→GC are more enriched in mutations with high frequency. Thus, we presented a base-biased evolutionary framework that explains the base-biased generation and accumulation of disease-associated mutations in human populations.
Collapse
Affiliation(s)
- Cheng Xue
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas
| | - Hua Chen
- Center for Computational Genomics, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Fuli Yu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas.
| |
Collapse
|
42
|
Goldberg A, Mychajliw AM, Hadly EA. Post-invasion demography of prehistoric humans in South America. Nature 2016; 532:232-5. [PMID: 27049941 DOI: 10.1038/nature17176] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 01/26/2016] [Indexed: 01/25/2023]
Abstract
As the last habitable continent colonized by humans, the site of multiple domestication hotspots, and the location of the largest Pleistocene megafaunal extinction, South America is central to human prehistory. Yet remarkably little is known about human population dynamics during colonization, subsequent expansions, and domestication. Here we reconstruct the spatiotemporal patterns of human population growth in South America using a newly aggregated database of 1,147 archaeological sites and 5,464 calibrated radiocarbon dates spanning fourteen thousand to two thousand years ago (ka). We demonstrate that, rather than a steady exponential expansion, the demographic history of South Americans is characterized by two distinct phases. First, humans spread rapidly throughout the continent, but remained at low population sizes for 8,000 years, including a 4,000-year period of 'boom-and-bust' oscillations with no net growth. Supplementation of hunting with domesticated crops and animals had a minimal impact on population carrying capacity. Only with widespread sedentism, beginning ~5 ka, did a second demographic phase begin, with evidence for exponential population growth in cultural hotspots, characteristic of the Neolithic transition worldwide. The unique extent of humanity's ability to modify its environment to markedly increase carrying capacity in South America is therefore an unexpectedly recent phenomenon.
Collapse
Affiliation(s)
- Amy Goldberg
- Biology Department, Stanford University, Stanford, California 94305, USA
| | - Alexis M Mychajliw
- Biology Department, Stanford University, Stanford, California 94305, USA
| | - Elizabeth A Hadly
- Biology Department, Stanford University, Stanford, California 94305, USA.,Woods Institute, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
43
|
Li B, Wang GT, Leal SM. Generation of sequence-based data for pedigree-segregating Mendelian or Complex traits. Bioinformatics 2015; 31:3706-8. [PMID: 26177964 DOI: 10.1093/bioinformatics/btv412] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 07/07/2015] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION There is great interest in analyzing next generation sequence data that has been generated for pedigrees. However, unlike for population-based data there are only a limited number of rare variant methods to analyze pedigree data. One limitation is the ability to evaluate type I and II errors for family-based methods, due to lack of software that can simulate realistic sequence data for pedigrees. SUMMARY We developed RarePedSim (Rare-variant Pedigree-based Simulator), a program to simulate region/gene-level genotype and phenotype data for complex and Mendelian traits for any given pedigree structure. Using a genetic model, sequence variant data can be generated either conditionally or unconditionally on pedigree members' qualitative or quantitative phenotypes. Additionally, qualitative or quantitative traits can be generated conditional on variant data. Sequence data can either be simulated using realistic population demographic models or obtained from sequence-based studies. Variant sites can be annotated with positions, allele frequencies and functionality. For rare variants, RarePedSim is the only program that can efficiently generate both genotypes and phenotypes, regardless of pedigree structure. Data generated by RarePedSim are in standard Linkage file (.ped) and Variant Call (.vcf) formats, ready to be used for a variety of purposes, including evaluation of type I error and power, for association methods including mixed models and linkage analysis methods. AVAILABILITY AND IMPLEMENTATION bioinformatics.org/simped/rare CONTACT sleal@bcm.edu.
Collapse
Affiliation(s)
- Biao Li
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Gao T Wang
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | - Suzanne M Leal
- Center for Statistical Genetics, Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| |
Collapse
|
44
|
|
45
|
Inference of Super-exponential Human Population Growth via Efficient Computation of the Site Frequency Spectrum for Generalized Models. Genetics 2015; 202:235-45. [PMID: 26450922 PMCID: PMC4701087 DOI: 10.1534/genetics.115.180570] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Accepted: 09/28/2015] [Indexed: 01/08/2023] Open
Abstract
The site frequency spectrum (SFS) and other genetic summary statistics are at the heart of many population genetic studies. Previous studies have shown that human populations have undergone a recent epoch of fast growth in effective population size. These studies assumed that growth is exponential, and the ensuing models leave an excess amount of extremely rare variants. This suggests that human populations might have experienced a recent growth with speed faster than exponential. Recent studies have introduced a generalized growth model where the growth speed can be faster or slower than exponential. However, only simulation approaches were available for obtaining summary statistics under such generalized models. In this study, we provide expressions to accurately and efficiently evaluate the SFS and other summary statistics under generalized models, which we further implement in a publicly available software. Investigating the power to infer deviation of growth from being exponential, we observed that adequate sample sizes facilitate accurate inference; e.g., a sample of 3000 individuals with the amount of data expected from exome sequencing allows observing and accurately estimating growth with speed deviating by ≥10% from that of exponential. Applying our inference framework to data from the NHLBI Exome Sequencing Project, we found that a model with a generalized growth epoch fits the observed SFS significantly better than the equivalent model with exponential growth (P-value =3.85×10−6). The estimated growth speed significantly deviates from exponential (P-value ≪10−12), with the best-fit estimate being of growth speed 12% faster than exponential.
Collapse
|
46
|
Lohmueller KE. The distribution of deleterious genetic variation in human populations. Curr Opin Genet Dev 2015; 29:139-46. [PMID: 25461617 DOI: 10.1016/j.gde.2014.09.005] [Citation(s) in RCA: 86] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2014] [Revised: 08/28/2014] [Accepted: 09/05/2014] [Indexed: 11/19/2022]
Abstract
Population genetic studies suggest that most amino-acid changing mutations are deleterious. Such mutations are of tremendous interest in human population genetics as they are important for the evolutionary process and may contribute risk to common disease. Genomic studies over the past 5 years have documented differences across populations in the number of heterozygous deleterious genotypes, number of homozygous derived deleterious genotypes, number of deleterious segregating sites and proportion of sites that are potentially deleterious. These differences have been attributed to population history affecting the ability of natural selection to remove deleterious variants from the population. However, recent studies have suggested that the genetic load is the same across populations and that the efficacy of natural selection has not differed across human populations. Here I show that these observations are not incompatible with each other and that the apparent differences are due to examining different features of the genetic data and differing definitions of terms.
Collapse
|
47
|
Chen H, Hey J, Chen K. Inferring Very Recent Population Growth Rate from Population-Scale Sequencing Data: Using a Large-Sample Coalescent Estimator. Mol Biol Evol 2015; 32:2996-3011. [PMID: 26187437 DOI: 10.1093/molbev/msv158] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Large-sample or population-level sequencing data provide unprecedented opportunities for inferring detailed population histories, especially recent demographic histories. On the other hand, it challenges most existing population genetic methods: Simulation-based approaches require intensive computation, and analytical approaches are often numerically intractable when the sample size is large. We propose a computationally efficient method for simultaneous estimation of population size, the rate, and onset time of population growth in the very recent history, using the pattern of the total number of segregating sites as a function of sample size. Coalescent simulation shows that it can accurately and efficiently estimate the parameters of recent population growth from large-scale data. This approach has the flexibility to model population history with multiple growth stages or other epochs, and it is robust when the sample size is very large or at the population scale, for which the Kingman's coalescent assumption is not valid. This approach is applied to recently published data and estimates the recent population growth rate in the European population to be 1.49% with the onset time 7.26 ka, and the rate in the African population to be 0.735% with the onset time 10.01 ka.
Collapse
Affiliation(s)
- Hua Chen
- Center for Computational Genomics, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Jody Hey
- Center for Computational Genetics and Genomics, Temple University
| | - Kun Chen
- Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA
| |
Collapse
|
48
|
Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum. Proc Natl Acad Sci U S A 2015; 112:7677-82. [PMID: 26056264 DOI: 10.1073/pnas.1503717112] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic that is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, little is currently known about the information theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least O(1/log s), where s is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number s of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.
Collapse
|
49
|
Fregel R, Cabrera V, Larruga JM, Abu-Amero KK, González AM. Carriers of Mitochondrial DNA Macrohaplogroup N Lineages Reached Australia around 50,000 Years Ago following a Northern Asian Route. PLoS One 2015; 10:e0129839. [PMID: 26053380 PMCID: PMC4460043 DOI: 10.1371/journal.pone.0129839] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 05/13/2015] [Indexed: 01/17/2023] Open
Abstract
Background The modern human colonization of Eurasia and Australia is mostly explained by a single-out-of-Africa exit following a southern coastal route throughout Arabia and India. However, dispersal across the Levant would better explain the introgression with Neanderthals, and more than one exit would fit better with the different ancient genomic components discovered in indigenous Australians and in ancient Europeans. The existence of an additional Northern route used by modern humans to reach Australia was previously deduced from the phylogeography of mtDNA macrohaplogroup N. Here, we present new mtDNA data and new multidisciplinary information that add more support to this northern route. Methods MtDNA hypervariable segments and haplogroup diagnostic coding positions were analyzed in 2,278 Saudi Arabs, from which 1,725 are new samples. Besides, we used 623 published mtDNA genomes belonging to macrohaplogroup N, but not R, to build updated phylogenetic trees to calculate their coalescence ages, and more than 70,000 partial mtDNA sequences were screened to establish their respective geographic ranges. Results The Saudi mtDNA profile confirms the absence of autochthonous mtDNA lineages in Arabia with coalescence ages deep enough to support population continuity in the region since the out-of-Africa episode. In contrast to Australia, where N(xR) haplogroups are found in high frequency and with deep coalescence ages, there are not autochthonous N(xR) lineages in India nor N(xR) branches with coalescence ages as deep as those found in Australia. These patterns are at odds with the supposition that Australian colonizers harboring N(xR) lineages used a route involving India as a stage. The most ancient N(xR) lineages in Eurasia are found in China, and inconsistently with the coastal route, N(xR) haplogroups with the southernmost geographical range have all more recent radiations than the Australians. Conclusions Apart from a single migration event via a southern route, phylogeny and phylogeography of N(xR) lineages support that people carrying mtDNA N lineages could have reach Australia following a northern route through Asia. Data from other disciplines also support this scenario.
Collapse
Affiliation(s)
- Rosa Fregel
- Departamento de Genética, Facultad de Biología, Universidad de La Laguna, La Laguna, Tenerife, Spain
- * E-mail:
| | - Vicente Cabrera
- Departamento de Genética, Facultad de Biología, Universidad de La Laguna, La Laguna, Tenerife, Spain
| | - Jose M. Larruga
- Departamento de Genética, Facultad de Biología, Universidad de La Laguna, La Laguna, Tenerife, Spain
| | - Khaled K. Abu-Amero
- Department of Ophthalmology, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Ana M. González
- Departamento de Genética, Facultad de Biología, Universidad de La Laguna, La Laguna, Tenerife, Spain
| |
Collapse
|
50
|
Yu F, Lu J, Liu X, Gazave E, Chang D, Raj S, Hunter-Zinck H, Blekhman R, Arbiza L, Van Hout C, Morrison A, Johnson AD, Bis J, Cupples LA, Psaty BM, Muzny D, Yu J, Gibbs RA, Keinan A, Clark AG, Boerwinkle E. Population genomic analysis of 962 whole genome sequences of humans reveals natural selection in non-coding regions. PLoS One 2015; 10:e0121644. [PMID: 25807536 PMCID: PMC4373932 DOI: 10.1371/journal.pone.0121644] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 08/14/2014] [Indexed: 12/13/2022] Open
Abstract
Whole genome analysis in large samples from a single population is needed to provide adequate power to assess relative strengths of natural selection across different functional components of the genome. In this study, we analyzed next-generation sequencing data from 962 European Americans, and found that as expected approximately 60% of the top 1% of positive selection signals lie in intergenic regions, 33% in intronic regions, and slightly over 1% in coding regions. Several detailed functional annotation categories in intergenic regions showed statistically significant enrichment in positively selected loci when compared to the null distribution of the genomic span of ENCODE categories. There was a significant enrichment of purifying selection signals detected in enhancers, transcription factor binding sites, microRNAs and target sites, but not on lincRNA or piRNAs, suggesting different evolutionary constraints for these domains. Loci in “repressed or low activity regions” and loci near or overlapping the transcription start site were the most significantly over-represented annotations among the top 1% of signals for positive selection.
Collapse
Affiliation(s)
- Fuli Yu
- Human Genome Sequencing Center, Molecular and Human Genetics Department, Baylor College of Medicine, Houston, Texas, United States of America
- Institute of Neurology, Tianjin Medical University General Hospital, Tianjin, China
- * E-mail: (FY); (EB)
| | - Jian Lu
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
- College of Life Sciences, State Key Laboratory of Protein and Plant Gene Research, Center for Bioinformatics, Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, China
| | - Xiaoming Liu
- Human Genetic Center, University of Texas Health Science Center, Houston, Texas, United States of America
| | - Elodie Gazave
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Diana Chang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Srilakshmi Raj
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Haley Hunter-Zinck
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Ran Blekhman
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Leonardo Arbiza
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Cris Van Hout
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Alanna Morrison
- Human Genetic Center, University of Texas Health Science Center, Houston, Texas, United States of America
| | - Andrew D. Johnson
- National Heart, Lung and Blood Institute (NHLBI) Framingham Heart Study, Framingham, Massachusetts, United States of America
| | - Joshua Bis
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, Washington, United States of America
| | - L. Adrienne Cupples
- National Heart, Lung and Blood Institute (NHLBI) Framingham Heart Study, Framingham, Massachusetts, United States of America
| | - Bruce M. Psaty
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, Washington, United States of America
| | - Donna Muzny
- Human Genome Sequencing Center, Molecular and Human Genetics Department, Baylor College of Medicine, Houston, Texas, United States of America
| | - Jin Yu
- Human Genome Sequencing Center, Molecular and Human Genetics Department, Baylor College of Medicine, Houston, Texas, United States of America
| | - Richard A. Gibbs
- Human Genome Sequencing Center, Molecular and Human Genetics Department, Baylor College of Medicine, Houston, Texas, United States of America
| | - Alon Keinan
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Andrew G. Clark
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, New York, United States of America
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Molecular and Human Genetics Department, Baylor College of Medicine, Houston, Texas, United States of America
- Human Genetic Center, University of Texas Health Science Center, Houston, Texas, United States of America
- * E-mail: (FY); (EB)
| |
Collapse
|