151
|
Layer RM, Kindlon N, Karczewski KJ, Quinlan AR. Efficient genotype compression and analysis of large genetic-variation data sets. Nat Methods 2016; 13:63-5. [PMID: 26550772 PMCID: PMC4697868 DOI: 10.1038/nmeth.3654] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 10/07/2015] [Indexed: 11/08/2022]
Abstract
Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.
Collapse
Affiliation(s)
- Ryan M Layer
- Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA
| | - Neil Kindlon
- Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA
| | - Konrad J Karczewski
- Analytical and Translational Genetics Unit, Harvard Medical School, Boston, Massachusetts, USA
| | - Aaron R Quinlan
- Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
- USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
| |
Collapse
|
152
|
Jacquin L, Cao TV, Grenier C, Ahmadi N. DHOEM: a statistical simulation software for simulating new markers in real SNP marker data. BMC Bioinformatics 2015; 16:404. [PMID: 26634451 PMCID: PMC4669601 DOI: 10.1186/s12859-015-0830-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2015] [Accepted: 11/16/2015] [Indexed: 11/10/2022] Open
Abstract
Background Numerous simulation tools based on specific assumptions have been proposed to simulate populations. Here we present a simulation tool named DHOEM (densification of haplotypes by loess regression and maximum likelihood) which is free from population assumptions and simulates new markers in real SNP marker data. The main objective of DHOEM is to generate a new population, which incorporates real and simulated SNP by statistical learning from an initial population, which match the realized features of the latter. Results To demonstrate DHOEM’s abilities, we used a sample of 704 haplotypes for 12 chromosomes with 8336 SNP from a synthetic population, used for breeding upland rice in Latin America. The distributions of allele frequencies, pairwise SNP LD coefficients and data structures, before and after marker densification of the associated marker data set, were shown to be in relatively good agreement at moderate degrees of marker densification. DHOEM is a user-friendly tool that allows the user to specify the level of marker density desired, with a user defined minor allele frequency (MAF) limit, which is produced in a reasonable computation time. Conclusions DHOEM is a user-friendly and useful tool for simulation and methodological studies in quantitative genetics and breeding.
Collapse
Affiliation(s)
- Laval Jacquin
- CIRAD, UMR AGAP, Centre de Coopération Internationale en Recherche Agronomique pour le Développement, Avenue Agropolis, Montpellier Cedex 5, 34398, France.
| | - Tuong-Vi Cao
- CIRAD, UMR AGAP, Centre de Coopération Internationale en Recherche Agronomique pour le Développement, Avenue Agropolis, Montpellier Cedex 5, 34398, France
| | - Cécile Grenier
- CIRAD, UMR AGAP, Centre de Coopération Internationale en Recherche Agronomique pour le Développement, Avenue Agropolis, Montpellier Cedex 5, 34398, France
| | - Nourollah Ahmadi
- CIRAD, UMR AGAP, Centre de Coopération Internationale en Recherche Agronomique pour le Développement, Avenue Agropolis, Montpellier Cedex 5, 34398, France
| |
Collapse
|
153
|
Cui R, Schumer M, Rosenthal GG. Admix’em: a flexible framework for forward-time simulations of hybrid populations with selection and mate choice. Bioinformatics 2015; 32:1103-5. [DOI: 10.1093/bioinformatics/btv700] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 11/25/2015] [Indexed: 11/13/2022] Open
|
154
|
Singhal S, Leffler EM, Sannareddy K, Turner I, Venn O, Hooper DM, Strand AI, Li Q, Raney B, Balakrishnan CN, Griffith SC, McVean G, Przeworski M. Stable recombination hotspots in birds. Science 2015; 350:928-32. [PMID: 26586757 PMCID: PMC4864528 DOI: 10.1126/science.aad0843] [Citation(s) in RCA: 198] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The DNA-binding protein PRDM9 has a critical role in specifying meiotic recombination hotspots in mice and apes, but it appears to be absent from other vertebrate species, including birds. To study the evolution and determinants of recombination in species lacking the gene that encodes PRDM9, we inferred fine-scale genetic maps from population resequencing data for two bird species: the zebra finch, Taeniopygia guttata, and the long-tailed finch, Poephila acuticauda. We found that both species have recombination hotspots, which are enriched near functional genomic elements. Unlike in mice and apes, most hotspots are shared between the two species, and their conservation seems to extend over tens of millions of years. These observations suggest that in the absence of PRDM9, recombination targets functional features that both enable access to the genome and constrain its evolution.
Collapse
Affiliation(s)
- Sonal Singhal
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA. Department of Systems Biology, Columbia University, New York, NY 10032, USA.
| | - Ellen M Leffler
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Keerthi Sannareddy
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Isaac Turner
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Oliver Venn
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Daniel M Hooper
- Committee on Evolutionary Biology, University of Chicago, Chicago, IL 60637, USA
| | - Alva I Strand
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA
| | - Qiye Li
- China National Genebank, BGI-Shenzhen, Shenzhen 518083, China
| | - Brian Raney
- Center for Biomolecular Science and Engineering, University of California-Santa Cruz, Santa Cruz, CA 95064, USA
| | | | - Simon C Griffith
- Department of Biological Sciences, Macquarie University, Sydney, NSW 2109, Australia
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Molly Przeworski
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA. Department of Systems Biology, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
155
|
Stram AH, Marjoram P, Chen GK. al3c: high-performance software for parameter inference using Approximate Bayesian Computation. Bioinformatics 2015; 31:3549-51. [PMID: 26142186 PMCID: PMC4626746 DOI: 10.1093/bioinformatics/btv393] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2015] [Revised: 05/21/2015] [Accepted: 06/24/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The development of Approximate Bayesian Computation (ABC) algorithms for parameter inference which are both computationally efficient and scalable in parallel computing environments is an important area of research. Monte Carlo rejection sampling, a fundamental component of ABC algorithms, is trivial to distribute over multiple processors but is inherently inefficient. While development of algorithms such as ABC Sequential Monte Carlo (ABC-SMC) help address the inherent inefficiencies of rejection sampling, such approaches are not as easily scaled on multiple processors. As a result, current Bayesian inference software offerings that use ABC-SMC lack the ability to scale in parallel computing environments. RESULTS We present al3c, a C++ framework for implementing ABC-SMC in parallel. By requiring only that users define essential functions such as the simulation model and prior distribution function, al3c abstracts the user from both the complexities of parallel programming and the details of the ABC-SMC algorithm. By using the al3c framework, the user is able to scale the ABC-SMC algorithm in parallel computing environments for his or her specific application, with minimal programming overhead. AVAILABILITY AND IMPLEMENTATION al3c is offered as a static binary for Linux and OS-X computing environments. The user completes an XML configuration file and C++ plug-in template for the specific application, which are used by al3c to obtain the desired results. Users can download the static binaries, source code, reference documentation and examples (including those in this article) by visiting https://github.com/ahstram/al3c. CONTACT astram@usc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Paul Marjoram
- Division of Biostatistics, Department of Preventive Medicine, USC, Los Angeles, CA 90033, USA
| | - Gary K Chen
- Division of Biostatistics, Department of Preventive Medicine, USC, Los Angeles, CA 90033, USA
| |
Collapse
|
156
|
Browning SR, Browning BL. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genet 2015; 97:404-18. [PMID: 26299365 PMCID: PMC4564943 DOI: 10.1016/j.ajhg.2015.07.012] [Citation(s) in RCA: 182] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2015] [Accepted: 07/28/2015] [Indexed: 10/23/2022] Open
Abstract
Existing methods for estimating historical effective population size from genetic data have been unable to accurately estimate effective population size during the most recent past. We present a non-parametric method for accurately estimating recent effective population size by using inferred long segments of identity by descent (IBD). We found that inferred segments of IBD contain information about effective population size from around 4 generations to around 50 generations ago for SNP array data and to over 200 generations ago for sequence data. In human populations that we examined, the estimates of effective size were approximately one-third of the census size. We estimate the effective population size of European-ancestry individuals in the UK four generations ago to be eight million and the effective population size of Finland four generations ago to be 0.7 million. Our method is implemented in the open-source IBDNe software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| | - Brian L Browning
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
157
|
Frantz LAF, Schraiber JG, Madsen O, Megens HJ, Cagan A, Bosse M, Paudel Y, Crooijmans RPMA, Larson G, Groenen MAM. Evidence of long-term gene flow and selection during domestication from analyses of Eurasian wild and domestic pig genomes. Nat Genet 2015; 47:1141-8. [PMID: 26323058 DOI: 10.1038/ng.3394] [Citation(s) in RCA: 173] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2014] [Accepted: 08/10/2015] [Indexed: 12/18/2022]
Abstract
Traditionally, the process of domestication is assumed to be initiated by humans, involve few individuals and rely on reproductive isolation between wild and domestic forms. We analyzed pig domestication using over 100 genome sequences and tested whether pig domestication followed a traditional linear model or a more complex, reticulate model. We found that the assumptions of traditional models, such as reproductive isolation and strong domestication bottlenecks, are incompatible with the genetic data. In addition, our results show that, despite gene flow, the genomes of domestic pigs have strong signatures of selection at loci that affect behavior and morphology. We argue that recurrent selection for domestic traits likely counteracted the homogenizing effect of gene flow from wild boars and created 'islands of domestication' in the genome. Our results have major ramifications for the understanding of animal domestication and suggest that future studies should employ models that do not assume reproductive isolation.
Collapse
Affiliation(s)
- Laurent A F Frantz
- Animal Breeding and Genomics Group, Wageningen University, Wageningen, the Netherlands.,Palaeogenomics and Bio-Archaeology Research Network, Research Laboratory for Archaeology and History of Art, University of Oxford, Oxford, UK
| | - Joshua G Schraiber
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, USA.,Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Ole Madsen
- Animal Breeding and Genomics Group, Wageningen University, Wageningen, the Netherlands
| | - Hendrik-Jan Megens
- Animal Breeding and Genomics Group, Wageningen University, Wageningen, the Netherlands
| | - Alex Cagan
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
| | - Mirte Bosse
- Animal Breeding and Genomics Group, Wageningen University, Wageningen, the Netherlands
| | - Yogesh Paudel
- Animal Breeding and Genomics Group, Wageningen University, Wageningen, the Netherlands
| | | | - Greger Larson
- Palaeogenomics and Bio-Archaeology Research Network, Research Laboratory for Archaeology and History of Art, University of Oxford, Oxford, UK
| | - Martien A M Groenen
- Animal Breeding and Genomics Group, Wageningen University, Wageningen, the Netherlands
| |
Collapse
|
158
|
van Dorp L, Balding D, Myers S, Pagani L, Tyler-Smith C, Bekele E, Tarekegn A, Thomas MG, Bradman N, Hellenthal G. Evidence for a Common Origin of Blacksmiths and Cultivators in the Ethiopian Ari within the Last 4500 Years: Lessons for Clustering-Based Inference. PLoS Genet 2015; 11:e1005397. [PMID: 26291793 PMCID: PMC4546361 DOI: 10.1371/journal.pgen.1005397] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 06/26/2015] [Indexed: 01/02/2023] Open
Abstract
The Ari peoples of Ethiopia are comprised of different occupational groups that can be distinguished genetically, with Ari Cultivators and the socially marginalised Ari Blacksmiths recently shown to have a similar level of genetic differentiation between them (FST ≈ 0.023 − 0.04) as that observed among multiple ethnic groups sampled throughout Ethiopia. Anthropologists have proposed two competing theories to explain the origins of the Ari Blacksmiths as (i) remnants of a population that inhabited Ethiopia prior to the arrival of agriculturists (e.g. Cultivators), or (ii) relatively recently related to the Cultivators but presently marginalized in the community due to their trade. Two recent studies by different groups analysed genome-wide DNA from samples of Ari Blacksmiths and Cultivators and suggested that genetic patterns between the two groups were more consistent with model (i) and subsequent assimilation of the indigenous peoples into the expanding agriculturalist community. We analysed the same samples using approaches designed to attenuate signals of genetic differentiation that are attributable to allelic drift within a population. By doing so, we provide evidence that the genetic differences between Ari Blacksmiths and Cultivators can be entirely explained by bottleneck effects consistent with hypothesis (ii). This finding serves as both a cautionary tale about interpreting results from unsupervised clustering algorithms, and suggests that social constructions are contributing directly to genetic differentiation over a relatively short time period among previously genetically similar groups. While it is widely recognized that DNA patterns vary across world-wide human populations, the primary features that drive these differences are less well understood. As an example, the Ari peoples of Ethiopia are presently socially divided according to occupation, with Ari Blacksmiths marginalised relative to Ari Cultivators. Two competing theories proposed by anthropologists to explain the existence of these occupational groupings suggest very different histories: (i) the Cultivators reflect migrants who moved into the region occupied by ancestors of the Blacksmiths perhaps many thousands of years ago, versus (ii) the Blacksmiths and Cultivators comprised the same ancestral group before the former was marginalised due solely to their trade. Recent genetic studies showed that Blacksmiths and Cultivators are distinguishable by their DNA, and suggested that overall DNA patterns among the two groups were consistent with (i). However, we demonstrate here that interpreting the results of currently popular algorithms that compare DNA is not always straight-forward. Instead we use a variety of analyses to show that (ii) seems a more likely explanation, perhaps illustrating how social marginalisation can lead to groups becoming genetically distinguishable over a relatively short time period.
Collapse
Affiliation(s)
- Lucy van Dorp
- University College London Genetics Institute (UGI), University College London, London, United Kingdom
- Centre for Mathematics and Physics in the Life Sciences and EXperimental Biology (CoMPLEX), University College London, London, United Kingdom
| | - David Balding
- University College London Genetics Institute (UGI), University College London, London, United Kingdom
- Schools of BioSciences and of Mathematics & Statistics, University of Melbourne, Melbourne, Australia
| | - Simon Myers
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Luca Pagani
- The Wellcome Trust Sanger Institute, Hinxton, United Kingdom
- Department of Archaeology and Anthropology, University of Cambridge, Cambridge, United Kingdom
| | | | | | | | - Mark G. Thomas
- Research Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
| | | | - Garrett Hellenthal
- University College London Genetics Institute (UGI), University College London, London, United Kingdom
- * E-mail:
| |
Collapse
|
159
|
Gorjanc G, Bijma P, Hickey JM. Reliability of pedigree-based and genomic evaluations in selected populations. Genet Sel Evol 2015; 47:65. [PMID: 26271246 PMCID: PMC4536753 DOI: 10.1186/s12711-015-0145-1] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2014] [Accepted: 07/29/2015] [Indexed: 11/14/2022] Open
Abstract
Background Reliability is an important parameter in breeding. It measures the precision of estimated breeding values (EBV) and, thus, potential response to selection on those EBV. The precision of EBV is commonly measured by relating the prediction error variance (PEV) of EBV to the base population additive genetic variance (base PEV reliability), while the potential for response to selection is commonly measured by the squared correlation between the EBV and breeding values (BV) on selection candidates (reliability of selection). While these two measures are equivalent for unselected populations, they are not equivalent for selected populations. The aim of this study was to quantify the effect of selection on these two measures of reliability and to show how this affects comparison of breeding programs using pedigree-based or genomic evaluations. Methods Two scenarios with random and best linear unbiased prediction (BLUP) selection were simulated, where the EBV of selection candidates were estimated using only pedigree, pedigree and phenotype, genome-wide marker genotypes and phenotype, or only genome-wide marker genotypes. The base PEV reliabilities of these EBV were compared to the corresponding reliabilities of selection. Realized genetic selection intensity was evaluated to quantify the potential of selection on the different types of EBV and, thus, to validate differences in reliabilities. Finally, the contribution of different underlying processes to changes in additive genetic variance and reliabilities was quantified. Results The simulations showed that, for selected populations, the base PEV reliability substantially overestimates the reliability of selection of EBV that are mainly based on old information from the parental generation, as is the case with pedigree-based prediction. Selection on such EBV gave very low realized genetic selection intensities, confirming the overestimation and importance of genotyping both male and female selection candidates. The two measures of reliability matched when the reductions in additive genetic variance due to the Bulmer effect, selection, and inbreeding were taken into account. Conclusions For populations under selection, EBV based on genome-wide information are more valuable than suggested by the comparison of the base PEV reliabilities between the different types of EBV. This implies that genome-wide marker information is undervalued for selected populations and that genotyping un-phenotyped female selection candidates should be reconsidered. Electronic supplementary material The online version of this article (doi:10.1186/s12711-015-0145-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - Piter Bijma
- Wageningen University, Animal Breeding and Genomics Centre, Wageningen, The Netherlands.
| | - John M Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| |
Collapse
|
160
|
Abstract
Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model that allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum-likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method's credible intervals for population size as a function of time cover 90% of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.
Collapse
|
161
|
Schumer M, Cui R, Rosenthal GG, Andolfatto P. simMSG: an experimental design tool for high-throughput genotyping of hybrids. Mol Ecol Resour 2015; 16:183-92. [PMID: 26032857 DOI: 10.1111/1755-0998.12434] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2015] [Revised: 05/19/2015] [Accepted: 05/22/2015] [Indexed: 11/30/2022]
Abstract
Hybridization between closely related species, whether naturally occurring or laboratory generated, is a useful tool for mapping the genetic basis of the phenotypic traits that distinguish species. The development of next-generation sequencing techniques has greatly improved our ability to assign ancestry to hybrid genomes. One such next-generation sequencing technique, multiplexed shotgun genotyping (or MSG), can be a powerful tool for genotyping hybrids. However, it is difficult a priori to predict the accuracy of MSG in natural hybrids because accuracy depends on ancestry tract length and number of ancestry informative markers. Here, we present a simulator, 'simMSG', that will allow researchers to design MSG experiments and show that in many cases MSG can accurately assign ancestry to hundreds of thousands of sites in the genomes of natural hybrids. The simMSG tool can be used to design experiments for diverse applications including QTL mapping, genotyping introgressed lines or admixture mapping.
Collapse
Affiliation(s)
- Molly Schumer
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ, 08544, USA.,Centro de Investigaciones Científicas de las Huastecas 'Aguazarca', Calnali, Hidalgo, Mexico
| | - Rongfeng Cui
- Centro de Investigaciones Científicas de las Huastecas 'Aguazarca', Calnali, Hidalgo, Mexico.,Department of Biology, Texas A&M University, TAMU, College Station, TX, USA.,Max Planck Institute for the Biology of Aging, Cologne, Germany
| | - Gil G Rosenthal
- Centro de Investigaciones Científicas de las Huastecas 'Aguazarca', Calnali, Hidalgo, Mexico.,Department of Biology, Texas A&M University, TAMU, College Station, TX, USA
| | - Peter Andolfatto
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ, 08544, USA.,Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, 08544, USA
| |
Collapse
|
162
|
Jenko J, Gorjanc G, Cleveland MA, Varshney RK, Whitelaw CBA, Woolliams JA, Hickey JM. Potential of promotion of alleles by genome editing to improve quantitative traits in livestock breeding programs. Genet Sel Evol 2015; 47:55. [PMID: 26133579 PMCID: PMC4487592 DOI: 10.1186/s12711-015-0135-3] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 06/15/2015] [Indexed: 12/29/2022] Open
Abstract
Background Genome editing (GE) is a method that enables specific nucleotides in the genome of an individual to be changed. To date, use of GE in livestock has focussed on simple traits that are controlled by a few quantitative trait nucleotides (QTN) with large effects. The aim of this study was to evaluate the potential of GE to improve quantitative traits that are controlled by many QTN, referred to here as promotion of alleles by genome editing (PAGE). Methods Multiple scenarios were simulated to test alternative PAGE strategies for a quantitative trait. They differed in (i) the number of edits per sire (0 to 100), (ii) the number of edits per generation (0 to 500), and (iii) the extent of use of PAGE (i.e. editing all sires or only a proportion of them). The base line scenario involved selecting individuals on true breeding values (i.e., genomic selection only (GS only)-genomic selection with perfect accuracy) for several generations. Alternative scenarios complemented this base line scenario with PAGE (GS + PAGE). The effect of different PAGE strategies was quantified by comparing response to selection, changes in allele frequencies, the number of distinct QTN edited, the sum of absolute effects of the edited QTN per generation, and inbreeding. Results Response to selection after 20 generations was between 1.08 and 4.12 times higher with GS + PAGE than with GS only. Increases in response to selection were larger with more edits per sire and more sires edited. When the total resources for PAGE were limited, editing a few sires for many QTN resulted in greater response to selection and inbreeding compared to editing many sires for a few QTN. Between the scenarios GS only and GS + PAGE, there was little difference in the average change in QTN allele frequencies, but there was a major difference for the QTN with the largest effects. The sum of the effects of the edited QTN decreased across generations. Conclusions This study showed that PAGE has great potential for application in livestock breeding programs, but inbreeding needs to be managed. Electronic supplementary material The online version of this article (doi:10.1186/s12711-015-0135-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Janez Jenko
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - Matthew A Cleveland
- , Genus plc.,100 Bluegrass Commons Blvd., Suite 2200, Hendersonville, TN, 37075, USA.
| | - Rajeev K Varshney
- International Crop Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India.
| | - C Bruce A Whitelaw
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - John A Woolliams
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - John M Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| |
Collapse
|
163
|
Pérez-Enciso M, Rincón JC, Legarra A. Sequence- vs. chip-assisted genomic selection: accurate biological information is advised. Genet Sel Evol 2015; 47:43. [PMID: 25956961 PMCID: PMC4424891 DOI: 10.1186/s12711-015-0117-5] [Citation(s) in RCA: 88] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 03/31/2015] [Indexed: 12/29/2022] Open
Abstract
Background The development of next-generation sequencing technologies (NGS) has made the use of whole-genome sequence data for routine genetic evaluations possible, which has triggered a considerable interest in animal and plant breeding fields. Here, we investigated whether complete or partial sequence data can improve upon existing SNP (single nucleotide polymorphism) array-based selection strategies by simulation using a mixed coalescence - gene-dropping approach. Results We simulated 20 or 100 causal mutations (quantitative trait nucleotides, QTN) within 65 predefined ‘gene’ regions, each 10 kb long, within a genome composed of ten 3-Mb chromosomes. We compared prediction accuracy by cross-validation using a medium-density chip (7.5 k SNPs), a high-density (HD, 17 k) and sequence data (335 k). Genetic evaluation was based on a GBLUP method. The simulations showed: (1) a law of diminishing returns with increasing number of SNPs; (2) a modest effect of SNP ascertainment bias in arrays; (3) a small advantage of using whole-genome sequence data vs. HD arrays i.e. ~4%; (4) a minor effect of NGS errors except when imputation error rates are high (≥20%); and (5) if QTN were known, prediction accuracy approached 1. Since this is obviously unrealistic, we explored milder assumptions. We showed that, if all SNPs within causal genes were included in the prediction model, accuracy could also dramatically increase by ~40%. However, this criterion was highly sensitive to either misspecification (including wrong genes) or to the use of an incomplete gene list; in these cases, accuracy fell rapidly towards that reached when all SNPs from sequence data were blindly included in the model. Conclusions Our study shows that, unless an accurate prior estimate on the functionality of SNPs can be included in the predictor, there is a law of diminishing returns with increasing SNP density. As a result, use of whole-genome sequence data may not result in a highly increased selection response over high-density genotyping.
Collapse
Affiliation(s)
- Miguel Pérez-Enciso
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, 08193, Bellaterra, Barcelona, Spain. .,Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain. .,Institut Català de Recerca i Estudis Avançats (ICREA), Carrer de Lluís Companys 23, Barcelona, 08010, Spain.
| | - Juan C Rincón
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, 08193, Bellaterra, Barcelona, Spain. .,Universidad Nacional de Colombia, Sede Medellín, Facultad de Ciencias Agrarias, Departamento de producción Animal, Medellín, Colombia.
| | - Andrés Legarra
- INRA, UMR 1388 GENPHYSE, Génétique, Physiologie et Systèmes d'Elevage, Castanet-Tolosan, 31326, France.
| |
Collapse
|
164
|
Bianco E, Soto HW, Vargas L, Pérez-Enciso M. The chimerical genome of Isla del Coco feral pigs (Costa Rica), an isolated population since 1793 but with remarkable levels of diversity. Mol Ecol 2015; 24:2364-78. [DOI: 10.1111/mec.13182] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Revised: 03/18/2015] [Accepted: 03/24/2015] [Indexed: 01/27/2023]
Affiliation(s)
- E. Bianco
- Centre for Research in Agricultural Genomics (CRAG); CSIC-IRTA-UAB-UB Consortium; 08193 Bellaterra Spain
- Department of Animal Science; Universitat Autònoma de Barcelona; 08193 Bellaterra Spain
| | - H. W. Soto
- Escuela de Zootecnia; Universidad de Costa Rica; 10501 San José Costa Rica
| | - L. Vargas
- Sistema Nacional de Áreas de Conservación (SINAC); Ministerio de Ambiente y Energía (MINAE); Avenida 15, Calle 1, San José Costa Rica
| | - M. Pérez-Enciso
- Centre for Research in Agricultural Genomics (CRAG); CSIC-IRTA-UAB-UB Consortium; 08193 Bellaterra Spain
- Department of Animal Science; Universitat Autònoma de Barcelona; 08193 Bellaterra Spain
- Institut Català de Recerca I Estudis Avançats (ICREA); Carrer de Lluís Companys 23 Barcelona 08010 Spain
| |
Collapse
|
165
|
Yunusbayev B, Metspalu M, Metspalu E, Valeev A, Litvinov S, Valiev R, Akhmetova V, Balanovska E, Balanovsky O, Turdikulova S, Dalimova D, Nymadawa P, Bahmanimehr A, Sahakyan H, Tambets K, Fedorova S, Barashkov N, Khidiyatova I, Mihailov E, Khusainova R, Damba L, Derenko M, Malyarchuk B, Osipova L, Voevoda M, Yepiskoposyan L, Kivisild T, Khusnutdinova E, Villems R. The genetic legacy of the expansion of Turkic-speaking nomads across Eurasia. PLoS Genet 2015; 11:e1005068. [PMID: 25898006 PMCID: PMC4405460 DOI: 10.1371/journal.pgen.1005068] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2013] [Accepted: 02/11/2015] [Indexed: 12/28/2022] Open
Abstract
The Turkic peoples represent a diverse collection of ethnic groups defined by the Turkic languages. These groups have dispersed across a vast area, including Siberia, Northwest China, Central Asia, East Europe, the Caucasus, Anatolia, the Middle East, and Afghanistan. The origin and early dispersal history of the Turkic peoples is disputed, with candidates for their ancient homeland ranging from the Transcaspian steppe to Manchuria in Northeast Asia. Previous genetic studies have not identified a clear-cut unifying genetic signal for the Turkic peoples, which lends support for language replacement rather than demic diffusion as the model for the Turkic language’s expansion. We addressed the genetic origin of 373 individuals from 22 Turkic-speaking populations, representing their current geographic range, by analyzing genome-wide high-density genotype data. In agreement with the elite dominance model of language expansion most of the Turkic peoples studied genetically resemble their geographic neighbors. However, western Turkic peoples sampled across West Eurasia shared an excess of long chromosomal tracts that are identical by descent (IBD) with populations from present-day South Siberia and Mongolia (SSM), an area where historians center a series of early Turkic and non-Turkic steppe polities. While SSM matching IBD tracts (> 1cM) are also observed in non-Turkic populations, Turkic peoples demonstrate a higher percentage of such tracts (p-values ≤ 0.01) compared to their non-Turkic neighbors. Finally, we used the ALDER method and inferred admixture dates (~9th–17th centuries) that overlap with the Turkic migrations of the 5th–16th centuries. Thus, our results indicate historical admixture among Turkic peoples, and the recent shared ancestry with modern populations in SSM supports one of the hypothesized homelands for their nomadic Turkic and related Mongolic ancestors. Centuries of nomadic migrations have ultimately resulted in the distribution of Turkic languages over a large area ranging from Siberia, across Central Asia to Eastern Europe and the Middle East. Despite the profound cultural impact left by these nomadic peoples, little is known about their prehistoric origins. Moreover, because contemporary Turkic speakers tend to genetically resemble their geographic neighbors, it is not clear whether their nomadic ancestors left an identifiable genetic trace. In this study, we show that Turkic-speaking peoples sampled across the Middle East, Caucasus, East Europe, and Central Asia share varying proportions of Asian ancestry that originate in a single area, southern Siberia and Mongolia. Mongolic- and Turkic-speaking populations from this area bear an unusually high number of long chromosomal tracts that are identical by descent with Turkic peoples from across west Eurasia. Admixture induced linkage disequilibrium decay across chromosomes in these populations indicates that admixture occurred during the 9th–17th centuries, in agreement with the historically recorded Turkic nomadic migrations and later Mongol expansion. Thus, our findings reveal genetic traces of recent large-scale nomadic migrations and map their source to a previously hypothesized area of Mongolia and southern Siberia.
Collapse
Affiliation(s)
- Bayazit Yunusbayev
- Evolutionary Biology group, Estonian Biocentre, Tartu, Estonia
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
- * E-mail: ,
| | - Mait Metspalu
- Evolutionary Biology group, Estonian Biocentre, Tartu, Estonia
- Department of Evolutionary Biology, University of Tartu, Tartu, Estonia
- Department of Integrative Biology, University of California Berkeley, Berkeley, California, United States of America
| | - Ene Metspalu
- Department of Evolutionary Biology, University of Tartu, Tartu, Estonia
| | - Albert Valeev
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
| | - Sergei Litvinov
- Evolutionary Biology group, Estonian Biocentre, Tartu, Estonia
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
| | - Ruslan Valiev
- Department of Genetics and Fundamental Medicine, Bashkir State University, Ufa, Bashkortostan, Russia
| | - Vita Akhmetova
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
| | | | - Oleg Balanovsky
- Research Centre for Medical Genetics, RAMS, Moscow, Russia
- Vavilov Institute for General Genetics, RAS, Moscow, Russia
| | - Shahlo Turdikulova
- Laboratory of Genomics, Institute of Bioorganic Chemistry, Academy of Sciences Republic of Uzbekistan, Tashkent, Uzbekistan
| | - Dilbar Dalimova
- Laboratory of Genomics, Institute of Bioorganic Chemistry, Academy of Sciences Republic of Uzbekistan, Tashkent, Uzbekistan
| | | | - Ardeshir Bahmanimehr
- Department of Medical Genetics, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Hovhannes Sahakyan
- Evolutionary Biology group, Estonian Biocentre, Tartu, Estonia
- Laboratory of Ethnogenomics, Institute of Molecular Biology, Academy of Sciences of Armenia, Yerevan, Armenia
| | | | - Sardana Fedorova
- Laboratory of Molecular Genetics, Yakut Research Center of Complex Medical Problems, Yakutsk, Sakha Republic, Russia
- Laboratory of Molecular Biology, North-Eastern Federal University, Yakutsk, Sakha Republic, Russia
| | - Nikolay Barashkov
- Laboratory of Molecular Genetics, Yakut Research Center of Complex Medical Problems, Yakutsk, Sakha Republic, Russia
- Laboratory of Molecular Biology, North-Eastern Federal University, Yakutsk, Sakha Republic, Russia
| | - Irina Khidiyatova
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
- Department of Genetics and Fundamental Medicine, Bashkir State University, Ufa, Bashkortostan, Russia
| | - Evelin Mihailov
- Estonian Genome Center, University of Tartu, Tartu, Estonia
- Gene Technology Workgroup, Estonian Biocentre, Tartu, Estonia
| | - Rita Khusainova
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
- Department of Genetics and Fundamental Medicine, Bashkir State University, Ufa, Bashkortostan, Russia
| | - Larisa Damba
- Institute of Internal Medicine, SB RAMS, Novosibirsk, Russia
| | | | | | - Ludmila Osipova
- Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia
| | - Mikhail Voevoda
- Institute of Internal Medicine, SB RAMS, Novosibirsk, Russia
- Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia
| | - Levon Yepiskoposyan
- Laboratory of Ethnogenomics, Institute of Molecular Biology, Academy of Sciences of Armenia, Yerevan, Armenia
| | - Toomas Kivisild
- Division of Biological Anthropology, University of Cambridge, Cambridge, United Kingdom
| | - Elza Khusnutdinova
- Institute of Biochemistry and Genetics, Ufa Research Centre, RAS, Ufa, Bashkortostan, Russia
- Department of Genetics and Fundamental Medicine, Bashkir State University, Ufa, Bashkortostan, Russia
| | - Richard Villems
- Evolutionary Biology group, Estonian Biocentre, Tartu, Estonia
- Department of Evolutionary Biology, University of Tartu, Tartu, Estonia
- Estonian Academy of Sciences, Tallinn, Estonia
| |
Collapse
|
166
|
Reconstructing Past Admixture Processes from Local Genomic Ancestry Using Wavelet Transformation. Genetics 2015; 200:469-81. [PMID: 25852078 PMCID: PMC4492373 DOI: 10.1534/genetics.115.176842] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2014] [Accepted: 04/03/2015] [Indexed: 11/18/2022] Open
Abstract
Admixture between long-separated populations is a defining feature of the genomes of many species. The mosaic block structure of admixed genomes can provide information about past contact events, including the time and extent of admixture. Here, we describe an improved wavelet-based technique that better characterizes ancestry block structure from observed genomic patterns. principal components analysis is first applied to genomic data to identify the primary population structure, followed by wavelet decomposition to develop a new characterization of local ancestry information along the chromosomes. For testing purposes, this method is applied to human genome-wide genotype data from Indonesia, as well as virtual genetic data generated using genome-scale sequential coalescent simulations under a wide range of admixture scenarios. Time of admixture is inferred using an approximate Bayesian computation framework, providing robust estimates of both admixture times and their associated levels of uncertainty. Crucially, we demonstrate that this revised wavelet approach, which we have released as the R package adwave, provides improved statistical power over existing wavelet-based techniques and can be used to address a broad range of admixture questions.
Collapse
|
167
|
Exploring population size changes using SNP frequency spectra. Nat Genet 2015; 47:555-9. [PMID: 25848749 PMCID: PMC4414822 DOI: 10.1038/ng.3254] [Citation(s) in RCA: 246] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 02/26/2015] [Indexed: 02/05/2023]
Abstract
Inferring demographic history is an important task in population genetics. Many existing inference methods are based on pre-defined simplified population models, which are more suitable for hypothesis testing than for exploratory analysis. We developed a novel model-flexible method called stairway plot, which infers population size changes over time using SNP frequency spectra. This method is applicable for whole-genome sequences of hundreds of individuals. Using extensive simulation we demonstrated the usefulness of the method for inferring demographic history, especially recent population size changes. The method was applied to the whole genome sequence data of nine populations from the 1000 Genomes Project, and showed a pattern of human population fluctuations from 10 to 200 thousand years ago.
Collapse
|
168
|
Cheng JY, Mailund T. Ancestral population genomics using coalescence hidden Markov models and heuristic optimisation algorithms. Comput Biol Chem 2015; 57:80-92. [PMID: 25819138 DOI: 10.1016/j.compbiolchem.2015.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2015] [Accepted: 02/02/2015] [Indexed: 10/23/2022]
Abstract
With full genome data from several closely related species now readily available, we have the ultimate data for demographic inference. Exploiting these full genomes, however, requires models that can explicitly model recombination along alignments of full chromosomal length. Over the last decade a class of models, based on the sequential Markov coalescence model combined with hidden Markov models, has been developed and used to make inference in simple demographic scenarios. To move forward to more complex demographic modelling we need better and more automated ways of specifying these models and efficient optimisation algorithms for inferring the parameters in complex and often high-dimensional models. In this paper we present a framework for building such coalescence hidden Markov models for pairwise alignments and present results for using heuristic optimisation algorithms for parameter estimation. We show that we can build more complex demographic models than our previous frameworks and that we obtain more accurate parameter estimates using heuristic optimisation algorithms than when using our previous gradient based approaches. Our new framework provides a flexible way of constructing coalescence hidden Markov models almost automatically. While estimating parameters in more complex models is still challenging we show that using heuristic optimisation algorithms we still get a fairly good accuracy.
Collapse
Affiliation(s)
- Jade Yu Cheng
- Bioinformatics Research Centre, Aarhus University, C.F. Møllers Allé 8, 8000 Aarhus, Denmark.
| | - Thomas Mailund
- Bioinformatics Research Centre, Aarhus University, C.F. Møllers Allé 8, 8000 Aarhus, Denmark.
| |
Collapse
|
169
|
Bianco E, Nevado B, Ramos-Onsins SE, Pérez-Enciso M. A deep catalog of autosomal single nucleotide variation in the pig. PLoS One 2015; 10:e0118867. [PMID: 25789620 PMCID: PMC4366260 DOI: 10.1371/journal.pone.0118867] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Accepted: 12/27/2014] [Indexed: 12/31/2022] Open
Abstract
A comprehensive catalog of variability in a given species is useful for many important purposes, e.g., designing high density arrays or pinpointing potential mutations of economic or physiological interest. Here we provide a genomewide, worldwide catalog of single nucleotide variants by simultaneously analyzing the shotgun sequence of 128 pigs and five suid outgroups. Despite the high SNP missing rate of some individuals (up to 88%), we retrieved over 48 million high quality variants. Of them, we were able to assess the ancestral allele of more than 39M biallelic SNPs. We found SNPs in 21,455 out of the 25,322 annotated genes in pig assembly 10.2. The annotation showed that more than 40% of the variants were novel variants, not present in dbSNP. Surprisingly, we found a large variability in transition / transversion rate along the genome, which is very well explained (R2=0.79) primarily by genome differences in in CpG content and recombination rate. The number of SNPs per window also varied but was less dependent of known factors such as gene density, missing rate or recombination (R2=0.48). When we divided the samples in four groups, Asian wild boar (ASWB), Asian domestics (ASDM), European wild boar (EUWB) and European domestics (EUDM), we found a marked correlation in allele frequencies between domestics and wild boars within Asia and within Europe, but not across continents, due to the large evolutive distance between pigs of both continents (~1.2 MYA). In general, the porcine species showed a small percentage of SNPs exclusive of each population group. EUWB and EUDM were predicted to harbor a larger fraction of potentially deleterious mutations, according to the SIFT algorithm, than Asian samples, perhaps a result of background selection being less effective due to a lower effective population size in Europe.
Collapse
Affiliation(s)
- Erica Bianco
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Universitat Autònoma de Barcelona, Department of Animal Science, Bellaterra, Spain
| | - Bruno Nevado
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Universitat Autònoma de Barcelona, Department of Animal Science, Bellaterra, Spain
| | | | - Miguel Pérez-Enciso
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB Consortium, Bellaterra, Spain
- Universitat Autònoma de Barcelona, Department of Animal Science, Bellaterra, Spain
- Institut Català de Recerca I Estudis Avançats (ICREA), Carrer de Lluís Companys 23, Barcelona, Spain
- * E-mail:
| |
Collapse
|
170
|
The SMC' is a highly accurate approximation to the ancestral recombination graph. Genetics 2015; 200:343-55. [PMID: 25786855 DOI: 10.1534/genetics.114.173898] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 03/12/2015] [Indexed: 11/18/2022] Open
Abstract
Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the first time the similarity between the SMC' and the ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC' is the same as it is marginally under the ARG, which demonstrates that the SMC' is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC' they are approximately asymptotically unbiased.
Collapse
|
171
|
Gorjanc G, Cleveland MA, Houston RD, Hickey JM. Potential of genotyping-by-sequencing for genomic selection in livestock populations. Genet Sel Evol 2015; 47:12. [PMID: 25887531 PMCID: PMC4344748 DOI: 10.1186/s12711-015-0102-z] [Citation(s) in RCA: 67] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2014] [Accepted: 01/29/2015] [Indexed: 12/12/2022] Open
Abstract
Background Next-generation sequencing techniques, such as genotyping-by-sequencing (GBS), provide alternatives to single nucleotide polymorphism (SNP) arrays. The aim of this work was to evaluate the potential of GBS compared to SNP array genotyping for genomic selection in livestock populations. Methods The value of GBS was quantified by simulation analyses in which three parameters were varied: (i) genome-wide sequence read depth (x) per individual from 0.01x to 20x or using SNP array genotyping; (ii) number of genotyped markers from 3000 to 300 000; and (iii) size of training and prediction sets from 500 to 50 000 individuals. The latter was achieved by distributing the total available x of 1000x, 5000x, or 10 000x per genotyped locus among the varying number of individuals. With SNP arrays, genotypes were called from sequence data directly. With GBS, genotypes were called from sequence reads that varied between loci and individuals according to a Poisson distribution with mean equal to x. Simulated data were analyzed with ridge regression and the accuracy and bias of genomic predictions and response to selection were quantified under the different scenarios. Results Accuracies of genomic predictions using GBS data or SNP array data were comparable when large numbers of markers were used and x per individual was ~1x or higher. The bias of genomic predictions was very high at a very low x. When the total available x was distributed among the training individuals, the accuracy of prediction was maximized when a large number of individuals was used that had GBS data with low x for a large number of markers. Similarly, response to selection was maximized under the same conditions due to increasing both accuracy and selection intensity. Conclusions GBS offers great potential for developing genomic selection in livestock populations because it makes it possible to cover large fractions of the genome and to vary the sequence read depth per individual. Thus, the accuracy of predictions is improved by increasing the size of training populations and the intensity of selection is increased by genotyping a larger number of selection candidates. Electronic supplementary material The online version of this article (doi:10.1186/s12711-015-0102-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - Matthew A Cleveland
- Genus Plc, 100 Bluegrass Commons Blvd., Suite 2200, Hendersonville, TN, 37075, USA.
| | - Ross D Houston
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| | - John M Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.
| |
Collapse
|
172
|
Staab PR, Zhu S, Metzler D, Lunter G. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. ACTA ACUST UNITED AC 2015; 31:1680-2. [PMID: 25596205 PMCID: PMC4426833 DOI: 10.1093/bioinformatics/btu861] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2014] [Accepted: 12/23/2014] [Indexed: 11/13/2022]
Abstract
Motivation: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations. Results: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure. Availability and implementation: The open source implementation scrm is freely available at https://scrm.github.io under the conditions of the GPLv3 license. Contact:staab@bio.lmu.de or gerton.lunter@well.ox.ac.uk. Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paul R Staab
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Sha Zhu
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Dirk Metzler
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Gerton Lunter
- Department of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| |
Collapse
|
173
|
Depperschmidt A, Pardoux É, Pfaffelhuber P. A mixing tree-valued process arising under neutral evolution with recombination. ELECTRON J PROBAB 2015. [DOI: 10.1214/ejp.v20-4286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
174
|
Peng B, Chen HS, Mechanic LE, Racine B, Clarke J, Gillanders E, Feuer EJ. Genetic data simulators and their applications: an overview. Genet Epidemiol 2014; 39:2-10. [PMID: 25504286 DOI: 10.1002/gepi.21876] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 09/14/2014] [Accepted: 10/31/2014] [Indexed: 11/10/2022]
Abstract
Computer simulations have played an indispensable role in the development and applications of statistical models and methods for genetic studies across multiple disciplines. The need to simulate complex evolutionary scenarios and pseudo-datasets for various studies has fueled the development of dozens of computer programs with varying reliability, performance, and application areas. To help researchers compare and choose the most appropriate simulators for their studies, we have created the genetic simulation resources (GSR) website, which allows authors of simulation software to register their applications and describe them with more than 160 defined attributes. This article summarizes the properties of 93 simulators currently registered at GSR and provides an overview of the development and applications of genetic simulators. Unlike other review articles that address technical issues or compare simulators for particular application areas, we focus on software development, maintenance, and features of simulators, often from a historical perspective. Publications that cite these simulators are used to summarize both the applications of genetic simulations and the utilization of simulators.
Collapse
Affiliation(s)
- Bo Peng
- Department of Bioinformatics and Computational Biology, The University of Texas, MD Anderson Cancer Center, Houston, Texas, United States of America
| | | | | | | | | | | | | |
Collapse
|
175
|
Hobolth A, Jensen JL. Markovian approximation to the finite loci coalescent with recombination along multiple sequences. Theor Popul Biol 2014; 98:48-58. [DOI: 10.1016/j.tpb.2014.01.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Revised: 10/23/2013] [Accepted: 01/18/2014] [Indexed: 10/25/2022]
|
176
|
Li P, Guo M, Wang C, Liu X, Zou Q. An overview of SNP interactions in genome-wide association studies. Brief Funct Genomics 2014; 14:143-55. [PMID: 25241224 DOI: 10.1093/bfgp/elu036] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
With the recent explosion in high-throughput genotyping technology, the amount and quality of single-nucleotide polymorphism (SNP) data has increased exponentially. Therefore, the identification of SNP interactions that are associated with common diseases is playing an increasing and important role in interpreting the genetic basis of disease susceptibility and in devising new diagnostic tests and treatments. However, because these data sets are large, although they typically have small sample sizes and low signal-to-noise ratios, there has been no major breakthrough despite many efforts, making this a major focus in the field of bioinformatics. In this article, we review the two main aspects of SNP interaction studies in recent years-the simulation and identification of SNP interactions-and then discuss the principles, efficiency and differences between these methods.
Collapse
|
177
|
Abstract
Recombination allows different parts of the genome to have different genealogical histories. When a species splits in two, allelic lineages sort into the two descendant species, and this lineage sorting varies along the genome. If speciation events are close in time, the lineage sorting process may be incomplete at the second speciation event and lead to gene genealogies that do not match the species phylogeny. We review different recent approaches to model lineage sorting along the genome and show how it is possible to learn about population sizes, natural selection, and recombination rates in ancestral species from application of these models to genome alignments of great ape species.
Collapse
Affiliation(s)
- Thomas Mailund
- Bioinformatics Research Centre, Aarhus University, DK-8000 Aarhus C, Denmark; , ,
| | | | | |
Collapse
|
178
|
Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nat Commun 2014; 5:4835. [PMID: 25203624 PMCID: PMC4164776 DOI: 10.1038/ncomms5835] [Citation(s) in RCA: 112] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Accepted: 07/28/2014] [Indexed: 12/17/2022] Open
Abstract
The Ashkenazi Jewish (AJ) population is a genetic isolate close to European and Middle Eastern groups, with genetic diversity patterns conducive to disease mapping. Here we report high-depth sequencing of 128 complete genomes of AJ controls. Compared with European samples, our AJ panel has 47% more novel variants per genome and is eightfold more effective at filtering benign variants out of AJ clinical genomes. Our panel improves imputation accuracy for AJ SNP arrays by 28%, and covers at least one haplotype in ≈67% of any AJ genome with long, identical-by-descent segments. Reconstruction of recent AJ history from such segments confirms a recent bottleneck of merely ≈350 individuals. Modelling of ancient histories for AJ and European populations using their joint allele frequency spectrum determines AJ to be an even admixture of European and likely Middle Eastern origins. We date the split between the two ancestral populations to ≈12–25 Kyr, suggesting a predominantly Near Eastern source for the repopulation of Europe after the Last Glacial Maximum. Ashkenazi Jews are a genetically isolated population with distinct patterns of genetic diversity. Here, the authors sequence the genomes of 128 Ashkenazi Jewish individuals and use the sequence information to provide insight into the population's European and Middle Eastern origins.
Collapse
|
179
|
Wang Y, Zhou Y, Li L, Chen X, Liu Y, Ma ZM, Xu S. A new method for modeling coalescent processes with recombination. BMC Bioinformatics 2014; 15:273. [PMID: 25113665 PMCID: PMC4137079 DOI: 10.1186/1471-2105-15-273] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2014] [Accepted: 07/17/2014] [Indexed: 11/10/2022] Open
Abstract
Background Recombination plays an important role in the maintenance of genetic diversity in many types of organisms, especially diploid eukaryotes. Recombination can be studied and used to map diseases. However, recombination adds a great deal of complexity to the genetic information. This renders estimation of evolutionary parameters more difficult. After the coalescent process was formulated, models capable of describing recombination using graphs, such as ancestral recombination graphs (ARG) were also developed. There are two typical models based on which to simulate ARG: back-in-time model such as ms and spatial model including Wiuf&Hein’s, SMC, SMC’, and MaCS. Results In this study, a new method of modeling coalescence with recombination, Spatial Coalescent simulator (SC), was developed, which considerably improved the algorithm described by Wiuf and Hein. The present algorithm constructs ARG spatially along the sequence, but it does not produce any redundant branches which are inevitable in Wiuf and Hein’s algorithm. Interestingly, the distribution of ARG generated by the present new algorithm is identical to that generated by a typical back-in-time model adopted by ms, an algorithm commonly used to model coalescence. It is here demonstrated that the existing approximate methods such as the sequentially Markov coalescent (SMC), a related method called SMC′, and Markovian coalescent simulator (MaCS) can be viewed as special cases of the present method. Using simulation analysis, the time to the most common ancestor (TMRCA) in the local trees of ARGs generated by the present algorithm was found to be closer to that produced by ms than time produced by MaCS. Sample-consistent ARGs can be generated using the present method. This may significantly reduce the computational burden. Conclusion In summary, the present method and algorithm may facilitate the estimation and description of recombination in population genomics and evolutionary biology. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-273) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | | | | | - Zhi-Ming Ma
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.
| | | |
Collapse
|
180
|
Abstract
Large whole-genome sequencing projects have provided access to much rare variation in human populations, which is highly informative about population structure and recent demography. Here, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how these ages can be related to historical relationships between populations. We investigate the distribution of the age of variants occurring exactly twice (ƒ(2) variants) in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous variation across populations. The median age of haplotypes carrying ƒ(2) variants is 50 to 160 generations across populations within Europe or Asia, and 170 to 320 generations within Africa. Haplotypes shared between continents are much older with median ages for haplotypes shared between Europe and Asia ranging from 320 to 670 generations. The distribution of the ages of ƒ(2) haplotypes is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the effect of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.
Collapse
Affiliation(s)
- Iain Mathieson
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- * E-mail:
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
181
|
|
182
|
Colonna V, Ayub Q, Chen Y, Pagani L, Luisi P, Pybus M, Garrison E, Xue Y, Tyler-Smith C, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol 2014; 15:R88. [PMID: 24980144 PMCID: PMC4197830 DOI: 10.1186/gb-2014-15-6-r88] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2014] [Accepted: 06/30/2014] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. RESULTS We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. CONCLUSIONS We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.
Collapse
|
183
|
Nevado B, Perez-Enciso M. Pipeliner: software to evaluate the performance of bioinformatics pipelines for next-generation resequencing. Mol Ecol Resour 2014; 15:99-106. [PMID: 24890372 DOI: 10.1111/1755-0998.12286] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 05/19/2014] [Accepted: 05/23/2014] [Indexed: 12/30/2022]
Abstract
The choice of technology and bioinformatics approach is critical in obtaining accurate and reliable information from next-generation sequencing (NGS) experiments. An increasing number of software and methodological guidelines are being published, but deciding upon which approach and experimental design to use can depend on the particularities of the species and on the aims of the study. This leaves researchers unable to produce informed decisions on these central questions. To address these issues, we developed pipeliner - a tool to evaluate, by simulation, the performance of NGS pipelines in resequencing studies. Pipeliner provides a graphical interface allowing the users to write and test their own bioinformatics pipelines with publicly available or custom software. It computes a number of statistics summarizing the performance in SNP calling, including the recovery, sensitivity and false discovery rate for heterozygous and homozygous SNP genotypes. Pipeliner can be used to answer many practical questions, for example, for a limited amount of NGS effort, how many more reliable SNPs can be detected by doubling coverage and halving sample size or what is the false discovery rate provided by different SNP calling algorithms and options. Pipeliner thus allows researchers to carefully plan their study's sampling design and compare the suitability of alternative bioinformatics approaches for their specific study systems. Pipeliner is written in C++ and is freely available from http://github.com/brunonevado/Pipeliner.
Collapse
Affiliation(s)
- B Nevado
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, 08193, Bellaterra, Spain; Universitat Autònoma de Barcelona, 08193, Bellaterra, Spain
| | | |
Collapse
|
184
|
Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nat Genet 2014; 46:919-25. [PMID: 24952747 PMCID: PMC4116295 DOI: 10.1038/ng.3015] [Citation(s) in RCA: 591] [Impact Index Per Article: 59.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 05/30/2014] [Indexed: 01/07/2023]
Abstract
The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.
Collapse
Affiliation(s)
- Stephan Schiffels
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK
| |
Collapse
|
185
|
A C++ template library for efficient forward-time population genetic simulation of large populations. Genetics 2014; 198:157-66. [PMID: 24950894 DOI: 10.1534/genetics.114.165019] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
fwdpp is a C++ library of routines intended to facilitate the development of forward-time simulations under arbitrary mutation and fitness models. The library design provides a combination of speed, low memory overhead, and modeling flexibility not currently available from other forward simulation tools. The library is particularly useful when the simulation of large populations is required, as programs implemented using the library are much more efficient than other available forward simulation programs.
Collapse
|
186
|
Moreno-Estrada A, Gignoux CR, Fernández-López JC, Zakharia F, Sikora M, Contreras AV, Acuña-Alonzo V, Sandoval K, Eng C, Romero-Hidalgo S, Ortiz-Tello P, Robles V, Kenny EE, Nuño-Arana I, Barquera-Lozano R, Macín-Pérez G, Granados-Arriola J, Huntsman S, Galanter JM, Via M, Ford JG, Chapela R, Rodriguez-Cintron W, Rodríguez-Santana JR, Romieu I, Sienra-Monge JJ, del Rio Navarro B, London SJ, Ruiz-Linares A, Garcia-Herrera R, Estrada K, Hidalgo-Miranda A, Jimenez-Sanchez G, Carnevale A, Soberón X, Canizales-Quinteros S, Rangel-Villalobos H, Silva-Zolezzi I, Burchard EG, Bustamante CD. Human genetics. The genetics of Mexico recapitulates Native American substructure and affects biomedical traits. Science 2014; 344:1280-5. [PMID: 24926019 PMCID: PMC4156478 DOI: 10.1126/science.1251688] [Citation(s) in RCA: 331] [Impact Index Per Article: 33.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Mexico harbors great cultural and ethnic diversity, yet fine-scale patterns of human genome-wide variation from this region remain largely uncharacterized. We studied genomic variation within Mexico from over 1000 individuals representing 20 indigenous and 11 mestizo populations. We found striking genetic stratification among indigenous populations within Mexico at varying degrees of geographic isolation. Some groups were as differentiated as Europeans are from East Asians. Pre-Columbian genetic substructure is recapitulated in the indigenous ancestry of admixed mestizo individuals across the country. Furthermore, two independently phenotyped cohorts of Mexicans and Mexican Americans showed a significant association between subcontinental ancestry and lung function. Thus, accounting for fine-scale ancestry patterns is critical for medical and population genetic studies within Mexico, in Mexican-descent populations, and likely in many other populations worldwide.
Collapse
Affiliation(s)
| | - Christopher R Gignoux
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA.
| | | | - Fouad Zakharia
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Martin Sikora
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | | | - Victor Acuña-Alonzo
- Escuela Nacional de Antropología e Historia (ENAH), Mexico City, Mexico. Department of Genetics, Evolution and Environment, University College London, London, UK
| | - Karla Sandoval
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Celeste Eng
- Department of Medicine, University of California, San Francisco, CA, USA
| | | | - Patricia Ortiz-Tello
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Victoria Robles
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Eimear E Kenny
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Ismael Nuño-Arana
- Instituto de Investigación en Genética Molecular, Universidad de Guadalajara, Ocotlán, Mexico
| | | | | | - Julio Granados-Arriola
- Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico
| | - Scott Huntsman
- Department of Medicine, University of California, San Francisco, CA, USA
| | - Joshua M Galanter
- Department of Medicine, University of California, San Francisco, CA, USA. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA
| | - Marc Via
- Department of Medicine, University of California, San Francisco, CA, USA
| | - Jean G Ford
- The Brooklyn Hospital Center, Brooklyn, NY, USA
| | - Rocío Chapela
- Instituto Nacional de Enfermedades Respiratorias (INER), Mexico City, Mexico
| | | | - Jose R Rodríguez-Santana
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA. Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, Mexico
| | | | | | | | - Stephanie J London
- National Institute of Environmental Health Sciences, National Institutes of Health, Department of Health and Human Services, Research Triangle Park, NC, USA
| | - Andrés Ruiz-Linares
- Department of Genetics, Evolution and Environment, University College London, London, UK
| | | | - Karol Estrada
- Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, Mexico
| | | | | | | | - Xavier Soberón
- Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, Mexico
| | - Samuel Canizales-Quinteros
- Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, Mexico. Facultad de Química, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | | | | | - Esteban Gonzalez Burchard
- Department of Medicine, University of California, San Francisco, CA, USA. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA.
| | - Carlos D Bustamante
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
187
|
Abstract
The "LD curve" relates the linkage disequilibrium (LD) between pairs of nucleotide sites to the distance that separates them along the chromosome. The shape of this curve reflects natural selection, admixture between populations, and the history of population size. This article derives new results about the last of these effects. When a population expands in size, the LD curve grows steeper, and this effect is especially pronounced following a bottleneck in population size. When a population shrinks, the LD curve rises but remains relatively flat. As LD converges toward a new equilibrium, its time path may not be monotonic. Following an episode of growth, for example, it declines to a low value before rising toward the new equilibrium. These changes happen at different rates for different LD statistics. They are especially slow for estimates of [Formula: see text], which therefore allow inferences about ancient population history. For the human population of Europe, these results suggest a history of population growth.
Collapse
|
188
|
Kelleher J, Etheridge AM, Barton NH. Coalescent simulation in continuous space: algorithms for large neighbourhood size. Theor Popul Biol 2014; 95:13-23. [PMID: 24910324 DOI: 10.1016/j.tpb.2014.05.001] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Revised: 05/20/2014] [Accepted: 05/22/2014] [Indexed: 11/15/2022]
Abstract
Many species have an essentially continuous distribution in space, in which there are no natural divisions between randomly mating subpopulations. Yet, the standard approach to modelling these populations is to impose an arbitrary grid of demes, adjusting deme sizes and migration rates in an attempt to capture the important features of the population. Such indirect methods are required because of the failure of the classical models of isolation by distance, which have been shown to have major technical flaws. A recently introduced model of extinction and recolonisation in two dimensions solves these technical problems, and provides a rigorous technical foundation for the study of populations evolving in a spatial continuum. The coalescent process for this model is simply stated, but direct simulation is very inefficient for large neighbourhood sizes. We present efficient and exact algorithms to simulate this coalescent process for arbitrary sample sizes and numbers of loci, and analyse these algorithms in detail.
Collapse
Affiliation(s)
- J Kelleher
- Institute of Evolutionary Biology, University of Edinburgh, Kings Buildings, West Mains Road, Edinburgh EH9 3JT, UK.
| | - A M Etheridge
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK.
| | - N H Barton
- Institute of Science and Technology, Am Campus I, A-3400 Klosterneuburg, Austria.
| |
Collapse
|
189
|
Hickey JM, Gorjanc G, Hearne S, Huang BE. AlphaMPSim: flexible simulation of multi-parent crosses. Bioinformatics 2014; 30:2686-8. [DOI: 10.1093/bioinformatics/btu206] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
190
|
Nossa CW, Havlak P, Yue JX, Lv J, Vincent KY, Brockmann HJ, Putnam NH. Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication. Gigascience 2014; 3:9. [PMID: 24987520 PMCID: PMC4066314 DOI: 10.1186/2047-217x-3-9] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2013] [Accepted: 04/23/2014] [Indexed: 11/11/2022] Open
Abstract
Background Horseshoe crabs are marine arthropods with a fossil record extending back approximately 450 million years. They exhibit remarkable morphological stability over their long evolutionary history, retaining a number of ancestral arthropod traits, and are often cited as examples of “living fossils.” As arthropods, they belong to the Ecdysozoa, an ancient super-phylum whose sequenced genomes (including insects and nematodes) have thus far shown more divergence from the ancestral pattern of eumetazoan genome organization than cnidarians, deuterostomes and lophotrochozoans. However, much of ecdysozoan diversity remains unrepresented in comparative genomic analyses. Results Here we apply a new strategy of combined de novo assembly and genetic mapping to examine the chromosome-scale genome organization of the Atlantic horseshoe crab, Limulus polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their parents at a mean redundancy of 1.1x per sample. The map includes 84,307 sequence markers grouped into 1,876 distinct genetic intervals and 5,775 candidate conserved protein coding genes. Conclusions Comparison with other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications 300 million years ago, followed by extensive chromosome fusion. These results provide a counter-example to the often noted correlation between whole genome duplication and evolutionary radiations. The new, low-cost genetic mapping method for obtaining a chromosome-scale view of non-model organism genomes that we demonstrate here does not require laboratory culture, and is potentially applicable to a broad range of other species.
Collapse
Affiliation(s)
- Carlos W Nossa
- Department of Ecology and Evolutionary Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA ; Current address: Gene by Gene, Ltd, Houston, TX 77008, USA
| | - Paul Havlak
- Department of Ecology and Evolutionary Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| | - Jia-Xing Yue
- Department of Ecology and Evolutionary Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| | - Jie Lv
- Department of Ecology and Evolutionary Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| | - Kimberly Y Vincent
- Department of Ecology and Evolutionary Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| | - H Jane Brockmann
- Department of Biology, University of Florida, P.O. Box 11-8525 Gainesville, FL 32611-8525, USA
| | - Nicholas H Putnam
- Department of Ecology and Evolutionary Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA ; Department of Biochemistry and Cell Biology, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| |
Collapse
|
191
|
Abstract
In genomic prediction, common analysis methods rely on a linear mixed-model framework to estimate SNP marker effects and breeding values of animals or plants. Ridge regression-best linear unbiased prediction (RR-BLUP) is based on the assumptions that SNP marker effects are normally distributed, are uncorrelated, and have equal variances. We propose DAIRRy-BLUP, a parallel, Distributed-memory RR-BLUP implementation, based on single-trait observations ( Y: ), that uses the Average Information algorithm for restricted maximum-likelihood estimation of the variance components. The goal of DAIRRy-BLUP is to enable the analysis of large-scale data sets to provide more accurate estimates of marker effects and breeding values. A distributed-memory framework is required since the dimensionality of the problem, determined by the number of SNP markers, can become too large to be analyzed by a single computing node. Initial results show that DAIRRy-BLUP enables the analysis of very large-scale data sets (up to 1,000,000 individuals and 360,000 SNPs) and indicate that increasing the number of phenotypic and genotypic records has a more significant effect on the prediction accuracy than increasing the density of SNP arrays.
Collapse
|
192
|
Hellenthal G, Busby GB, Band G, Wilson JF, Capelli C, Falush D, Myers S. A genetic atlas of human admixture history. Science 2014; 343:747-751. [PMID: 24531965 PMCID: PMC4209567 DOI: 10.1126/science.1243518] [Citation(s) in RCA: 477] [Impact Index Per Article: 47.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Modern genetic data combined with appropriate statistical methods have the potential to contribute substantially to our understanding of human history. We have developed an approach that exploits the genomic structure of admixed populations to date and characterize historical mixture events at fine scales. We used this to produce an atlas of worldwide human admixture history, constructed by using genetic data alone and encompassing over 100 events occurring over the past 4000 years. We identified events whose dates and participants suggest they describe genetic impacts of the Mongol empire, Arab slave trade, Bantu expansion, first millennium CE migrations in Eastern Europe, and European colonialism, as well as unrecorded events, revealing admixture to be an almost universal force shaping human populations.
Collapse
Affiliation(s)
- Garrett Hellenthal
- UCL Genetics Institute, University College London, Gower Street, London WC1E 6BT, UK
| | - George B.J. Busby
- Department of Zoology, Oxford University, South Parks Road, Oxford OX1 3PS, UK
| | - Gavin Band
- Wellcome Trust Centre for Human Genetics, Oxford University, Roosevelt Drive, Oxford OX3 7BN, UK
| | - James F. Wilson
- Centre for Population Health Sciences, University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG, UK
| | - Cristian Capelli
- Department of Zoology, Oxford University, South Parks Road, Oxford OX1 3PS, UK
| | - Daniel Falush
- Max Planck Institute for Evolutionary Anthropology, DeutscherPlatz 6, 04103 Leipzig, Germany
| | - Simon Myers
- Wellcome Trust Centre for Human Genetics, Oxford University, Roosevelt Drive, Oxford OX3 7BN, UK
- Department of Statistics, Oxford University, 1 South Parks Road, Oxford OX1 3TG, UK
| |
Collapse
|
193
|
Bouwman AC, Hickey JM, Calus MPL, Veerkamp RF. Imputation of non-genotyped individuals based on genotyped relatives: assessing the imputation accuracy of a real case scenario in dairy cattle. Genet Sel Evol 2014; 46:6. [PMID: 24490796 PMCID: PMC3929150 DOI: 10.1186/1297-9686-46-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2013] [Accepted: 01/07/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Imputation of genotypes for ungenotyped individuals could enable the use of valuable phenotypes created before the genomic era in analyses that require genotypes. The objective of this study was to investigate the accuracy of imputation of non-genotyped individuals using genotype information from relatives. METHODS Genotypes were simulated for all individuals in the pedigree of a real (historical) dataset of phenotyped dairy cows and with part of the pedigree genotyped. The software AlphaImpute was used for imputation in its standard settings but also without phasing, i.e. using basic inheritance rules and segregation analysis only. Different scenarios were evaluated i.e.: (1) the real data scenario, (2) addition of genotypes of sires and maternal grandsires of the ungenotyped individuals, and (3) addition of one, two, or four genotyped offspring of the ungenotyped individuals to the reference population. RESULTS The imputation accuracy using AlphaImpute in its standard settings was lower than without phasing. Including genotypes of sires and maternal grandsires in the reference population improved imputation accuracy, i.e. the correlation of the true genotypes with the imputed genotype dosages, corrected for mean gene content, across all animals increased from 0.47 (real situation) to 0.60. Including one, two and four genotyped offspring increased the accuracy of imputation across all animals from 0.57 (no offspring) to 0.73, 0.82, and 0.92, respectively. CONCLUSIONS At present, the use of basic inheritance rules and segregation analysis appears to be the best imputation method for ungenotyped individuals. Comparison of our empirical animal-specific imputation accuracies to predictions based on selection index theory suggested that not correcting for mean gene content considerably overestimates the true accuracy. Imputation of ungenotyped individuals can help to include valuable phenotypes for genome-wide association studies or for genomic prediction, especially when the ungenotyped individuals have genotyped offspring.
Collapse
Affiliation(s)
- Aniek C Bouwman
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, P,O, Box 135, Wageningen 6700, AC, Netherlands.
| | | | | | | |
Collapse
|
194
|
Baldwin-Brown JG, Long AD, Thornton KR. The power to detect quantitative trait loci using resequenced, experimentally evolved populations of diploid, sexual organisms. Mol Biol Evol 2014; 31:1040-55. [PMID: 24441104 PMCID: PMC3969567 DOI: 10.1093/molbev/msu048] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
A novel approach for dissecting complex traits is to experimentally evolve laboratory populations under a controlled environment shift, resequence the resulting populations, and identify single nucleotide polymorphisms (SNPs) and/or genomic regions highly diverged in allele frequency. To better understand the power and localization ability of such an evolve and resequence (E&R) approach, we carried out forward-in-time population genetics simulations of 1 Mb genomic regions under a large combination of experimental conditions, then attempted to detect significantly diverged SNPs. Our analysis indicates that the ability to detect differentiation between populations is primarily affected by selection coefficient, population size, number of replicate populations, and number of founding haplotypes. We estimate that E&R studies can detect and localize causative sites with 80% success or greater when the number of founder haplotypes is over 500, experimental populations are replicated at least 25-fold, population size is at least 1,000 diploid individuals, and the selection coefficient on the locus of interest is at least 0.1. More achievable experimental designs (less replicated, fewer founder haplotypes, smaller effective population size, and smaller selection coefficients) can have power of greater than 50% to identify a handful of SNPs of which one is likely causative. Similarly, in cases where s ≥ 0.2, less demanding experimental designs can yield high power.
Collapse
|
195
|
Durbin R. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 2014; 30:1266-72. [PMID: 24413527 PMCID: PMC3998136 DOI: 10.1093/bioinformatics/btu014] [Citation(s) in RCA: 241] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms. RESULTS Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows-Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM) time rather than O(NM(2)) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing.
Collapse
Affiliation(s)
- Richard Durbin
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, UK
| |
Collapse
|
196
|
Pérez-Enciso M. Genomic relationships computed from either next-generation sequence or array SNP data. J Anim Breed Genet 2014; 131:85-96. [PMID: 24397314 DOI: 10.1111/jbg.12074] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Accepted: 12/02/2013] [Indexed: 01/18/2023]
Abstract
The use of sequence data in genomic prediction models is a topic of high interest, given the decreasing prices of current 'next'-generation sequencing technologies (NGS) and the theoretical possibility of directly interrogating the genomes for all causal mutations. Here, we compare by simulation how well genetic relationships (G) could be estimated using either NGS or ascertained SNP arrays. DNA sequences were simulated using the coalescence according to two scenarios: a 'cattle' scenario that consisted of a bottleneck followed by a split in two breeds without migration, and a 'pig' model where Chinese introgression into international pig breeds was simulated. We found that introgression results in a large amount of variability across the genome and between individuals, both in differentiation and in diversity. In general, NGS data allowed the most accurate estimates of G, provided enough sequencing depth was available, because shallow NGS (4×) may result in highly distorted estimates of G elements, especially if not standardized by allele frequency. However, high-density genotyping can also result in accurate estimates of G. Given that genotyping is much less noisy than NGS data, it is suggested that specific high-density arrays (~3M SNPs) that minimize the effects of ascertainment could be developed in the population of interest by sequencing the most influential animals and rely on those arrays for implementing genomic selection.
Collapse
Affiliation(s)
- M Pérez-Enciso
- Centre for Research in Agricultural Genomics (CRAG), Bellaterra, Spain; Veterinary School, Universitat Autònoma de Barcelona, Bellaterra, Spain; Institut Català de Recerca i Estudis Avancats (ICREA), Barcelona, Spain; Animal Breeding and Genomics Group, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
197
|
Yang T, Deng HW, Niu T. Critical assessment of coalescent simulators in modeling recombination hotspots in genomic sequences. BMC Bioinformatics 2014; 15:3. [PMID: 24387001 PMCID: PMC3890628 DOI: 10.1186/1471-2105-15-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2013] [Accepted: 12/30/2013] [Indexed: 12/04/2022] Open
Abstract
Background Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging. Results We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data. Conclusions While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.
Collapse
Affiliation(s)
| | | | - Tianhua Niu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, Tulane University School of Public Health and Tropical Medicine, 1440 Canal Street, Suite 2001, New Orleans, LA 70112, USA.
| |
Collapse
|
198
|
Qian Y, Browning BL, Browning SR. Efficient clustering of identity-by-descent between multiple individuals. ACTA ACUST UNITED AC 2013; 30:915-22. [PMID: 24363374 DOI: 10.1093/bioinformatics/btt734] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
MOTIVATION Most existing identity-by-descent (IBD) detection methods only consider haplotype pairs; less attention has been paid to considering multiple haplotypes simultaneously, even though IBD is an equivalence relation on haplotypes that partitions a set of haplotypes into IBD clusters. Multiple-haplotype IBD clusters may have advantages over pairwise IBD in some applications, such as IBD mapping. Existing methods for detecting multiple-haplotype IBD clusters are often computationally expensive and unable to handle large samples with thousands of haplotypes. RESULTS We present a clustering method, efficient multiple-IBD, which uses pairwise IBD segments to infer multiple-haplotype IBD clusters. It expands clusters from seed haplotypes by adding qualified neighbors and extends clusters across sliding windows in the genome. Our method is an order of magnitude faster than existing methods and has comparable performance with respect to the quality of clusters it uncovers. We further investigate the potential application of multiple-haplotype IBD clusters in association studies by testing for association between multiple-haplotype IBD clusters and low-density lipoprotein cholesterol in the Northern Finland Birth Cohort. Using our multiple-haplotype IBD cluster approach, we found an association with a genomic interval covering the PCSK9 gene in these data that is missed by standard single-marker association tests. Previously published studies confirm association of PCSK9 with low-density lipoprotein. AVAILABILITY AND IMPLEMENTATION Source code is available under the GNU Public License http://cs.au.dk/~qianyuxx/EMI/.
Collapse
Affiliation(s)
- Yu Qian
- Bioinformatics Research Center, Aarhus Universitet, 8000C Aarhus, Denmark, Department of Biostatistics and Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA
| | | | | |
Collapse
|
199
|
Koch E, Ristroph M, Kirkpatrick M. Long range linkage disequilibrium across the human genome. PLoS One 2013; 8:e80754. [PMID: 24349013 PMCID: PMC3861250 DOI: 10.1371/journal.pone.0080754] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2013] [Accepted: 10/17/2013] [Indexed: 11/19/2022] Open
Abstract
Long-range linkage disequilibria (LRLD) between sites that are widely separated on chromosomes may suggest that population admixture, epistatic selection, or other evolutionary forces are at work. We quantified patterns of LRLD on a chromosome-wide level in the YRI population of the HapMap dataset of single nucleotide polymorphisms (SNPs). We calculated the disequilibrium between all pairs of SNPs on each chromosome (a total of >2×10(11) values) and evaluated significance of overall disequilibrium using randomization. The results show an excess of associations between pairs of distant sites (separated by >0.25 cM) on all of the 22 autosomes. We discuss possible explanations for this observation.
Collapse
Affiliation(s)
- Evan Koch
- Department of Integrative Biology, University of Texas, Austin, Texas, United States of America
| | - Mickey Ristroph
- Department of Integrative Biology, University of Texas, Austin, Texas, United States of America
| | - Mark Kirkpatrick
- Department of Integrative Biology, University of Texas, Austin, Texas, United States of America
- * E-mail:
| |
Collapse
|
200
|
Kessner D, Novembre J. forqs: forward-in-time simulation of recombination, quantitative traits and selection. ACTA ACUST UNITED AC 2013; 30:576-7. [PMID: 24336146 PMCID: PMC3928523 DOI: 10.1093/bioinformatics/btt712] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Summary: forqs is a forward-in-time simulation of recombination, quantitative traits and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. Availability and implementation: forqs is implemented as a command-line C++ program. Source code and binary executables for Linux, OSX and Windows are freely available under a permissive BSD license: https://bitbucket.org/dkessner/forqs. Contact:jnovembre@uchicago.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Darren Kessner
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095 and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | | |
Collapse
|