201
|
Gao C, Tignor NL, Salit J, Strulovici-Barel Y, Hackett NR, Crystal RG, Mezey JG. HEFT: eQTL analysis of many thousands of expressed genes while simultaneously controlling for hidden factors. ACTA ACUST UNITED AC 2013; 30:369-76. [PMID: 24307700 DOI: 10.1093/bioinformatics/btt690] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
MOTIVATION Identification of expression Quantitative Trait Loci (eQTL), the genetic loci that contribute to heritable variation in gene expression, can be obstructed by factors that produce variation in expression profiles if these factors are unmeasured or hidden from direct analysis. METHODS We have developed a method for Hidden Expression Factor analysis (HEFT) that identifies individual and pleiotropic effects of eQTL in the presence of hidden factors. The HEFT model is a combined multivariate regression and factor analysis, where the complete likelihood of the model is used to derive a ridge estimator for simultaneous factor learning and detection of eQTL. HEFT requires no pre-estimation of hidden factor effects; it provides P-values and is extremely fast, requiring just a few hours to complete an eQTL analysis of thousands of expression variables when analyzing hundreds of thousands of single nucleotide polymorphisms on a standard 8 core 2.6 G desktop. RESULTS By analyzing simulated data, we demonstrate that HEFT can correct for an unknown number of hidden factors and significantly outperforms all related hidden factor methods for eQTL analysis when there are eQTL with univariate and multivariate (pleiotropic) effects. To demonstrate a real-world application, we applied HEFT to identify eQTL affecting gene expression in the human lung for a study that included presumptive hidden factors. HEFT identified all of the cis-eQTL found by other hidden factor methods and 91 additional cis-eQTL. HEFT also identified a number of eQTLs with direct relevance to lung disease that could not be found without a hidden factor analysis, including cis-eQTL for GTF2H1 and MTRR, genes that have been independently associated with lung cancer. AVAILABILITY Software is available at http://mezeylab.cb.bscb.cornell.edu/Software.aspx. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chuan Gao
- Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14850, USA and Department of Genetic Medicine, Weill Cornell Medical College, New York, NY 10021, USA
| | | | | | | | | | | | | |
Collapse
|
202
|
Sikora MJ, Colonna V, Xue Y, Tyler-Smith C. Modeling the contrasting Neolithic male lineage expansions in Europe and Africa. INVESTIGATIVE GENETICS 2013; 4:25. [PMID: 24262073 PMCID: PMC4177147 DOI: 10.1186/2041-2223-4-25] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Accepted: 10/21/2013] [Indexed: 11/10/2022]
Abstract
BACKGROUND Patterns of genetic variation in a population carry information about the prehistory of the population, and for the human Y chromosome an especially informative phylogenetic tree has previously been constructed from fully-sequenced chromosomes. This revealed contrasting bifurcating and starlike phylogenies for the major lineages associated with the Neolithic expansions in sub-Saharan Africa and Western Europe, respectively. RESULTS We used coalescent simulations to investigate the range of demographic models most likely to produce the phylogenetic structures observed in Africa and Europe, assessing the starting and ending genetic effective population sizes, duration of the expansion, and time when expansion ended. The best-fitting models in Africa and Europe are very different. In Africa, the expansion took about 12 thousand years, ending very recently; it started from approximately 40 men and numbers expanded approximately 50-fold. In Europe, the expansion was much more rapid, taking only a few generations and occurring as soon as the major R1b lineage entered Europe; it started from just one to three men, whose numbers expanded more than a thousandfold. CONCLUSIONS Although highly simplified, the demographic model we have used captures key elements of the differences between the male Neolithic expansions in Africa and Europe, and is consistent with archaeological findings.
Collapse
Affiliation(s)
- Michael J Sikora
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.
| | | | | | | |
Collapse
|
203
|
Browning BL, Browning SR. Detecting identity by descent and estimating genotype error rates in sequence data. Am J Hum Genet 2013; 93:840-51. [PMID: 24207118 DOI: 10.1016/j.ajhg.2013.09.014] [Citation(s) in RCA: 114] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Revised: 09/21/2013] [Accepted: 09/26/2013] [Indexed: 11/17/2022] Open
Abstract
Existing methods for identity by descent (IBD) segment detection were designed for SNP array data, not sequence data. Sequence data have a much higher density of genetic variants and a different allele frequency distribution, and can have higher genotype error rates. Consequently, best practices for IBD detection in SNP array data do not necessarily carry over to sequence data. We present a method, IBDseq, for detecting IBD segments in sequence data and a method, SEQERR, for estimating genotype error rates at low-frequency variants by using detected IBD. The IBDseq method estimates probabilities of genotypes observed with error for each pair of individuals under IBD and non-IBD models. The ratio of estimated probabilities under the two models gives a LOD score for IBD. We evaluate several IBD detection methods that are fast enough for application to sequence data (IBDseq, Beagle Refined IBD, PLINK, and GERMLINE) under multiple parameter settings, and we show that IBDseq achieves high power and accuracy for IBD detection in sequence data. The SEQERR method estimates genotype error rates by comparing observed and expected rates of pairs of homozygote and heterozygote genotypes at low-frequency variants in IBD segments. We demonstrate the accuracy of SEQERR in simulated data, and we apply the method to estimate genotype error rates in sequence data from the UK10K and 1000 Genomes projects.
Collapse
Affiliation(s)
- Brian L Browning
- Department of Medicine, Division of Medical Genetics, University of Washington, Seattle, WA 98195, USA.
| | | |
Collapse
|
204
|
Clark SA, Kinghorn BP, Hickey JM, van der Werf JHJ. The effect of genomic information on optimal contribution selection in livestock breeding programs. Genet Sel Evol 2013; 45:44. [PMID: 24171942 PMCID: PMC4176995 DOI: 10.1186/1297-9686-45-44] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2012] [Accepted: 10/13/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Long-term benefits in animal breeding programs require that increases in genetic merit be balanced with the need to maintain diversity (lost due to inbreeding). This can be achieved by using optimal contribution selection. The availability of high-density DNA marker information enables the incorporation of genomic data into optimal contribution selection but this raises the question about how this information affects the balance between genetic merit and diversity. METHODS The effect of using genomic information in optimal contribution selection was examined based on simulated and real data on dairy bulls. We compared the genetic merit of selected animals at various levels of co-ancestry restrictions when using estimated breeding values based on parent average, genomic or progeny test information. Furthermore, we estimated the proportion of variation in estimated breeding values that is due to within-family differences. RESULTS Optimal selection on genomic estimated breeding values increased genetic gain. Genetic merit was further increased using genomic rather than pedigree-based measures of co-ancestry under an inbreeding restriction policy. Using genomic instead of pedigree relationships to restrict inbreeding had a significant effect only when the population consisted of many large full-sib families; with a half-sib family structure, no difference was observed. In real data from dairy bulls, optimal contribution selection based on genomic estimated breeding values allowed for additional improvements in genetic merit at low to moderate inbreeding levels. Genomic estimated breeding values were more accurate and showed more within-family variation than parent average breeding values; for genomic estimated breeding values, 30 to 40% of the variation was due to within-family differences. Finally, there was no difference between constraining inbreeding via pedigree or genomic relationships in the real data. CONCLUSIONS The use of genomic estimated breeding values increased genetic gain in optimal contribution selection. Genomic estimated breeding values were more accurate and showed more within-family variation, which led to higher genetic gains for the same restriction on inbreeding. Using genomic relationships to restrict inbreeding provided no additional gain, except in the case of very large full-sib families.
Collapse
Affiliation(s)
- Samuel A Clark
- University of New England, Armidale, NSW 2351, Australia.
| | | | | | | |
Collapse
|
205
|
Ferretti L, Ramos-Onsins SE, Pérez-Enciso M. Population genomics from pool sequencing. Mol Ecol 2013; 22:5561-76. [DOI: 10.1111/mec.12522] [Citation(s) in RCA: 109] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2011] [Revised: 08/03/2013] [Accepted: 09/06/2013] [Indexed: 11/30/2022]
Affiliation(s)
- Luca Ferretti
- Center for Research in Agricultural Genomics (CRAG); UAB 08193 Bellaterra Spain
| | | | - Miguel Pérez-Enciso
- Center for Research in Agricultural Genomics (CRAG); UAB 08193 Bellaterra Spain
- Department of Animal Science and Food; Faculty of Veterinary; Universitat Autonoma de Barcelona; 08193 Bellaterra Spain
- Institut Català de Recerca i Estudis Avancats (ICREA); Passeig Lluís Companys 23 08010 Barcelona Spain
| |
Collapse
|
206
|
Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A, Das J, Abyzov A, Balasubramanian S, Beal K, Chakravarty D, Challis D, Chen Y, Clarke D, Clarke L, Cunningham F, Evani US, Flicek P, Fragoza R, Garrison E, Gibbs R, Gümüş ZH, Herrero J, Kitabayashi N, Kong Y, Lage K, Liluashvili V, Lipkin SM, MacArthur DG, Marth G, Muzny D, Pers TH, Ritchie GRS, Rosenfeld JA, Sisu C, Wei X, Wilson M, Xue Y, Yu F, Dermitzakis ET, Yu H, Rubin MA, Tyler-Smith C, Gerstein M. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 2013; 342:1235587. [PMID: 24092746 PMCID: PMC3947637 DOI: 10.1126/science.1235587] [Citation(s) in RCA: 270] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
Collapse
Affiliation(s)
- Ekta Khurana
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Yao Fu
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
| | - Vincenza Colonna
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
- Institute of Genetics and Biophysics, National Research Council
(CNR), 80131 Naples, Italy
| | - Xinmeng Jasmine Mu
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
| | - Hyun Min Kang
- Center for Statistical Genetics, Biostatistics, University of
Michigan, Ann Arbor, MI 48109, USA
| | - Tuuli Lappalainen
- Department of Genetic Medicine and Development, University of Geneva
Medical School, 1211 Geneva, Switzerland
- Institute for Genetics and Genomics in Geneva (iGE3), University of
Geneva, 1211 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| | - Andrea Sboner
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021,
USA
| | - Lucas Lochovsky
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
| | - Jieming Chen
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Integrated Graduate Program in Physical and Engineering Biology,
Yale University, New Haven, CT 06520, USA
| | - Arif Harmanci
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Jishnu Das
- Department of Biological Statistics and Computational Biology,
Cornell University, Ithaca, NY 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
| | - Alexej Abyzov
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Suganthi Balasubramanian
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Kathryn Beal
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Dimple Chakravarty
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
| | - Daniel Challis
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Yuan Chen
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
| | - Declan Clarke
- Department of Chemistry, Yale University, New Haven, CT 06520, USA
| | - Laura Clarke
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Uday S. Evani
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Robert Fragoza
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
- Department of Molecular Biology and Genetics, Cornell University,
Ithaca, NY 14853, USA
| | - Erik Garrison
- Department of Biology, Boston College, Chestnut Hill, MA 02467, USA
| | - Richard Gibbs
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Zeynep H. Gümüş
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021,
USA
- Department of Physiology and Biophysics, Weill Cornell Medical
College, New York, NY, 10065, USA
| | - Javier Herrero
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Naoki Kitabayashi
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
| | - Yong Kong
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
- Keck Biotechnology Resource Laboratory, Yale University, New Haven,
CT 06511, USA
| | - Kasper Lage
- Pediatric Surgical Research Laboratories, MassGeneral Hospital for
Children, Massachusetts General Hospital, Boston, MA 02114, USA
- Analytical and Translational Genetics Unit, Massachusetts General
Hospital, Boston, MA 02114, USA
- Harvard Medical School, Boston, MA 02115, USA
- Center for Biological Sequence Analysis, Department of Systems
Biology, Technical University of Denmark, Lyngby, Denmark
- Center for Protein Research, University of Copenhagen, Copenhagen,
Denmark
| | - Vaja Liluashvili
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute
for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021,
USA
- Department of Physiology and Biophysics, Weill Cornell Medical
College, New York, NY, 10065, USA
| | - Steven M. Lipkin
- Department of Medicine, Weill Cornell Medical College, New York, NY
10065, USA
| | - Daniel G. MacArthur
- Analytical and Translational Genetics Unit, Massachusetts General
Hospital, Boston, MA 02114, USA
- Program in Medical and Population Genetics, Broad Institute of
Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02142,
USA
| | - Gabor Marth
- Department of Biology, Boston College, Chestnut Hill, MA 02467, USA
| | - Donna Muzny
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | - Tune H. Pers
- Center for Biological Sequence Analysis, Department of Systems
Biology, Technical University of Denmark, Lyngby, Denmark
- Division of Endocrinology and Center for Basic and Translational
Obesity Research, Children’s Hospital, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham R. S. Ritchie
- European Molecular Biology Laboratory, European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jeffrey A. Rosenfeld
- Department of Medicine, Rutgers New Jersey Medical School, Newark,
NJ 07101, USA
- IST/High Performance and Research Computing, Rutgers University
Newark, NJ 07101, USA
- Sackler Institute for Comparative Genomics, American Museum of
Natural History, New York, NY 10024, USA
| | - Cristina Sisu
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
| | - Xiaomu Wei
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
- Department of Medicine, Weill Cornell Medical College, New York, NY
10065, USA
| | - Michael Wilson
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Child Study Center, Yale University, New Haven, CT 06520, USA
| | - Yali Xue
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
| | - Fuli Yu
- Baylor College of Medicine, Human Genome Sequencing Center,
Houston, TX 77030, USA
| | | | - Emmanouil T. Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva
Medical School, 1211 Geneva, Switzerland
- Institute for Genetics and Genomics in Geneva (iGE3), University of
Geneva, 1211 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| | - Haiyuan Yu
- Department of Biological Statistics and Computational Biology,
Cornell University, Ithaca, NY 14853, USA
- Weill Institute for Cell and Molecular Biology, Cornell University,
Ithaca, NY 14853, USA
| | - Mark A. Rubin
- Institute for Precision Medicine and the Department of Pathology and
Laboratory Medicine, Weill Cornell Medical College and New York-Presbyterian
Hospital, New York, NY 10065, USA
| | - Chris Tyler-Smith
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus,
Cambridge, CB10 1SA, UK
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale
University, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, New Haven, CT
06520, USA
| |
Collapse
|
207
|
Abstract
High-throughput shotgun sequence data make it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual are limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step that calls genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first, by its performance on simulated sequence data and, second, on real sequence data where we obtain estimates using low-coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse worldwide populations and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show how we can use filters to correct for the confounding arising from sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher-coverage data.
Collapse
|
208
|
Cridland JM, Macdonald SJ, Long AD, Thornton KR. Abundance and distribution of transposable elements in two Drosophila QTL mapping resources. Mol Biol Evol 2013; 30:2311-27. [PMID: 23883524 PMCID: PMC3773372 DOI: 10.1093/molbev/mst129] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Here we present computational machinery to efficiently and accurately identify transposable element (TE) insertions in 146 next-generation sequenced inbred strains of Drosophila melanogaster. The panel of lines we use in our study is composed of strains from a pair of genetic mapping resources: the Drosophila Genetic Reference Panel (DGRP) and the Drosophila Synthetic Population Resource (DSPR). We identified 23,087 TE insertions in these lines, of which 83.3% are found in only one line. There are marked differences in the distribution of elements over the genome, with TEs found at higher densities on the X chromosome, and in regions of low recombination. We also identified many more TEs per base pair of intronic sequence and fewer TEs per base pair of exonic sequence than expected if TEs are located at random locations in the euchromatic genome. There was substantial variation in TE load across genes. For example, the paralogs derailed and derailed-2 show a significant difference in the number of TE insertions, potentially reflecting differences in the selection acting on these loci. When considering TE families, we find a very weak effect of gene family size on TE insertions per gene, indicating that as gene family size increases the number of TE insertions in a given gene within that family also increases. TEs are known to be associated with certain phenotypes, and our data will allow investigators using the DGRP and DSPR to assess the functional role of TE insertions in complex trait variation more generally. Notably, because most TEs are very rare and often private to a single line, causative TEs resulting in phenotypic differences among individuals may typically fail to replicate across mapping panels since individual elements are unlikely to segregate in both panels. Our data suggest that “burden tests” that test for the effect of TEs as a class may be more fruitful.
Collapse
Affiliation(s)
- Julie M Cridland
- Department of Ecology, Evolution and Physiology, University of California, Irvine
| | | | | | | |
Collapse
|
209
|
Pavlidis P, Živkovic D, Stamatakis A, Alachiotis N. SweeD: likelihood-based detection of selective sweeps in thousands of genomes. Mol Biol Evol 2013; 30:2224-34. [PMID: 23777627 PMCID: PMC3748355 DOI: 10.1093/molbev/mst112] [Citation(s) in RCA: 278] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.
Collapse
Affiliation(s)
- Pavlos Pavlidis
- Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg, Heidelberg, Germany.
| | | | | | | |
Collapse
|
210
|
A sequential coalescent algorithm for chromosomal inversions. Heredity (Edinb) 2013; 111:200-9. [PMID: 23632894 DOI: 10.1038/hdy.2013.38] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 02/04/2013] [Accepted: 03/25/2013] [Indexed: 01/06/2023] Open
Abstract
Chromosomal inversions are common in natural populations and are believed to be involved in many important evolutionary phenomena, including speciation, the evolution of sex chromosomes and local adaptation. While recent advances in sequencing and genotyping methods are leading to rapidly increasing amounts of genome-wide sequence data that reveal interesting patterns of genetic variation within inverted regions, efficient simulation methods to study these patterns are largely missing. In this work, we extend the sequential Markovian coalescent, an approximation to the coalescent with recombination, to include the effects of polymorphic inversions on patterns of recombination. Results show that our algorithm is fast, memory-efficient and accurate, making it feasible to simulate large inversions in large populations for the first time. The SMC algorithm enables studies of patterns of genetic variation (for example, linkage disequilibria) and tests of hypotheses (using simulation-based approaches) that were previously intractable.
Collapse
|
211
|
Hara Y, Imanishi T, Satta Y. Reconstructing the demographic history of the human lineage using whole-genome sequences from human and three great apes. Genome Biol Evol 2013; 4:1133-45. [PMID: 22975719 PMCID: PMC3752010 DOI: 10.1093/gbe/evs075] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The demographic history of human would provide helpful information for identifying the evolutionary events that shaped the humanity but remains controversial even in the genomic era. To settle the controversies, we inferred the speciation times (T) and ancestral population sizes (N) in the lineage leading to human and great apes based on whole-genome alignment. A coalescence simulation determined the sizes of alignment blocks and intervals between them required to obtain recombination-free blocks with a high frequency. This simulation revealed that the size of the block strongly affects the parameter inference, indicating that recombination is an important factor for achieving optimum parameter inference. From the whole genome alignments (1.9 giga-bases) of human (H), chimpanzee (C), gorilla (G), and orangutan, 100-bp alignment blocks separated by ≥5-kb intervals were sampled and subjected to estimate τ = μT and θ = 4μgN using the Markov chain Monte Carlo method, where μ is the mutation rate and g is the generation time. Although the estimated τHC differed across chromosomes, τHC and τHCG were strongly correlated across chromosomes, indicating that variation in τ is subject to variation in μ, rather than T, and thus, all chromosomes share a single speciation time. Subsequently, we estimated Ts of the human lineage from chimpanzee, gorilla, and orangutan to be 6.0–7.6, 7.6–9.7, and 15–19 Ma, respectively, assuming variable μ across lineages and chromosomes. These speciation times were consistent with the fossil records. We conclude that the speciation times in our recombination-free analysis would be conclusive and the speciation between human and chimpanzee was a single event.
Collapse
Affiliation(s)
- Yuichiro Hara
- Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo, Japan
| | | | | |
Collapse
|
212
|
Padhukasahasram B, Rannala B. Meiotic gene-conversion rate and tract length variation in the human genome. Eur J Hum Genet 2013:ejhg201330. [PMID: 23443031 DOI: 10.1038/ejhg.2013.30] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2012] [Revised: 12/17/2012] [Accepted: 01/10/2013] [Indexed: 01/11/2023] Open
Abstract
Meiotic recombination occurs in the form of two different mechanisms called crossing-over and gene-conversion and both processes have an important role in shaping genetic variation in populations. Although variation in crossing-over rates has been studied extensively using sperm-typing experiments, pedigree studies and population genetic approaches, our knowledge of variation in gene-conversion parameters (ie, rates and mean tract lengths) remains far from complete. To explore variability in population gene-conversion rates and its relationship to crossing-over rate variation patterns, we have developed and validated using coalescent simulations a comprehensive Bayesian full-likelihood method that can jointly infer crossing-over and gene-conversion rates as well as tract lengths from population genomic data under general variable rate models with recombination hotspots. Here, we apply this new method to SNP data from multiple human populations and attempt to characterize for the first time the fine-scale variation in gene-conversion parameters along the human genome. We find that the estimated ratio of gene-conversion to crossing-over rates varies considerably across genomic regions as well as between populations. However, there is a great degree of uncertainty associated with such estimates. We also find substantial evidence for variation in the mean conversion tract length. The estimated tract lengths did not show any negative relationship with the local heterozygosity levels in our analysis.European Journal of Human Genetics advance online publication, 27 February 2013; doi:10.1038/ejhg.2013.30.
Collapse
Affiliation(s)
- Badri Padhukasahasram
- 1] Center for Health Policy and Health Services Research, Henry Ford Health System, Detroit, MI, USA [2] Genome Center and Department of Evolution and Ecology, University of California, Davis, Davis, CA, USA
| | - Bruce Rannala
- Genome Center and Department of Evolution and Ecology, University of California, Davis, Davis, CA, USA
| |
Collapse
|
213
|
Abstract
Long-range migrations and the resulting admixtures between populations have been important forces shaping human genetic diversity. Most existing methods for detecting and reconstructing historical admixture events are based on allele frequency divergences or patterns of ancestry segments in chromosomes of admixed individuals. An emerging new approach harnesses the exponential decay of admixture-induced linkage disequilibrium (LD) as a function of genetic distance. Here, we comprehensively develop LD-based inference into a versatile tool for investigating admixture. We present a new weighted LD statistic that can be used to infer mixture proportions as well as dates with fewer constraints on reference populations than previous methods. We define an LD-based three-population test for admixture and identify scenarios in which it can detect admixture events that previous formal tests cannot. We further show that we can uncover phylogenetic relationships among populations by comparing weighted LD curves obtained using a suite of references. Finally, we describe several improvements to the computation and fitting of weighted LD curves that greatly increase the robustness and speed of the calculations. We implement all of these advances in a software package, ALDER, which we validate in simulations and apply to test for admixture among all populations from the Human Genome Diversity Project (HGDP), highlighting insights into the admixture history of Central African Pygmies, Sardinians, and Japanese.
Collapse
|
214
|
Abstract
Neanderthals were a group of archaic hominins that occupied most of Europe and parts of Western Asia from ∼30,000 to 300,000 years ago (KYA). They coexisted with modern humans during part of this time. Previous genetic analyses that compared a draft sequence of the Neanderthal genome with genomes of several modern humans concluded that Neanderthals made a small (1-4%) contribution to the gene pools of all non-African populations. This observation was consistent with a single episode of admixture from Neanderthals into the ancestors of all non-Africans when the two groups coexisted in the Middle East 50-80 KYA. We examined the relationship between Neanderthals and modern humans in greater detail by applying two complementary methods to the published draft Neanderthal genome and an expanded set of high-coverage modern human genome sequences. We find that, consistent with the recent finding of Meyer et al. (2012), Neanderthals contributed more DNA to modern East Asians than to modern Europeans. Furthermore we find that the Maasai of East Africa have a small but significant fraction of Neanderthal DNA. Because our analysis is of several genomic samples from each modern human population considered, we are able to document the extent of variation in Neanderthal ancestry within and among populations. Our results combined with those previously published show that a more complex model of admixture between Neanderthals and modern humans is necessary to account for the different levels of Neanderthal ancestry among human populations. In particular, at least some Neanderthal-modern human admixture must postdate the separation of the ancestors of modern European and modern East Asian populations.
Collapse
|
215
|
Shang J, Zhang J, Lei X, Zhao W, Dong Y. EpiSIM: simulation of multiple epistasis, linkage disequilibrium patterns and haplotype blocks for genome-wide interaction analysis. Genes Genomics 2013. [DOI: 10.1007/s13258-013-0081-9] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
216
|
Hickey JM, Kinghorn BP, Tier B, Clark SA, van der Werf JHJ, Gorjanc G. Genomic evaluations using similarity between haplotypes. J Anim Breed Genet 2012; 130:259-69. [PMID: 23855628 DOI: 10.1111/jbg.12020] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Accepted: 11/07/2012] [Indexed: 10/27/2022]
Abstract
Long-range phasing and haplotype library imputation methodologies are accurate and efficient methods to provide haplotype information that could be used in prediction of breeding value or phenotype. Modelling long haplotypes as independent effects in genomic prediction would be inefficient due to the many effects that need to be estimated and phasing errors, even if relatively low in frequency, exacerbate this problem. One approach to overcome this is to use similarity between haplotypes to model covariance of genomic effects by region or of animal breeding values. We developed a simple method to do this and tested impact on genomic prediction by simulation. Results show that the diagonal and off-diagonal elements of a genomic relationship matrix constructed using the haplotype similarity method had higher correlations with the true relationship between pairs of individuals than genomic relationship matrices built using unphased genotypes or assumed unrelated haplotypes. However, the prediction accuracy of such haplotype-based prediction methods was not higher than those based on unphased genotype information.
Collapse
Affiliation(s)
- J M Hickey
- School of Environmental and Rural Science, University of New England, Armidale, Australia
| | | | | | | | | | | |
Collapse
|
217
|
Chan AH, Jenkins PA, Song YS. Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genet 2012; 8:e1003090. [PMID: 23284288 PMCID: PMC3527307 DOI: 10.1371/journal.pgen.1003090] [Citation(s) in RCA: 178] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2012] [Accepted: 09/29/2012] [Indexed: 01/18/2023] Open
Abstract
Estimating fine-scale recombination maps of Drosophila from population genomic data is a challenging problem, in particular because of the high background recombination rate. In this paper, a new computational method is developed to address this challenge. Through an extensive simulation study, it is demonstrated that the method allows more accurate inference, and exhibits greater robustness to the effects of natural selection and noise, compared to a well-used previous method developed for studying fine-scale recombination rate variation in the human genome. As an application, a genome-wide analysis of genetic variation data is performed for two Drosophila melanogaster populations, one from North America (Raleigh, USA) and the other from Africa (Gikongoro, Rwanda). It is shown that fine-scale recombination rate variation is widespread throughout the D. melanogaster genome, across all chromosomes and in both populations. At the fine-scale, a conservative, systematic search for evidence of recombination hotspots suggests the existence of a handful of putative hotspots each with at least a tenfold increase in intensity over the background rate. A wavelet analysis is carried out to compare the estimated recombination maps in the two populations and to quantify the extent to which recombination rates are conserved. In general, similarity is observed at very broad scales, but substantial differences are seen at fine scales. The average recombination rate of the X chromosome appears to be higher than that of the autosomes in both populations, and this pattern is much more pronounced in the African population than the North American population. The correlation between various genomic features—including recombination rates, diversity, divergence, GC content, gene content, and sequence quality—is examined using the wavelet analysis, and it is shown that the most notable difference between D. melanogaster and humans is in the correlation between recombination and diversity. Recombination is a process by which chromosomes exchange genetic material during meiosis. It is important in evolution because it provides offspring with new combinations of genes, and so estimating the rate of recombination is of fundamental importance in various population genomic inference problems. In this paper, we develop a new statistical method to enable robust estimation of fine-scale recombination maps of Drosophila, a genus of common fruit flies, in which the background recombination rate is high and natural selection has been prevalent. We apply our method to produce fine-scale recombination maps for a North American population and an African population of D. melanogaster. For both populations, we find extensive fine-scale variation in recombination rate throughout the genome. We provide a quantitative characterization of the similarities and differences between the recombination maps of the two populations; our study reveals high correlation at broad scales and low correlation at fine scales, as has been documented among human populations. We also examine the correlation between various genomic features. Furthermore, using a conservative approach, we find a handful of putative recombination “hotspot” regions with solid statistical support for a local elevation of at least 10 times the background recombination rate.
Collapse
Affiliation(s)
- Andrew H. Chan
- Computer Science Division, University of California Berkeley, Berkeley, California, United States of America
| | - Paul A. Jenkins
- Computer Science Division, University of California Berkeley, Berkeley, California, United States of America
| | - Yun S. Song
- Computer Science Division, University of California Berkeley, Berkeley, California, United States of America
- Department of Statistics, University of California Berkeley, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
218
|
Pool JE, Corbett-Detig RB, Sugino RP, Stevens KA, Cardeno CM, Crepeau MW, Duchen P, Emerson JJ, Saelao P, Begun DJ, Langley CH. Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet 2012; 8:e1003080. [PMID: 23284287 PMCID: PMC3527209 DOI: 10.1371/journal.pgen.1003080] [Citation(s) in RCA: 229] [Impact Index Per Article: 19.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2012] [Accepted: 09/27/2012] [Indexed: 11/25/2022] Open
Abstract
Drosophila melanogaster has played a pivotal role in the development of modern population genetics. However, many basic questions regarding the demographic and adaptive history of this species remain unresolved. We report the genome sequencing of 139 wild-derived strains of D. melanogaster, representing 22 population samples from the sub-Saharan ancestral range of this species, along with one European population. Most genomes were sequenced above 25X depth from haploid embryos. Results indicated a pervasive influence of non-African admixture in many African populations, motivating the development and application of a novel admixture detection method. Admixture proportions varied among populations, with greater admixture in urban locations. Admixture levels also varied across the genome, with localized peaks and valleys suggestive of a non-neutral introgression process. Genomes from the same location differed starkly in ancestry, suggesting that isolation mechanisms may exist within African populations. After removing putatively admixed genomic segments, the greatest genetic diversity was observed in southern Africa (e.g. Zambia), while diversity in other populations was largely consistent with a geographic expansion from this potentially ancestral region. The European population showed different levels of diversity reduction on each chromosome arm, and some African populations displayed chromosome arm-specific diversity reductions. Inversions in the European sample were associated with strong elevations in diversity across chromosome arms. Genomic scans were conducted to identify loci that may represent targets of positive selection within an African population, between African populations, and between European and African populations. A disproportionate number of candidate selective sweep regions were located near genes with varied roles in gene regulation. Outliers for Europe-Africa F(ST) were found to be enriched in genomic regions of locally elevated cosmopolitan admixture, possibly reflecting a role for some of these loci in driving the introgression of non-African alleles into African populations.
Collapse
Affiliation(s)
- John E Pool
- Laboratory of Genetics, University of Wisconsin-Madison, Madison, Wisconsin, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
219
|
Mailund T, Halager AE, Westergaard M, Dutheil JY, Munch K, Andersen LN, Lunter G, Prüfer K, Scally A, Hobolth A, Schierup MH. A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species. PLoS Genet 2012; 8:e1003125. [PMID: 23284294 PMCID: PMC3527290 DOI: 10.1371/journal.pgen.1003125] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2012] [Accepted: 10/14/2012] [Indexed: 11/18/2022] Open
Abstract
We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.
Collapse
Affiliation(s)
- Thomas Mailund
- Bioinformatics Research Center, Aarhus University, Aarhus, Denmark.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
220
|
Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics 2012; 193:347-65. [PMID: 23222650 DOI: 10.1534/genetics.112.147983] [Citation(s) in RCA: 239] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The genomic prediction of phenotypes and breeding values in animals and plants has developed rapidly into its own research field. Results of genomic prediction studies are often difficult to compare because data simulation varies, real or simulated data are not fully described, and not all relevant results are reported. In addition, some new methods have been compared only in limited genetic architectures, leading to potentially misleading conclusions. In this article we review simulation procedures, discuss validation and reporting of results, and apply benchmark procedures for a variety of genomic prediction methods in simulated and real example data. Plant and animal breeding programs are being transformed by the use of genomic data, which are becoming widely available and cost-effective to predict genetic merit. A large number of genomic prediction studies have been published using both simulated and real data. The relative novelty of this area of research has made the development of scientific conventions difficult with regard to description of the real data, simulation of genomes, validation and reporting of results, and forward in time methods. In this review article we discuss the generation of simulated genotype and phenotype data, using approaches such as the coalescent and forward in time simulation. We outline ways to validate simulated data and genomic prediction results, including cross-validation. The accuracy and bias of genomic prediction are highlighted as performance indicators that should be reported. We suggest that a measure of relatedness between the reference and validation individuals be reported, as its impact on the accuracy of genomic prediction is substantial. A large number of methods were compared in example simulated and real (pine and wheat) data sets, all of which are publicly available. In our limited simulations, most methods performed similarly in traits with a large number of quantitative trait loci (QTL), whereas in traits with fewer QTL variable selection did have some advantages. In the real data sets examined here all methods had very similar accuracies. We conclude that no single method can serve as a benchmark for genomic prediction. We recommend comparing accuracy and bias of new methods to results from genomic best linear prediction and a variable selection approach (e.g., BayesB), because, together, these methods are appropriate for a range of genetic architectures. An accompanying article in this issue provides a comprehensive review of genomic prediction methods and discusses a selection of topics related to application of genomic prediction in plants and animals.
Collapse
|
221
|
Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics 2012; 192:1065-93. [PMID: 22960212 PMCID: PMC3522152 DOI: 10.1534/genetics.112.145037] [Citation(s) in RCA: 1504] [Impact Index Per Article: 125.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2012] [Accepted: 08/28/2012] [Indexed: 12/11/2022] Open
Abstract
Population mixture is an important process in biology. We present a suite of methods for learning about population mixtures, implemented in a software package called ADMIXTOOLS, that support formal tests for whether mixture occurred and make it possible to infer proportions and dates of mixture. We also describe the development of a new single nucleotide polymorphism (SNP) array consisting of 629,433 sites with clearly documented ascertainment that was specifically designed for population genetic analyses and that we genotyped in 934 individuals from 53 diverse populations. To illustrate the methods, we give a number of examples that provide new insights about the history of human admixture. The most striking finding is a clear signal of admixture into northern Europe, with one ancestral population related to present-day Basques and Sardinians and the other related to present-day populations of northeast Asia and the Americas. This likely reflects a history of admixture between Neolithic migrants and the indigenous Mesolithic population of Europe, consistent with recent analyses of ancient bones from Sweden and the sequencing of the genome of the Tyrolean "Iceman."
Collapse
Affiliation(s)
- Nick Patterson
- Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
222
|
Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics 2012. [PMID: 22960212 DOI: 10.1534/genetics.112.145037/-/dc1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2023] Open
Abstract
Population mixture is an important process in biology. We present a suite of methods for learning about population mixtures, implemented in a software package called ADMIXTOOLS, that support formal tests for whether mixture occurred and make it possible to infer proportions and dates of mixture. We also describe the development of a new single nucleotide polymorphism (SNP) array consisting of 629,433 sites with clearly documented ascertainment that was specifically designed for population genetic analyses and that we genotyped in 934 individuals from 53 diverse populations. To illustrate the methods, we give a number of examples that provide new insights about the history of human admixture. The most striking finding is a clear signal of admixture into northern Europe, with one ancestral population related to present-day Basques and Sardinians and the other related to present-day populations of northeast Asia and the Americas. This likely reflects a history of admixture between Neolithic migrants and the indigenous Mesolithic population of Europe, consistent with recent analyses of ancient bones from Sweden and the sequencing of the genome of the Tyrolean "Iceman."
Collapse
Affiliation(s)
- Nick Patterson
- Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
223
|
Haubold B, Pfaffelhuber P. Alignment-free population genomics: an efficient estimator of sequence diversity. G3 (BETHESDA, MD.) 2012; 2:883-9. [PMID: 22908037 PMCID: PMC3411244 DOI: 10.1534/g3.112.002527] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Accepted: 05/29/2012] [Indexed: 11/18/2022]
Abstract
Comparative sequencing contributes critically to the functional annotation of genomes. One prerequisite for successful analysis of the increasingly abundant comparative sequencing data is the availability of efficient computational tools. We present here a strategy for comparing unaligned genomes based on a coalescent approach combined with advanced algorithms for indexing sequences. These algorithms are particularly efficient when analyzing large genomes, as their run time ideally grows only linearly with sequence length. Using this approach, we have derived and implemented a maximum-likelihood estimator of the average number of mismatches per site between two closely related sequences, π. By allowing for fluctuating coalescent times, we are able to improve a previously published alignment-free estimator of π. We show through simulation that our new estimator is fast and accurate even with moderate recombination (ρ ≤ π). To demonstrate its applicability to real data, we compare the unaligned genomes of Drosophila persimilis and D. pseudoobscura. In agreement with previous studies, our sliding window analysis locates the global divergence minimum between these two genomes to the pericentromeric region of chromosome 3.
Collapse
Affiliation(s)
- Bernhard Haubold
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany.
| | | |
Collapse
|
224
|
Theunert C, Tang K, Lachmann M, Hu S, Stoneking M. Inferring the history of population size change from genome-wide SNP data. Mol Biol Evol 2012; 29:3653-67. [PMID: 22787284 DOI: 10.1093/molbev/mss175] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Dense, genome-wide single-nucleotide polymorphism (SNP) data can be used to reconstruct the demographic history of human populations. However, demographic inferences from such data are complicated by recombination and ascertainment bias. We introduce two new statistics, allele frequency-identity by descent (AF-IBD) and allele frequency-identity by state (AF-IBS), that make use of linkage disequilibrium information and show defined relationships to the time of coalescence. These statistics, when conditioned on the derived allele frequency, are able to infer complex population size changes. Moreover, the AF-IBS statistic, which is based on genome-wide SNP data, is robust to varying ascertainment conditions. We constructed an efficient approximate Bayesian computation (ABC) pipeline based on AF-IBD and AF-IBS that can accurately estimate demographic parameters, even for fairly complex models. Finally, we applied this ABC approach to genome-wide SNP data and inferred the demographic histories of two human populations, Yoruba and French. Our results suggest a rather stable ancestral population size with a mild recent expansion for Yoruba, whereas the French seemingly experienced a long-lasting severe bottleneck followed by a drastic population growth. This approach should prove useful for new insights into populations, especially those with complex demographic histories.
Collapse
Affiliation(s)
- Christoph Theunert
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
| | | | | | | | | |
Collapse
|
225
|
Alachiotis N, Stamatakis A, Pavlidis P. OmegaPlus: a scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics 2012; 28:2274-5. [DOI: 10.1093/bioinformatics/bts419] [Citation(s) in RCA: 88] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
226
|
Pavlidis P, Jensen JD, Stephan W, Stamatakis A. A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans. Mol Biol Evol 2012; 29:3237-48. [PMID: 22617950 DOI: 10.1093/molbev/mss136] [Citation(s) in RCA: 159] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
In the age of whole-genome population genetics, so-called genomic scan studies often conclude with a long list of putatively selected loci. These lists are then further scrutinized to annotate these regions by gene function, corresponding biological processes, expression levels, or gene networks. Such annotations are often used to assess and/or verify the validity of the genome scan and the statistical methods that have been used to perform the analyses. Furthermore, these results are frequently considered to validate "true-positives" if the identified regions make biological sense a posteriori. Here, we show that this approach can be potentially misleading. By simulating neutral evolutionary histories, we demonstrate that it is possible not only to obtain an extremely high false-positive rate but also to make biological sense out of the false-positives and construct a sensible biological narrative. Results are compared with a recent polymorphism data set from Drosophila melanogaster.
Collapse
Affiliation(s)
- Pavlos Pavlidis
- The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Heidelberg, Germany.
| | | | | | | |
Collapse
|
227
|
Adhikari K, AlChawa T, Ludwig K, Mangold E, Laird N, Lange C. Is it rare or common? Genet Epidemiol 2012; 36:419-29. [PMID: 22549767 DOI: 10.1002/gepi.21637] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2011] [Revised: 03/09/2012] [Accepted: 03/09/2012] [Indexed: 11/09/2022]
Abstract
Many genome-wide association studies (GWAS) have signals with unknown etiology. This paper addresses the question-is such an association signal caused by rare or common variants that lead to increased disease risk? For a genomic region implicated by a GWAS, we use single nucleotide polymorphism (SNP) data in a case-control setting to predict how many common or rare variants there are, using a Bayesian analysis. Our objective is to compute posterior probabilities for configurations of rare and/or common variants. We use an extension of coalescent trees--the ancestral recombination graphs--to model the genealogical history of the samples based on marker data. As we expect SNPs to be in linkage disequilibrium with common disease variants, we can expect the trees to reflect the type of variants. To demonstrate the application, we apply our method to candidate gene sequencing data from a German case-control study on nonsyndromic cleft lip with or without cleft palate.
Collapse
Affiliation(s)
- Kaustubh Adhikari
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA.
| | | | | | | | | | | |
Collapse
|
228
|
Simulated data for genomic selection and genome-wide association studies using a combination of coalescent and gene drop methods. G3-GENES GENOMES GENETICS 2012; 2:425-7. [PMID: 22540033 PMCID: PMC3337470 DOI: 10.1534/g3.111.001297] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/30/2011] [Accepted: 11/09/2011] [Indexed: 11/18/2022]
Abstract
An approach is described for simulating data sequence, genotype, and phenotype data to study genomic selection and genome-wide association studies (GWAS). The simulation method, implemented in a software package called AlphaDrop, can be used to simulate genomic data and phenotypes with flexibility in terms of the historical population structure, recent pedigree structure, distribution of quantitative trait loci effects, and with sequence and single nucleotide polymorphism-phased alleles and genotypes. Ten replicates of a representative scenario used to study genomic selection in livestock were generated and have been made publically available. The simulated data sets were structured to encompass a spectrum of additive quantitative trait loci effect distributions, relationship structures, and single nucleotide polymorphism chip densities.
Collapse
|
229
|
Brown MD, Glazner CG, Zheng C, Thompson EA. Inferring coancestry in population samples in the presence of linkage disequilibrium. Genetics 2012; 190:1447-60. [PMID: 22298700 PMCID: PMC3316655 DOI: 10.1534/genetics.111.137570] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2011] [Accepted: 01/16/2012] [Indexed: 01/03/2023] Open
Abstract
In both pedigree linkage studies and in population-based association studies there has been much interest in the use of modern dense genetic marker data to infer segments of gene identity by descent (ibd) among individuals not known to be related, to increase power and resolution in localizing genes affecting complex traits. In this article, we present a hidden Markov model (HMM) for ibd among a set of chromosomes and describe methods and software for inference of ibd among the four chromosomes of pairs of individuals, using either phased (haplotypic) or unphased (genotypic) data. The model allows for missing data and typing error, but does not model linkage disequilibrium (LD), because fitting an accurate LD model requires large samples from well-studied populations. However, LD remains a major confounding factor, since LD is itself a reflection of coancestry at the population level. To study the impact of LD, we have developed a novel simulation approach to generate realistic dense marker data for the same set of markers but at varying levels of LD. Using this approach, we present results of a study of the impact of LD on the sensitivity and specificity of our HMM model in estimating segments of ibd among sets of four chromosomes and between genotype pairs. We show that, despite not incorporating LD, our model has been quite successful in detecting segments as small as 10(6) bp (1 Mpb); we present also comparisons with fastIBD which uses an LD model in estimating ibd.
Collapse
Affiliation(s)
- M. D. Brown
- Department of Statistics, University of Washington, Seattle, Washington 98195-4322
| | - C. G. Glazner
- Department of Statistics, University of Washington, Seattle, Washington 98195-4322
| | - C. Zheng
- Department of Statistics, University of Washington, Seattle, Washington 98195-4322
| | - E. A. Thompson
- Department of Statistics, University of Washington, Seattle, Washington 98195-4322
| |
Collapse
|
230
|
Clark SA, Hickey JM, Daetwyler HD, van der Werf JHJ. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet Sel Evol 2012. [PMID: 22321529 DOI: 10.1186/1297‐9686‐44‐4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND The theory of genomic selection is based on the prediction of the effects of genetic markers in linkage disequilibrium with quantitative trait loci. However, genomic selection also relies on relationships between individuals to accurately predict genetic value. This study aimed to examine the importance of information on relatives versus that of unrelated or more distantly related individuals on the estimation of genomic breeding values. METHODS Simulated and real data were used to examine the effects of various degrees of relationship on the accuracy of genomic selection. Genomic Best Linear Unbiased Prediction (gBLUP) was compared to two pedigree based BLUP methods, one with a shallow one generation pedigree and the other with a deep ten generation pedigree. The accuracy of estimated breeding values for different groups of selection candidates that had varying degrees of relationships to a reference data set of 1750 animals was investigated. RESULTS The gBLUP method predicted breeding values more accurately than BLUP. The most accurate breeding values were estimated using gBLUP for closely related animals. Similarly, the pedigree based BLUP methods were also accurate for closely related animals, however when the pedigree based BLUP methods were used to predict unrelated animals, the accuracy was close to zero. In contrast, gBLUP breeding values, for animals that had no pedigree relationship with animals in the reference data set, allowed substantial accuracy. CONCLUSIONS An animal's relationship to the reference data set is an important factor for the accuracy of genomic predictions. Animals that share a close relationship to the reference data set had the highest accuracy from genomic predictions. However a baseline accuracy that is driven by the reference data set size and the overall population effective population size enables gBLUP to estimate a breeding value for unrelated animals within a population (breed), using information previously ignored by pedigree based BLUP methods.
Collapse
Affiliation(s)
- Samuel A Clark
- University of New England, Armidale, NSW 2351, Australia.
| | | | | | | |
Collapse
|
231
|
Clark SA, Hickey JM, Daetwyler HD, van der Werf JHJ. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet Sel Evol 2012; 44:4. [PMID: 22321529 PMCID: PMC3299588 DOI: 10.1186/1297-9686-44-4] [Citation(s) in RCA: 184] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2011] [Accepted: 02/09/2012] [Indexed: 01/17/2023] Open
Abstract
Background The theory of genomic selection is based on the prediction of the effects of genetic markers in linkage disequilibrium with quantitative trait loci. However, genomic selection also relies on relationships between individuals to accurately predict genetic value. This study aimed to examine the importance of information on relatives versus that of unrelated or more distantly related individuals on the estimation of genomic breeding values. Methods Simulated and real data were used to examine the effects of various degrees of relationship on the accuracy of genomic selection. Genomic Best Linear Unbiased Prediction (gBLUP) was compared to two pedigree based BLUP methods, one with a shallow one generation pedigree and the other with a deep ten generation pedigree. The accuracy of estimated breeding values for different groups of selection candidates that had varying degrees of relationships to a reference data set of 1750 animals was investigated. Results The gBLUP method predicted breeding values more accurately than BLUP. The most accurate breeding values were estimated using gBLUP for closely related animals. Similarly, the pedigree based BLUP methods were also accurate for closely related animals, however when the pedigree based BLUP methods were used to predict unrelated animals, the accuracy was close to zero. In contrast, gBLUP breeding values, for animals that had no pedigree relationship with animals in the reference data set, allowed substantial accuracy. Conclusions An animal's relationship to the reference data set is an important factor for the accuracy of genomic predictions. Animals that share a close relationship to the reference data set had the highest accuracy from genomic predictions. However a baseline accuracy that is driven by the reference data set size and the overall population effective population size enables gBLUP to estimate a breeding value for unrelated animals within a population (breed), using information previously ignored by pedigree based BLUP methods.
Collapse
Affiliation(s)
- Samuel A Clark
- University of New England, Armidale, NSW 2351, Australia.
| | | | | | | |
Collapse
|
232
|
|
233
|
Axelsson E, Webster MT, Ratnakumar A, Ponting CP, Lindblad-Toh K. Death of PRDM9 coincides with stabilization of the recombination landscape in the dog genome. Genome Res 2012; 22:51-63. [PMID: 22006216 PMCID: PMC3246206 DOI: 10.1101/gr.124123.111] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2011] [Accepted: 10/05/2011] [Indexed: 11/25/2022]
Abstract
Analysis of diverse eukaryotes has revealed that recombination events cluster in discrete genomic locations known as hotspots. In humans, a zinc-finger protein, PRDM9, is believed to initiate recombination in >40% of hotspots by binding to a specific DNA sequence motif. However, the PRDM9 coding sequence is disrupted in the dog genome assembly, raising questions regarding the nature and control of recombination in dogs. By analyzing the sequences of PRDM9 orthologs in a number of dog breeds and several carnivores, we show here that this gene was inactivated early in canid evolution. We next use patterns of linkage disequilibrium using more than 170,000 SNP markers typed in almost 500 dogs to estimate the recombination rates in the dog genome using a coalescent-based approach. Broad-scale recombination rates show good correspondence with an existing linkage-based map. Significant variation in recombination rate is observed on the fine scale, and we are able to detect over 4000 recombination hotspots with high confidence. In contrast to human hotspots, 40% of canine hotspots are characterized by a distinct peak in GC content. A comparative genomic analysis indicates that these peaks are present also as weaker peaks in the panda, suggesting that the hotspots have been continually reinforced by accelerated and strongly GC biased nucleotide substitutions, consistent with the long-term action of biased gene conversion on the dog lineage. These results are consistent with the loss of PRDM9 in canids, resulting in a greater evolutionary stability of recombination hotspots. The genetic determinants of recombination hotspots in the dog genome may thus reflect a fundamental process of relevance to diverse animal species.
Collapse
Affiliation(s)
- Erik Axelsson
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, 75237 Uppsala, Sweden
| | - Matthew T. Webster
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, 75237 Uppsala, Sweden
| | - Abhirami Ratnakumar
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, 75237 Uppsala, Sweden
| | - Chris P. Ponting
- MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom
| | - Kerstin Lindblad-Toh
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, 75237 Uppsala, Sweden
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
234
|
Abstract
The network structure that captures the common evolutionary history of a diploid population has been termed an ancestral recombinations graph. When the structure is a tree the number of internal nodes is usually [Formula: see text] where K is the number of samples. However, when the structure is not a tree, this number has been observed to be very large. We explore the possible redundancies in this structure. This has implications both in simulations and in reconstructability studies.
Collapse
Affiliation(s)
- Laxmi Parida
- IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA.
| |
Collapse
|
235
|
Metspalu M, Romero I, Yunusbayev B, Chaubey G, Mallick C, Hudjashov G, Nelis M, Mägi R, Metspalu E, Remm M, Pitchappan R, Singh L, Thangaraj K, Villems R, Kivisild T. Shared and unique components of human population structure and genome-wide signals of positive selection in South Asia. Am J Hum Genet 2011; 89:731-44. [PMID: 22152676 DOI: 10.1016/j.ajhg.2011.11.010] [Citation(s) in RCA: 121] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Revised: 09/06/2011] [Accepted: 11/12/2011] [Indexed: 02/06/2023] Open
Abstract
South Asia harbors one of the highest levels genetic diversity in Eurasia, which could be interpreted as a result of its long-term large effective population size and of admixture during its complex demographic history. In contrast to Pakistani populations, populations of Indian origin have been underrepresented in previous genomic scans of positive selection and population structure. Here we report data for more than 600,000 SNP markers genotyped in 142 samples from 30 ethnic groups in India. Combining our results with other available genome-wide data, we show that Indian populations are characterized by two major ancestry components, one of which is spread at comparable frequency and haplotype diversity in populations of South and West Asia and the Caucasus. The second component is more restricted to South Asia and accounts for more than 50% of the ancestry in Indian populations. Haplotype diversity associated with these South Asian ancestry components is significantly higher than that of the components dominating the West Eurasian ancestry palette. Modeling of the observed haplotype diversities suggests that both Indian ancestry components are older than the purported Indo-Aryan invasion 3,500 YBP. Consistent with the results of pairwise genetic distances among world regions, Indians share more ancestry signals with West than with East Eurasians. However, compared to Pakistani populations, a higher proportion of their genes show regionally specific signals of high haplotype homozygosity. Among such candidates of positive selection in India are MSTN and DOK5, both of which have potential implications in lipid metabolism and the etiology of type 2 diabetes.
Collapse
|
236
|
Yuan X, Miller DJ, Zhang J, Herrington D, Wang Y. An overview of population genetic data simulation. J Comput Biol 2011; 19:42-54. [PMID: 22149682 DOI: 10.1089/cmb.2010.0188] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Simulation studies in population genetics play an important role in helping to better understand the impact of various evolutionary and demographic scenarios on sequence variation and sequence patterns, and they also permit investigators to better assess and design analytical methods in the study of disease-associated genetic factors. To facilitate these studies, it is imperative to develop simulators with the capability to accurately generate complex genomic data under various genetic models. Currently, a number of efficient simulation software packages for large-scale genomic data are available, and new simulation programs with more sophisticated capabilities and features continue to emerge. In this article, we review the three basic simulation frameworks--coalescent, forward, and resampling--and some of the existing simulators that fall under these frameworks, comparing them with respect to their evolutionary and demographic scenarios, their computational complexity, and their specific applications. Additionally, we address some limitations in current simulation algorithms and discuss future challenges in the development of more powerful simulation tools.
Collapse
Affiliation(s)
- Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, P.R. China
| | | | | | | | | |
Collapse
|
237
|
Identification of genomic regions associated with phenotypic variation between dog breeds using selection mapping. PLoS Genet 2011; 7:e1002316. [PMID: 22022279 PMCID: PMC3192833 DOI: 10.1371/journal.pgen.1002316] [Citation(s) in RCA: 273] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2011] [Accepted: 07/30/2011] [Indexed: 12/30/2022] Open
Abstract
The extraordinary phenotypic diversity of dog breeds has been sculpted by a unique population history accompanied by selection for novel and desirable traits. Here we perform a comprehensive analysis using multiple test statistics to identify regions under selection in 509 dogs from 46 diverse breeds using a newly developed high-density genotyping array consisting of >170,000 evenly spaced SNPs. We first identify 44 genomic regions exhibiting extreme differentiation across multiple breeds. Genetic variation in these regions correlates with variation in several phenotypic traits that vary between breeds, and we identify novel associations with both morphological and behavioral traits. We next scan the genome for signatures of selective sweeps in single breeds, characterized by long regions of reduced heterozygosity and fixation of extended haplotypes. These scans identify hundreds of regions, including 22 blocks of homozygosity longer than one megabase in certain breeds. Candidate selection loci are strongly enriched for developmental genes. We chose one highly differentiated region, associated with body size and ear morphology, and characterized it using high-throughput sequencing to provide a list of variants that may directly affect these traits. This study provides a catalogue of genomic regions showing extreme reduction in genetic variation or population differentiation in dogs, including many linked to phenotypic variation. The many blocks of reduced haplotype diversity observed across the genome in dog breeds are the result of both selection and genetic drift, but extended blocks of homozygosity on a megabase scale appear to be best explained by selection. Further elucidation of the variants under selection will help to uncover the genetic basis of complex traits and disease. There are hundreds of dog breeds that exhibit massive differences in appearance and behavior sculpted by tightly controlled selective breeding. This large-scale natural experiment has provided an ideal resource that geneticists can use to search for genetic variants that control these differences. With this goal, we developed a high-density array that surveys variable sites at more than 170,000 positions in the dog genome and used it to analyze genetic variation in 46 breeds. We identify 44 chromosomal regions that are extremely variable between breeds and are likely to control many of the traits that vary between them, including curly tails and sociality. Many other regions also bear the signature of strong artificial selection. We characterize one such region, known to associate with body size and ear type, in detail using “next-generation” sequencing technology to identify candidate mutations that may control these traits. Our results suggest that artificial selection has targeted genes involved in development and metabolism and that it may have increased the incidence of disease in dog breeds. Knowledge of these regions will be of great importance for uncovering the genetic basis of variation between dog breeds and for finding mutations that cause disease.
Collapse
|
238
|
Esteve-Codina A, Kofler R, Himmelbauer H, Ferretti L, Vivancos AP, Groenen MAM, Folch JM, Rodríguez MC, Pérez-Enciso M. Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof. Heredity (Edinb) 2011; 107:256-64. [PMID: 21407255 PMCID: PMC3183945 DOI: 10.1038/hdy.2011.13] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2010] [Revised: 01/20/2011] [Accepted: 01/27/2011] [Indexed: 11/08/2022] Open
Abstract
Despite dramatic reduction in sequencing costs with the advent of next generation sequencing technologies, obtaining a complete mammalian genome sequence at sufficient depth is still costly. An alternative is partial sequencing. Here, we have sequenced a reduced representation library of an Iberian sow from the Guadyerbas strain, a highly inbred strain that has been used in numerous QTL studies because of its extreme phenotypic characteristics. Using the Illumina Genome Analyzer II (San Diego, CA, USA), we resequenced ∼ 1% of the genome with average 4 × depth, identifying 68,778 polymorphisms. Of these, 55,457 were putative fixed differences with respect to the assembly, based on the genome of a Duroc pig, and 13,321 were heterozygous positions within Guadyerbas. Despite being highly inbred, the estimate of heterozygosity within Guadyerbas was ∼ 0.78 kb(-1) in autosomes, after correcting for low depth. Nucleotide variability was consistently higher at the telomeric regions than on the rest of the chromosome, likely a result of increased recombination rates. Further, variability was 50% lower in the X-chromosome than in autosomes, which may be explained by a recent bottleneck or by selection. We divided the whole genome in 500 kb windows and we analyzed overrepresented gene ontology terms in regions of low and high variability. Multi organism process, pigmentation and cell killing were overrepresented in high variability regions and metabolic process ontology, within low variability regions. Further, a genome wide Hudson-Kreitman-Aguadé test was carried out per window; overall, variability was in agreement with neutral expectations.
Collapse
Affiliation(s)
- A Esteve-Codina
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - R Kofler
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
- Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - H Himmelbauer
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | - L Ferretti
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona, Bellaterra, Spain
- Department of Animal Science, Centre for Research in Agrigenomics (CRAG), Bellaterra, Spain
| | - A P Vivancos
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | - M A M Groenen
- Animal Breeding and Genomics Centre, Wageningen University and Research Centre, Wageningen, The Netherlands
| | - J M Folch
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - M C Rodríguez
- Departamento de Mejora Genética Animal, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Madrid, Spain
| | - M Pérez-Enciso
- Departament de Ciència Animal i dels Aliments, Facultat de Veterinària, Universitat Autònoma de Barcelona, Bellaterra, Spain
- Institut Català de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| |
Collapse
|
239
|
Wegmann D, Kessner DE, Veeramah KR, Mathias RA, Nicolae DL, Yanek LR, Sun YV, Torgerson DG, Rafaels N, Mosley T, Becker LC, Ruczinski I, Beaty TH, Kardia SLR, Meyers DA, Barnes KC, Becker DM, Freimer NB, Novembre J. Recombination rates in admixed individuals identified by ancestry-based inference. Nat Genet 2011; 43:847-53. [DOI: 10.1038/ng.894] [Citation(s) in RCA: 96] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2011] [Accepted: 07/01/2011] [Indexed: 12/17/2022]
|
240
|
He Y, Li C, Amos CI, Xiong M, Ling H, Jin L. Accelerating haplotype-based genome-wide association study using perfect phylogeny and phase-known reference data. PLoS One 2011; 6:e22097. [PMID: 21789217 PMCID: PMC3137625 DOI: 10.1371/journal.pone.0022097] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 06/17/2011] [Indexed: 11/18/2022] Open
Abstract
The genome-wide association study (GWAS) has become a routine approach for mapping disease risk loci with the advent of large-scale genotyping technologies. Multi-allelic haplotype markers can provide superior power compared with single-SNP markers in mapping disease loci. However, the application of haplotype-based analysis to GWAS is usually bottlenecked by prohibitive time cost for haplotype inference, also known as phasing. In this study, we developed an efficient approach to haplotype-based analysis in GWAS. By using a reference panel, our method accelerated the phasing process and reduced the potential bias generated by unrealistic assumptions in phasing process. The haplotype-based approach delivers great power and no type I error inflation for association studies. With only a medium-size reference panel, phasing error in our method is comparable to the genotyping error afforded by commercial genotyping solutions.
Collapse
Affiliation(s)
- Yungang He
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (YH); (LJ)
| | - Cong Li
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
| | - Christopher I. Amos
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas School of Public Health, Houston, Texas, United States of America
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Hua Ling
- Center for Inherited Disease Research, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Li Jin
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- * E-mail: (YH); (LJ)
| |
Collapse
|
241
|
Clark SA, Hickey JM, van der Werf JHJ. Different models of genetic variation and their effect on genomic evaluation. Genet Sel Evol 2011; 43:18. [PMID: 21575265 PMCID: PMC3114710 DOI: 10.1186/1297-9686-43-18] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2010] [Accepted: 05/17/2011] [Indexed: 12/30/2022] Open
Abstract
Background The theory of genomic selection is based on the prediction of the effects of quantitative trait loci (QTL) in linkage disequilibrium (LD) with markers. However, there is increasing evidence that genomic selection also relies on "relationships" between individuals to accurately predict genetic values. Therefore, a better understanding of what genomic selection actually predicts is relevant so that appropriate methods of analysis are used in genomic evaluations. Methods Simulation was used to compare the performance of estimates of breeding values based on pedigree relationships (Best Linear Unbiased Prediction, BLUP), genomic relationships (gBLUP), and based on a Bayesian variable selection model (Bayes B) to estimate breeding values under a range of different underlying models of genetic variation. The effects of different marker densities and varying animal relationships were also examined. Results This study shows that genomic selection methods can predict a proportion of the additive genetic value when genetic variation is controlled by common quantitative trait loci (QTL model), rare loci (rare variant model), all loci (infinitesimal model) and a random association (a polygenic model). The Bayes B method was able to estimate breeding values more accurately than gBLUP under the QTL and rare variant models, for the alternative marker densities and reference populations. The Bayes B and gBLUP methods had similar accuracies under the infinitesimal model. Conclusions Our results suggest that Bayes B is superior to gBLUP to estimate breeding values from genomic data. The underlying model of genetic variation greatly affects the predictive ability of genomic selection methods, and the superiority of Bayes B over gBLUP is highly dependent on the presence of large QTL effects. The use of SNP sequence data will outperform the less dense marker panels. However, the size and distribution of QTL effects and the size of reference populations still greatly influence the effectiveness of using sequence data for genomic prediction.
Collapse
Affiliation(s)
- Samuel A Clark
- School of Environmental and Rural Science, University of New England, Armidale, NSW 2351, Australia.
| | | | | |
Collapse
|
242
|
Amaral AJ, Ferretti L, Megens HJ, Crooijmans RPMA, Nie H, Ramos-Onsins SE, Perez-Enciso M, Schook LB, Groenen MAM. Genome-wide footprints of pig domestication and selection revealed through massive parallel sequencing of pooled DNA. PLoS One 2011; 6:e14782. [PMID: 21483733 PMCID: PMC3070695 DOI: 10.1371/journal.pone.0014782] [Citation(s) in RCA: 99] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2010] [Accepted: 01/29/2011] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Artificial selection has caused rapid evolution in domesticated species. The identification of selection footprints across domesticated genomes can contribute to uncover the genetic basis of phenotypic diversity. METHODOLOGY/MAIN FINDINGS Genome wide footprints of pig domestication and selection were identified using massive parallel sequencing of pooled reduced representation libraries (RRL) representing ∼2% of the genome from wild boar and four domestic pig breeds (Large White, Landrace, Duroc and Pietrain) which have been under strong selection for muscle development, growth, behavior and coat color. Using specifically developed statistical methods that account for DNA pooling, low mean sequencing depth, and sequencing errors, we provide genome-wide estimates of nucleotide diversity and genetic differentiation in pig. Widespread signals suggestive of positive and balancing selection were found and the strongest signals were observed in Pietrain, one of the breeds most intensively selected for muscle development. Most signals were population-specific but affected genomic regions which harbored genes for common biological categories including coat color, brain development, muscle development, growth, metabolism, olfaction and immunity. Genetic differentiation in regions harboring genes related to muscle development and growth was higher between breeds than between a given breed and the wild boar. CONCLUSIONS/SIGNIFICANCE These results, suggest that although domesticated breeds have experienced similar selective pressures, selection has acted upon different genes. This might reflect the multiple domestication events of European breeds or could be the result of subsequent introgression of Asian alleles. Overall, it was estimated that approximately 7% of the porcine genome has been affected by selection events. This study illustrates that the massive parallel sequencing of genomic pools is a cost-effective approach to identify footprints of selection.
Collapse
Affiliation(s)
- Andreia J. Amaral
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, The Netherlands
| | - Luca Ferretti
- Department of Animal Science and Food Technology, Universitat Autonoma de Barcelona, Bellaterra, Spain
- Animal Science Department, Centre for Research in Agricultural Genomics, Bellaterra, Spain
| | - Hendrik-Jan Megens
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, The Netherlands
| | | | - Haisheng Nie
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, The Netherlands
| | - Sebastian E. Ramos-Onsins
- Department of Animal Science and Food Technology, Universitat Autonoma de Barcelona, Bellaterra, Spain
- Animal Science Department, Centre for Research in Agricultural Genomics, Bellaterra, Spain
| | - Miguel Perez-Enciso
- Department of Animal Science and Food Technology, Universitat Autonoma de Barcelona, Bellaterra, Spain
- Animal Science Department, Centre for Research in Agricultural Genomics, Bellaterra, Spain
- Life and Medical Sciences, Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain
| | - Lawrence B. Schook
- Institute for Genomic Biology, University of Illinois, Urbana, Illinois, United States of America
| | - Martien A. M. Groenen
- Animal Breeding and Genomics Centre, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
243
|
Paul JS, Steinrücken M, Song YS. An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination. Genetics 2011; 187:1115-28. [PMID: 21270390 PMCID: PMC3070520 DOI: 10.1534/genetics.110.125534] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Accepted: 01/21/2011] [Indexed: 02/07/2023] Open
Abstract
The sequentially Markov coalescent is a simplified genealogical process that aims to capture the essential features of the full coalescent model with recombination, while being scalable in the number of loci. In this article, the sequentially Markov framework is applied to the conditional sampling distribution (CSD), which is at the core of many statistical tools for population genetic analyses. Briefly, the CSD describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. A hidden Markov model (HMM) formulation of the sequentially Markov CSD is developed here, yielding an algorithm with time complexity linear in both the number of loci and the number of haplotypes. This work provides a highly accurate, practical approximation to a recently introduced CSD derived from the diffusion process associated with the coalescent with recombination. It is empirically demonstrated that the improvement in accuracy of the new CSD over previously proposed HMM-based CSDs increases substantially with the number of loci. The framework presented here can be adopted in a wide range of applications in population genetics, including imputing missing sequence data, estimating recombination rates, and inferring human colonization history.
Collapse
Affiliation(s)
- Joshua S. Paul
- Computer Science Division and Department of Statistics, University of California, Berkeley, California 94720
| | - Matthias Steinrücken
- Computer Science Division and Department of Statistics, University of California, Berkeley, California 94720
| | - Yun S. Song
- Computer Science Division and Department of Statistics, University of California, Berkeley, California 94720
| |
Collapse
|
244
|
Excoffier L, Foll M. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics 2011; 27:1332-4. [DOI: 10.1093/bioinformatics/btr124] [Citation(s) in RCA: 343] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
245
|
Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JHJ. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol 2011; 43:12. [PMID: 21388557 PMCID: PMC3068938 DOI: 10.1186/1297-9686-43-12] [Citation(s) in RCA: 96] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2010] [Accepted: 03/10/2011] [Indexed: 06/24/2024] Open
Abstract
BACKGROUND Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data. METHODS A long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information. RESULTS The algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available. CONCLUSIONS The algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets.
Collapse
|
246
|
Mailund T, Dutheil JY, Hobolth A, Lunter G, Schierup MH. Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet 2011; 7:e1001319. [PMID: 21408205 PMCID: PMC3048369 DOI: 10.1371/journal.pgen.1001319] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Accepted: 01/25/2011] [Indexed: 12/01/2022] Open
Abstract
Due to genetic variation in the ancestor of two populations or two species, the divergence time for DNA sequences from two populations is variable along the genome. Within genomic segments all bases will share the same divergence—because they share a most recent common ancestor—when no recombination event has occurred to split them apart. The size of these segments of constant divergence depends on the recombination rate, but also on the speciation time, the effective population size of the ancestral population, as well as demographic effects and selection. Thus, inference of these parameters may be possible if we can decode the divergence times along a genomic alignment. Here, we present a new hidden Markov model that infers the changing divergence (coalescence) times along the genome alignment using a coalescent framework, in order to estimate the speciation time, the recombination rate, and the ancestral effective population size. The model is efficient enough to allow inference on whole-genome data sets. We first investigate the power and consistency of the model with coalescent simulations and then apply it to the whole-genome sequences of the two orangutan sub-species, Bornean (P. p. pygmaeus) and Sumatran (P. p. abelii) orangutans from the Orangutan Genome Project. We estimate the speciation time between the two sub-species to be thousand years ago and the effective population size of the ancestral orangutan species to be , consistent with recent results based on smaller data sets. We also report a negative correlation between chromosome size and ancestral effective population size, which we interpret as a signature of recombination increasing the efficacy of selection. We present a hidden Markov model that uses variation in coalescence times between two distantly related populations, or closely related species, to infer population genetics parameters in ancestral population or species. The model infers the divergence times in segments along the alignment. Using coalescent simulations, we show that the model accurately estimates the divergence time between the two populations and the effective population size of the ancestral population. We apply the model to the recently sequenced orangutan sub-species and estimate their divergence time and the effective population size of their ancestor population.
Collapse
Affiliation(s)
- Thomas Mailund
- Bioinformatics Research Centre, Aarhus University, Denmark.
| | | | | | | | | |
Collapse
|
247
|
Parida L, Palamara PF, Javed A. A minimal descriptor of an ancestral recombinations graph. BMC Bioinformatics 2011; 12 Suppl 1:S6. [PMID: 21342589 PMCID: PMC3044314 DOI: 10.1186/1471-2105-12-s1-s6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ancestral Recombinations Graph (ARG) is a phylogenetic structure that encodes both duplication events, such as mutations, as well as genetic exchange events, such as recombinations: this captures the (genetic) dynamics of a population evolving over generations. RESULTS In this paper, we identify structure-preserving and samples-preserving core of an ARG G and call it the minimal descriptor ARG of G. Its structure-preserving characteristic ensures that all the branch lengths of the marginal trees of the minimal descriptor ARG are identical to that of G and the samples-preserving property asserts that the patterns of genetic variation in the samples of the minimal descriptor ARG are exactly the same as that of G. We also prove that even an unbounded G has a finite minimal descriptor, that continues to preserve certain (graph-theoretic) properties of G and for an appropriate class of ARGs, our estimate (Eqn 8) as well as empirical observation is that the expected reduction in the number of vertices is exponential. CONCLUSIONS Based on the definition of this lossless and bounded structure, we derive local properties of the vertices of a minimal descriptor ARG, which lend itself very naturally to the design of efficient sampling algorithms. We further show that a class of minimal descriptors, that of binary ARGs, models the standard coalescent exactly (Thm 6).
Collapse
Affiliation(s)
- Laxmi Parida
- Computational Genomics, IBM T J Watson Research, Yorktown, New York, USA.
| | | | | |
Collapse
|
248
|
Albert FW, Hodges E, Jensen JD, Besnier F, Xuan Z, Rooks M, Bhattacharjee A, Brizuela L, Good JM, Green RE, Burbano HA, Plyusnina IZ, Trut L, Andersson L, Schöneberg T, Carlborg O, Hannon GJ, Pääbo S. Targeted resequencing of a genomic region influencing tameness and aggression reveals multiple signals of positive selection. Heredity (Edinb) 2011; 107:205-14. [PMID: 21304545 DOI: 10.1038/hdy.2011.4] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
The identification of the causative genetic variants in quantitative trait loci (QTL) influencing phenotypic traits is challenging, especially in crosses between outbred strains. We have previously identified several QTL influencing tameness and aggression in a cross between two lines of wild-derived, outbred rats (Rattus norvegicus) selected for their behavior towards humans. Here, we use targeted sequence capture and massively parallel sequencing of all genes in the strongest QTL in the founder animals of the cross. We identify many novel sequence variants, several of which are potentially functionally relevant. The QTL contains several regions where either the tame or the aggressive founders contain no sequence variation, and two regions where alternative haplotypes are fixed between the founders. A re-analysis of the QTL signal showed that the causative site is likely to be fixed among the tame founder animals, but that several causative alleles may segregate among the aggressive founder animals. Using a formal test for the detection of positive selection, we find 10 putative positively selected regions, some of which are close to genes known to influence behavior. Together, these results show that the QTL is probably not caused by a single selected site, but may instead represent the joint effects of several sites that were targets of polygenic selection.
Collapse
Affiliation(s)
- F W Albert
- Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
249
|
Yuan X, Zhang J, Wang Y. Simulating linkage disequilibrium structures in a human population for SNP association studies. Biochem Genet 2011; 49:395-409. [PMID: 21234669 DOI: 10.1007/s10528-011-9416-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2010] [Accepted: 12/02/2010] [Indexed: 12/22/2022]
Abstract
Existing simulation methods usually simulate linkage disequilibrium (LD) structures starting with an initial population that is randomly generated according to specified allele frequencies. These at random based methods might be unstable because the LD level of the initial population is generally extremely low. This study presents a new algorithm, SIMLD, to simulate genome populations with real LD structures. SIMLD begins from an initial population with possibly the highest LD level, and then the LD decays to fit the desired level through processes of mating and recombination over generations. SIMLD can produce case-control samples according to various disease models. Using empirical SNP marker information from three populations of HapMap data, we implement the proposed algorithm and demonstrate a set of experimental results.
Collapse
Affiliation(s)
- Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| | | | | |
Collapse
|
250
|
Le SQ, Durbin R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res 2010; 21:952-60. [PMID: 20980557 DOI: 10.1101/gr.113084.110] [Citation(s) in RCA: 124] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Reductions in the cost of sequencing have enabled whole-genome sequencing to identify sequence variants segregating in a population. An efficient approach is to sequence many samples at low coverage, then to combine data across samples to detect shared variants. Here, we present methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. For each population, we first collect SNP candidates based on independent sequence calls per site. We then use MARGARITA with genotype or phased haplotype data from the same samples to collect 20 ancestral recombination graphs (ARGs). We refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. Using a population genetic prior distribution on tree-branch length and Bayesian inference, we determine a posterior probability of the SNP being real and also the most probable phased genotype call for each individual. We present experiments on both simulation data and real data from the 1000 Genomes Project to prove the applicability of the methods. We also explore the relative tradeoff between sequencing depth and the number of sequenced samples.
Collapse
Affiliation(s)
- Si Quang Le
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, United Kingdom
| | | |
Collapse
|