1
|
Ray DD, Flagel L, Schrider DR. IntroUNET: Identifying introgressed alleles via semantic segmentation. PLoS Genet 2024; 20:e1010657. [PMID: 38377104 PMCID: PMC10906877 DOI: 10.1371/journal.pgen.1010657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/01/2024] [Accepted: 01/29/2024] [Indexed: 02/22/2024] Open
Abstract
A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.
Collapse
Affiliation(s)
- Dylan D. Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Lex Flagel
- Division of Data Science, Gencove Inc., New York, New York, United States of America
- Department of Plant and Microbial Biology, University of Minnesota, Saint Paul, Minnesota, United States of America
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
2
|
Ray DD, Flagel L, Schrider DR. IntroUNET: identifying introgressed alleles via semantic segmentation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.02.07.527435. [PMID: 36865105 PMCID: PMC9979274 DOI: 10.1101/2023.02.07.527435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.
Collapse
Affiliation(s)
- Dylan D. Ray
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Lex Flagel
- Division of Data Science, Gencove Inc., New York, NY 11101, USA
- Department of Plant and Microbial Biology, University of Minnesota, St Paul MN, 55108, USA
| | - Daniel R. Schrider
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
3
|
Wakeley J, Fan WT(L, Koch E, Sunyaev S. Recurrent mutation in the ancestry of a rare variant. Genetics 2023; 224:iyad049. [PMID: 36967220 PMCID: PMC10324944 DOI: 10.1093/genetics/iyad049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 01/30/2023] [Accepted: 03/08/2023] [Indexed: 03/28/2023] Open
Abstract
Recurrent mutation produces multiple copies of the same allele which may be co-segregating in a population. Yet, most analyses of allele-frequency or site-frequency spectra assume that all observed copies of an allele trace back to a single mutation. We develop a sampling theory for the number of latent mutations in the ancestry of a rare variant, specifically a variant observed in relatively small count in a large sample. Our results follow from the statistical independence of low-count mutations, which we show to hold for the standard neutral coalescent or diffusion model of population genetics as well as for more general coalescent trees. For populations of constant size, these counts are distributed like the number of alleles in the Ewens sampling formula. We develop a Poisson sampling model for populations of varying size and illustrate it using new results for site-frequency spectra in an exponentially growing population. We apply our model to a large data set of human SNPs and use it to explain dramatic differences in site-frequency spectra across the range of mutation rates in the human genome.
Collapse
Affiliation(s)
- John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Wai-Tong (Louis) Fan
- Department of Mathematics, Indiana University, Bloomington, IN 47405, USA
- Center of Mathematical Sciences and Applications, Harvard University, Cambridge, MA 02138, USA
| | - Evan Koch
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Shamil Sunyaev
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Division of Genetics, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
4
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschmar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2021; 220:6460344. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 91] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence "Controlling Microbes to Fight Infections", Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology,The University of Edinburgh, EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, MA 02115, USA.,No affiliation
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin-Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Victoria, 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science,Museum für Naturkunde Berlin, 10115, Germany
| | | | - Jared G Galloway
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA.,Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA.,Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences,University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Warren W Kretzschmar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology,The University of Edinburgh, EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA
| | - Kumar Saunack
- IIT Bombay, Powai, Mumbai 400 076, Maharashtra, India
| | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK.,Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, CV4 7AL, UK
| | - Peter L Ralph
- Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA.,Department of Mathematics, University of Oregon, OR 97403-5289 USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK
| |
Collapse
|
5
|
Wakeley J. Developments in coalescent theory from single loci to chromosomes. Theor Popul Biol 2020; 133:56-64. [DOI: 10.1016/j.tpb.2020.02.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Revised: 02/19/2020] [Accepted: 02/26/2020] [Indexed: 10/24/2022]
|
6
|
Okii D, Badji A, Odong T, Talwana H, Tukamuhabwa P, Male A, Mukankusi C, Gepts P. Recombination fraction and genetic linkage among key disease resistance genes ( Co-42 / Phg-2 and Co-5/"P.ult") in common bean. ACTA ACUST UNITED AC 2019; 18:AJB-18-29-819. [PMID: 33281892 PMCID: PMC7672375 DOI: 10.5897/ajb2019.16776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 06/26/2019] [Indexed: 10/31/2022]
Abstract
Anthracnose (Colletotrichum lindemuthianum), Angular leaf spot (Pseudocercospora griseola) and Pythium root rot are important pathogens affecting common bean production in the tropics. A promising strategy to manage these diseases consists of combining several resistance (R) genes into one cultivar. The aim of the study was to determine genetic linkage between gene pairs, Co-42 /Phg-2, on bean-chromosome Pv08 and Co-5/"P.ult" on-chromosome Pv07, to increase the efficiency of dual selection of resistance genes for major bean diseases, with molecular markers. The level of recombination was determined by tracking molecular markers for both BC3F6 and F2 generations. Recombination fraction r, among gene pairs, the likelihood of linkage, L(r), and logarithm of odds (LOD) scores were computed using the statistical relationship of likelihood which assumes a binomial distribution. The SCAR marker pair SAB3/PYAA19 for the gene pair Co-5/"P.ult" exhibited moderate linkage (r = 32 cM with a high LOD score of 9.2) for BC3F6 population, but relatively stronger linkage for the F2 population (r = 21 cM with a high LOD score of 18.7). However, the linkage among SCAR marker pair SH18/SN02, for the gene pair Co-42 /Phg-2 was incomplete for BC3F6 population (r = 47 cM with a low LOD score of 0.16) as well as F2 population (r = 44 cM with a low LOD score of 0.7). Generally, the weak or incomplete genetic linkage between marker pairs studied showed that all the four genes mentioned earlier have to be tagged with a corresponding linked marker during selection. The approaches used in this study will contribute to two loci linkage mapping techniques in segregating plant populations.
Collapse
Affiliation(s)
- Dennis Okii
- Department of Agricultural Production, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Arfang Badji
- Department of Agricultural Production, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Thomas Odong
- Department of Agricultural Production, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Herbert Talwana
- Department of Agricultural Production, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Phinehas Tukamuhabwa
- Department of Agricultural Production, Makerere University, P. O. Box 7062, Kampala, Uganda
| | - Allan Male
- International Centre for Tropical Agriculture (CIAT)/Pan African Bean Research Alliance (PABRA), P. O. Box 6247, Kampala, Uganda
| | - Clare Mukankusi
- International Centre for Tropical Agriculture (CIAT)/Pan African Bean Research Alliance (PABRA), P. O. Box 6247, Kampala, Uganda
| | - Paul Gepts
- Section of Crop and Ecosystem Sciences, Department of Plant Sciences/MS1, University of California, 1 Shields Avenue, Davis, CA 95616-8780, USA
| |
Collapse
|
7
|
Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput Biol 2016; 12:e1004842. [PMID: 27145223 PMCID: PMC4856371 DOI: 10.1371/journal.pcbi.1004842] [Citation(s) in RCA: 328] [Impact Index Per Article: 41.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 03/02/2016] [Indexed: 01/23/2023] Open
Abstract
A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods. Our understanding of the distribution of genetic variation in natural populations has been driven by mathematical models of the underlying biological and demographic processes. A key strength of such coalescent models is that they enable efficient simulation of data we might see under a variety of evolutionary scenarios. However, current methods are not well suited to simulating genome-scale data sets on hundreds of thousands of samples, which is essential if we are to understand the data generated by population-scale sequencing projects. Similarly, processing the results of large simulations also presents researchers with a major challenge, as it can take many days just to read the data files. In this paper we solve these problems by introducing a new way to represent information about the ancestral process. This new representation leads to huge gains in simulation speed and storage efficiency so that large simulations complete in minutes and the output files can be processed in seconds.
Collapse
Affiliation(s)
- Jerome Kelleher
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- * E-mail:
| | | | - Gilean McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
8
|
The SMC' is a highly accurate approximation to the ancestral recombination graph. Genetics 2015; 200:343-55. [PMID: 25786855 DOI: 10.1534/genetics.114.173898] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 03/12/2015] [Indexed: 11/18/2022] Open
Abstract
Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the first time the similarity between the SMC' and the ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC' is the same as it is marginally under the ARG, which demonstrates that the SMC' is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC' they are approximately asymptotically unbiased.
Collapse
|
9
|
Affiliation(s)
- John Wakeley
- Harvard University, 4096 Biological Laboratories, 16 Divinity Avenue, Cambridge, MA 02138, USA.
| |
Collapse
|
10
|
HIV-1 continues to replicate and evolve in patients with natural control of HIV infection. J Virol 2010; 84:12971-81. [PMID: 20926564 DOI: 10.1128/jvi.00387-10] [Citation(s) in RCA: 88] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Elucidating mechanisms leading to the natural control of HIV-1 infection is of great importance for vaccine design and for understanding viral pathogenesis. Rare HIV-1-infected individuals, termed HIV-1 controllers, have plasma HIV-1 RNA levels below the limit of detection by standard clinical assays (<50 to 75 copies/ml) without antiretroviral therapy. Although several recent studies have documented persistent low-grade viremia in HIV-1 controllers at a level not significantly different from that in HIV-1-infected individuals undergoing treatment with combination antiretroviral therapy (cART), it is unclear if plasma viruses are undergoing full cycles of replication in vivo or if the infection of new cells is completely blocked by host immune mechanisms. We studied a cohort of 21 HIV-1 controllers with a median level of viremia below 1 copy/ml, followed for a median of 11 years. Less than half of the cohort carried known protective HLA types (B*57/27). By isolating HIV-1 RNA from large volumes of plasma, we amplified single genome sequences of both pro-rt and env longitudinally. This study is the first to document that HIV-1 pro-rt and env evolve in this patient group, albeit at rates somewhat lower than in HIV-1 noncontrollers, in HLA B*57/27-positive, as well as HLA B*57/27-negative, individuals. Viral diversity and adaptive events associated with immune escape were found to be restricted in HIV-1 controllers, suggesting that replication occurs in the face of less overall immune selection.
Collapse
|
11
|
RoyChoudhury A, Wakeley J. Sufficiency of the number of segregating sites in the limit under finite-sites mutation. Theor Popul Biol 2010; 78:118-22. [DOI: 10.1016/j.tpb.2010.05.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2010] [Revised: 05/17/2010] [Accepted: 05/19/2010] [Indexed: 10/19/2022]
|
12
|
Verhoeven KJF, Casella G, McIntyre LM. Epistasis: obstacle or advantage for mapping complex traits? PLoS One 2010; 5:e12264. [PMID: 20865037 PMCID: PMC2928725 DOI: 10.1371/journal.pone.0012264] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2009] [Accepted: 04/19/2010] [Indexed: 01/22/2023] Open
Abstract
Identification of genetic loci in complex traits has focused largely on one-dimensional genome scans to search for associations between single markers and the phenotype. There is mounting evidence that locus interactions, or epistasis, are a crucial component of the genetic architecture of biologically relevant traits. However, epistasis is often viewed as a nuisance factor that reduces power for locus detection. Counter to expectations, recent work shows that fitting full models, instead of testing marker main effect and interaction components separately, in exhaustive multi-locus genome scans can have higher power to detect loci when epistasis is present than single-locus scans, and improvement that comes despite a much larger multiple testing alpha-adjustment in such searches. We demonstrate, both theoretically and via simulation, that the expected power to detect loci when fitting full models is often larger when these loci act epistatically than when they act additively. Additionally, we show that the power for single locus detection may be improved in cases of epistasis compared to the additive model. Our exploration of a two step model selection procedure shows that identifying the true model is difficult. However, this difficulty is certainly not exacerbated by the presence of epistasis, on the contrary, in some cases the presence of epistasis can aid in model selection. The impact of allele frequencies on both power and model selection is dramatic.
Collapse
Affiliation(s)
- Koen J. F. Verhoeven
- Netherlands Institute of Ecology (NIOO-KNAW), Department of Terrestrial Ecology, Heteren, The Netherlands
| | - George Casella
- Department of Statistics and Genetics Institute, University of Florida, Gainesville, Florida, United States of America
| | - Lauren M. McIntyre
- Genetics Institute, Department of Molecular Genetics and Microbiology and Department of Statistics, University of Florida, Gainesville, Florida, United States of America
- * E-mail:
| |
Collapse
|
13
|
Abstract
The coalescent with recombination is a very useful tool in molecular population genetics. Under this framework, genealogies often represent the evolution of the substitution unit, and because of this, the few coalescent algorithms implemented for the simulation of coding sequences force recombination to occur only between codons. However, it is clear that recombination is expected to occur most often within codons. Here we have developed an algorithm that can evolve coding sequences under an ancestral recombination graph that represents the genealogies at each nucleotide site, thereby allowing for intracodon recombination. The algorithm is a modification of Hudson's coalescent in which, in addition to keeping track of events occurring in the ancestral material that reaches the sample, we need to keep track of events occurring in ancestral material that does not reach the sample but that is produced by intracodon recombination. We are able to show that at typical substitution rates the number of nonsynonymous changes induced by intracodon recombination is small and that intracodon recombination does not generally result in inflated estimates of the overall nonsynonymous/synonymous substitution ratio (omega). On the other hand, recombination can bias the estimation of omega at particular codons, resulting in apparent rate variation among sites and in the spurious identification of positively selected sites. Importantly, in this case, allowing for variable synonymous rates across sites greatly reduces the false-positive rate and recovers statistical power. Finally, coalescent simulations with intracodon recombination could be used to better represent the evolution of nuclear coding genes or fast-evolving pathogens such as HIV-1.We have implemented this algorithm in a computer program called NetRecodon, freely available at http://darwin.uvigo.es.
Collapse
|
14
|
Abstract
With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. The amount of ILS depends on population parameters such as the ancestral effective population sizes and the recombination rate, but also on the number of generations between speciation events. We use a hidden Markov model parameterized according to coalescent theory to infer the genealogy along a four-species genome alignment of closely related species and estimate population parameters. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment can be recovered efficiently together with the ancestral recombination rate. Finally, we introduce an extension of the basic model that allows for mutation rate heterogeneity and reanalyze human-chimpanzee-gorilla-orangutan alignments, using the new models. We expect that this framework will prove useful for population genomics and provide exciting insights into genome evolution.
Collapse
|
15
|
Eriksson A, Mahjani B, Mehlig B. Sequential Markov coalescent algorithms for population models with demographic structure. Theor Popul Biol 2009; 76:84-91. [PMID: 19433100 DOI: 10.1016/j.tpb.2009.05.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2009] [Revised: 05/04/2009] [Accepted: 05/04/2009] [Indexed: 10/24/2022]
Abstract
We analyse sequential Markov coalescent algorithms for populations with demographic structure: for a bottleneck model, a population-divergence model, and for a two-island model with migration. The sequential Markov coalescent method is an approximation to the coalescent suggested by McVean and Cardin, and by Marjoram and Wall. Within this algorithm we compute, for two individuals randomly sampled from the population, the correlation between times to the most recent common ancestor and the linkage probability corresponding to two different loci with recombination rate R between them. These quantities characterise the linkage between the two loci in question. We find that the sequential Markov coalescent method approximates the coalescent well in general in models with demographic structure. An exception is the case where individuals are sampled from populations separated by reduced gene flow. In this situation, the correlations may be significantly underestimated. We explain why this is the case.
Collapse
Affiliation(s)
- A Eriksson
- Department of Physics, University of Gothenburg, SE-41296 Gothenburg, Sweden
| | | | | |
Collapse
|
16
|
Yawata M, Yawata N, Draghi M, Little AM, Partheniou F, Parham P. Roles for HLA and KIR polymorphisms in natural killer cell repertoire selection and modulation of effector function. ACTA ACUST UNITED AC 2006; 203:633-45. [PMID: 16533882 PMCID: PMC2118260 DOI: 10.1084/jem.20051884] [Citation(s) in RCA: 436] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Interactions between killer cell immunoglobulin-like receptors (KIRs) and human leukocyte antigen (HLA) class I ligands regulate the development and response of human natural killer (NK) cells. Natural selection drove an allele-level group A KIR haplotype and the HLA-C1 ligand to unusually high frequency in the Japanese, who provide a particularly informative population for investigating the mechanisms by which KIR and HLA polymorphism influence NK cell repertoire and function. HLA class I ligands increase the frequencies of NK cells expressing cognate KIR, an effect modified by gene dose, KIR polymorphism, and the presence of other cognate ligand-receptor pairs. The five common Japanese KIR3DLI allotypes have distinguishable inhibitory capacity, frequency of cellular expression, and level of cell surface expression as measured by antibody binding. Although KIR haplotypes encoding 3DL1*001 or 3DL1*005, the strongest inhibitors, have no activating KIR, the dominant haplotype encodes a moderate inhibitor, 3DL1*01502, plus functional forms of the activating receptors 2DL4 and 2DS4. In the population, certain combinations of KIR and HLA class I ligand are overrepresented or underrepresented in women, but not men, and thus influence female fitness and survival. These findings show how KIR-HLA interactions shape the genetic and phenotypic KIR repertoires for both individual humans and the population.
Collapse
Affiliation(s)
- Makoto Yawata
- Department of Structural Biology, School of Medicine, Stanford University, Stanford, CA 94305, USA, and Department of Haematology, The Royal Free Hospital, London, UK.
| | | | | | | | | | | |
Collapse
|
17
|
Abstract
Correlation of gene histories in the human genome determines the patterns of genetic variation (haplotype structure) and is crucial to understanding genetic factors in common diseases. We derive closed analytical expressions for the correlation of gene histories in established demographic models for genetic evolution and show how to extend the analysis to more realistic (but more complicated) models of demographic structure. We identify two contributions to the correlation of gene histories in divergent populations: linkage disequilibrium, and differences in the demographic history of individuals in the sample. These two factors contribute to correlations at different length scales: the former at small, and the latter at large scales. We show that recent mixing events in divergent populations limit the range of correlations and compare our findings to empirical results on the correlation of gene histories in the human genome.
Collapse
Affiliation(s)
- A Eriksson
- Department of Physical Resource Theory, Chalmers and Göteborg University, Sweden
| | | |
Collapse
|
18
|
Zhang K, Sun F. Assessing the power of tag SNPs in the mapping of quantitative trait loci (QTL) with extremal and random samples. BMC Genet 2005; 6:51. [PMID: 16236175 PMCID: PMC1274312 DOI: 10.1186/1471-2156-6-51] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2005] [Accepted: 10/19/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent studies have indicated that the human genome could be divided into regions with low haplotype diversity interspersed with regions of high haplotype diversity. In regions of low haplotype diversity, a small fraction of SNPs (tag SNPs) are sufficient to account for most of the haplotype diversity of the human genome. These tag SNPs can be extremely useful for testing the association of a marker locus with a qualitative or quantitative trait locus in that it may not be necessary to genotype all the SNPs. When tag SNPs are used to reduce the genotyping effort in association studies, it is important to know how much power is lost. It is also important to know how much power is gained when tag SNPs instead of the same number of randomly chosen SNPs are used. RESULTS We design a simulation study to tackle these problems for a variety of quantitative association tests using either case-parent samples or unrelated population samples. First, the samples are generated based on the quantitative trait model with the assumption of either an extremal sampling scheme or a random sampling scheme. Second, a small number of samples are selected to determine the haplotype blocks and the tag SNPs. Third, the statistical power of the tests is evaluated using four kinds of data: (1) all the SNPs and the corresponding haplotypes, (2) the tag SNPs and the corresponding haplotypes, (3) the same number of evenly spaced SNPs with minor allele frequency greater than a threshold and the corresponding haplotypes, (4) the same number of randomly chosen SNPs and their corresponding haplotypes. CONCLUSION Our results suggest that in most situations genotyping efforts can be significantly reduced by using tag SNPs for mapping the QTL in association studies without much loss of power, which is consistent with previous studies on association mapping of qualitative traits. For all situations considered, two-locus haplotype analysis using tag SNPs are more powerful than those using the same number of randomly selected SNPs, but the degree of such power differences depends upon the sampling scheme and the population history.
Collapse
Affiliation(s)
- Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
19
|
Takebayashi N, Newbigin E, Uyenoyama MK. Maximum-likelihood estimation of rates of recombination within mating-type regions. Genetics 2005; 167:2097-109. [PMID: 15342543 PMCID: PMC1471000 DOI: 10.1534/genetics.103.021535] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Features common to many mating-type regions include recombination suppression over large genomic tracts and cosegregation of genes of various functions, not necessarily related to reproduction. Model systems for homomorphic self-incompatibility (SI) in flowering plants share these characteristics. We introduce a method for the exact computation of the joint probability of numbers of neutral mutations segregating at the determinant of mating type and at a linked marker locus. The underlying Markov model incorporates strong balancing selection into a two-locus coalescent. We apply the method to obtain a maximum-likelihood estimate of the rate of recombination between a marker locus, 48A, and S-RNase, the determinant of SI specificity in pistils of Nicotiana alata. Even though the sampled haplotypes show complete allelic linkage disequilibrium and recombinants have never been detected, a highly significant deficiency of synonymous substitutions at 48A compared to S-RNase suggests a history of recombination. Our maximum-likelihood estimate indicates a rate of recombination of perhaps 3 orders of magnitude greater than the rate of synonymous mutation. This approach may facilitate the construction of genetic maps of regions tightly linked to targets of strong balancing selection.
Collapse
Affiliation(s)
- Naoki Takebayashi
- Department of Biology, Duke University, Durham, North Carolina 27708-0338, USA
| | | | | |
Collapse
|
20
|
Abstract
Recent developments in population genetics are reviewed and placed in a historical context. Current and future challenges, both in computational methodology and in analytical theory, are to develop models and techniques to extract the most information possible from multilocus DNA datasets. As an example of the theoretical issues, five limiting forms of the island model of population subdivision with migration are presented in a unified framework. These approximations illustrate the interplay between migration and drift in structuring gene genealogies, and some of them make connections between the fairly complicated island-model genealogical process and the much simpler, unstructured neutral coalescent process which underlies most inferential techniques in population genetics.
Collapse
Affiliation(s)
- J Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, 2102 Biological Laboratories, 16 Divinity Ave., Cambridge, MA 02138, USA.
| |
Collapse
|
21
|
Verhoeven KJF, Simonsen KL. Genomic haplotype blocks may not accurately reflect spatial variation in historic recombination intensity. Mol Biol Evol 2004; 22:735-40. [PMID: 15563716 DOI: 10.1093/molbev/msi058] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Recently, genomic data have revealed a "block-like" structure of haplotype diversity on human chromosomes. This structure is anticipated to facilitate gene mapping studies, because strong associations among loci within a block may allow haplotype variation to be tagged with a limited number of markers. But its usefulness to mapping efforts depends on the consistency of the block structure within and among populations, which in turn depends on how the block structure arises. Recombination hot spots are generally thought to underlie the block structure, but haplotype blocks can also develop stochastically under random recombination, in which case the block structure will show limited consistency among populations. Using coalescent models, which we upscaled to simulate the evolution of haplotypes with many markers at fixed distances, we show that the relationship between block boundaries and historic recombination intensity may be surprisingly weak. The majority of historic recombinations do not leave a footprint in present-day linkage disequilibrium patterns, and the block structure is sensitive to factors that affect the timing of recombination relative to marker mutation events in the genealogy, such as marker frequency bias and historic population size changes. Our results give insight into the potential of stochastic events to affect haplotype block structure, which can limit the usefulness of the block structure to mapping studies.
Collapse
Affiliation(s)
- Koen J F Verhoeven
- Computational Genomics and Department of Agronomy, Purdue University, USA.
| | | |
Collapse
|
22
|
Eriksson A, Mehlig B. On the effect of fluctuating recombination rates on the decorrelation of gene histories in the human genome. Genetics 2004; 169:1175-8. [PMID: 15520270 PMCID: PMC1449132 DOI: 10.1534/genetics.103.018002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We show how to incorporate fluctuations of the recombination rate along the chromosome into standard gene-genealogical models for the decorrelation of gene histories. This enables us to determine how small-scale fluctuations (Poissonian hot-spot model) and large-scale variations (Kong et al. 2002) of the recombination rate influence this decorrelation. We find that the empirically determined large-scale variations of the recombination rate give rise to a significantly slower decay of correlations compared to the standard, unstructured gene-genealogical model assuming constant recombination rate. A model with long-range recombination-rate variations and with demographic structure (divergent population) is found to be consistent with the empirically observed slow decorrelation of gene histories. Conversely, we show that small-scale recombination-rate fluctuations do not alter the large-scale decorrelation of gene histories.
Collapse
Affiliation(s)
- A Eriksson
- Physics and Engineering Physics, Gothenburg University/Chalmers, 41296 Gothenburg, Sweden
| | | |
Collapse
|
23
|
Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F. Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. Genome Res 2004; 14:908-16. [PMID: 15078859 PMCID: PMC479119 DOI: 10.1101/gr.1837404] [Citation(s) in RCA: 125] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed by regions of low LD. A small fraction of SNPs (tag SNPs) is sufficient to capture most of the haplotype structure of the human genome. In this paper, we develop a method to partition haplotypes into blocks and to identify tag SNPs based on genotype data by combining a dynamic programming algorithm for haplotype block partitioning and tag SNP selection based on haplotype data with a variation of the expectation maximization (EM) algorithm for haplotype inference. We assess the effects of using either haplotype or genotype data in haplotype block identification and tag SNP selection as a function of several factors, including sample size, density or number of SNPs studied, allele frequencies, fraction of missing data, and genotyping error rate, using extensive simulations. We find that a modest number of haplotype or genotype samples will result in consistent block partitions and tag SNP selection. The power of association studies based on tag SNPs using genotype data is similar to that using haplotype data.
Collapse
Affiliation(s)
- Kui Zhang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California 90089-1113, USA
| | | | | | | | | | | |
Collapse
|
24
|
Uyenoyama MK, Takebayashi N. A simple method for computing exact probabilities of mutation numbers. Theor Popul Biol 2004; 65:271-84. [PMID: 15066423 DOI: 10.1016/j.tpb.2003.12.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2003] [Indexed: 10/26/2022]
Abstract
We describe a method for the recursive computation of exact probability distributions for the number of neutral mutations segregating in samples of arbitrary size and configuration. Construction of the recursions requires only characterization of evolutionary changes as a Markov process and determination of one-step transition matrices. We address the pattern of nucleotide diversity at a neutral marker locus linked to a determinant of mating type. Under a reformulation of parameters, the method also applies directly to metapopulation models with island migration among demes. Characterization of complete probability distributions facilitates parameter estimation and hypothesis testing by likelihood- as well as moment-based approaches.
Collapse
Affiliation(s)
- Marcy K Uyenoyama
- Department of Biology, Box 90338, 107 Bio. Sci. Building, Duke University, Durham, NC 27708-0338, USA,
| | | |
Collapse
|
25
|
Abstract
Complex diseases are generally caused by intricate interactions of multiple genes and environmental factors. Most available linkage and association methods are developed to identify individual susceptibility genes assuming a simple disease model blind to any possible gene - gene and gene - environmental interactions. We used a set association method that uses single-nucleotide polymorphism markers to locate genetic variation responsible for complex diseases in which multiple genes are involved. Here we extended the set association method from bi-allelic to multiallelic markers. In addition, we studied the type I error rates and power for both approaches using simulations based on the coalescent process. Both bi-allelic set association (BSA) and multiallelic set association (MSA) tests have the correct type I error rates. In addition, BSA and MSA can have more power than individual marker analysis when multiple genes are involved in a complex disease. We applied the MSA approach to the simulated data sets from Genetic Analysis Workshop 13. High cholesterol level was used as the definitive phenotype for a disease. MSA failed to detect markers with significant linkage disequilibrium with genes responsible for cholesterol level. This is due to the wide spacing between the markers and the lack of association between the marker loci and the simulated phenotype.
Collapse
Affiliation(s)
- Sung Kim
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California, USA
| | - Kui Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
26
|
Abstract
We consider (approximate) likelihood methods for estimating the population-scaled recombination rate from population genetic data. We show that the dependence between the data from two regions of a chromosome decays inversely with the amount of recombination between the two regions. We use this result to show that the maximum likelihood estimator (mle) for the recombination rate, based on the composite likelihood of Fearnhead and Donnelly, is consistent. We also consider inference based on the pairwise likelihood of Hudson. We consider two approximations to this likelihood, and prove that the mle based on one of these approximations is consistent, while the mle based on the other approximation (which is used by McVean, Awadalla and Fearnhead) is not.
Collapse
Affiliation(s)
- Paul Fearnhead
- Department of Mathematics and Statistics, Lancaster University, Fylde College, B Floor, Room 4b, LA1 4YF, Lancaster, UK.
| |
Collapse
|
27
|
Wakeley J, Lessard S. Theory of the effects of population structure and sampling on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 2003; 164:1043-53. [PMID: 12871914 PMCID: PMC1462626 DOI: 10.1093/genetics/164.3.1043] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We develop predictions for the correlation of heterozygosity and for linkage disequilibrium between two loci using a simple model of population structure that includes migration among local populations, or demes. We compare the results for a sample of size two from the same deme (a single-deme sample) to those for a sample of size two from two different demes (a scattered sample). The correlation in heterozygosity for a scattered sample is surprisingly insensitive to both the migration rate and the number of demes. In contrast, the correlation in heterozygosity for a single-deme sample is sensitive to both, and the effect of an increase in the number of demes is qualitatively similar to that of a decrease in the migration rate: both increase the correlation in heterozygosity. These same conclusions hold for a commonly used measure of linkage disequilibrium (r(2)). We compare the predictions of the theory to genomic data from humans and show that subdivision might account for a substantial portion of the genetic associations observed within the human genome, even though migration rates among local populations of humans are relatively large. Because correlations due to subdivision rather than to physical linkage can be large even in a single-deme sample, then if long-term migration has been important in shaping patterns of human polymorphism, the common practice of disease mapping using linkage disequilibrium in "isolated" local populations may be subject to error.
Collapse
Affiliation(s)
- John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138, USA.
| | | |
Collapse
|
28
|
Abstract
Recent experimental findings suggest that the assumption of a homogeneous recombination rate along the human genome is too naive. These findings point to block-structured recombination rates; certain regions (called hotspots) are more prone than other regions to recombination. In this report a coalescent model incorporating hotspot or block-structured recombination is developed and investigated analytically as well as by simulation. Our main results can be summarized as follows: (1) The expected number of recombination events is much lower in a model with pure hotspot recombination than in a model with pure homogeneous recombination, (2) hotspots give rise to large variation in recombination rates along the genome as well as in the number of historical recombination events, and (3) the size of a (nonrecombining) block in the hotspot model is likely to be overestimated grossly when estimated from SNP data. The results are discussed with reference to the current debate about block-structured recombination and, in addition, the results are compared to genome-wide variation in recombination rates. A number of new analytical results about the model are derived.
Collapse
Affiliation(s)
- Carsten Wiuf
- Variagenics, Cambridge, Massachusetts 02139, USA.
| | | |
Collapse
|
29
|
Broughton RE, Harrison RG. Nuclear gene genealogies reveal historical, demographic and selective factors associated with speciation in field crickets. Genetics 2003; 163:1389-401. [PMID: 12702683 PMCID: PMC1462531 DOI: 10.1093/genetics/163.4.1389] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Population genetics theory predicts that genetic drift should eliminate shared polymorphism, leading to monophyly or exclusivity of populations, when the elapsed time between lineage-splitting events is large relative to effective population size. We examined patterns of nucleotide variation in introns at four nuclear loci to relate processes affecting the history of genes to patterns of divergence among natural populations and species. Ancestral polymorphisms were shared among three recognized species, Gryllus firmus, G. pennsylvanicus, and G. ovisopis, and genealogical patterns suggest that successive speciation events occurred recently and rapidly relative to effective population size. High levels of shared polymorphism among these morphologically, behaviorally, and ecologically distinct species indicate that only a small fraction of the genome needs to become differentiated for speciation to occur. Among the four nuclear gene loci there was a 10-fold range in nucleotide diversity, and patterns of polymorphism and divergence suggest that natural selection has acted to maintain or eliminate variation at some loci. While nuclear gene genealogies may have limited applications in phylogeography or other approaches dependent on population monophyly, they provide important insights into the historical, demographic, and selective forces that shape speciation.
Collapse
Affiliation(s)
- Richard E Broughton
- Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, New York 14853, USA.
| | | |
Collapse
|
30
|
Zhang K, Calabrese P, Nordborg M, Sun F. Haplotype block structure and its applications to association studies: power and study designs. Am J Hum Genet 2002; 71:1386-94. [PMID: 12439824 PMCID: PMC378580 DOI: 10.1086/344780] [Citation(s) in RCA: 207] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2002] [Accepted: 09/16/2002] [Indexed: 11/04/2022] Open
Abstract
Recent studies have shown that the human genome has a haplotype block structure, such that it can be divided into discrete blocks of limited haplotype diversity. In each block, a small fraction of single-nucleotide polymorphisms (SNPs), referred to as "tag SNPs," can be used to distinguish a large fraction of the haplotypes. These tag SNPs can potentially be extremely useful for association studies, in that it may not be necessary to genotype all SNPs; however, this depends on how much power is lost. Here we develop a simulation study to quantitatively assess the power loss for a variety of study designs, including case-control designs and case-parental control designs. First, a number of data sets containing case-parental or case-control samples are generated on the basis of a disease model. Second, a small fraction of case and control individuals in each data set are genotyped at all the loci, and a dynamic programming algorithm is used to determine the haplotype blocks and the tag SNPs based on the genotypes of the sampled individuals. Third, the statistical power of tests was evaluated on the basis of three kinds of data: (1) all of the SNPs and the corresponding haplotypes, (2) the tag SNPs and the corresponding haplotypes, and (3) the same number of randomly chosen SNPs as the number of tag SNPs and the corresponding haplotypes. We study the power of different association tests with a variety of disease models and block-partitioning criteria. Our study indicates that the genotyping efforts can be significantly reduced by the tag SNPs, without much loss of power. Depending on the specific haplotype block-partitioning algorithm and the disease model, when the identified tag SNPs are only 25% of all the SNPs, the power is reduced by only 4%, on average, compared with a power loss of approximately 12% when the same number of randomly chosen SNPs is used in a two-locus haplotype analysis. When the identified tag SNPs are approximately 14% of all the SNPs, the power is reduced by approximately 9%, compared with a power loss of approximately 21% when the same number of randomly chosen SNPs is used in a two-locus haplotype analysis. Our study also indicates that haplotype-based analysis can be much more powerful than marker-by-marker analysis.
Collapse
Affiliation(s)
- Kui Zhang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles 90089, USA
| | | | | | | |
Collapse
|
31
|
Abstract
The degree of association between alleles at different loci, or linkage disequilibrium, is widely used to infer details of evolutionary processes. Here I explore how associations between alleles relate to properties of the underlying genealogy of sequences. Under the neutral, infinite-sites assumption I show that there is a direct correspondence between the covariance in coalescence times at different parts of the genome and the degree of linkage disequilibrium. These covariances can be calculated exactly under the standard neutral model and by Monte Carlo simulation under different demographic models. I show that the effects of population growth, population bottlenecks, and population structure on linkage disequilibrium can be described through their effects on the covariance in coalescence times.
Collapse
|
32
|
Fearnhead P, Donnelly P. Approximate likelihood methods for estimating local recombination rates. J R Stat Soc Series B Stat Methodol 2002. [DOI: 10.1111/1467-9868.00355] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
33
|
Discussion on the meeting on 'Statistical modelling and analysis of genetic data'. J R Stat Soc Series B Stat Methodol 2002. [DOI: 10.1111/1467-9868.00359] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
34
|
Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, Higgins JM, Richter DJ, Lander ES, Altshuler D. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet 2002; 32:135-42. [PMID: 12161752 DOI: 10.1038/ng947] [Citation(s) in RCA: 235] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Variation in the human genome sequence is key to understanding susceptibility to disease in modern populations and the history of ancestral populations. Unlocking this information requires knowledge of the patterns and underlying causes of human sequence diversity. By applying a new population-genetic framework to two genome-wide polymorphism surveys, we find that the human genome contains sizeable regions (stretching over tens of thousands of base pairs) that have intrinsically high and low rates of sequence variation. We show that the primary determinant of these patterns is shared genealogical history. Only a fraction of the variation (at most 25%) is due to the local mutation rate. By measuring the average distance over which genealogical histories are typically preserved, these data provide the first genome-wide estimate of the average extent of correlation among variants (linkage disequilibrium). The results are best explained by extreme variability in the recombination rate at a fine scale, and provide the first empirical evidence that such recombination 'hot spots' are a general feature of the human genome and have a principal role in shaping genetic variation in the human population.
Collapse
Affiliation(s)
- David E Reich
- Whitehead Institute/MIT Center for Genome Research, One Kendall Square, Cambridge, Massachusetts 02139, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K. The discovery of single-nucleotide polymorphisms--and inferences about human demographic history. Am J Hum Genet 2001; 69:1332-47. [PMID: 11704929 PMCID: PMC1235544 DOI: 10.1086/324521] [Citation(s) in RCA: 133] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2001] [Accepted: 09/24/2001] [Indexed: 11/03/2022] Open
Abstract
A method of historical inference that accounts for ascertainment bias is developed and applied to single-nucleotide polymorphism (SNP) data in humans. The data consist of 84 short fragments of the genome that were selected, from three recent SNP surveys, to contain at least two polymorphisms in their respective ascertainment samples and that were then fully resequenced in 47 globally distributed individuals. Ascertainment bias is the deviation, from what would be observed in a random sample, caused either by discovery of polymorphisms in small samples or by locus selection based on levels or patterns of polymorphism. The three SNP surveys from which the present data were derived differ both in their protocols for ascertainment and in the size of the samples used for discovery. We implemented a Monte Carlo maximum-likelihood method to fit a subdivided-population model that includes a possible change in effective size at some time in the past. Incorrectly assuming that ascertainment bias does not exist causes errors in inference, affecting both estimates of migration rates and historical changes in size. Migration rates are overestimated when ascertainment bias is ignored. However, the direction of error in inferences about changes in effective population size (whether the population is inferred to be shrinking or growing) depends on whether either the numbers of SNPs per fragment or the SNP-allele frequencies are analyzed. We use the abbreviation "SDL," for "SNP-discovered locus," in recognition of the genomic-discovery context of SNPs. When ascertainment bias is modeled fully, both the number of SNPs per SDL and their allele frequencies support a scenario of growth in effective size in the context of a subdivided population. If subdivision is ignored, however, the hypothesis of constant effective population size cannot be rejected. An important conclusion of this work is that, in demographic or other studies, SNP data are useful only to the extent that their ascertainment can be modeled.
Collapse
Affiliation(s)
- J Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.
| | | | | | | |
Collapse
|
36
|
Posada D, Crandall KA. Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proc Natl Acad Sci U S A 2001; 98:13757-62. [PMID: 11717435 PMCID: PMC61114 DOI: 10.1073/pnas.241370698] [Citation(s) in RCA: 1074] [Impact Index Per Article: 46.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2001] [Indexed: 11/18/2022] Open
Abstract
Recombination is a key evolutionary process that shapes the architecture of genomes and the genetic structure of populations. Although many statistical methods are available for the detection of recombination from DNA sequences, their absolute and relative performance is still unknown. Here we evaluated the performance of 14 different recombination detection algorithms. We used the coalescent with recombination to simulate DNA sequences with different levels of recombination, genetic diversity, and rate variation among sites. Recombination detection methods were applied to these data sets, and whether they detected or not recombination was recorded. Different recombination methods showed distinct performance depending on the amount of recombination, genetic diversity, and rate variation among sites. The model of nucleotide substitution under which the data were generated did not seem to have a significant effect. Most methods increase power with more sequence divergence. In general, recombination detection methods seem to capture the presence of recombination, but they are not very powerful. Methods that use substitution patterns or incompatibility among sites were more powerful than methods based on phylogenetic incongruence. Most methods do not seem to infer more false positives than expected by chance. Especially depending on the amount of diversity in the data, different methods could be used to attain maximum power while minimizing false positives. Results shown here will provide some guidance in the selection of the most appropriate method/s for the analysis of the particular data at hand.
Collapse
Affiliation(s)
- D Posada
- Department of Zoology, Brigham Young University, Provo, UT 84602, USA.
| | | |
Collapse
|
37
|
Abstract
We introduce a new method for estimating recombination rates from population genetic data. The method uses a computationally intensive statistical procedure (importance sampling) to calculate the likelihood under a coalescent-based model. Detailed comparisons of the new algorithm with two existing methods (the importance sampling method of Griffiths and Marjoram and the MCMC method of Kuhner and colleagues) show it to be substantially more efficient. (The improvement over the existing importance sampling scheme is typically by four orders of magnitude.) The existing approaches not infrequently led to misleading results on the problems we investigated. We also performed a simulation study to look at the properties of the maximum-likelihood estimator of the recombination rate and its robustness to misspecification of the demographic model.
Collapse
Affiliation(s)
- P Fearnhead
- Department of Statistics, University of Oxford, Oxford, OX1 3TG, United Kingdom
| | | |
Collapse
|
38
|
McIntyre LM, Martin ER, Simonsen KL, Kaplan NL. Circumventing multiple testing: a multilocus Monte Carlo approach to testing for association. Genet Epidemiol 2000; 19:18-29. [PMID: 10861894 DOI: 10.1002/1098-2272(200007)19:1<18::aid-gepi2>3.0.co;2-y] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Advances in marker technology have made a dense marker map a reality. If each marker is considered separately, and separate tests for association with a disease gene are performed, then multiple testing becomes an issue. A common solution uses a Bonferroni correction to account for multiple tests performed. However, with dense marker maps, neighboring markers are tightly linked and may have associated alleles; thus tests at nearby marker loci may not be independent. When alleles at different marker loci are associated, the Bonferroni correction may lead to a conservative test, and hence a power loss. As an alternative, for tests of association that use family data, we propose a Monte Carlo procedure that provides a global assessment of significance. We examine the case of tightly linked markers with varying amounts of association between them. Using computer simulations, we study a family-based test for association (the transmission/disequilibrium test), and compare its power when either the Bonferroni or Monte Carlo procedure is used to determine significance. Our results show that when the alleles at different marker loci are not associated, using either procedure results in tests with similar power. However, when alleles at linked markers are associated, the test using the Monte Carlo procedure is more powerful than the test using the Bonferroni procedure. This proposed Monte Carlo procedure can be applied whenever it is suspected that markers examined have high amounts of association, or as a general approach to ensure appropriate significance levels and optimal power.
Collapse
Affiliation(s)
- L M McIntyre
- Institute for Clinical and Epidemiological Research, Veterans Affairs Medical Center, Durham, North Carolina, USA
| | | | | | | |
Collapse
|
39
|
Abstract
In this paper we develop a coalescent model with intralocus gene conversion. Such models are of increasing importance in the analysis of intralocus variability and linkage disequilibrium. We derive the distribution of the waiting time until a gene conversion event occurs in a sample in terms of the distribution of the length of the transferred segment, zeta. We do not assume any specific form of the distribution of zeta. Further, given that a gene conversion event occurs we find the distribution of (sigma, tau), the end points of the transferred segment and derive results on correlations between local trees in positions chi(1) and chi(2). Among other results we show that the correlation between the branch lengths of two local trees in the coalescent with gene conversion (and no recombination) decreases toward a nonzero constant when the distance between chi(1) and chi(2) increases. Finally, we show that a model including both recombination and gene conversion might account for the lack of intralocus associations found in, e.g., Drosophila melanogaster.
Collapse
Affiliation(s)
- C Wiuf
- Department of Statistics, University of Oxford, Oxford, OX1 3TG, England
| |
Collapse
|
40
|
Abstract
Three new estimators of the population recombination rate C = 4Nr are introduced. These estimators summarize the data using the number of distinct haplotypes and the estimated minimum number of recombination events, then calculate the value of C that maximizes the likelihood of obtaining the summarized data. They are compared with a number of previously proposed estimators of the recombination rate. One of the newly proposed estimators is generally better than the others for the parameter values considered here, while the three programs that calculate maximum-likelihood estimates give conflicting results.
Collapse
Affiliation(s)
- J D Wall
- Department of Ecology and Evolution, University of Chicago, Illinois 60637, USA.
| |
Collapse
|
41
|
Abstract
Histories of sequences in the coalescent model with recombination can be simulated using an algorithm that takes as input a sample of extant sequences. The algorithm traces the history of the sequences going back in time, encountering recombinations and coalescence (duplications) until the ancestral material is located on one sequence for homologous positions in the present sequences. Here an alternative algorithm is formulated not as going back in time and operating on sequences, but by moving spatially along the sequences, updating the history of the sequences as recombination points are encountered. This algorithm focuses on spatial aspects of the coalescent with recombination rather than on temporal aspects as is the case of familiar algorithms. Mathematical results related to spatial aspects of the coalescent with recombination are derived.
Collapse
Affiliation(s)
- C Wiuf
- Institute of Biological Sciences, University of Aarhus, Aarhus, DK-8000, Denmark
| | | |
Collapse
|
42
|
Abstract
In a sample of DNA sequences where recombination can occur to the ancestors of the sample, distinct parts of the sequences may have different most recent common ancestors. This paper presents a Markov chain Monte Carlo algorithm for computing the expected time to the most recent common ancestor along the sequences, conditional on where the mutations occur on the sequences.
Collapse
Affiliation(s)
- R C Griffiths
- Mathematics Department, Monash University, Clayton, 3168, Australia.
| |
Collapse
|
43
|
Abstract
In this article we discuss the ancestry of sequences sampled from the coalescent with recombination with constant population size 2N. We have studied a number of variables based on simulations of sample histories, and some analytical results are derived. Consider the leftmost nucleotide in the sequences. We show that the number of nucleotides sharing a most recent common ancestor (MRCA) with the leftmost nucleotide is approximately log(1 + 4N Lr)/4Nr when two sequences are compared, where L denotes sequence length in nucleotides, and r the recombination rate between any two neighboring nucleotides per generation. For larger samples, the number of nucleotides sharing MRCA with the leftmost nucleotide decreases and becomes almost independent of 4N Lr. Further, we show that a segment of the sequences sharing a MRCA consists in mean of 3/8Nr nucleotides, when two sequences are compared, and that this decreases toward 1/4Nr nucleotides when the whole population is sampled. A measure of the correlation between the genealogies of two nucleotides on two sequences is introduced. We show analytically that even when the nucleotides are separated by a large genetic distance, but share MRCA, the genealogies will show only little correlation. This is surprising, because the time until the two nucleotides shared MRCA is reciprocal to the genetic distance. Using simulations, the mean time until all positions in the sample have found a MRCA increases logarithmically with increasing sequence length and is considerably lower than a theoretically predicted upper bound. On the basis of simulations, it turns out that important properties of the coalescent with recombinations of the whole population are reflected in the properties of a sample of low size.
Collapse
Affiliation(s)
- C Wiuf
- Institute of Biological Sciences, University of Aarhus, DK-8000 Aarhus, Denmark.
| | | |
Collapse
|
44
|
Simonsen KL, Churchill GA. A Markov Chain Model of Coalescence with Recombination. Theor Popul Biol 1997; 52:43-59. [PMID: 9356323 DOI: 10.1006/tpbi.1997.1307] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Trees that describe the ancestry of DNA sequences sampled from a population may differ between loci because of genetic recombination. We seek to understand the relationship between such trees for loci that are linked with non-zero recombination rate. We consider a coalescent process model with recombination, as described by Hudson (1983; 1990). For two loci and a sample size of two sequences, a detailed analysis of this process yields the joint distribution of the two trees (one at each locus). A number of interesting results follow from this analysis, including the distribution of the number of recombination events in the history of the sample. For the general case of m loci and samples of size n, we describe an algorithm for simulating the tree building process. Because analytic results are difficult to obtain in this case, we use simulation to study properties of trees at multiple linked loci such as total tree time and number of recombination events. Copyright 1997 Academic Press
Collapse
Affiliation(s)
- KL Simonsen
- Center for Applied Math, Cornell University, Ithaca, New York, 14853
| | | |
Collapse
|
45
|
|
46
|
Griffiths RC, Marjoram P. Ancestral inference from samples of DNA sequences with recombination. J Comput Biol 1996; 3:479-502. [PMID: 9018600 DOI: 10.1089/cmb.1996.3.479] [Citation(s) in RCA: 251] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
The sampling distribution of a collection of DNA sequences is studied under a model where recombination can occur in the ancestry of the sequences. The infinitely-many-sites model of mutation is assumed where there may only be one mutation at a given site. Ancestral inference procedures are discussed for: estimating recombination and mutation rates; estimating the times to the most recent common ancestors along the sequences; estimating ages of mutations; and estimating the number of recombination events in the ancestry of the sample. Inferences are made conditional on the configuration of the pattern of mutations at sites in observed sample sequences. A computational algorithm based on a Markov chain simulation is developed, implemented, and illustrated with examples for these inference procedures. This algorithm is very computationally intensive.
Collapse
Affiliation(s)
- R C Griffiths
- Mathematics Department, Monash University, Clayton, Australia
| | | |
Collapse
|
47
|
Griffiths RC. Which locus has the oldest allele? J Math Biol 1991; 29:763-77. [PMID: 1940668 DOI: 10.1007/bf00160191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
This paper studies aspects of the distribution of non-mutant ancestors of a sample of gametes in a two-locus infinitely-many-alleles model. The ancestral process of two gametes is considered in detail. Included are algorithms for calculating the probability that the oldest allele is from the first locus, and the expected age of the oldest allele. Extensions to an r-locus model in the cases of complete linkage and independence are also studied.
Collapse
Affiliation(s)
- R C Griffiths
- Mathematics Department, Monash University, Clayton, Australia
| |
Collapse
|
48
|
|