1
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. Genetics 2024; 228:iyae100. [PMID: 39013109 PMCID: PMC11373519 DOI: 10.1093/genetics/iyae100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 06/05/2024] [Indexed: 07/18/2024] Open
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. However, this approach is out of step with some modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalizes these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8TA, UK
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, Newcastle NE1 7RU, UK
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Anthony W Wohns
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5101, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
2
|
Wong Y, Ignatieva A, Koskela J, Gorjanc G, Wohns AW, Kelleher J. A general and efficient representation of ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.03.565466. [PMID: 37961279 PMCID: PMC10635123 DOI: 10.1101/2023.11.03.565466] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
As a result of recombination, adjacent nucleotides can have different paths of genetic inheritance and therefore the genealogical trees for a sample of DNA sequences vary along the genome. The structure capturing the details of these intricately interwoven paths of inheritance is referred to as an ancestral recombination graph (ARG). Classical formalisms have focused on mapping coalescence and recombination events to the nodes in an ARG. This approach is out of step with modern developments, which do not represent genetic inheritance in terms of these events or explicitly infer them. We present a simple formalism that defines an ARG in terms of specific genomes and their intervals of genetic inheritance, and show how it generalises these classical treatments and encompasses the outputs of recent methods. We discuss nuances arising from this more general structure, and argue that it forms an appropriate basis for a software standard in this rapidly growing field.
Collapse
Affiliation(s)
- Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Anastasia Ignatieva
- School of Mathematics and Statistics, University of Glasgow, UK
- Department of Statistics, University of Oxford, UK
| | - Jere Koskela
- School of Mathematics, Statistics and Physics, Newcastle University, UK
- Department of Statistics, University of Warwick, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, UK
| | - Anthony W. Wohns
- Broad Institute of MIT and Harvard, Cambridge, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
3
|
Rasmussen DA, Guo F. Espalier: Efficient Tree Reconciliation and Ancestral Recombination Graphs Reconstruction Using Maximum Agreement Forests. Syst Biol 2023; 72:1154-1170. [PMID: 37458753 PMCID: PMC10627558 DOI: 10.1093/sysbio/syad040] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 06/26/2023] [Accepted: 06/30/2023] [Indexed: 11/08/2023] Open
Abstract
In the presence of recombination individuals may inherit different regions of their genome from different ancestors, resulting in a mosaic of phylogenetic histories across their genome. Ancestral recombination graphs (ARGs) can capture how phylogenetic relationships vary across the genome due to recombination, but reconstructing ARGs from genomic sequence data is notoriously difficult. Here, we present a method for reconciling discordant phylogenetic trees and reconstructing ARGs using maximum agreement forests (MAFs). Given two discordant trees, a MAF identifies the smallest possible set of topologically concordant subtrees present in both trees. We show how discordant trees can be reconciled through their MAF in a way that retains discordances strongly supported by sequence data while eliminating conflicts likely attributable to phylogenetic noise. We further show how MAFs and our reconciliation approach can be combined to select a path of local trees across the genome that maximizes the likelihood of the genomic sequence data, minimizes discordance between neighboring local trees, and identifies the recombination events necessary to explain remaining discordances to obtain a fully connected ARG. While heuristic, our ARG reconstruction approach is often as accurate as more exact methods while being much more computationally efficient. Moreover, important demographic parameters such as recombination rates can be accurately estimated from reconstructed ARGs. Finally, we apply our approach to plant infecting RNA viruses in the genus Potyvirus to demonstrate how true recombination events can be disentangled from phylogenetic noise using our ARG reconstruction methods.
Collapse
Affiliation(s)
- David A Rasmussen
- Department of Entomology and Plant Pathology, North Carolina State University, Campus Box 7613, Raleigh, NC 27695, USA
- Bioinformatics Research Center, North Carolina State University, Campus Box 7566, Raleigh, NC 27695, USA
| | - Fangfang Guo
- Department of Entomology and Plant Pathology, North Carolina State University, Campus Box 7613, Raleigh, NC 27695, USA
| |
Collapse
|
4
|
Harris K. Using enormous genealogies to map causal variants in space and time. Nat Genet 2023; 55:730-731. [PMID: 37127671 PMCID: PMC10350326 DOI: 10.1038/s41588-023-01389-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
A new method infers huge gene trees and tests the tree branches for phenotypic associations. This improves power to map the effects of rare variants that are missing from genotype arrays and imputation panels.
Collapse
Affiliation(s)
- Kelley Harris
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
- Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, WA, USA.
| |
Collapse
|
5
|
Guo F, Carbone I, Rasmussen DA. Recombination-aware phylogeographic inference using the structured coalescent with ancestral recombination. PLoS Comput Biol 2022; 18:e1010422. [PMID: 35984849 PMCID: PMC9447913 DOI: 10.1371/journal.pcbi.1010422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 09/06/2022] [Accepted: 07/21/2022] [Indexed: 11/19/2022] Open
Abstract
Movement of individuals between populations or demes is often restricted, especially between geographically isolated populations. The structured coalescent provides an elegant theoretical framework for describing how movement between populations shapes the genealogical history of sampled individuals and thereby structures genetic variation within and between populations. However, in the presence of recombination an individual may inherit different regions of their genome from different parents, resulting in a mosaic of genealogical histories across the genome, which can be represented by an Ancestral Recombination Graph (ARG). In this case, different genomic regions may have different ancestral histories and so different histories of movement between populations. Recombination therefore poses an additional challenge to phylogeographic methods that aim to reconstruct the movement of individuals from genealogies, although also a potential benefit in that different loci may contain additional information about movement. Here, we introduce the Structured Coalescent with Ancestral Recombination (SCAR) model, which builds on recent approximations to the structured coalescent by incorporating recombination into the ancestry of sampled individuals. The SCAR model allows us to infer how the migration history of sampled individuals varies across the genome from ARGs, and improves estimation of key population genetic parameters such as population sizes, recombination rates and migration rates. Using the SCAR model, we explore the potential and limitations of phylogeographic inference using full ARGs. We then apply the SCAR to lineages of the recombining fungus Aspergillus flavus sampled across the United States to explore patterns of recombination and migration across the genome.
Collapse
Affiliation(s)
- Fangfang Guo
- Department of Entomology and Plant Pathology, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Ignazio Carbone
- Department of Entomology and Plant Pathology, North Carolina State University, Raleigh, North Carolina, United States of America
- Center for Integrated Fungal Research, North Carolina State University, Raleigh, North Carolina, United States of America
| | - David A. Rasmussen
- Department of Entomology and Plant Pathology, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| |
Collapse
|
6
|
Melka AB, Louzoun Y. High fraction of silent recombination in a finite-population two-locus neutral birth-death-mutation model. Phys Rev E 2022; 106:024409. [PMID: 36109958 DOI: 10.1103/physreve.106.024409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 07/25/2022] [Indexed: 06/15/2023]
Abstract
A precise estimate of allele and haplotype polymorphism is of great interest in theoretical population genetics, but also has practical applications, such as bone marrow registries management. Allele polymorphism is driven mainly by point mutations, while haplotype polymorphism is also affected by recombination. Current estimates treat recombination as mutations in an infinite site model. We here show that even in the simple case of two loci in a haploid individual, for a finite population, most recombination events produce existing haplotypes, and as such are silent. Silent recombination considerably reduces the total number of haplotypes expected from the infinite site model for populations that are not much larger than one over the mutation rate. Moreover, in contrast with mutations, the number of haplotypes does not grow linearly with the population size. We hence propose a more accurate estimate of the total number of haplotypes that takes into account silent recombination. We study large-scale human leukocyte antigen (HLA) haplotype frequencies from human populations to show that the current estimated recombination rate in the HLA region is underestimated.
Collapse
Affiliation(s)
- A B Melka
- Department of Mathematics, Bar-Ilan University, Ramat Gan 52900, Israel
| | - Y Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan 52900, Israel
- Gonda Brain Research Center, Bar-Ilan University, Ramat Gan 52900, Israel
| |
Collapse
|
7
|
Mahmoudi A, Koskela J, Kelleher J, Chan YB, Balding D. Bayesian inference of ancestral recombination graphs. PLoS Comput Biol 2022; 18:e1009960. [PMID: 35263345 PMCID: PMC8936483 DOI: 10.1371/journal.pcbi.1009960] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 03/21/2022] [Accepted: 02/23/2022] [Indexed: 11/18/2022] Open
Abstract
We present a novel algorithm, implemented in the software ARGinfer, for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the sample, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.
Collapse
Affiliation(s)
- Ali Mahmoudi
- Melbourne Integrative Genomics / School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
| | - Jere Koskela
- Department of Statistics, The University of Warwick, Coventry, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, The University of Oxford, Oxford, United Kingdom
| | - Yao-ban Chan
- Melbourne Integrative Genomics / School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
| | - David Balding
- Melbourne Integrative Genomics / School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
- School of BioSciences, The University of Melbourne, Melbourne, Australia
- * E-mail:
| |
Collapse
|
8
|
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, Zhu S, Eldon B, Ellerman EC, Galloway JG, Gladstein AL, Gorjanc G, Guo B, Jeffery B, Kretzschumar WW, Lohse K, Matschiner M, Nelson D, Pope NS, Quinto-Cortés CD, Rodrigues MF, Saunack K, Sellinger T, Thornton K, van Kemenade H, Wohns AW, Wong Y, Gravel S, Kern AD, Koskela J, Ralph PL, Kelleher J. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 2022; 220:iyab229. [PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229] [Citation(s) in RCA: 116] [Impact Index Per Article: 58.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/03/2021] [Indexed: 11/13/2022] Open
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.
Collapse
Affiliation(s)
- Franz Baumdicker
- Cluster of Excellence “Controlling Microbes to Fight Infections”, Mathematical and Computational Population Genetics, University of Tübingen, 72076 Tübingen, Germany
| | - Gertjan Bisschop
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | - Daniel Goldstein
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Graham Gower
- Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark
| | - Aaron P Ragsdale
- Department of Integrative Biology, University of Wisconsin–Madison, Madison, WI 53706, USA
| | - Georgia Tsambos
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC 3010, Australia
| | - Sha Zhu
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Bjarki Eldon
- Leibniz Institute for Evolution and Biodiversity Science, Museum für Naturkunde, Berlin 10115, Germany
| | | | - Jared G Galloway
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
| | - Ariella L Gladstein
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7264, USA
- Embark Veterinary, Inc., Boston, MA 02111, USA
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
| | - Bing Guo
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Warren W Kretzschumar
- Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden
| | - Konrad Lohse
- Institute of Evolutionary Biology, The University of Edinburgh, Edinburgh EH9 3FL, UK
| | | | - Dominic Nelson
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Nathaniel S Pope
- Department of Entomology, Pennsylvania State University, State College, PA 16802, USA
| | - Consuelo D Quinto-Cortés
- National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico
| | - Murillo F Rodrigues
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | | | - Thibaut Sellinger
- Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany
| | - Kevin Thornton
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| | | | - Anthony W Wohns
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Yan Wong
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| | - Simon Gravel
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
| | - Andrew D Kern
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jere Koskela
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK
| | - Peter L Ralph
- Department of Biology, Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA
- Department of Mathematics, University of Oregon, Eugene, OR 97403-5289, USA
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
| |
Collapse
|
9
|
Zhou Y, Browning BL, Browning SR. Population-Specific Recombination Maps from Segments of Identity by Descent. Am J Hum Genet 2020; 107:137-148. [PMID: 32533945 PMCID: PMC7332656 DOI: 10.1016/j.ajhg.2020.05.016] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Accepted: 05/20/2020] [Indexed: 12/26/2022] Open
Abstract
Recombination rates vary significantly across the genome, and estimates of recombination rates are needed for downstream analyses such as haplotype phasing and genotype imputation. Existing methods for recombination rate estimation are limited by insufficient amounts of informative genetic data or by high computational cost. We present a method and software, called IBDrecomb, for using segments of identity by descent to infer recombination rates. IBDrecomb can be applied to sequenced population cohorts to obtain high-resolution, population-specific recombination maps. In simulated admixed data, IBDrecomb obtains higher accuracy than admixture-based estimation of recombination rates. When applied to 2,500 simulated individuals, IBDrecomb obtains similar accuracy to a linkage-disequilibrium (LD)-based method applied to 96 individuals (the largest number for which computation is tractable). Compared to LD-based maps, our IBD-based maps have the advantage of estimating recombination rates in the recent past rather than the distant past. We used IBDrecomb to generate new recombination maps for European Americans and for African Americans from TOPMed sequence data from the Framingham Heart Study (1,626 unrelated individuals) and the Jackson Heart Study (2,046 unrelated individuals), and we compare them to LD-based, admixture-based, and family-based maps.
Collapse
Affiliation(s)
- Ying Zhou
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| | - Brian L Browning
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
| |
Collapse
|
10
|
Hey J, Wang K. The effect of undetected recombination on genealogy sampling and inference under an isolation-with-migration model. Mol Ecol Resour 2019; 19:1593-1609. [PMID: 31479562 DOI: 10.1111/1755-0998.13083] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 07/22/2019] [Accepted: 07/24/2019] [Indexed: 11/30/2022]
Abstract
Many methods for fitting demographic models to data sets of aligned sequences rely upon an assumption that the data have a branching coalescent history without recombination within regions or loci. To mitigate the effects of the failure of this assumption, a common approach is to filter data and sample regions that pass the four-gamete criterion for recombination, an approach that allows data to run, but that is expected to detect only a minority of recombination events. A series of empirical tests of this approach were conducted using computer simulations with and without recombination for a variety of isolation-with-migration (IM) model for two and three populations. Only the IMa3 program was used, but the general results should apply to related genealogy-sampling-based methods for IM models or subsets of IM models. It was found that the details of sampling intervals that pass a four-gamete filter have a moderate effect, and that schemes that use the longest intervals, or that use overlapping intervals, gave poorer results. A simple approach of using a random nonoverlapping interval returned the smallest difference between results with and without recombination, with the mean difference between parameter estimates usually less than 20% of the true value (usually much less). However, the posterior probability distributions for migration rates were flatter with recombination, suggesting that filtering based on the four-gamete criterion, while necessary for methods like these, leads to reduced resolution on migration. A distinct, alternative approach, of using a finite sites mutation model and not filtering the data, performed quite poorly.
Collapse
Affiliation(s)
- Jody Hey
- Center for Computational Genetics and Genomics, Department of Biology, Temple University, Philadelphia, PA, USA
| | - Katherine Wang
- Center for Computational Genetics and Genomics, Department of Biology, Temple University, Philadelphia, PA, USA
| |
Collapse
|
11
|
Inferring whole-genome histories in large population datasets. Nat Genet 2019; 51:1330-1338. [PMID: 31477934 PMCID: PMC6726478 DOI: 10.1038/s41588-019-0483-y] [Citation(s) in RCA: 121] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 07/15/2019] [Indexed: 01/01/2023]
Abstract
Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.
Collapse
|
12
|
Hermann P, Heissl A, Tiemann-Boege I, Futschik A. LDJump: Estimating variable recombination rates from population genetic data. Mol Ecol Resour 2019; 19:623-638. [PMID: 30666785 PMCID: PMC6519033 DOI: 10.1111/1755-0998.12994] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Revised: 12/13/2018] [Accepted: 01/11/2019] [Indexed: 11/27/2022]
Abstract
As recombination plays an important role in evolution, its estimation and the identification of hotspot positions is of considerable interest. We propose a novel approach for estimating population recombination rates based on genotyping or sequence data that involves a sequential multiscale change point estimator. Our method also permits demography to be taken into account. It uses several summary statistics within a regression model fitted on suitable scenarios. Our proposed method is accurate, computationally fast, and provides a parsimonious solution by ensuring a type I error control against too many changes in the recombination rate. An application to human genome data suggests a good congruence between our estimated and experimentally identified hotspots. Our method is implemented in the R‐package LDJump, which is freely available at https://github.com/PhHermann/LDJump.
Collapse
Affiliation(s)
- Philipp Hermann
- Department of Applied Statistics, Johannes Kepler University Linz, Linz, Austria
| | - Angelika Heissl
- Institute of Biophysics, Johannes Kepler University Linz, Linz, Austria
| | | | - Andreas Futschik
- Department of Applied Statistics, Johannes Kepler University Linz, Linz, Austria
| |
Collapse
|
13
|
Heine K, Beskos A, Jasra A, Balding D, De Iorio M. Bridging trees for posterior inference on ancestral recombination graphs. Proc Math Phys Eng Sci 2018; 474:20180568. [PMID: 30602937 PMCID: PMC6304023 DOI: 10.1098/rspa.2018.0568] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 11/01/2018] [Indexed: 11/08/2023] Open
Abstract
We present a new Markov chain Monte Carlo algorithm, implemented in the software Arbores, for inferring the history of a sample of DNA sequences. Our principal innovation is a bridging procedure, previously applied only for simple stochastic processes, in which the local computations within a bridge can proceed independently of the rest of the DNA sequence, facilitating large-scale parallelization.
Collapse
Affiliation(s)
- K. Heine
- Department of Mathematical Sciences, University of Bath, Claverton Down, Bath BA2 7AY, UK
| | - A. Beskos
- Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK
| | - A. Jasra
- Department of Statistics and Applied Probability, National University of Singapore, 6 Science Drive 2, 117546, Singapore
| | - D. Balding
- Centre for Systems Genomics, School of BioSciences, University of Melbourne, Victoria 3010, Australia
- School of Mathematics and Statistics, University of Melbourne, Victoria 3010, Australia
| | - M. De Iorio
- Department of Statistical Science, University College London, Gower Street, London WC1E 6BT, UK
- Yale-NUS College, 16 College Avenue West, 138527, Singapore
| |
Collapse
|
14
|
Muñoz M, Ríos-Chaparro DI, Patarroyo MA, Ramírez JD. Determining Clostridium difficile intra-taxa diversity by mining multilocus sequence typing databases. BMC Microbiol 2017; 17:62. [PMID: 28288567 PMCID: PMC5348806 DOI: 10.1186/s12866-017-0969-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Accepted: 03/03/2017] [Indexed: 12/18/2022] Open
Abstract
Background Multilocus sequence typing (MLST) is a highly discriminatory typing strategy; it is reproducible and scalable. There is a MLST scheme for Clostridium difficile (CD), a gram positive bacillus causing different pathologies of the gastrointestinal tract. This work was aimed at describing the frequency of sequence types (STs) and Clades (C) reported and evalute the intra-taxa diversity in the CD MLST database (CD-MLST-db) using an MLSA approach. Results Analysis of 1778 available isolates showed that clade 1 (C1) was the most frequent worldwide (57.7%), followed by C2 (29.1%). Regarding sequence types (STs), it was found that ST-1, belonging to C2, was the most frequent. The isolates analysed came from 17 countries, mostly from the United Kingdom (UK) (1541 STs, 87.0%). The diversity of the seven housekeeping genes in the MLST scheme was evaluated, and alleles from the profiles (STs), for identifying CD population structure. It was found that adk and atpA are conserved genes allowing a limited amount of clusters to be discriminated; however, different genes such as drx, glyA and particularly sodA showed high diversity indexes and grouped CD populations in many clusters, suggesting that these genes’ contribution to CD typing should be revised. It was identified that CD STs reported to date have a mostly clonal population structure with foreseen events of recombination; however, one group of STs was not assigned to a clade being highly different containing at least nine well-supported clusters, suggesting a greater amount of clades for CD. Conclusions This study shows the usefulness of CD-MLST-db as a tool for studying CD distribution and population structure, identifying the need for reviewing the usefulness of sodA as housekeeping gene within the MLST scheme and suggesting the existence of a greater amount of CD clades. The study also shows the plausible exchange of genetic material between STs, contributing towards intra-taxa genetic diversity. Electronic supplementary material The online version of this article (doi:10.1186/s12866-017-0969-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marina Muñoz
- Grupo de Investigaciones Microbiológicas-UR (GIMUR), Programa de Biología, Facultad de Ciencias Naturales y Matemáticas, Universidad del Rosario, Carrera 24 # 63C - 69, Bogotá, Colombia.,Posgrado Interfacultades Doctorado en Biotecnología, Facultad de Ciencias, Universidad Nacional de Colombia, Bogotá, Colombia
| | - Dora Inés Ríos-Chaparro
- Grupo de Investigaciones Microbiológicas-UR (GIMUR), Programa de Biología, Facultad de Ciencias Naturales y Matemáticas, Universidad del Rosario, Carrera 24 # 63C - 69, Bogotá, Colombia
| | - Manuel Alfonso Patarroyo
- Molecular Biology and Immunology Department, Fundación Instituto de Inmunología de Colombia (FIDIC), Bogotá, Colombia.,School of Medicine and Health Sciences, Universidad del Rosario, Bogotá, Colombia
| | - Juan David Ramírez
- Grupo de Investigaciones Microbiológicas-UR (GIMUR), Programa de Biología, Facultad de Ciencias Naturales y Matemáticas, Universidad del Rosario, Carrera 24 # 63C - 69, Bogotá, Colombia.
| |
Collapse
|
15
|
Choi SC. Methods for delimiting species via population genetics and phylogenetics using genotype data. Genes Genomics 2016. [DOI: 10.1007/s13258-016-0458-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
16
|
Abstract
We prove a result concerning the joint distribution of alleles at linked loci on a chromosome drawn from the population at stationarity. For a neutral locus, the allele is a draw from the stationary distribution of the mutation process. Furthermore, this allele is independent of the alleles at different loci on any chromosomes in the population.
Collapse
|
17
|
Gärtner K, Futschik A. Improved Versions of Common Estimators of the Recombination Rate. J Comput Biol 2016; 23:756-68. [PMID: 27409412 DOI: 10.1089/cmb.2016.0039] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The scaled recombination parameter [Formula: see text] is one of the key parameters, turning up frequently in population genetic models. Accurate estimates of [Formula: see text] are difficult to obtain, as recombination events do not always leave traces in the data. One of the most widely used approaches is composite likelihood. Here, we show that popular implementations of composite likelihood estimators can often be uniformly improved by optimizing the trade-off between bias and variance. The amount of possible improvement depends on parameters such as the sequence length, the sample size, and the mutation rate, and it can be considerable in some cases. It turns out that approximate Bayesian computation, with composite likelihood as a summary statistic, also leads to improved estimates, but now in terms of the posterior risk. Finally, we demonstrate a practical application on real data from Drosophila.
Collapse
Affiliation(s)
- Kerstin Gärtner
- 1 Vienna Graduate School of Population Genetics , Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria
| | - Andreas Futschik
- 2 Department of Applied Statistics, Johannes Kepler University , Linz, Austria
| |
Collapse
|
18
|
Evolution of Neuroadaptation in the Periphery and Purifying Selection in the Brain Contribute to Compartmentalization of Simian Immunodeficiency Virus (SIV) in the Brains of Rhesus Macaques with SIV-Associated Encephalitis. J Virol 2016; 90:6112-6126. [PMID: 27122578 DOI: 10.1128/jvi.00137-16] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Accepted: 04/16/2016] [Indexed: 12/25/2022] Open
Abstract
UNLABELLED The emergence of a distinct subpopulation of human or simian immunodeficiency virus (HIV/SIV) sequences within the brain (compartmentalization) during infection is hypothesized to be linked to AIDS-related central nervous system (CNS) neuropathology. However, the exact evolutionary mechanism responsible for HIV/SIV brain compartmentalization has not been thoroughly investigated. Using extensive viral sampling from several different peripheral tissues and cell types and from three distinct regions within the brain from two well-characterized rhesus macaque models of the neurological complications of HIV infection (neuroAIDS), we have been able to perform in-depth evolutionary analyses that have been unattainable in HIV-infected subjects. The results indicate that, despite multiple introductions of virus into the brain over the course of infection, brain sequence compartmentalization in macaques with SIV-associated CNS neuropathology likely results from late viral entry of virus that has acquired through evolution in the periphery sufficient adaptation for the distinct microenvironment of the CNS. IMPORTANCE HIV-associated neurocognitive disorders remain prevalent among HIV type 1-infected individuals, whereas our understanding of the critical components of disease pathogenesis, such as virus evolution and adaptation, remains limited. Building upon earlier findings of specific viral subpopulations in the brain, we present novel yet fundamental results concerning the evolutionary patterns driving this phenomenon in two well-characterized animal models of neuroAIDS and provide insight into the timing of entry of virus into the brain and selective pressure associated with viral adaptation to this particular microenvironment. Such knowledge is invaluable for therapeutic strategies designed to slow or even prevent neurocognitive impairment associated with AIDS.
Collapse
|
19
|
Jenkins PA, Fearnhead P, Song YS. TRACTABLE DIFFUSION AND COALESCENT PROCESSES FOR WEAKLY CORRELATED LOCI. ELECTRON J PROBAB 2016; 20. [PMID: 27375350 DOI: 10.1214/ejp.v20-3564] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Widely used models in genetics include the Wright-Fisher diffusion and its moment dual, Kingman's coalescent. Each has a multilocus extension but under neither extension is the sampling distribution available in closed-form, and their computation is extremely difficult. In this paper we derive two new multilocus population genetic models, one a diffusion and the other a coalescent process, which are much simpler than the standard models, but which capture their key properties for large recombination rates. The diffusion model is based on a central limit theorem for density dependent population processes, and we show that the sampling distribution is a linear combination of moments of Gaussian distributions and hence available in closed-form. The coalescent process is based on a probabilistic coupling of the ancestral recombination graph to a simpler genealogical process which exposes the leading dynamics of the former. We further demonstrate that when we consider the sampling distribution as an asymptotic expansion in inverse powers of the recombination parameter, the sampling distributions of the new models agree with the standard ones up to the first two orders.
Collapse
Affiliation(s)
- Paul A Jenkins
- Department of Statistics, University of Warwick, Coventry CV4 7AL, UK,
| | - Paul Fearnhead
- Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK,
| | - Yun S Song
- Department of Statistics and Computer Science Division, University of California, Berkeley, Berkeley, CA 94720, USA,
| |
Collapse
|
20
|
New Software for the Fast Estimation of Population Recombination Rates (FastEPRR) in the Genomic Era. G3-GENES GENOMES GENETICS 2016; 6:1563-71. [PMID: 27172192 PMCID: PMC4889653 DOI: 10.1534/g3.116.028233] [Citation(s) in RCA: 82] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Genetic recombination is a very important evolutionary mechanism that mixes parental haplotypes and produces new raw material for organismal evolution. As a result, information on recombination rates is critical for biological research. In this paper, we introduce a new extremely fast open-source software package (FastEPRR) that uses machine learning to estimate recombination rate ρ (=4Ner) from intraspecific DNA polymorphism data. When ρ>10 and the number of sampled diploid individuals is large enough (≥50), the variance of ρFastEPRR remains slightly smaller than that of ρLDhat. The new estimate ρcomb (calculated by averaging ρFastEPRR and ρLDhat) has the smallest variance of all cases. When estimating ρFastEPRR, the finite-site model was employed to analyze cases with a high rate of recurrent mutations, and an additional method is proposed to consider the effect of variable recombination rates within windows. Simulations encompassing a wide range of parameters demonstrate that different evolutionary factors, such as demography and selection, may not increase the false positive rate of recombination hotspots. Overall, accuracy of FastEPRR is similar to the well-known method, LDhat, but requires far less computation time. Genetic maps for each human population (YRI, CEU, and CHB) extracted from the 1000 Genomes OMNI data set were obtained in less than 3 d using just a single CPU core. The Pearson Pairwise correlation coefficient between the ρFastEPRR and ρLDhat maps is very high, ranging between 0.929 and 0.987 at a 5-Mb scale. Considering that sample sizes for these kinds of data are increasing dramatically with advances in next-generation sequencing technologies, FastEPRR (freely available at http://www.picb.ac.cn/evolgen/) is expected to become a widely used tool for establishing genetic maps and studying recombination hotspots in the population genomic era.
Collapse
|
21
|
Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection. Genetics 2016; 202:1449-72. [PMID: 26857628 DOI: 10.1534/genetics.115.177931] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 01/31/2016] [Indexed: 01/11/2023] Open
Abstract
Human immunodeficiency virus (HIV) is a rapidly evolving pathogen that causes chronic infections, so genetic diversity within a single infection can be very high. High-throughput "deep" sequencing can now measure this diversity in unprecedented detail, particularly since it can be performed at different time points during an infection, and this offers a potentially powerful way to infer the evolutionary dynamics of the intrahost viral population. However, population genomic inference from HIV sequence data is challenging because of high rates of mutation and recombination, rapid demographic changes, and ongoing selective pressures. In this article we develop a new method for inference using HIV deep sequencing data, using an approach based on importance sampling of ancestral recombination graphs under a multilocus coalescent model. The approach further extends recent progress in the approximation of so-called conditional sampling distributions, a quantity of key interest when approximating coalescent likelihoods. The chief novelties of our method are that it is able to infer rates of recombination and mutation, as well as the effective population size, while handling sampling over different time points and missing data without extra computational difficulty. We apply our method to a data set of HIV-1, in which several hundred sequences were obtained from an infected individual at seven time points over 2 years. We find mutation rate and effective population size estimates to be comparable to those produced by the software BEAST. Additionally, our method is able to produce local recombination rate estimates. The software underlying our method, Coalescenator, is freely available.
Collapse
|
22
|
Zhang S, Bian Y, Li L, Sun K, Wang Z, Zhao Q, Zha L, Cai J, Gao Y, Ji C, Li C. Population genetic study of 34 X-Chromosome markers in 5 main ethnic groups of China. Sci Rep 2015; 5:17711. [PMID: 26634331 PMCID: PMC4669481 DOI: 10.1038/srep17711] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2015] [Accepted: 11/04/2015] [Indexed: 01/09/2023] Open
Abstract
As a multi-ethnic country, China has some indigenous population groups which vary in culture and social customs, perhaps as a result of geographic isolation and different traditions. However, upon close interactions and intermarriage, admixture of different gene pools among these ethnic groups may occur. In order to gain more insight on the genetic background of X-Chromosome from these ethnic groups, a set of X-markers (18 X-STRs and 16 X-Indels) was genotyped in 5 main ethnic groups of China (HAN, HUI, Uygur, Mongolian, Tibetan). Twenty-three private alleles were detected in HAN, Uygur, Tibetan and Mongolian. Significant differences (p < 0.0001) were all observed for the 3 parameters of heterozygosity (Ho, He and UHe) among the 5 ethnic groups. Highest values of Nei genetic distance were always observed at HUI-Uygur pairwise when analyzed with X-STRs or X-Indels separately and combined. Phylogenetic tree and PCA analyses revealed a clear pattern of population differentiation of HUI and Uygur. However, the HAN, Tibetan and Mongolian ethnic groups were closely clustered. Eighteen X-Indels exhibited in general congruent phylogenetic signal and similar cluster among the 5 ethnic groups compared with 16 X-STRs. Aforementioned results proved the genetic polymorphism and potential of the 34 X-markers in the 5 ethnic groups.
Collapse
Affiliation(s)
- Suhua Zhang
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China.,State Key Laboratory of Genetic Engineering, Institute of Genetics, School of Life Sciences, Fudan University, Shanghai 200433, P.R. China
| | - Yingnan Bian
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| | - Li Li
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| | - Kuan Sun
- Institute of Forensic Medicine, West China School of Basic Science and Forensic Medicine, Sichuan University, Chengdu 610041, P.R.China
| | - Zheng Wang
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| | - Qi Zhao
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| | - Lagabaiyila Zha
- Department of Forensic Science, School of Basic Medical Sciences, Central South University, Changsha 410013, P.R. China
| | - Jifeng Cai
- Department of Forensic Science, School of Basic Medical Sciences, Central South University, Changsha 410013, P.R. China
| | - Yuzhen Gao
- Department of Forensic Medicine, Medical College of Soochow University, Suzhou 215123, P.R. China
| | - Chaoneng Ji
- State Key Laboratory of Genetic Engineering, Institute of Genetics, School of Life Sciences, Fudan University, Shanghai 200433, P.R. China
| | - Chengtao Li
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| |
Collapse
|
23
|
Assessing Differences Between Ancestral Recombination Graphs. J Mol Evol 2015; 80:258-64. [PMID: 25841763 DOI: 10.1007/s00239-015-9676-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Accepted: 03/29/2015] [Indexed: 10/23/2022]
Abstract
Ancestral recombination graphs (ARGs) represent the history of portions of a genome with recombination. Attempts to infer ARGs have been hampered by the lack of an ARG comparison metric which could be used to measure how well inference succeeded. We propose a simple ARG comparison framework based on averaging standard tree comparison measures across either all sites or variable sites only. Using simulated data, we show that this framework, instantiated with an appropriate tree comparison measure, can distinguish better from worse inferences of an ARG.
Collapse
|
24
|
Frost SDW, Pybus OG, Gog JR, Viboud C, Bonhoeffer S, Bedford T. Eight challenges in phylodynamic inference. Epidemics 2015; 10:88-92. [PMID: 25843391 PMCID: PMC4383806 DOI: 10.1016/j.epidem.2014.09.001] [Citation(s) in RCA: 106] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Revised: 08/30/2014] [Accepted: 09/02/2014] [Indexed: 02/06/2023] Open
Abstract
The field of phylodynamics, which attempts to enhance our understanding of infectious disease dynamics using pathogen phylogenies, has made great strides in the past decade. Basic epidemiological and evolutionary models are now well characterized with inferential frameworks in place. However, significant challenges remain in extending phylodynamic inference to more complex systems. These challenges include accounting for evolutionary complexities such as changing mutation rates, selection, reassortment, and recombination, as well as epidemiological complexities such as stochastic population dynamics, host population structure, and different patterns at the within-host and between-host scales. An additional challenge exists in making efficient inferences from an ever increasing corpus of sequence data.
Collapse
Affiliation(s)
- Simon D W Frost
- Department of Veterinary Medicine, University of Cambridge, Cambridge, UK; Institute of Public Health, University of Cambridge, Cambridge, UK.
| | | | - Julia R Gog
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| | - Cecile Viboud
- Fogarty International Center, National Institutes of Health, Bethesda, USA
| | | | - Trevor Bedford
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| |
Collapse
|
25
|
Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not. mBio 2014; 5:e02158. [PMID: 25425237 PMCID: PMC4251999 DOI: 10.1128/mbio.02158-14] [Citation(s) in RCA: 90] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Phylogenetic inference in bacterial genomics is fundamental to understanding problems such as population history, antimicrobial resistance, and transmission dynamics. The field has been plagued by an apparent state of contradiction since the distorting effects of recombination on phylogeny were discovered more than a decade ago. Researchers persist with detailed phylogenetic analyses while simultaneously acknowledging that recombination seriously misleads inference of population dynamics and selection. Here we resolve this paradox by showing that phylogenetic tree topologies based on whole genomes robustly reconstruct the clonal frame topology but that branch lengths are badly skewed. Surprisingly, removing recombining sites can exacerbate branch length distortion caused by recombination. Phylogenetic tree reconstruction is a popular approach for understanding the relatedness of bacteria in a population from differences in their genome sequences. However, bacteria frequently exchange regions of their genomes by a process called homologous recombination, which violates a fundamental assumption of phylogenetic methods. Since many researchers continue to use phylogenetics for recombining bacteria, it is important to understand how recombination affects the conclusions drawn from these analyses. We find that whole-genome sequences afford great accuracy in reconstructing evolutionary relationships despite concerns surrounding the presence of recombination, but the branch lengths of the phylogenetic tree are indeed badly distorted. Surprisingly, methods to reduce the impact of recombination on branch lengths can exacerbate the problem.
Collapse
|
26
|
Rasmussen MD, Hubisz MJ, Gronau I, Siepel A. Genome-wide inference of ancestral recombination graphs. PLoS Genet 2014; 10:e1004342. [PMID: 24831947 PMCID: PMC4022496 DOI: 10.1371/journal.pgen.1004342] [Citation(s) in RCA: 179] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Accepted: 03/17/2014] [Indexed: 01/23/2023] Open
Abstract
The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of [Formula: see text] chromosomes conditional on an ARG of [Formula: see text] chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.
Collapse
Affiliation(s)
- Matthew D. Rasmussen
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- * E-mail: (MDR); (AS)
| | - Melissa J. Hubisz
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Ilan Gronau
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Adam Siepel
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambs, United Kingdom
- * E-mail: (MDR); (AS)
| |
Collapse
|
27
|
Zheng C, Kuhner MK, Thompson EA. Bayesian inference of local trees along chromosomes by the sequential Markov coalescent. J Mol Evol 2014; 78:279-92. [PMID: 24817610 PMCID: PMC4104301 DOI: 10.1007/s00239-014-9620-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2013] [Accepted: 04/18/2014] [Indexed: 11/30/2022]
Abstract
We propose a genealogy-sampling algorithm, Sequential Markov Ancestral Recombination Tree (SMARTree), that provides an approach to estimation from SNP haplotype data of the patterns of coancestry across a genome segment among a set of homologous chromosomes. To enable analysis across longer segments of genome, the sequence of coalescent trees is modeled via the modified sequential Markov coalescent (Marjoram and Wall, Genetics 7:16, 2006). To assess performance in estimating these local trees, our SMARTree implementation is tested on simulated data. Our base data set is of the SNPs in 10 DNA sequences over 50 kb. We examine the effects of longer sequences and of more sequences, and of a recombination and/or mutational hotspot. The model underlying SMARTree is an approximation to the full recombinant-coalescent distribution. However, in a small trial on simulated data, recovery of local trees was similar to that of LAMARC (Kuhner et al. Genetics 156:1393-1401, 2000a), a sampler which uses the full model.
Collapse
Affiliation(s)
- Chaozhi Zheng
- Department of Statistics, Box 354322, University of Washington, Seattle, WA 98115-4322, USA, Tel.: (206) 543-7237, Fax: (206) 685-7419
| | - Mary K. Kuhner
- Department of Genome Sciences, Box 355065, University of Washington, Seattle, WA 98115-5065, USA, Tel.: (206) 543-8751, Fax: (206) 685-7301
| | - Elizabeth A. Thompson
- Department of Statistics, Box 354322, University of Washington, Seattle, WA 98115-4322, USA, Tel.: (206) 685-0108, Fax: (206) 685-7419
| |
Collapse
|
28
|
A framework including recombination for analyzing the dynamics of within-host HIV genetic diversity. PLoS One 2014; 9:e87655. [PMID: 24516557 PMCID: PMC3917834 DOI: 10.1371/journal.pone.0087655] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Accepted: 12/31/2013] [Indexed: 12/01/2022] Open
Abstract
This paper presents a novel population genetic model and a computationally and statistically tractable framework for analyzing within-host HIV diversity based on serial samples of HIV DNA sequences. This model considers within-host HIV evolution during the chronic phase of infection and assumes that the HIV population is homogeneous at the beginning, corresponding to the time of seroconversion, and evolves according to the Wright-Fisher reproduction model with recombination and variable mutation rate across nucleotide sites. In addition, the population size and generation time vary over time as piecewise constant functions of time. Under this model I approximate the genealogical and mutational processes for serial samples of DNA sequences by a continuous coalescent-recombination process and an inhomogeneous Poisson process, respectively. Based on these derivations, an efficient algorithm is described for generating polymorphisms in serial samples of DNA sequences under the model including various substitution models. Extensions of the algorithm are also described for other demographic scenarios that can be more suitable for analyzing the dynamics of genetic diversity of other pathogens in vitro and in vivo. For the case of the infinite-sites model, I derive analytical formulas for the expected number of polymorphic sites in sample of DNA sequences, and apply the developed simulation and analytical methods to explore the fit of the model to HIV genetic diversity based on serial samples of HIV DNA sequences from 9 HIV-infected individuals. The results particularly show that the estimates of the ratio of recombination rate over mutation rate can vary over time between very high and low values, which can be considered as a consequence of the impact of selection forces.
Collapse
|
29
|
Lamers SL, Nolan DJ, Strickland SL, Prosperi M, Fogel GB, Goodenow MM, Salemi M. Longitudinal analysis of intra-host simian immunodeficiency virus recombination in varied tissues of the rhesus macaque model for neuroAIDS. J Gen Virol 2013; 94:2469-2479. [PMID: 23963535 DOI: 10.1099/vir.0.055335-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Human immunodeficiency virus intra-host recombination has never been studied in vivo both during early infection and throughout disease progression. The CD8-depleted rhesus macaque model of neuroAIDS was used to investigate the impact of recombination from early infection up to the onset of neuropathology in animals inoculated with a simian immunodeficiency virus (SIV) swarm. Several lymphoid and non-lymphoid tissues were collected longitudinally at 21 days post-infection (p.i.), 61 days p.i. and necropsy (75-118 days p.i.) from four macaques that developed SIV-encephalitis or meningitis, as well as from two animals euthanized at 21 days p.i. The number of recombinant sequences and breakpoints in different tissues and over time from each primate were compared. Breakpoint locations were mapped onto predicted RNA and protein secondary structures. Recombinants were found at each time point and in each primate as early as 21 days p.i. No association was found between recombination rates and specific tissue of origin. Several identical breakpoints were identified in sequences derived from different tissues in the same primate and among different primates. Breakpoints predominantly mapped to unpaired nucleotides or pseudoknots in RNA secondary structures, and proximal to glycosylation sites and cysteine residues in protein sequences, suggesting selective advantage in the emergence of specific recombinant sequences. Results indicate that recombinant sequences can become fixed very early after infection with a heterogeneous viral swarm. Features of RNA and protein secondary structure appear to play a role in driving the production of recombinants and their selection in the rapid disease model of neuroAIDS.
Collapse
Affiliation(s)
| | - David J Nolan
- Emerging Pathogens Institute, University of Florida, Gainesville, FL 32610, USA.,Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Samantha L Strickland
- Emerging Pathogens Institute, University of Florida, Gainesville, FL 32610, USA.,Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Mattia Prosperi
- Emerging Pathogens Institute, University of Florida, Gainesville, FL 32610, USA.,Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Gary B Fogel
- Natural Selection Inc., San Diego, CA 92121, USA
| | - Maureen M Goodenow
- Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Marco Salemi
- Emerging Pathogens Institute, University of Florida, Gainesville, FL 32610, USA.,Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, FL 32610, USA
| |
Collapse
|
30
|
Thompson EA. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics 2013; 194:301-26. [PMID: 23733848 PMCID: PMC3664843 DOI: 10.1534/genetics.112.148825] [Citation(s) in RCA: 199] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2012] [Accepted: 03/10/2013] [Indexed: 01/04/2023] Open
Abstract
Gene identity by descent (IBD) is a fundamental concept that underlies genetically mediated similarities among relatives. Gene IBD is traced through ancestral meioses and is defined relative to founders of a pedigree, or to some time point or mutational origin in the coalescent of a set of extant genes in a population. The random process underlying changes in the patterns of IBD across the genome is recombination, so the natural context for defining IBD is the ancestral recombination graph (ARG), which specifies the complete ancestry of a collection of chromosomes. The ARG determines both the sequence of coalescent ancestries across the chromosome and the extant segments of DNA descending unbroken by recombination from their most recent common ancestor (MRCA). DNA segments IBD from a recent common ancestor have high probability of being of the same allelic type. Non-IBD DNA is modeled as of independent allelic type, but the population frame of reference for defining allelic independence can vary. Whether of IBD, allelic similarity, or phenotypic covariance, comparisons may be made to other genomic regions of the same gametes, or to the same genomic regions in other sets of gametes or diploid individuals. In this review, I present IBD as the framework connecting evolutionary and coalescent theory with the analysis of genetic data observed on individuals. I focus on the high variance of the processes that determine IBD, its changes across the genome, and its impact on observable data.
Collapse
Affiliation(s)
- Elizabeth A Thompson
- Department of Statistics, University of Washington, Seattle, WA 98195-4322, USA.
| |
Collapse
|
31
|
Sousa V, Hey J. Understanding the origin of species with genome-scale data: modelling gene flow. Nat Rev Genet 2013; 14:404-14. [PMID: 23657479 DOI: 10.1038/nrg3446] [Citation(s) in RCA: 181] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
As it becomes easier to sequence multiple genomes from closely related species, evolutionary biologists working on speciation are struggling to get the most out of very large population genomic data sets. Such data hold the potential to resolve long-standing questions in evolutionary biology about the role of gene exchange in species formation. In principle, the new population genomic data can be used to disentangle the conflicting roles of natural selection and gene flow during the divergence process. However, there are great challenges in taking full advantage of such data, especially with regard to including recombination in genetic models of the divergence process. Current data, models, methods and the potential pitfalls in using them will be considered here.
Collapse
Affiliation(s)
- Vitor Sousa
- Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, New Jersey 08854, USA
| | | |
Collapse
|
32
|
Abstract
Recombination is a fundamental evolutionary force. Therefore the population recombination rate ρ plays an important role in the analysis of population genetic data; however, it is notoriously difficult to estimate. This difficulty applies both to the accuracy of commonly used estimates and to the computational efforts required to obtain them. Some particularly popular methods are based on approximations to the likelihood. They require considerably less computational efforts than the full-likelihood method with not much less accuracy. Nevertheless, the computation of these approximate estimates can still be very time consuming, in particular when the sample size is large. Although auxiliary quantities for composite likelihood estimates can be computed in advance and stored in tables, these tables need to be recomputed if either the sample size or the mutation rate θ changes. Here we introduce a new method based on regression combined with boosting as a model selection technique. For large samples, it requires much less computational effort than other approximate methods, while providing similar levels of accuracy. Notably, for a sample of hundreds or thousands of individuals, the estimate of ρ using regression can be obtained on a single personal computer within a couple of minutes while other methods may need a couple of days or months (or even years). When the sample size is smaller (n ≤ 50), our new method remains computational efficient but produces biased estimates. We expect the new estimates to be helpful when analyzing large samples and/or many loci with possibly different mutation rates.
Collapse
|
33
|
Bhaskar A, Song YS. CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI. ADV APPL PROBAB 2012; 44:391-407. [PMID: 22859863 DOI: 10.1239/aap/1339878717] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Obtaining a closed-form sampling distribution for the coalescent with recombination is a challenging problem. In the case of two loci, a new framework based on asymptotic series has recently been developed to derive closed-form results when the recombination rate is moderate to large. In this paper, an arbitrary number of loci is considered and combinatorial approaches are employed to find closed-form expressions for the first couple of terms in an asymptotic expansion of the multi-locus sampling distribution. These expressions are universal in the sense that their functional form in terms of the marginal one-locus distributions applies to all finite- and infinite-alleles models of mutation.
Collapse
|
34
|
Stopping-time resampling and population genetic inference under coalescent models. Stat Appl Genet Mol Biol 2012; 11:Article 9. [PMID: 22499685 DOI: 10.2202/1544-6115.1770] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
To extract full information from samples of DNA sequence data, it is necessary to use sophisticated model-based techniques such as importance sampling under the coalescent. However, these are limited in the size of datasets they can handle efficiently. Chen and Liu (2000) introduced the idea of stopping-time resampling and showed that it can dramatically improve the efficiency of importance sampling methods under a finite-alleles coalescent model. In this paper, a new framework is developed for designing stopping-time resampling schemes under more general models. It is implemented on data both from infinite sites and stepwise models of mutation, and extended to incorporate crossover recombination. A simulation study shows that this new framework offers a substantial improvement in the accuracy of likelihood estimation over a range of parameters, while a direct application of the scheme of Chen and Liu (2000) can actually diminish the estimate. The method imposes no additional computational burden and is robust to the choice of parameters.
Collapse
|
35
|
Abstract
Performing inference on contemporary samples of DNA sequence data is an important and challenging task. Computationally intensive methods such as importance sampling (IS) are attractive because they make full use of the available data, but in the presence of recombination the large state space of genealogies can be prohibitive. In this article, we make progress by developing an efficient IS proposal distribution for a two-locus model of sequence data. We show that the proposal developed here leads to much greater efficiency, outperforming existing IS methods that could be adapted to this model. Among several possible applications, the algorithm can be used to find maximum likelihood estimates for mutation and crossover rates, and to perform ancestral inference. We illustrate the method on previously reported sequence data covering two loci either side of the well-studied TAP2 recombination hotspot. The two loci are themselves largely non-recombining, so we obtain a gene tree at each locus and are able to infer in detail the effect of the hotspot on their joint ancestry. We summarize this joint ancestry by introducing the gene graph, a summary of the well-known ancestral recombination graph.
Collapse
Affiliation(s)
- Paul A Jenkins
- Department of Statistics, University of Oxford, Oxford, United Kingdom.
| | | |
Collapse
|
36
|
Grünwald NJ, Goss EM. Evolution and population genetics of exotic and re-emerging pathogens: novel tools and approaches. ANNUAL REVIEW OF PHYTOPATHOLOGY 2011; 49:249-267. [PMID: 21370974 DOI: 10.1146/annurev-phyto-072910-095246] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Given human population growth and accelerated global trade, the rate of emergence of exotic plant pathogens is bound to increase. Understanding the processes that lead to the emergence of new pathogens can help manage emerging epidemics. Novel tools for analyzing population genetic variation can be used to infer the evolutionary history of populations or species, allowing for the unprecedented reconstruction of the demographic history of pathogens. Specifically, recent advances in the application of coalescent, maximum likelihood (ML), and Bayesian methods to population genetic data combined with increasing availability of affordable sequencing and parallel computing have created the opportunity to apply these methods to a broad range of questions regarding the evolution of emerging pathogens. These approaches are particularly powerful when used to test multiple competing hypotheses. We provide several examples illustrating how coalescent analysis provides critical insights into understanding migration pathways as well as processes of divergence, speciation, and recombination.
Collapse
Affiliation(s)
- Niklaus J Grünwald
- Horticultural Crops Research Laboratory, USDA Agricultural Research Service, Corvallis, Oregon 97330, USA.
| | | |
Collapse
|
37
|
A principled approach to deriving approximate conditional sampling distributions in population genetics models with recombination. Genetics 2010; 186:321-38. [PMID: 20592264 DOI: 10.1534/genetics.110.117986] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The multilocus conditional sampling distribution (CSD) describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. The CSD has a wide range of applications in both computational biology and population genomics analysis, including phasing genotype data into haplotype data, imputing missing data, estimating recombination rates, inferring local ancestry in admixed populations, and importance sampling of coalescent genealogies. Unfortunately, the true CSD under the coalescent with recombination is not known, so approximations, formulated as hidden Markov models, have been proposed in the past. These approximations have led to a number of useful statistical tools, but it is important to recognize that they were not derived from, though were certainly motivated by, principles underlying the coalescent process. The goal of this article is to develop a principled approach to derive improved CSDs directly from the underlying population genetics model. Our approach is based on the diffusion process approximation and the resulting mathematical expressions admit intuitive genealogical interpretations, which we utilize to introduce further approximations and make our method scalable in the number of loci. The general algorithm presented here applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model. Empirical results are provided to demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations.
Collapse
|
38
|
Jenkins PA, Song YS. AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. ANN APPL PROBAB 2010; 20:1005-1028. [PMID: 20671802 DOI: 10.1214/09-aap646] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Ewens sampling formula (ESF) is a one-parameter family of probability distributions with a number of intriguing combinatorial connections. This elegant closed-form formula first arose in biology as the stationary probability distribution of a sample configuration at one locus under the infinite-alleles model of mutation. Since its discovery in the early 1970s, the ESF has been used in various biological applications, and has sparked several interesting mathematical generalizations. In the population genetics community, extending the underlying random-mating model to include recombination has received much attention in the past, but no general closed-form sampling formula is currently known even for the simplest extension, that is, a model with two loci. In this paper, we show that it is possible to obtain useful closed-form results in the case the population-scaled recombination rate ρ is large but not necessarily infinite. Specifically, we consider an asymptotic expansion of the two-locus sampling formula in inverse powers of ρ and obtain closed-form expressions for the first few terms in the expansion. Our asymptotic sampling formula applies to arbitrary sample sizes and configurations.
Collapse
Affiliation(s)
- Paul A Jenkins
- Computer Science Division, University of California, Berkeley, Berkeley, CA 94720, USA
| | | |
Collapse
|
39
|
Matsen FA. constNJ: An Algorithm to Reconstruct Sets of Phylogenetic Trees Satisfying Pairwise Topological Constraints. J Comput Biol 2010; 17:799-818. [DOI: 10.1089/cmb.2009.0201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Frederick A. Matsen
- Program in Computational Biology, Fred Hutchinson Cancer Research Center 1100, Seattle, Washington, USA
| |
Collapse
|
40
|
Bulla I, Schultz AK, Schreiber F, Zhang M, Leitner T, Korber B, Morgenstern B, Stanke M. HIV classification using the coalescent theory. ACTA ACUST UNITED AC 2010; 26:1409-15. [PMID: 20400454 DOI: 10.1093/bioinformatics/btq159] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Existing coalescent models and phylogenetic tools based on them are not designed for studying the genealogy of sequences like those of HIV, since in HIV recombinants with multiple cross-over points between the parental strains frequently arise. Hence, ambiguous cases in the classification of HIV sequences into subtypes and circulating recombinant forms (CRFs) have been treated with ad hoc methods in lack of tools based on a comprehensive coalescent model accounting for complex recombination patterns. RESULTS We developed the program ARGUS that scores classifications of sequences into subtypes and recombinant forms. It reconstructs ancestral recombination graphs (ARGs) that reflect the genealogy of the input sequences given a classification hypothesis. An ARG with maximal probability is approximated using a Markov chain Monte Carlo approach. ARGUS was able to distinguish the correct classification with a low error rate from plausible alternative classifications in simulation studies with realistic parameters. We applied our algorithm to decide between two recently debated alternatives in the classification of CRF02 of HIV-1 and find that CRF02 is indeed a recombinant of Subtypes A and G. AVAILABILITY ARGUS is implemented in C++ and the source code is available at http://gobics.de/software.
Collapse
Affiliation(s)
- Ingo Bulla
- Abteilung Bioinformatik, Institut für Mikrobiologie und Genetik, Georg-August-Universität Göttingen, Goldschmidtstrasse 1, 37077 Göttingen, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
41
|
Bloomquist EW, Suchard MA. Unifying vertical and nonvertical evolution: a stochastic ARG-based framework. Syst Biol 2009; 59:27-41. [PMID: 20525618 DOI: 10.1093/sysbio/syp076] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Evolutionary biologists have introduced numerous statistical approaches to explore nonvertical evolution, such as horizontal gene transfer, recombination, and genomic reassortment, through collections of Markov-dependent gene trees. These tree collections allow for inference of nonvertical evolution, but only indirectly, making findings difficult to interpret and models difficult to generalize. An alternative approach to explore nonvertical evolution relies on phylogenetic networks. These networks provide a framework to model nonvertical evolution but leave unanswered questions such as the statistical significance of specific nonvertical events. In this paper, we begin to correct the shortcomings of both approaches by introducing the "stochastic model for reassortment and transfer events" (SMARTIE) drawing upon ancestral recombination graphs (ARGs). ARGs are directed graphs that allow for formal probabilistic inference on vertical speciation events and nonvertical evolutionary events. We apply SMARTIE to phylogenetic data. Because of this, we can typically infer a single most probable ARG, avoiding coarse population dynamic summary statistics. In addition, a focus on phylogenetic data suggests novel probability distributions on ARGs. To make inference with our model, we develop a reversible jump Markov chain Monte Carlo sampler to approximate the posterior distribution of SMARTIE. Using the BEAST phylogenetic software as a foundation, the sampler employs a parallel computing approach that allows for inference on large-scale data sets. To demonstrate SMARTIE, we explore 2 separate phylogenetic applications, one involving pathogenic Leptospirochete and the other Saccharomyces.
Collapse
Affiliation(s)
- Erik W Bloomquist
- Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA
| | | |
Collapse
|
42
|
Abstract
Genome assemblies are now available for nine primate species, and large-scale sequencing projects are underway or approved for six others. An explicitly evolutionary and phylogenetic approach to comparative genomics, called phylogenomics, will be essential in unlocking the valuable information about evolutionary history and genomic function that is contained within these genomes. However, most phylogenomic analyses so far have ignored the effects of variation in ancestral populations on patterns of sequence divergence. These effects can be pronounced in the primates, owing to large ancestral effective population sizes relative to the intervals between speciation events. In particular, local genealogies can vary considerably across loci, which can produce biases and diminished power in many phylogenomic analyses of interest, including phylogeny reconstruction, the identification of functional elements, and the detection of natural selection. At the same time, this variation in genealogies can be exploited to gain insight into the nature of ancestral populations. In this Perspective, I explore this area of intersection between phylogenetics and population genetics, and its implications for primate phylogenomics. I begin by "lifting the hood" on the conventional tree-like representation of the phylogenetic relationships between species, to expose the population-genetic processes that operate along its branches. Next, I briefly review an emerging literature that makes use of the complex relationships among coalescence, recombination, and speciation to produce inferences about evolutionary histories, ancestral populations, and natural selection. Finally, I discuss remaining challenges and future prospects at this nexus of phylogenetics, population genetics, and genomics.
Collapse
Affiliation(s)
- Adam Siepel
- Department of Biological Statistics and Computational Biology, Cornell Center for Comparative and Population Genomics, Cornell University, Ithaca, New York 14853, USA.
| |
Collapse
|
43
|
Abstract
Sampling distributions play an important role in population genetics analyses, but closed-form sampling formulas are generally intractable to obtain. In the presence of recombination, there is no known closed-form sampling formula that holds for an arbitrary recombination rate. However, we recently showed that it is possible to obtain useful closed-form sampling formulas when the population-scaled recombination rate rho is large. Specifically, in the case of the two-locus infinite-alleles model, we considered an asymptotic expansion of the sampling formula in inverse powers of rho and obtained closed-form expressions for the first few terms in the expansion. In this article, we generalize this result to an arbitrary finite-alleles mutation model and show that, up to the first few terms in the expansion that we are able to compute analytically, the functional form of the asymptotic sampling formula is common to all mutation models. We carry out an extensive study of the accuracy of the asymptotic formula for the two-locus parent-independent mutation model and discuss in detail a concrete application in the context of the composite-likelihood method. Furthermore, using our asymptotic sampling formula, we establish a simple sufficient condition for a given two-locus sample configuration to have a finite maximum-likelihood estimate (MLE) of rho. This condition is the first analytic result on the classification of the MLE of rho and is instantaneous to check in practice, provided that one-locus probabilities are known.
Collapse
|
44
|
Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet 2009; 10:639-50. [PMID: 19687804 DOI: 10.1038/nrg2611] [Citation(s) in RCA: 792] [Impact Index Per Article: 52.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Wright's F-statistics, and especially F(ST), provide important insights into the evolutionary processes that influence the structure of genetic variation within and among populations, and they are among the most widely used descriptive statistics in population and evolutionary genetics. Estimates of F(ST) can identify regions of the genome that have been the target of selection, and comparisons of F(ST) from different parts of the genome can provide insights into the demographic history of populations. For these reasons and others, F(ST) has a central role in population and evolutionary genetics and has wide applications in fields that range from disease association mapping to forensic science. This Review clarifies how F(ST) is defined, how it should be estimated, how it is related to similar statistics and how estimates of F(ST) should be interpreted.
Collapse
Affiliation(s)
- Kent E Holsinger
- Department of Ecology and Evolutionary Biology, U-3043, University of Connecticut, Storrs, Connecticut 06269-3043, USA.
| | | |
Collapse
|
45
|
Abstract
When a novel genetic trait arises in a population, it introduces a signal in the haplotype distribution of that population. Through recombination that signal's history becomes differentiated from the DNA distant to it, but remains similar to the DNA close by. Fine-scale mapping techniques rely on this differentiation to pinpoint trait loci. In this study, we analyzed the differentiation itself to better understand how much information is available to these techniques. Simulated alleles on known recombinant coalescent trees show the upper limit for fine-scale mapping. Varying characteristics of the population being studied increase or decrease this limit. The initial uncertainty in map position has the most direct influence on the final precision of the estimate, with wider initial areas resulting in wider final estimates, though the increase is sigmoidal rather than linear. The Theta of the trait (4Nmu) is also important, with lower values for Theta resulting in greater precision of trait placement up to a point--the increase is sigmoidal as Theta decreases. Collecting data from more individuals can increase precision, though only logarithmically with the total number of individuals, so that each added individual contributes less to the final precision. However, a case/control analysis has the potential to greatly increase the effective number of individuals, as the bulk of the information lies in the differential between affected and unaffected genotypes. If haplotypes are unknown due to incomplete penetrance, much information is lost, with more information lost the less indicative phenotype is of the underlying genotype.
Collapse
Affiliation(s)
- Lucian P Smith
- Department of Genome Sciences, University of Washington, Seattle, WA 98195-5065, USA.
| | | |
Collapse
|
46
|
Wang Y, Rannala B. Bayesian inference of fine-scale recombination rates using population genomic data. Philos Trans R Soc Lond B Biol Sci 2009; 363:3921-30. [PMID: 18852101 DOI: 10.1098/rstb.2008.0172] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Recently, several statistical methods for estimating fine-scale recombination rates using population samples have been developed. However, currently available methods that can be applied to large-scale data are limited to approximated likelihoods. Here, we developed a full-likelihood Markov chain Monte Carlo method for estimating recombination rate under a Bayesian framework. Genealogies underlying a sampling of chromosomes are effectively modelled by using marginal individual single nucleotide polymorphism genealogies related through an ancestral recombination graph. The method is compared with two existing composite-likelihood methods using simulated data.Simulation studies show that our method performs well for different simulation scenarios. The method is applied to two human population genetic variation datasets that have been studied by sperm typing. Our results are consistent with the estimates from sperm crossover analysis.
Collapse
Affiliation(s)
- Ying Wang
- Genome Center, University of California Davis, One Shields Avenue, Davis, CA 95616, USA
| | | |
Collapse
|
47
|
Abstract
As more human genomic data become available, fine-scale recombination rate variation can be inferred on a genome-wide scale. Current statistical methods to infer recombination rates that can be applied to moderate, or large, genomic regions are limited to approximated likelihoods. Here, we develop a Bayesian full-likelihood method using Markov Chain Monte Carlo (MCMC) to estimate background recombination rates and hotspots. The probability model is inspired by the observed patterns of recombination at several genomic regions analyzed in sperm-typing studies. Posterior probabilities and Bayes factors of recombination hotspots along chromosomes are inferred. For moderate-size genomic regions (e.g., with <100 SNPs), the full-likelihood method is used. Larger regions are split into subintervals (typically each having between 20 and 50 markers). The likelihood is approximated based on the genealogies for each subinterval. The background recombination rates, hotspots, and parameters are evaluated by using a parallel computing approach and assuming shared parameters across the subintervals. Simulation analyses show that our method can accurately estimate the variation in recombination rates across genomic regions. In particular, clusters of hotspots can be distinguished even though weaker hotspots are present. The method is applied to SNP data from the HLA region, the MS32, and chromosome 19.
Collapse
|
48
|
Lee JY, Edwards SV. DIVERGENCE ACROSS AUSTRALIA'S CARPENTARIAN BARRIER: STATISTICAL PHYLOGEOGRAPHY OF THE RED-BACKED FAIRY WREN (MALURUS MELANOCEPHALUS). Evolution 2008; 62:3117-34. [DOI: 10.1111/j.1558-5646.2008.00543.x] [Citation(s) in RCA: 140] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
49
|
Estimating the recombination parameter: a commentary on 'Estimating the recombination parameter of a finite population model without selection' by Richard R. Hudson. Genet Res (Camb) 2008; 89:425-6. [PMID: 18976530 DOI: 10.1017/s0016672308009622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|
50
|
Janes DE, Ezaz T, Marshall Graves JA, Edwards SV. Recombination and nucleotide diversity in the sex chromosomal pseudoautosomal region of the emu, Dromaius novaehollandiae. J Hered 2008; 100:125-36. [PMID: 18775880 DOI: 10.1093/jhered/esn065] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Pseudoautosomal regions (PARs) shared by avian Z and W sex chromosomes are typically small homologous regions within which recombination still occurs and are hypothesized to share the properties of autosomes. We capitalized on the unusual structure of the sex chromosomes of emus, Dromaius novaehollandiae, which consist almost entirely of PAR shared by both sex chromosomes, to test this hypothesis. We compared recombination, linkage disequilibrium (LD), GC content, and nucleotide diversity between pseudoautosomal and autosomal loci derived from 11 emu bacterial artificial chromosome (BAC) clones that were mapped to chromosomes by fluorescent in situ hybridization. Nucleotide diversity (pi = 4N(e)mu) was not significantly lower in pseudoautosomal loci (14 loci, 1.9 +/- 2.4 x 10(-3)) than autosomal loci (8 loci, 4.2 +/- 6.1 x 10(-3)). By contrast, recombination per site within BAC-end sequences (rho = 4Nc) (pseudoautosomal, 3.9 +/- 6.9 x 10(-2); autosomal, 2.3 +/- 3.7 x 10(-2)) was higher and average LD (D') (pseudoautosomal, 4.2 +/- 0.2 x 10(-1); autosomal, 4.7 +/- 0.5 x 10(-1)) slightly lower in pseudoautosomal sequences. We also report evidence of deviation from a simple neutral model in the PAR and in autosomal loci, possibly caused by departures from demographic equilibrium, such as population growth. This study provides a snapshot of the population genetics of avian sex chromosomes at an early stage of differentiation.
Collapse
Affiliation(s)
- Daniel E Janes
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.
| | | | | | | |
Collapse
|