1
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the discrete-time Wright-Fisher model to biobank-scale datasets. Genetics 2023; 225:iyad168. [PMID: 37724741 PMCID: PMC10627256 DOI: 10.1093/genetics/iyad168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/01/2023] [Accepted: 09/08/2023] [Indexed: 09/21/2023] Open
Abstract
The discrete-time Wright-Fisher (DTWF) model and its diffusion limit are central to population genetics. These models can describe the forward-in-time evolution of allele frequencies in a population resulting from genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large samples or in the presence of strong selection. Existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here, we present a scalable algorithm that approximates the DTWF model with provably bounded error. Our approach relies on two key observations about the DTWF model. The first is that transition probabilities under the model are approximately sparse. The second is that transition distributions for similar starting allele frequencies are extremely close as distributions. Together, these observations enable approximate matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the tens of millions, paving the way for rigorous biobank-scale inference. Finally, we use our results to estimate the impact of larger samples on estimating selection coefficients for loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
- Jeffrey P Spence
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Tony Zeng
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Jonathan K Pritchard
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Biology, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
2
|
Spence JP, Zeng T, Mostafavi H, Pritchard JK. Scaling the Discrete-time Wright Fisher model to biobank-scale datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.19.541517. [PMID: 37293115 PMCID: PMC10245735 DOI: 10.1101/2023.05.19.541517] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing like-lihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
Collapse
Affiliation(s)
| | - Tony Zeng
- Department of Genetics, Stanford University
| | | | - Jonathan K. Pritchard
- Department of Genetics, Stanford University
- Department of Biology, Stanford University
| |
Collapse
|
3
|
Dokmai N, Kockan C, Zhu K, Wang X, Sahinalp SC, Cho H. Privacy-preserving genotype imputation in a trusted execution environment. Cell Syst 2021; 12:983-993.e7. [PMID: 34450045 PMCID: PMC8542641 DOI: 10.1016/j.cels.2021.08.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 07/14/2021] [Accepted: 08/02/2021] [Indexed: 01/02/2023]
Abstract
Genotype imputation is an essential tool in genomics research, whereby missing genotypes are inferred using reference genomes to enhance downstream analyses. Recently, public imputation servers have allowed researchers to leverage large-scale genomic data resources for imputation. However, privacy concerns about uploading one's genetic data to a server limit the utility of these services. We introduce a secure hardware-based solution for privacy-preserving genotype imputation, which keeps the input genomes private by processing them within Intel SGX's trusted execution environment. Our solution features SMac, an efficient and secure imputation algorithm designed for Intel SGX, which employs a state-of-the-art imputation strategy also utilized by existing imputation servers. SMac achieves imputation accuracy equivalent to existing tools and provides protection against known side-channel attacks on SGX while maintaining scalability. We also show the necessity of our enhanced security by identifying vulnerabilities in existing imputation software. Our work represents a step toward privacy-preserving genomic analysis services.
Collapse
Affiliation(s)
- Natnatee Dokmai
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Can Kockan
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA; Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kaiyuan Zhu
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA; Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - XiaoFeng Wang
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
4
|
Zheng C, Tan L, Sang M, Ye M, Wu R. Genetic adaptation of Tibetan poplar ( Populus szechuanica var. tibetica) to high altitudes on the Qinghai-Tibetan Plateau. Ecol Evol 2020; 10:10974-10985. [PMID: 33144942 PMCID: PMC7593140 DOI: 10.1002/ece3.6508] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Revised: 05/14/2020] [Accepted: 05/28/2020] [Indexed: 12/26/2022] Open
Abstract
Plant adaptation to high altitudes has long been a substantial focus of ecological and evolutionary research. However, the genetic mechanisms underlying such adaptation remain poorly understood. Here, we address this issue by sampling, genotyping, and comparing populations of Tibetan poplar, Populus szechuanica var. tibetica, distributed from low (~2,000 m) to high altitudes (~3,000 m) of Sejila Mountain on the Qinghai-Tibet Plateau. Population structure analyses allow clear classification of two groups according to their altitudinal distributions. However, in contrast to the genetic variation within each population, differences between the two populations only explain a small portion of the total genetic variation (3.64%). We identified asymmetrical gene flow from high- to low-altitude populations. Integrating population genomic and landscape genomic analyses, we detected two hotspot regions, one containing four genes associated with altitudinal variation, and the other containing ten genes associated with response to solar radiation. These genes participate in abiotic stress resistance and regulation of reproductive processes. Our results provide insight into the genetic mechanisms underlying high-altitude adaptation in Tibetan poplar.
Collapse
Affiliation(s)
- Chenfei Zheng
- Beijing Advanced Innovation Center for Tree Breeding by Molecular DesignCenter for Computational BiologyCollege of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
| | - Lizhi Tan
- Beijing Advanced Innovation Center for Tree Breeding by Molecular DesignCenter for Computational BiologyCollege of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
| | - Mengmeng Sang
- Beijing Advanced Innovation Center for Tree Breeding by Molecular DesignCenter for Computational BiologyCollege of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
| | - Meixia Ye
- Beijing Advanced Innovation Center for Tree Breeding by Molecular DesignCenter for Computational BiologyCollege of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
| | - Rongling Wu
- Beijing Advanced Innovation Center for Tree Breeding by Molecular DesignCenter for Computational BiologyCollege of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
- Center for Statistical GeneticsPennsylvania State UniversityHersheyPAUSA
| |
Collapse
|
5
|
Steinrücken M, Kamm J, Spence JP, Song YS. Inference of complex population histories using whole-genome sequences from multiple populations. Proc Natl Acad Sci U S A 2019; 116:17115-17120. [PMID: 31387977 PMCID: PMC6708337 DOI: 10.1073/pnas.1905060116] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
There has been much interest in analyzing genome-scale DNA sequence data to infer population histories, but inference methods developed hitherto are limited in model complexity and computational scalability. Here we present an efficient, flexible statistical method, diCal2, that can use whole-genome sequence data from multiple populations to infer complex demographic models involving population size changes, population splits, admixture, and migration. Applying our method to data from Australian, East Asian, European, and Papuan populations, we find that the population ancestral to Australians and Papuans started separating from East Asians and Europeans about 100,000 y ago, and that the separation of East Asians and Europeans started about 50,000 y ago, with pervasive gene flow between all pairs of populations.
Collapse
Affiliation(s)
- Matthias Steinrücken
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637
- Department of Human Genetics, University of Chicago, Chicago, IL 60637
| | - Jack Kamm
- Department of Statistics, University of California, Berkeley, CA 94720
- Chan Zuckerberg Biohub, San Francisco, CA 94158
| | - Jeffrey P Spence
- Computational Biology Graduate Group, University of California, Berkeley, CA 94720
| | - Yun S Song
- Department of Statistics, University of California, Berkeley, CA 94720;
- Chan Zuckerberg Biohub, San Francisco, CA 94158
- Computer Science Division, University of California, Berkeley, CA 94720
| |
Collapse
|
6
|
Spence JP, Steinrücken M, Terhorst J, Song YS. Inference of population history using coalescent HMMs: review and outlook. Curr Opin Genet Dev 2018; 53:70-76. [PMID: 30056275 PMCID: PMC6296859 DOI: 10.1016/j.gde.2018.07.002] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 07/08/2018] [Accepted: 07/09/2018] [Indexed: 01/02/2023]
Abstract
Studying how diverse human populations are related is of historical and anthropological interest, in addition to providing a realistic null model for testing for signatures of natural selection or disease associations. Furthermore, understanding the demographic histories of other species is playing an increasingly important role in conservation genetics. A number of statistical methods have been developed to infer population demographic histories using whole-genome sequence data, with recent advances focusing on allowing for more flexible modeling choices, scaling to larger data sets, and increasing statistical power. Here we review coalescent hidden Markov models, a powerful class of population genetic inference methods that can utilize linkage disequilibrium information effectively. We highlight recent advances, give advice for practitioners, point out potential pitfalls, and present possible future research directions.
Collapse
Affiliation(s)
- Jeffrey P Spence
- Computational Biology Graduate Group, University of California, Berkeley, United States
| | | | | | - Yun S Song
- Computer Science Division and Department of Statistics, University of California, Berkeley, United States; Chan Zuckerberg Biohub, San Francisco, United States.
| |
Collapse
|
7
|
Abstract
With the advent of sequencing techniques population genomics took a major shift. The structure of data sets has evolved from a sample of a few loci in the genome, sequenced in dozens of individuals, to collections of complete genomes, virtually comprising all available loci. Initially sequenced in a few individuals, such genomic data sets are now reaching and even exceeding the size of traditional data sets in the number of haplotypes sequenced. Because all loci in a genome are not independent, this evolution of data sets is mirrored by a methodological change. The evolutionary processes that generate the observed sequences are now modeled spatially along genomes whereas it was previously described temporally (either in a forward or backward manner). Although the spatial process of sequence evolution is complex, approximations to the model feature Markovian properties, permitting efficient inference. In this chapter, we introduce these recent developments that enable the modeling of the evolutionary history of a sample of several individual genomes. Such models assume the occurrence of meiotic recombination, and therefore, to date, they are dedicated to the analysis of eukaryotic species.
Collapse
|
8
|
Landscape Genomics: Understanding Relationships Between Environmental Heterogeneity and Genomic Characteristics of Populations. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/13836_2017_2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
9
|
Terhorst J, Kamm JA, Song YS. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet 2017; 49:303-309. [PMID: 28024154 PMCID: PMC5470542 DOI: 10.1038/ng.3748] [Citation(s) in RCA: 397] [Impact Index Per Article: 56.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2016] [Accepted: 11/23/2016] [Indexed: 12/20/2022]
Abstract
It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.
Collapse
Affiliation(s)
- Jonathan Terhorst
- Department of Statistics, University of California, Berkeley, Berkeley, California, USA
| | - John A Kamm
- Department of Statistics, University of California, Berkeley, Berkeley, California, USA
- Computer Science Division, University of California, Berkeley, Berkeley, California, USA
| | - Yun S Song
- Department of Statistics, University of California, Berkeley, Berkeley, California, USA
- Computer Science Division, University of California, Berkeley, Berkeley, California, USA
- Department of Integrative Biology, University of California, Berkeley, Berkeley, California, USA
- Departments of Biology and Mathematics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
10
|
Next-generation genotype imputation service and methods. Nat Genet 2016; 48:1284-1287. [PMID: 27571263 DOI: 10.1038/ng.3656] [Citation(s) in RCA: 2337] [Impact Index Per Article: 292.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 08/02/2016] [Indexed: 02/07/2023]
Abstract
Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.
Collapse
|
11
|
Estimating variable effective population sizes from multiple genomes: a sequentially markov conditional sampling distribution approach. Genetics 2013; 194:647-62. [PMID: 23608192 DOI: 10.1534/genetics.112.149096] [Citation(s) in RCA: 124] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Throughout history, the population size of modern humans has varied considerably due to changes in environment, culture, and technology. More accurate estimates of population size changes, and when they occurred, should provide a clearer picture of human colonization history and help remove confounding effects from natural selection inference. Demography influences the pattern of genetic variation in a population, and thus genomic data of multiple individuals sampled from one or more present-day populations contain valuable information about the past demographic history. Recently, Li and Durbin developed a coalescent-based hidden Markov model, called the pairwise sequentially Markovian coalescent (PSMC), for a pair of chromosomes (or one diploid individual) to estimate past population sizes. This is an efficient, useful approach, but its accuracy in the very recent past is hampered by the fact that, because of the small sample size, only few coalescence events occur in that period. Multiple genomes from the same population contain more information about the recent past, but are also more computationally challenging to study jointly in a coalescent framework. Here, we present a new coalescent-based method that can efficiently infer population size changes from multiple genomes, providing access to a new store of information about the recent past. Our work generalizes the recently developed sequentially Markov conditional sampling distribution framework, which provides an accurate approximation of the probability of observing a newly sampled haplotype given a set of previously sampled haplotypes. Simulation results demonstrate that we can accurately reconstruct the true population histories, with a significant improvement over the PSMC in the recent past. We apply our method, called diCal, to the genomes of multiple human individuals of European and African ancestry to obtain a detailed population size change history during recent times.
Collapse
|
12
|
Steinrücken M, Paul JS, Song YS. A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol 2012; 87:51-61. [PMID: 23010245 DOI: 10.1016/j.tpb.2012.08.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Revised: 08/20/2012] [Accepted: 08/28/2012] [Indexed: 10/27/2022]
Abstract
Conditional sampling distributions (CSDs), sometimes referred to as copying models, underlie numerous practical tools in population genomic analyses. Though an important application that has received much attention is the inference of population structure, the explicit exchange of migrants at specified rates has not hitherto been incorporated into the CSD in a principled framework. Recently, in the case of a single panmictic population, a sequentially Markov CSD has been developed as an accurate, efficient approximation to a principled CSD derived from the diffusion process dual to the coalescent with recombination. In this paper, the sequentially Markov CSD framework is extended to incorporate subdivided population structure, thus providing an efficiently computable CSD that admits a genealogical interpretation related to the structured coalescent with migration and recombination. As a concrete application, it is demonstrated empirically that the CSD developed here can be employed to yield accurate estimation of a wide range of migration rates.
Collapse
|