1
|
Subedi S, Sumida TS, Park YP. A scalable approach to topic modelling in single-cell data by approximate pseudobulk projection. Life Sci Alliance 2024; 7:e202402713. [PMID: 39107066 PMCID: PMC11303850 DOI: 10.26508/lsa.202402713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 07/29/2024] [Accepted: 07/30/2024] [Indexed: 08/09/2024] Open
Abstract
Probabilistic topic modelling has become essential in many types of single-cell data analysis. Based on probabilistic topic assignments in each cell, we identify the latent representation of cellular states. A dictionary matrix, consisting of topic-specific gene frequency vectors, provides interpretable bases to be compared with known cell type-specific marker genes and other pathway annotations. However, fitting a topic model on a large number of cells would require heavy computational resources-specialized computing units, computing time and memory. Here, we present a scalable approximation method customized for single-cell RNA-seq data analysis, termed ASAP, short for Annotating a Single-cell data matrix by Approximate Pseudobulk estimation. Our approach is more accurate than existing methods but requires orders of magnitude less computing time, leaving much lower memory consumption. We also show that our approach is widely applicable for atlas-scale data analysis; our method seamlessly integrates single-cell and bulk data in joint analysis, not requiring additional preprocessing or feature selection steps.
Collapse
Affiliation(s)
- Sishir Subedi
- https://ror.org/03rmrcq20Bioinformatics Graduate Program, University of British Columbia, Vancouver, Canada
- BC Cancer Research, Vancouver, Canada
| | - Tomokazu S Sumida
- Neurology, Program for Neuroinflammation, Yale School of Medicine, New Haven, CT, USA
| | - Yongjin P Park
- BC Cancer Research, Vancouver, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
- Department of Statistics, University of British Columbia, Vancouver, Canada
| |
Collapse
|
2
|
Huang D, Niu S, Bai D, Zhao Z, Li C, Deng X, Wang Y. Analysis of population structure and genetic diversity of Camellia tachangensis in Guizhou based on SNP markers. Mol Biol Rep 2024; 51:715. [PMID: 38824248 PMCID: PMC11144125 DOI: 10.1007/s11033-024-09632-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 05/10/2024] [Indexed: 06/03/2024]
Abstract
BACKGROUND Camellia tachangensis F. C. Zhang is a five-compartment species in the ovary of tea group plants, which represents the original germline of early differentiation of some tea group plants. METHODS AND RESULTS In this study, we analyzed single-nucleotide polymorphisms (SNPs) at the genome level, constructed a phylogenetic tree, analyzed the genetic diversity, and further investigated the population structure of 100 C. tachangensis accessions using the genotyping-by-sequencing (GBS) method. A total of 91,959 high-quality SNPs were obtained. Population structure analysis showed that the 100 C. tachangensis accessions clustered into three groups: YQ-1 (Village Group), YQ-2 (Forest Group) and YQ-3 (Transition Group), which was further consistent with the results of phylogenetic analysis and principal component analyses (PCA). In addition, a comparative analysis of the genetic diversity among the three populations (Forest, Village, and Transition Groups) detected the highest genetic diversity in the Transition Group and the highest differentiation between Forest and Village Groups. CONCLUSIONS C. tachangensis plants growing in the forest had different genetic backgrounds from those growing in villages. This study provides a basis for the effective protection and utilization of C. tachangensis populations and lays a foundation for future C. tachangensis breeding.
Collapse
Grants
- (2021YFD1200203-1) Project of the National key R & D plan
- (2021YFD1200203-1) Project of the National key R & D plan
- (2021YFD1200203-1) Project of the National key R & D plan
- (2021YFD1200203-1) Project of the National key R & D plan
- (2021YFD1200203-1) Project of the National key R & D plan
- (2021YFD1200203-1) Project of the National key R & D plan
- (2021YFD1200203-1) Project of the National key R & D plan
- (32060700) Projectofthe National Science Foundation, in PR China·
- (32060700) Projectofthe National Science Foundation, in PR China·
- (32060700) Projectofthe National Science Foundation, in PR China·
- (32060700) Projectofthe National Science Foundation, in PR China·
- (32060700) Projectofthe National Science Foundation, in PR China·
- (32060700) Projectofthe National Science Foundation, in PR China·
- (32060700) Projectofthe National Science Foundation, in PR China·
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (2023009) the National Guidance Foundation for Local Science and Technology Development of China
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (Construction Technology Contract [2023] ·48-21) Guiyang Science and Technology Plan Project
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- (KY [20211·042) Project of the key filed project of Natural Science Foundation of Guizhou Provincial Department of education
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- ([2021] General 126) Science and Technology Plan Project of Guizhou province, in PR China
- Project of the National key R & D plan
Collapse
Affiliation(s)
- Dejun Huang
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China
| | - Suzhen Niu
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China.
- Institute of Agro-Bioengineering, Guizhou university, Xueshi Road, Guiyang, Guizhou, China.
| | - Dingchen Bai
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China
| | - Zhifei Zhao
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China
| | - Caiyun Li
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China
| | - Xiuling Deng
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China
| | - Yihan Wang
- Institute of Tea, Guizhou university, Jiaxiu South Road, Guiyang, Guizhou, China
| |
Collapse
|
3
|
Mantes AD, Montserrat DM, Bustamante CD, Giró-i-Nieto X, Ioannidis AG. Neural ADMIXTURE for rapid genomic clustering. NATURE COMPUTATIONAL SCIENCE 2023; 3:621-629. [PMID: 37600116 PMCID: PMC10438426 DOI: 10.1038/s43588-023-00482-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 06/06/2023] [Indexed: 08/22/2023]
Abstract
Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by calculating multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.
Collapse
Affiliation(s)
- Albert Dominguez Mantes
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
- Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
- School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne, Vaud, Switzerland
| | - Daniel Mas Montserrat
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
| | | | - Xavier Giró-i-Nieto
- Signal Theory and Communications Department, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Alexander G. Ioannidis
- Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, United States
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, United States
| |
Collapse
|
4
|
Ko S, Chu BB, Peterson D, Okenwa C, Papp JC, Alexander DH, Sobel EM, Zhou H, Lange KL. Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets. Am J Hum Genet 2023; 110:314-325. [PMID: 36610401 PMCID: PMC9943729 DOI: 10.1016/j.ajhg.2022.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 12/12/2022] [Indexed: 01/09/2023] Open
Abstract
Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.
Collapse
Affiliation(s)
- Seyoon Ko
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Benjamin B. Chu
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Daniel Peterson
- Department of Mathematics, Brigham Young University, Provo, UT 84602, USA
| | - Chidera Okenwa
- Department of Mathematics, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Jeanette C. Papp
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | | | - Eric M. Sobel
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA,Corresponding author
| | - Hua Zhou
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Kenneth L. Lange
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| |
Collapse
|
5
|
Dang T, Kumaishi K, Usui E, Kobori S, Sato T, Toda Y, Yamasaki Y, Tsujimoto H, Ichihashi Y, Iwata H. Stochastic variational variable selection for high-dimensional microbiome data. MICROBIOME 2022; 10:236. [PMID: 36566203 PMCID: PMC9789572 DOI: 10.1186/s40168-022-01439-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 11/28/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND The rapid and accurate identification of a minimal-size core set of representative microbial species plays an important role in the clustering of microbial community data and interpretation of clustering results. However, the huge dimensionality of microbial metagenomics datasets is a major challenge for the existing methods such as Dirichlet multinomial mixture (DMM) models. In the approach of the existing methods, the computational burden of identifying a small number of representative species from a large number of observed species remains a challenge. RESULTS We propose a novel approach to improve the performance of the widely used DMM approach by combining three ideas: (i) we propose an indicator variable to identify representative operational taxonomic units that substantially contribute to the differentiation among clusters; (ii) to address the computational burden of high-dimensional microbiome data, we propose a stochastic variational inference, which approximates the posterior distribution using a controllable distribution called variational distribution, and stochastic optimization algorithms for fast computation; and (iii) we extend the finite DMM model to an infinite case by considering Dirichlet process mixtures and estimating the number of clusters as a variational parameter. Using the proposed method, stochastic variational variable selection (SVVS), we analyzed the root microbiome data collected in our soybean field experiment, the human gut microbiome data from three published datasets of large-scale case-control studies and the healthy human microbiome data from the Human Microbiome Project. CONCLUSIONS SVVS demonstrates a better performance and significantly faster computation than those of the existing methods in all cases of testing datasets. In particular, SVVS is the only method that can analyze massive high-dimensional microbial data with more than 50,000 microbial species and 1000 samples. Furthermore, a core set of representative microbial species is identified using SVVS that can improve the interpretability of Bayesian mixture models for a wide range of microbiome studies. Video Abstract.
Collapse
Affiliation(s)
- Tung Dang
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | - Kie Kumaishi
- RIKEN BioResource Research Center, Tsukuba, Ibaraki, Japan
| | - Erika Usui
- RIKEN BioResource Research Center, Tsukuba, Ibaraki, Japan
| | - Shungo Kobori
- RIKEN BioResource Research Center, Tsukuba, Ibaraki, Japan
| | - Takumi Sato
- RIKEN BioResource Research Center, Tsukuba, Ibaraki, Japan
| | - Yusuke Toda
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | - Yuji Yamasaki
- Arid Land Research Center, Tottori University, Tottori, Japan
| | | | | | - Hiroyoshi Iwata
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
6
|
Jha J, Hashemi M, Vattikonda AN, Wang H, Jirsa V. Fully Bayesian estimation of virtual brain parameters with self-tuning Hamiltonian Monte Carlo. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac9037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
Virtual brain models are data-driven patient-specific brain models integrating individual brain imaging data with neural mass modeling in a single computational framework, capable of autonomously generating brain activity and its associated brain imaging signals. Along the example of epilepsy, we develop an efficient and accurate Bayesian methodology estimating the parameters linked to the extent of the epileptogenic zone. State-of-the-art advances in Bayesian inference using Hamiltonian Monte Carlo (HMC) algorithms have remained elusive for large-scale differential-equations based models due to their slow convergence. We propose appropriate priors and a novel reparameterization to facilitate efficient exploration of the posterior distribution in terms of computational time and convergence diagnostics. The methodology is illustrated for in-silico dataset and then, applied to infer the personalized model parameters based on the empirical stereotactic electroencephalography (SEEG) recordings of retrospective patients. This improved methodology may pave the way to render HMC methods sufficiently easy and efficient to use, thus applicable in personalized medicine.
Collapse
|
7
|
Gewirtz AD, Townes FW, Engelhardt BE. Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues. Life Sci Alliance 2022; 5:e202101297. [PMID: 35977827 PMCID: PMC9387650 DOI: 10.26508/lsa.202101297] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 07/15/2022] [Accepted: 07/18/2022] [Indexed: 11/24/2022] Open
Abstract
Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene-variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual's genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at https://github.com/gewirtz/TBLDA.
Collapse
Affiliation(s)
- Ariel Dh Gewirtz
- Lewis-Sigler Institute of Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - F William Townes
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Gladstone Institutes, San Francisco, CA, USA
| |
Collapse
|
8
|
Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs. Heredity (Edinb) 2022; 129:79-92. [PMID: 35508539 PMCID: PMC9338324 DOI: 10.1038/s41437-022-00535-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 04/04/2022] [Accepted: 04/05/2022] [Indexed: 11/08/2022] Open
Abstract
Model-based (likelihood and Bayesian) and non-model-based (PCA and K-means clustering) methods were developed to identify populations and assign individuals to the identified populations using marker genotype data. Model-based methods are favoured because they are based on a probabilistic model of population genetics with biologically meaningful parameters and thus produce results that are easily interpretable and applicable. Furthermore, they often yield more accurate structure inferences than non-model-based methods. However, current model-based methods either are computationally demanding and thus applicable to small problems only or use simplified admixture models that could yield inaccurate results in difficult situations such as unbalanced sampling. In this study, I propose new likelihood methods for fast and accurate population admixture inference using genotype data from a few multiallelic microsatellites to millions of diallelic SNPs. The methods conduct first a clustering analysis of coarse-grained population structure by using the mixture model and the simulated annealing algorithm, and then an admixture analysis of fine-grained population structure by using the clustering results as a starting point in an expectation maximisation algorithm. Extensive analyses of both simulated and empirical data show that the new methods compare favourably with existing methods in both accuracy and running speed. They can analyse small datasets with just a few multiallelic microsatellites but can also handle in parallel terabytes of data with millions of markers and millions of individuals. In difficult situations such as many and/or lowly differentiated populations, unbalanced or very small samples of individuals, the new methods are substantially more accurate than other methods.
Collapse
|
9
|
Chiu AM, Molloy EK, Tan Z, Talwalkar A, Sankararaman S. Inferring population structure in biobank-scale genomic data. Am J Hum Genet 2022; 109:727-737. [PMID: 35298920 PMCID: PMC9069078 DOI: 10.1016/j.ajhg.2022.02.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 02/21/2022] [Indexed: 01/07/2023] Open
Abstract
Inferring the structure of human populations from genetic variation data is a key task in population and medical genomic studies. Although a number of methods for population structure inference have been proposed, current methods are impractical to run on biobank-scale genomic datasets containing millions of individuals and genetic variants. We introduce SCOPE, a method for population structure inference that is orders of magnitude faster than existing methods while achieving comparable accuracy. SCOPE infers population structure in about a day on a dataset containing one million individuals and variants as well as on the UK Biobank dataset containing 488,363 individuals and 569,346 variants. Furthermore, SCOPE can leverage allele frequencies from previous studies to improve the interpretability of population structure estimates.
Collapse
Affiliation(s)
- Alec M Chiu
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Erin K Molloy
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Institute for Advanced Computer Studies, University of Maryland, College Park, College Park, MD 20742, USA
| | - Zilong Tan
- Facebook, Inc., Menlo Park, CA 94025, USA
| | - Ameet Talwalkar
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Sriram Sankararaman
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Computer Science, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
10
|
Liu J, Xie H, Lin T, Tie C, Luo H, Yang B, Xiong D. Putative variants, genetic diversity and population structure among Soybean cultivars bred at different ages in Huang-Huai-Hai region. Sci Rep 2022; 12:2372. [PMID: 35149770 PMCID: PMC8837640 DOI: 10.1038/s41598-022-06447-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 01/24/2022] [Indexed: 11/25/2022] Open
Abstract
Soybean cultivars bred in the Huang-Huai-Hai region (HR) are rich in pedigree information. To date, few reports have exposed the genetic variants, population structure and genetic diversity of cultivars in this region by making use of genome-wide resequencing data. To depict genetic variation, population structure and composition characteristics of genetic diversity, a sample of soybean population composed all by cultivars was constructed. We re-sequenced 181 soybean cultivar genomes with an average depth of 10.38×. In total, 11,185,589 single nucleotide polymorphisms (SNPs) and 2,520,208 insertion-deletions (InDels) were identified on all 20 chromosomes. A considerable number of putative variants existed in important genome regions that may have an incalculable influence on genes, which participated in momentous biological processes. All 181 varieties were divided into five subpopulations according to their breeding years, SA (1963-1980), SB (1983-1988), SC (1991-2000), SD (2001-2011), SE (2012-2017). PCA and population structure figured out that there was no obvious grouping trend. The LD semi-decay distances of sub-population D and E were 182 kb, and 227 kb, respectively. Sub-population A (SA) had the highest value of nucleotide polymorphism (π). With the passage of time, the nucleotide polymorphism of SB and SC decreased gradually, however that of SD and SE, opposite to SB and SC, gave a rapid up-climbing trend, which meant a sharp increase in genetic diversity during the latest 20 years, hinting that breeders may have different breeding goals in different breeding periods in HR. Analysis of the PIC statistics exhibited very similar results with π. The current study is to analyze the genetic variants and characterize the structure and genetic diversity of soybean cultivars bred in different decades in HR, and to provide a theoretical reference for other identical studies.
Collapse
Affiliation(s)
- Jialin Liu
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China
| | - Huimin Xie
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China
| | - Ting Lin
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China
| | - Congxiao Tie
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China
| | - Huolin Luo
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China
| | - Boyun Yang
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China
| | - Dongjin Xiong
- College of Life Science, Nanchang University, Key Laboratory of Plant Resources in Jiangxi Province, Nanchang, China.
| |
Collapse
|
11
|
Zhou R, Yang S, Zhang B, Qi Z, Xin D, Su A, Li S, Cheng P, Bai Y, Yin Z, Zhang B, Zhao Y, Zhao Y, Chen Q, Wu X. Analysis of the genetic diversity of grain legume germplasm resources in China and the development of universal SSR primers. BIOTECHNOL BIOTEC EQ 2022. [DOI: 10.1080/13102818.2021.2006784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
Affiliation(s)
- Runnan Zhou
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Siqi Yang
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Bo Zhang
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Zhaoming Qi
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Dawei Xin
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Anyu Su
- Department of Land Remediation Engineering, College of Public Administration and Law, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Sinan Li
- Key Lab of Maize Genetics and Breeding, Department of National Corn Engineering Laboratory, Heilongjiang Academy of Agricultural Sciences, Harbin, Heilongjiang, PR China
| | - Peng Cheng
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Yunqi Bai
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Zhengong Yin
- Crop Resources Institute of Heilongjiang Academy of Agricultural Sciences, Harbin, Heilongjiang, PR China
| | - Binshuo Zhang
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Yujing Zhao
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Ying Zhao
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Qingshan Chen
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| | - Xiaoxia Wu
- Department of Agronomy, College of Agriculture, Northeast Agricultural University, Harbin, Heilongjiang, PR China
| |
Collapse
|
12
|
Carress H, Lawson DJ, Elhaik E. Population genetic considerations for using biobanks as international resources in the pandemic era and beyond. BMC Genomics 2021; 22:351. [PMID: 34001009 PMCID: PMC8127217 DOI: 10.1186/s12864-021-07618-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 04/14/2021] [Indexed: 12/11/2022] Open
Abstract
The past years have seen the rise of genomic biobanks and mega-scale meta-analysis of genomic data, which promises to reveal the genetic underpinnings of health and disease. However, the over-representation of Europeans in genomic studies not only limits the global understanding of disease risk but also inhibits viable research into the genomic differences between carriers and patients. Whilst the community has agreed that more diverse samples are required, it is not enough to blindly increase diversity; the diversity must be quantified, compared and annotated to lead to insight. Genetic annotations from separate biobanks need to be comparable and computable and to operate without access to raw data due to privacy concerns. Comparability is key both for regular research and to allow international comparison in response to pandemics. Here, we evaluate the appropriateness of the most common genomic tools used to depict population structure in a standardized and comparable manner. The end goal is to reduce the effects of confounding and learn from genuine variation in genetic effects on phenotypes across populations, which will improve the value of biobanks (locally and internationally), increase the accuracy of association analyses and inform developmental efforts.
Collapse
Affiliation(s)
- Hannah Carress
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK
| | - Daniel John Lawson
- School of Mathematics and Integrative Epidemiology Unit, University of Bristol, Bristol, UK
| | - Eran Elhaik
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK. .,Department of Biology, Lund University, Lund, Sweden.
| |
Collapse
|
13
|
Shastry V, Adams PE, Lindtke D, Mandeville EG, Parchman TL, Gompert Z, Buerkle CA. Model-based genotype and ancestry estimation for potential hybrids with mixed-ploidy. Mol Ecol Resour 2021; 21:1434-1451. [PMID: 33482035 DOI: 10.1111/1755-0998.13330] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 12/11/2020] [Accepted: 01/11/2021] [Indexed: 11/29/2022]
Abstract
Non-random mating among individuals can lead to spatial clustering of genetically similar individuals and population stratification. This deviation from panmixia is commonly observed in natural populations. Consequently, individuals can have parentage in single populations or involving hybridization between differentiated populations. Accounting for this mixture and structure is important when mapping the genetics of traits and learning about the formative evolutionary processes that shape genetic variation among individuals and populations. Stratified genetic relatedness among individuals is commonly quantified using estimates of ancestry that are derived from a statistical model. Development of these models for polyploid and mixed-ploidy individuals and populations has lagged behind those for diploids. Here, we extend and test a hierarchical Bayesian model, called entropy, which can use low-depth sequence data to estimate genotype and ancestry parameters in autopolyploid and mixed-ploidy individuals (including sex chromosomes and autosomes within individuals). Our analysis of simulated data illustrated the trade-off between sequencing depth and genome coverage and found lower error associated with low-depth sequencing across a larger fraction of the genome than with high-depth sequencing across a smaller fraction of the genome. The model has high accuracy and sensitivity as verified with simulated data and through analysis of admixture among populations of diploid and tetraploid Arabidopsis arenosa.
Collapse
Affiliation(s)
| | - Paula E Adams
- Department of Biological Sciences, University of Alabama, Tuscaloosa, AL, USA
| | - Dorothea Lindtke
- Institute of Plant Sciences, University of Bern, Bern, Switzerland
| | | | | | | | - C Alex Buerkle
- Department of Botany, University of Wyoming, Laramie, WY, USA
| |
Collapse
|
14
|
Bose A, Kalantzis V, Kontopoulou EM, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 2020; 35:3679-3683. [PMID: 30957838 DOI: 10.1093/bioinformatics/btz157] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Revised: 02/26/2019] [Accepted: 04/04/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. RESULTS We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. AVAILABILITY AND IMPLEMENTATION Source code and documentation are both available at https://github.com/aritra90/TeraPCA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aritra Bose
- Computer Science Department, Purdue University, West Lafayette, IN, USA
| | - Vassilis Kalantzis
- IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
| | | | - Mai Elkady
- Computer Science Department, Purdue University, West Lafayette, IN, USA
| | - Peristera Paschou
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Petros Drineas
- Computer Science Department, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
15
|
Hashemi M, Vattikonda AN, Sip V, Guye M, Bartolomei F, Woodman MM, Jirsa VK. The Bayesian Virtual Epileptic Patient: A probabilistic framework designed to infer the spatial map of epileptogenicity in a personalized large-scale brain model of epilepsy spread. Neuroimage 2020; 217:116839. [PMID: 32387625 DOI: 10.1016/j.neuroimage.2020.116839] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 04/02/2020] [Accepted: 04/07/2020] [Indexed: 12/28/2022] Open
Abstract
Despite the importance and frequent use of Bayesian frameworks in brain network modeling for parameter inference and model prediction, the advanced sampling algorithms implemented in probabilistic programming languages to overcome the inference difficulties have received relatively little attention in this context. In this technical note, we propose a probabilistic framework, namely the Bayesian Virtual Epileptic Patient (BVEP), which relies on the fusion of structural data of individuals to infer the spatial map of epileptogenicity in a personalized large-scale brain model of epilepsy spread. To invert the individualized whole-brain model employed in this study, we use the recently developed algorithms known as No-U-Turn Sampler (NUTS) as well as Automatic Differentiation Variational Inference (ADVI). Our results indicate that NUTS and ADVI accurately estimate the degree of epileptogenicity of brain regions, therefore, the hypothetical brain areas responsible for the seizure initiation and propagation, while the convergence diagnostics and posterior behavior analysis validate the reliability of the estimations. Moreover, we illustrate the efficiency of the transformed non-centered parameters in comparison to centered form of parameterization. The Bayesian framework used in this work proposes an appropriate patient-specific strategy for estimating the epileptogenicity of the brain regions to improve outcome after epilepsy surgery.
Collapse
Affiliation(s)
- M Hashemi
- Aix Marseille Univ, INSERM, INS, Inst Neurosci Syst, Marseille, France.
| | - A N Vattikonda
- Aix Marseille Univ, INSERM, INS, Inst Neurosci Syst, Marseille, France
| | - V Sip
- Aix Marseille Univ, INSERM, INS, Inst Neurosci Syst, Marseille, France
| | - M Guye
- Aix Marseille Univ, CNRS, CRMBM, Marseille, France
| | - F Bartolomei
- Aix Marseille Univ, INSERM, INS, Inst Neurosci Syst, Marseille, France; Epileptology Department, and Clinical Neurophysiology Department, Assistance Publique des Hôpitaux de Marseille, Marseille, France
| | - M M Woodman
- Aix Marseille Univ, INSERM, INS, Inst Neurosci Syst, Marseille, France
| | - V K Jirsa
- Aix Marseille Univ, INSERM, INS, Inst Neurosci Syst, Marseille, France.
| |
Collapse
|
16
|
Greenbaum G, Rubin A, Templeton AR, Rosenberg NA. Network-based hierarchical population structure analysis for large genomic data sets. Genome Res 2019; 29:2020-2033. [PMID: 31694865 PMCID: PMC6886512 DOI: 10.1101/gr.250092.119] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 11/01/2019] [Indexed: 01/24/2023]
Abstract
Analysis of population structure in natural populations using genetic data is a common practice in ecological and evolutionary studies. With large genomic data sets of populations now appearing more frequently across the taxonomic spectrum, it is becoming increasingly possible to reveal many hierarchical levels of structure, including fine-scale genetic clusters. To analyze these data sets, methods need to be appropriately suited to the challenges of extracting multilevel structure from whole-genome data. Here, we present a network-based approach for constructing population structure representations from genetic data. The use of community-detection algorithms from network theory generates a natural hierarchical perspective on the representation that the method produces. The method is computationally efficient, and it requires relatively few assumptions regarding the biological processes that underlie the data. We show the approach by analyzing population structure in the model plant species Arabidopsis thaliana and in human populations. These examples illustrate how network-based approaches for population structure analysis are well-suited to extracting valuable ecological and evolutionary information in the era of large genomic data sets.
Collapse
Affiliation(s)
- Gili Greenbaum
- Department of Biology, Stanford University, Stanford, California 94305, USA
| | - Amir Rubin
- Department of Computer Science, Ben-Gurion University of the Negev, Be'er-Sheva, 8410501, Israel
| | - Alan R Templeton
- Department of Biology, Washington University, St. Louis, Missouri 63130, USA
- Department of Evolutionary and Environmental Ecology, University of Haifa, Haifa, 31905, Israel
| | - Noah A Rosenberg
- Department of Biology, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
17
|
Hao W, Storey JD. Extending Tests of Hardy-Weinberg Equilibrium to Structured Populations. Genetics 2019; 213:759-770. [PMID: 31537622 PMCID: PMC6827367 DOI: 10.1534/genetics.119.302370] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Accepted: 08/21/2019] [Indexed: 12/22/2022] Open
Abstract
Testing for Hardy-Weinberg equilibrium (HWE) is an important component in almost all analyses of population genetic data. Genetic markers that violate HWE are often treated as special cases; for example, they may be flagged as possible genotyping errors, or they may be investigated more closely for evolutionary signatures of interest. The presence of population structure is one reason why genetic markers may fail a test of HWE. This is problematic because almost all natural populations studied in the modern setting show some degree of structure. Therefore, it is important to be able to detect deviations from HWE for reasons other than structure. To this end, we extend statistical tests of HWE to allow for population structure, which we call a test of "structural HWE." Additionally, our new test allows one to automatically choose tuning parameters and identify accurate models of structure. We demonstrate our approach on several important studies, provide theoretical justification for the test, and present empirical evidence for its utility. We anticipate the proposed test will be useful in a broad range of analyses of genome-wide population genetic data.
Collapse
Affiliation(s)
- Wei Hao
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, New Jersey 08544
| | - John D Storey
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, New Jersey 08544
| |
Collapse
|
18
|
Joseph TA, Pe'er I. Inference of Population Structure from Time-Series Genotype Data. Am J Hum Genet 2019; 105:317-333. [PMID: 31256878 PMCID: PMC6698887 DOI: 10.1016/j.ajhg.2019.06.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 06/04/2019] [Indexed: 10/26/2022] Open
Abstract
Sequencing ancient DNA can offer direct probing of population history. Yet, such data are commonly analyzed with standard tools that assume DNA samples are all contemporary. We present DyStruct, a model and inference algorithm for inferring shared ancestry from temporally sampled genotype data. DyStruct explicitly incorporates temporal dynamics by modeling individuals as mixtures of unobserved populations whose allele frequencies drift over time. We develop an efficient inference algorithm for our model using stochastic variational inference. On simulated data, we show that DyStruct outperforms the current state of the art when individuals are sampled over time. Using a dataset of 296 modern and 80 ancient samples, we demonstrate DyStruct is able to capture a well-supported admixture event of steppe ancestry into modern Europe. We further apply DyStruct to a genome-wide dataset of 2,067 modern and 262 ancient samples used to study the origin of farming in the Near East. We show that DyStruct provides new insight into population history when compared with alternate approaches, within feasible run time.
Collapse
Affiliation(s)
- Tyler A Joseph
- Department of Computer Science, Columbia University, New York, NY 10027, USA.
| | - Itsik Pe'er
- Department of Computer Science, Columbia University, New York, NY 10027, USA; Department of Systems Biology, Columbia University, New York, NY 10027, USA; Data Science Institute, Columbia University, New York, NY 10027, USA.
| |
Collapse
|
19
|
Cabreros I, Storey JD. A Likelihood-Free Estimator of Population Structure Bridging Admixture Models and Principal Components Analysis. Genetics 2019; 212:1009-1029. [PMID: 31028112 PMCID: PMC6707457 DOI: 10.1534/genetics.119.302159] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Accepted: 04/08/2019] [Indexed: 11/18/2022] Open
Abstract
We introduce a simple and computationally efficient method for fitting the admixture model of genetic population structure, called ALStructure The strategy of ALStructure is to first estimate the low-dimensional linear subspace of the population admixture components, and then search for a model within this subspace that is consistent with the admixture model's natural probabilistic constraints. Central to this strategy is the observation that all models belonging to this constrained space of solutions are risk-minimizing and have equal likelihood, rendering any additional optimization unnecessary. The low-dimensional linear subspace is estimated through a recently introduced principal components analysis method that is appropriate for genotype data, thereby providing a solution that has both principal components and probabilistic admixture interpretations. Our approach differs fundamentally from other existing methods for estimating admixture, which aim to fit the admixture model directly by searching for parameters that maximize the likelihood function or the posterior probability. We observe that ALStructure typically outperforms existing methods both in accuracy and computational speed under a wide array of simulated and real human genotype datasets. Throughout this work, we emphasize that the admixture model is a special case of a much broader class of models for which algorithms similar to ALStructure may be successfully employed.
Collapse
Affiliation(s)
- Irineo Cabreros
- Program in Applied and Computational Mathematics, Princeton University, New Jersey 08544
| | - John D Storey
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, New Jersey 08544
| |
Collapse
|
20
|
Abstract
The pattern of molecular evolution varies among gene sites and genes in a genome. By taking into account the complex heterogeneity of evolutionary processes among sites in a genome, Bayesian infinite mixture models of genomic evolution enable robust phylogenetic inference. With large modern data sets, however, the computational burden of Markov chain Monte Carlo sampling techniques becomes prohibitive. Here, we have developed a variational Bayesian procedure to speed up the widely used PhyloBayes MPI program, which deals with the heterogeneity of amino acid profiles. Rather than sampling from the posterior distribution, the procedure approximates the (unknown) posterior distribution using a manageable distribution called the variational distribution. The parameters in the variational distribution are estimated by minimizing Kullback-Leibler divergence. To examine performance, we analyzed three empirical data sets consisting of mitochondrial, plastid-encoded, and nuclear proteins. Our variational method accurately approximated the Bayesian inference of phylogenetic tree, mixture proportions, and the amino acid propensity of each component of the mixture while using orders of magnitude less computational time.
Collapse
Affiliation(s)
- Tung Dang
- Department of Agricultural and Environmental Biology, University of Tokyo, Tokyo, Japan
| | - Hirohisa Kishino
- Department of Agricultural and Environmental Biology, University of Tokyo, Tokyo, Japan
| |
Collapse
|
21
|
Pan X, Wang Y, Wong EHM, Telenti A, Venter JC, Jin L. Fine population structure analysis method for genomes of many. Sci Rep 2017; 7:12608. [PMID: 28974706 PMCID: PMC5626719 DOI: 10.1038/s41598-017-12319-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Accepted: 09/01/2017] [Indexed: 12/22/2022] Open
Abstract
Fine population structure can be examined through the clustering of individuals into subpopulations. The clustering of individuals in large sequence datasets into subpopulations makes the calculation of subpopulation specific allele frequency possible, which may shed light on selection of candidate variants for rare diseases. However, as the magnitude of the data increases, computational burden becomes a challenge in fine population structure analysis. To address this issue, we propose fine population structure analysis (FIPSA), which is an individual-based non-parametric method for dissecting fine population structure. FIPSA maximizes the likelihood ratio of the contingency table of the allele counts multiplied by the group. We demonstrated that its speed and accuracy were superior to existing non-parametric methods when the simulated sample size was up to 5,000 individuals. When applied to real data, the method showed high resolution on the Human Genome Diversity Project (HGDP) East Asian dataset. FIPSA was independently validated on 11,257 human genomes. The group assignment given by FIPSA was 99.1% similar to those assigned based on supervised learning. Thus, FIPSA provides high resolution and is compatible with a real dataset of more than ten thousand individuals.
Collapse
Affiliation(s)
- Xuedong Pan
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
| | - Yi Wang
- Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China
| | | | | | | | - Li Jin
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, China.
| |
Collapse
|
22
|
Novembre J, Peter BM. Recent advances in the study of fine-scale population structure in humans. Curr Opin Genet Dev 2016; 41:98-105. [PMID: 27662060 DOI: 10.1016/j.gde.2016.08.007] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Revised: 08/18/2016] [Accepted: 08/24/2016] [Indexed: 01/17/2023]
Abstract
Empowered by modern genotyping and large samples, population structure can be accurately described and quantified even when it only explains a fraction of a percent of total genetic variance. This is especially relevant and interesting for humans, where fine-scale population structure can both confound disease-mapping studies and reveal the history of migration and divergence that shaped our species' diversity. Here we review notable recent advances in the detection, use, and understanding of population structure. Our work addresses multiple areas where substantial progress is being made: improved statistics and models for better capturing differentiation, admixture, and the spatial distribution of variation; computational speed-ups that allow methods to scale to modern data; and advances in haplotypic modeling that have wide ranging consequences for the analysis of population structure. We conclude by outlining four important open challenges: the limitations of discrete population models, uncertainty in individual origins, the incorporation of both fine-scale structure and ancient DNA in parametric models, and the development of efficient computational tools, particularly for haplotype-based methods.
Collapse
Affiliation(s)
- John Novembre
- Department of Human Genetics, University of Chicago, IL 60636, United States; Department of Ecology and Evolutionary Biology, University of Chicago, IL 60636, United States
| | - Benjamin M Peter
- Department of Human Genetics, University of Chicago, IL 60636, United States
| |
Collapse
|