1
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CWK, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. Am J Hum Genet 2023; 110:2077-2091. [PMID: 38065072 PMCID: PMC10716520 DOI: 10.1016/j.ajhg.2023.10.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Revised: 10/26/2023] [Accepted: 10/27/2023] [Indexed: 12/18/2023] Open
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide association studies (GWASs) are a powerful way to find genetic loci associated with phenotypes. GWASs are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix (local eGRM) given the ARG. Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to analyze two chromosomes containing known body size loci in a sample of Native Hawaiians. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Joshua G Schraiber
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Michael D Edge
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
2
|
Link V, Schraiber JG, Fan C, Dinh B, Mancuso N, Chiang CW, Edge MD. Tree-based QTL mapping with expected local genetic relatedness matrices. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.07.536093. [PMID: 37066144 PMCID: PMC10104234 DOI: 10.1101/2023.04.07.536093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Understanding the genetic basis of complex phenotypes is a central pursuit of genetics. Genome-wide Association Studies (GWAS) are a powerful way to find genetic loci associated with phenotypes. GWAS are widely and successfully used, but they face challenges related to the fact that variants are tested for association with a phenotype independently, whereas in reality variants at different sites are correlated because of their shared evolutionary history. One way to model this shared history is through the ancestral recombination graph (ARG), which encodes a series of local coalescent trees. Recent computational and methodological breakthroughs have made it feasible to estimate approximate ARGs from large-scale samples. Here, we explore the potential of an ARG-based approach to quantitative-trait locus (QTL) mapping, echoing existing variance-components approaches. We propose a framework that relies on the conditional expectation of a local genetic relatedness matrix given the ARG (local eGRM). Simulations show that our method is especially beneficial for finding QTLs in the presence of allelic heterogeneity. By framing QTL mapping in terms of the estimated ARG, we can also facilitate the detection of QTLs in understudied populations. We use local eGRM to identify a large-effect BMI locus, the CREBRF gene, in a sample of Native Hawaiians in which it was not previously detectable by GWAS because of a lack of population-specific imputation resources. Our investigations can provide intuition about the benefits of using estimated ARGs in population- and statistical-genetic methods in general.
Collapse
Affiliation(s)
- Vivian Link
- Department of Quantitative and Computational Biology, University of Southern California
| | - Joshua G. Schraiber
- Department of Quantitative and Computational Biology, University of Southern California
| | - Caoqi Fan
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Bryan Dinh
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Charleston W.K. Chiang
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Michael D. Edge
- Department of Quantitative and Computational Biology, University of Southern California
| |
Collapse
|
3
|
Kabisch M, Hamann U, Lorenzo Bermejo J. Imputation of missing genotypes within LD-blocks relying on the basic coalescent and beyond: consideration of population growth and structure. BMC Genomics 2017; 18:798. [PMID: 29041903 PMCID: PMC5646149 DOI: 10.1186/s12864-017-4208-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Accepted: 10/12/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genotypes not directly measured in genetic studies are often imputed to improve statistical power and to increase mapping resolution. The accuracy of standard imputation techniques strongly depends on the similarity of linkage disequilibrium (LD) patterns in the study and reference populations. Here we develop a novel approach for genotype imputation in low-recombination regions that relies on the coalescent and permits to explicitly account for population demographic factors. To test the new method, study and reference haplotypes were simulated and gene trees were inferred under the basic coalescent and also considering population growth and structure. The reference haplotypes that first coalesced with study haplotypes were used as templates for genotype imputation. Computer simulations were complemented with the analysis of real data. Genotype concordance rates were used to compare the accuracies of coalescent-based and standard (IMPUTE2) imputation. RESULTS Simulations revealed that, in LD-blocks, imputation accuracy relying on the basic coalescent was higher and less variable than with IMPUTE2. Explicit consideration of population growth and structure, even if present, did not practically improve accuracy. The advantage of coalescent-based over standard imputation increased with the minor allele frequency and it decreased with population stratification. Results based on real data indicated that, even in low-recombination regions, further research is needed to incorporate recombination in coalescence inference, in particular for studies with genetically diverse and admixed individuals. CONCLUSIONS To exploit the full potential of coalescent-based methods for the imputation of missing genotypes in genetic studies, further methodological research is needed to reduce computer time, to take into account recombination, and to implement these methods in user-friendly computer programs. Here we provide reproducible code which takes advantage of publicly available software to facilitate further developments in the field.
Collapse
Affiliation(s)
- Maria Kabisch
- Molecular Genetics of Breast Cancer, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany.,Institute of Medical Biometry and Informatics, University of Heidelberg, 69120, Heidelberg, Germany
| | - Ute Hamann
- Molecular Genetics of Breast Cancer, German Cancer Research Center (DKFZ), 69120, Heidelberg, Germany
| | - Justo Lorenzo Bermejo
- Institute of Medical Biometry and Informatics, University of Heidelberg, 69120, Heidelberg, Germany.
| |
Collapse
|
4
|
Xavier A, Muir WM, Rainey KM. Impact of imputation methods on the amount of genetic variation captured by a single-nucleotide polymorphism panel in soybeans. BMC Bioinformatics 2016; 17:55. [PMID: 26830693 PMCID: PMC4736474 DOI: 10.1186/s12859-016-0899-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 01/19/2016] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Success in genome-wide association studies and marker-assisted selection depends on good phenotypic and genotypic data. The more complete this data is, the more powerful will be the results of analysis. Nevertheless, there are next-generation technologies that seek to provide genotypic information in spite of great proportions of missing data. The procedures these technologies use to impute genetic data, therefore, greatly affect downstream analyses. This study aims to (1) compare the genetic variance in a single-nucleotide polymorphism panel of soybean with missing data imputed using various methods, (2) evaluate the imputation accuracy and post-imputation quality associated with these methods, and (3) evaluate the impact of imputation method on heritability and the accuracy of genome-wide prediction of soybean traits. The imputation methods we evaluated were as follows: multivariate mixed model, hidden Markov model, logical algorithm, k-nearest neighbor, single value decomposition, and random forest. We used raw genotypes from the SoyNAM project and the following phenotypes: plant height, days to maturity, grain yield, and seed protein composition. RESULTS We propose an imputation method based on multivariate mixed models using pedigree information. Our methods comparison indicate that heritability of traits can be affected by the imputation method. Genotypes with missing values imputed with methods that make use of genealogic information can favor genetic analysis of highly polygenic traits, but not genome-wide prediction accuracy. The genotypic matrix captured the highest amount of genetic variance when missing loci were imputed by the method proposed in this paper. CONCLUSIONS We concluded that hidden Markov models and random forest imputation are more suitable to studies that aim analyses of highly heritable traits while pedigree-based methods can be used to best analyze traits with low heritability. Despite the notable contribution to heritability, advantages in genomic prediction were not observed by changing the imputation method. We identified significant differences across imputation methods in a dataset missing 20 % of the genotypic values. It means that genotypic data from genotyping technologies that provide a high proportion of missing values, such as GBS, should be handled carefully because the imputation method will impact downstream analysis.
Collapse
Affiliation(s)
- A Xavier
- Department of Agronomy, Purdue University, Lilly Hall of Life Sciences, 915 W. State St., West Lafayette, Indiana, 47907, USA.
| | - William M Muir
- Department of Animal Science, Purdue University, Lilly Hall of Life Sciences, 915 W. State St., West Lafayette, Indiana, 47907, USA.
| | - Katy M Rainey
- Department of Agronomy, Purdue University, Lilly Hall of Life Sciences, 915 W. State St., West Lafayette, Indiana, 47907, USA.
| |
Collapse
|
5
|
Burkett KM, Greenwood CMT, McNeney B, Graham J. Gene genealogies for genetic association mapping, with application to Crohn's disease. Front Genet 2013; 4:260. [PMID: 24348515 PMCID: PMC3845011 DOI: 10.3389/fgene.2013.00260] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2013] [Accepted: 11/12/2013] [Indexed: 11/30/2022] Open
Abstract
A gene genealogy describes relationships among haplotypes sampled from a population. Knowledge of the gene genealogy for a set of haplotypes is useful for estimation of population genetic parameters and it also has potential application in finding disease-predisposing genetic variants. As the true gene genealogy is unknown, Markov chain Monte Carlo (MCMC) approaches have been used to sample genealogies conditional on data at multiple genetic markers. We previously implemented an MCMC algorithm to sample from an approximation to the distribution of the gene genealogy conditional on haplotype data. Our approach samples ancestral trees, recombination and mutation rates at a genomic focal point. In this work, we describe how our sampler can be used to find disease-predisposing genetic variants in samples of cases and controls. We use a tree-based association statistic that quantifies the degree to which case haplotypes are more closely related to each other around the focal point than control haplotypes, without relying on a disease model. As the ancestral tree is a latent variable, so is the tree-based association statistic. We show how the sampler can be used to estimate the posterior distribution of the latent test statistic and corresponding latent p-values, which together comprise a fuzzy p-value. We illustrate the approach on a publicly-available dataset from a study of Crohn's disease that consists of genotypes at multiple SNP markers in a small genomic region. We estimate the posterior distribution of the tree-based association statistic and the recombination rate at multiple focal points in the region. Reassuringly, the posterior mean recombination rates estimated at the different focal points are consistent with previously published estimates. The tree-based association approach finds multiple sub-regions where the case haplotypes are more genetically related than the control haplotypes, and that there may be one or multiple disease-predisposing loci.
Collapse
Affiliation(s)
- Kelly M Burkett
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada ; Department of Epidemiology, Biostatistics and Occupational Health, McGill University Montreal, QC, Canada
| | - Celia M T Greenwood
- Department of Oncology, Department of Epidemiology, Biostatistics and Occupational Health, and Division of Cancer Epidemiology, McGill University Montreal, QC, Canada ; Lady Davis Institute for Medical Research, Jewish General Hospital Montreal, QC, Canada
| | - Brad McNeney
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada
| |
Collapse
|
6
|
Fang M. A fast expectation-maximum algorithm for fine-scale QTL mapping. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2012; 125:1727-1734. [PMID: 22865126 DOI: 10.1007/s00122-012-1949-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2012] [Accepted: 07/15/2012] [Indexed: 06/01/2023]
Abstract
The recent technology of the single-nucleotide-polymorphism (SNP) array makes it possible to genotype millions of SNP markers on genome, which in turn requires to develop fast and efficient method for fine-scale quantitative trait loci (QTL) mapping. The single-marker association (SMA) is the simplest method for fine-scale QTL mapping, but it usually shows many false-positive signals and has low QTL-detection power. Compared with SMA, the haplotype-based method of Meuwissen and Goddard who assume QTL effect to be random and estimate variance components (VC) with identity-by-descent (IBD) matrices that inferred from unknown historic population is more powerful for fine-scale QTL mapping; furthermore, their method also tends to show continuous QTL-detection profile to diminish many false-positive signals. However, as we know, the variance component estimation is usually very time consuming and difficult to converge. Thus, an extremely fast EMF (Expectation-Maximization algorithm under Fixed effect model) is proposed in this research, which assumes a biallelic QTL and uses an expectation-maximization (EM) algorithm to solve model effects. The results of simulation experiments showed that (1) EMF was computationally much faster than VC method; (2) EMF and VC performed similarly in QTL detection power and parameter estimations, and both outperformed the paired-marker analysis and SMA. However, the power of EMF would be lower than that of VC if the QTL was multiallelic.
Collapse
Affiliation(s)
- Ming Fang
- Life Science College, Heilongjiang Bayi Agricultural University, Daqing 163319, People's Republic of China.
| |
Collapse
|
7
|
Abstract
In the age of high-density genome-wide association (GWAS) data, correcting for multiple comparisons is a substantial issue for genetic epidemiological studies. However, the current manuscript review process generally requires both stringent correction and independent replication. The result of this stringency is that studies that are published suffer from inflated Type 2 error rates (false negatives), thereby removing many likely real signals from follow-up. Elimination of these alleles, if they are truly associated, from further study will slow research progress in studies of complex disease. We argue that this method of correction is overly conservative, especially in an age when high-density follow-up experiments are possible and reasonably inexpensive.
Collapse
Affiliation(s)
- Scott M Williams
- Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN.
| | | |
Collapse
|
8
|
Rosenberg NA, Huang L, Jewett EM, Szpiech ZA, Jankovic I, Boehnke M. Genome-wide association studies in diverse populations. Nat Rev Genet 2010; 11:356-66. [PMID: 20395969 PMCID: PMC3079573 DOI: 10.1038/nrg2760] [Citation(s) in RCA: 414] [Impact Index Per Article: 29.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Genome-wide association (GWA) studies have identified a large number of SNPs associated with disease phenotypes. As most GWA studies have been performed in populations of European descent, this Review examines the issues involved in extending the consideration of GWA studies to diverse worldwide populations. Although challenges exist with issues such as imputation, admixture and replication, investigation of a greater diversity of populations could make substantial contributions to the goal of mapping the genetic determinants of complex diseases for the human population as a whole.
Collapse
Affiliation(s)
- Noah A Rosenberg
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA.
| | | | | | | | | | | |
Collapse
|
9
|
Current world literature. Curr Opin Lipidol 2010; 21:148-52. [PMID: 20616627 DOI: 10.1097/mol.0b013e3283390e49] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
10
|
Affiliation(s)
- Eran Halperin
- International Computer Science Institute, Berkeley, CA, USA
| | | |
Collapse
|