1
|
Naito T, Okada Y. Genotype imputation methods for whole and complex genomic regions utilizing deep learning technology. J Hum Genet 2024; 69:481-486. [PMID: 38225263 PMCID: PMC11422162 DOI: 10.1038/s10038-023-01213-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/23/2023] [Accepted: 12/04/2023] [Indexed: 01/17/2024]
Abstract
The imputation of unmeasured genotypes is essential in human genetic research, particularly in enhancing the power of genome-wide association studies and conducting subsequent fine-mapping. Recently, several deep learning-based genotype imputation methods for genome-wide variants with the capability of learning complex linkage disequilibrium patterns have been developed. Additionally, deep learning-based imputation has been applied to a distinct genomic region known as the major histocompatibility complex, referred to as HLA imputation. Despite their various advantages, the current deep learning-based genotype imputation methods do have certain limitations and have not yet become standard. These limitations include the modest accuracy improvement over statistical and conventional machine learning-based methods. However, their benefits include other aspects, such as their "reference-free" nature, which ensures complete privacy protection, and their higher computational efficiency. Furthermore, the continuing evolution of deep learning technologies is expected to contribute to further improvements in prediction accuracy and usability in the future.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan.
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan.
| | - Yukinori Okada
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan
- Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
- Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka University, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
| |
Collapse
|
2
|
Cahoon JL, Rui X, Tang E, Simons C, Langie J, Chen M, Lo YC, Chiang CWK. Imputation accuracy across global human populations. Am J Hum Genet 2024; 111:979-989. [PMID: 38604166 PMCID: PMC11080279 DOI: 10.1016/j.ajhg.2024.03.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 04/13/2024] Open
Abstract
Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of references from non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative improved the imputation of admixed African-ancestry and Hispanic/Latino samples, but imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we imputed the genotypes of over 43,000 individuals across 123 populations around the world and identified numerous populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for variants with minor allele frequencies between 1% and 5% in Saudi Arabians (n = 1,061), Vietnamese (n = 1,264), Thai (n = 2,435), and Papua New Guineans (n = 776) were 0.79, 0.78, 0.76, and 0.62, respectively, compared to 0.90-0.93 for comparable European populations matched in sample size and SNP array content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European-ancestry reference increased, as predicted. Using sequencing data as ground truth, we also showed that Rsq may over-estimate imputation accuracy for non-European populations more than European populations, suggesting further disparity in accuracy between populations. Using 1,496 sequenced individuals from Taiwan Biobank as a second reference panel to TOPMed, we also assessed a strategy to improve imputation for non-European populations with meta-imputation, but this design did not improve accuracy across frequency spectra. Taken together, our analyses suggest that we must ultimately strive to increase diversity and size to promote equity within genetics research.
Collapse
Affiliation(s)
- Jordan L Cahoon
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Xinyue Rui
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Echo Tang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Christopher Simons
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Jalen Langie
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Minhui Chen
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Ying-Chu Lo
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA.
| |
Collapse
|
3
|
Kamprasert N, Aliloo H, van der Werf JHJ, Clark SA. Short communication: Accuracy of whole-genome sequence imputation in Angus cattle using within-breed and multi breed reference populations. Animal 2024; 18:101087. [PMID: 38364656 DOI: 10.1016/j.animal.2024.101087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 01/16/2024] [Accepted: 01/19/2024] [Indexed: 02/18/2024] Open
Abstract
Genotype imputation is a standard approach used in the field of genetics. It can be used to fill in missing genotypes or to increase genotype density. Accurate imputed genotypes are required for downstream analyses. In this study, the accuracy of whole-genome sequence imputation for Angus beef cattle was examined using two different ways to form the reference panel, a within-breed reference population and a multi breed reference population. A stepwise imputation was conducted by imputing medium-density (50k) genotypes to high-density, and then to the whole genome sequence (WGS). The reference population consisted of animals with WGS information from the 1 000 Bull Genomes project. The within-breed reference panel comprised 396 Angus cattle, while an additional 2 380 Taurine cattle were added to the reference population for the multi breed reference scenario. Imputation accuracies were variant-wise average accuracies from a 10-fold cross-validation and expressed as concordance rates (CR) and Pearson's correlations (PR). The two imputation scenarios achieved moderate to high imputation accuracies ranging from 0.896 to 0.966 for CR and from 0.779 to 0.834 for PR. The accuracies from two different scenarios were similar, except for PR from WGS imputation, where the within-breed scenario outperformed the multi breed scenario. The result indicated that including a large number of animals from other breeds in the reference panel to impute purebred Angus did not improve the accuracy and may negatively impact the results. In conclusion, the imputed WGS in Angus cattle can be obtained with high accuracy using a within-breed reference panel.
Collapse
Affiliation(s)
- N Kamprasert
- School of Environmental and Rural Science, University of New England, 2351, Armidale, NSW, Australia.
| | - H Aliloo
- School of Environmental and Rural Science, University of New England, 2351, Armidale, NSW, Australia
| | - J H J van der Werf
- School of Environmental and Rural Science, University of New England, 2351, Armidale, NSW, Australia
| | - S A Clark
- School of Environmental and Rural Science, University of New England, 2351, Armidale, NSW, Australia
| |
Collapse
|
4
|
Wang D, Xie K, Wang Y, Hu J, Li W, Yang A, Zhang Q, Ning C, Fan X. Cost-effectively dissecting the genetic architecture of complex wool traits in rabbits by low-coverage sequencing. Genet Sel Evol 2022; 54:75. [PMCID: PMC9673297 DOI: 10.1186/s12711-022-00766-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Accepted: 10/31/2022] [Indexed: 11/19/2022] Open
Abstract
Background Rabbit wool traits are important in fiber production and for model organism research on hair growth, but their genetic architecture remains obscure. In this study, we focused on wool characteristics in Angora rabbits, a breed well-known for the quality of its wool. Considering the cost to generate population-scale sequence data and the biased detection of variants using chip data, developing an effective genotyping strategy using low-coverage whole-genome sequencing (LCS) data is necessary to conduct genetic analyses. Results Different genotype imputation strategies (BaseVar + STITCH, Bcftools + Beagle4, and GATK + Beagle5), sequencing coverages (0.1X, 0.5X, 1.0X, 1.5X, and 2.0X), and sample sizes (100, 200, 300, 400, 500, and 600) were compared. Our results showed that using BaseVar + STITCH at a sequencing depth of 1.0X with a sample size larger than 300 resulted in the highest genotyping accuracy, with a genotype concordance higher than 98.8% and genotype accuracy higher than 0.97. We performed multivariate genome-wide association studies (GWAS), followed by conditional GWAS and estimation of the confidence intervals of quantitative trait loci (QTL) to investigate the genetic architecture of wool traits. Six QTL were detected, which explained 0.4 to 7.5% of the phenotypic variation. Gene-level mapping identified the fibroblast growth factor 10 (FGF10) gene as associated with fiber growth and diameter, which agrees with previous results from functional data analyses on the FGF gene family in other species, and is relevant for wool rabbit breeding. Conclusions We suggest that LCS followed by imputation can be a cost-effective alternative to array and high-depth sequencing for assessing common variants. GWAS combined with LCS can identify new QTL and candidate genes that are associated with quantitative traits. This study provides a cost-effective and powerful method for investigating the genetic architecture of complex traits, which will be useful for genomic breeding applications. Supplementary Information The online version contains supplementary material available at 10.1186/s12711-022-00766-y.
Collapse
Affiliation(s)
- Dan Wang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Kerui Xie
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Yanyan Wang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Jiaqing Hu
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Wenqiang Li
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Aiguo Yang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Qin Zhang
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Chao Ning
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| | - Xinzhong Fan
- grid.440622.60000 0000 9482 4676College of Animal Science and Veterinary Medicine, Shandong Agricultural University, Tai’an, China
| |
Collapse
|
5
|
De Marino A, Mahmoud AA, Bose M, Bircan KO, Terpolovsky A, Bamunusinghe V, Bohn S, Khan U, Novković B, Yazdi PG. A comparative analysis of current phasing and imputation software. PLoS One 2022; 17:e0260177. [PMID: 36260643 PMCID: PMC9581364 DOI: 10.1371/journal.pone.0260177] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 09/01/2022] [Indexed: 12/02/2022] Open
Abstract
Whole-genome data has become significantly more accessible over the last two decades. This can largely be attributed to both reduced sequencing costs and imputation models which make it possible to obtain nearly whole-genome data from less expensive genotyping methods, such as microarray chips. Although there are many different approaches to imputation, the Hidden Markov Model (HMM) remains the most widely used. In this study, we compared the latest versions of the most popular HMM-based tools for phasing and imputation: Beagle5.4, Eagle2.4.1, Shapeit4, Impute5 and Minimac4. We benchmarked them on four input datasets with three levels of chip density. We assessed each imputation software on the basis of accuracy, speed and memory usage, and showed how the choice of imputation accuracy metric can result in different interpretations. The highest average concordance rate was achieved by Beagle5.4, followed by Impute5 and Minimac4, using a reference-based approach during phasing and the highest density chip. IQS and R2 metrics revealed that Impute5 and Minimac4 obtained better results for low frequency markers, while Beagle5.4 remained more accurate for common markers (MAF>5%). Computational load as measured by run time was lower for Beagle5.4 than Minimac4 and Impute5, while Minimac4 utilized the least memory of the imputation tools we compared. ShapeIT4, used the least memory of the phasing tools examined with genotype chip data, while Eagle2.4.1 used the least memory phasing WGS data. Finally, we determined the combination of phasing software, imputation software, and reference panel, best suited for different situations and analysis needs and created an automated pipeline that provides a way for users to create customized chips designed to optimize their imputation results.
Collapse
Affiliation(s)
- Adriano De Marino
- Research & Development, SelfDecode, Miami, FL, United States of America
| | | | - Madhuchanda Bose
- Research & Development, SelfDecode, Miami, FL, United States of America
| | | | | | | | - Sandra Bohn
- Research & Development, SelfDecode, Miami, FL, United States of America
| | - Umar Khan
- Research & Development, SelfDecode, Miami, FL, United States of America
| | - Biljana Novković
- Research & Development, SelfDecode, Miami, FL, United States of America
| | - Puya G. Yazdi
- Research & Development, SelfDecode, Miami, FL, United States of America
- * E-mail:
| |
Collapse
|
6
|
Srikanth K, von Pfeil DJF, Stanley BJ, Griffitts C, Huson HJ. Genome Wide Association Study with Imputed Whole Genome Sequence Data Identifies a 431 kb Risk Haplotype on CFA18 for Congenital Laryngeal Paralysis in Alaskan Sled Dogs. Genes (Basel) 2022; 13:genes13101808. [PMID: 36292693 PMCID: PMC9602090 DOI: 10.3390/genes13101808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 10/03/2022] [Accepted: 10/04/2022] [Indexed: 11/16/2022] Open
Abstract
Congenital laryngeal paralysis (CLP) is an inherited disorder that affects the ability of the dog to exercise and precludes it from functioning as a working sled dog. Though CLP is known to occur in Alaskan sled dogs (ASDs) since 1986, the genetic mutation underlying the disease has not been reported. Using a genome-wide association study (GWAS), we identified a 708 kb region on CFA 18 harboring 226 SNPs to be significantly associated with CLP. The significant SNPs explained 47.06% of the heritability of CLP. We narrowed the region to 431 kb through autozygosity mapping and found 18 of the 20 cases to be homozygous for the risk haplotype. Whole genome sequencing of two cases and a control ASD, and comparison with the genome of 657 dogs from various breeds, confirmed the homozygous status of the risk haplotype to be unique to the CLP cases. Most of the dogs that were homozygous for the risk allele had blue eyes. Gene annotation and a gene-based association study showed that the risk haplotype encompasses genes implicated in developmental and neurodegenerative disorders. Pathway analysis showed enrichment of glycoproteins and glycosaminoglycans biosynthesis, which play a key role in repairing damaged nerves. In conclusion, our results suggest an important role for the identified candidate region in CLP.
Collapse
Affiliation(s)
- Krishnamoorthy Srikanth
- Department of Animal Science, College of Agriculture and Life Science, Cornell University, Ithaca, NY 14850, USA
| | | | - Bryden J. Stanley
- Department of Small Animal Clinical Sciences, Michigan State University, East Lansing, MI 48824, USA
| | | | - Heather J. Huson
- Department of Animal Science, College of Agriculture and Life Science, Cornell University, Ithaca, NY 14850, USA
- Correspondence:
| |
Collapse
|
7
|
Nawaz MY, Bernardes PA, Savegnago RP, Lim D, Lee SH, Gondro C. Evaluation of Whole-Genome Sequence Imputation Strategies in Korean Hanwoo Cattle. Animals (Basel) 2022; 12:ani12172265. [PMID: 36077985 PMCID: PMC9454883 DOI: 10.3390/ani12172265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 08/25/2022] [Accepted: 08/30/2022] [Indexed: 11/29/2022] Open
Abstract
Simple Summary In this study, we evaluated various imputation strategies for the Korean Hanwoo cattle. We observed that a large reference panel consisting of many cattle breeds did not improve the imputation accuracy when compared to a proportionally small purebred Hanwoo reference. This was because the multi-breed reference did not contain animals sufficiently related to the Hanwoo to improve the accuracies and, although not detrimental, in effect, only added to the computational burden of the imputation. Despite the large multi-breed reference, when the Hanwoo were removed from the reference, the imputation accuracies were low. These results suggest additional sequencing efforts are needed for underrepresented breeds, particularly those less genetically related to the main European breeds. Abstract This study evaluated the accuracy of sequence imputation in Hanwoo beef cattle using different reference panels: a large multi-breed reference with no Hanwoo (n = 6269), a much smaller Hanwoo purebred reference (n = 88), and both datasets combined (n = 6357). The target animals were 136 cattle both sequenced and genotyped with the Illumina BovineSNP50 v2 (50K). The average imputation accuracy measured by the Pearson correlation (R) was 0.695 with the multi-breed reference, 0.876 with the purebred Hanwoo, and 0.887 with the combined data; the average concordance rates (CR) were 88.16%, 94.49%, and 94.84%, respectively. The accuracy gains from adding a large multi-breed reference of 6269 samples to only 88 Hanwoo was marginal; however, the concordance rate for the heterozygotes decreased from 85% to 82%, and the concordance rate for fixed SNPs in Hanwoo also decreased from 99.98% to 98.73%. Although the multi-breed panel was large, it was not sufficiently representative of the breed for accurate imputation without the Hanwoo animals. Additionally, we evaluated the value of high-density 700K genotypes (n = 991) as an intermediary step in the imputation process. The imputation accuracy differences were negligible between a single-step imputation strategy from 50K directly to sequence and a two-step imputation approach (50K-700K-sequence). We also observed that imputed sequence data can be used as a reference panel for imputation (mean R = 0.9650, mean CR = 98.35%). Finally, we identified 31 poorly imputed genomic regions in the Hanwoo genome and demonstrated that imputation accuracies were particularly lower at the chromosomal ends.
Collapse
Affiliation(s)
- Muhammad Yasir Nawaz
- Genetics and Genome Sciences Graduate Program, Michigan State University, East Lansing, MI 48824, USA
- Correspondence: (M.Y.N.); (C.G.)
| | - Priscila Arrigucci Bernardes
- Department of Animal Science and Rural Development, Federal University of Santa Catarina, Florianopolis 88034-000, SC, Brazil
| | | | - Dajeong Lim
- Animal Genome & Bioinformatics Division, National Institute of Animal Science, RDA, Wanju 55365, Korea
| | - Seung Hwan Lee
- Division of Animal and Dairy Science, Chungnam National University, Daejeon 305764, Korea
| | - Cedric Gondro
- Department of Animal Science, Michigan State University, East Lansing, MI 48824, USA
- Correspondence: (M.Y.N.); (C.G.)
| |
Collapse
|
8
|
Genotype imputation and polygenic score estimation in northwestern Russian population. PLoS One 2022; 17:e0269434. [PMID: 35763490 PMCID: PMC9239469 DOI: 10.1371/journal.pone.0269434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 05/21/2022] [Indexed: 11/19/2022] Open
Abstract
Numerous studies demonstrated the lack of transferability of polygenic score (PGS) models across populations and the problem arising from unequal presentation of ancestries across genetic studies. However, even within European ancestry there are ethnic groups that are rarely presented in genetic studies. For instance, Russians, being one of the largest, diverse, and yet understudied group in Europe. In this study, we evaluated the reliability of genotype imputation for the Russian cohort by testing several commonly used imputation reference panels (e.g. HRC, 1000G, HGDP). HRC, in comparison with two other panels, showed the most accurate results based on both imputation accuracy and allele frequency concordance between masked and imputed genotypes. We built polygenic score models based on GWAS results from the UK biobank, measured the explained phenotypic variance in the Russian cohort attributed to polygenic scores for 11 phenotypes, collected in the clinic for each participant, and finally explored the role of allele frequency discordance between the UK biobank and the study cohort in the resulting PGS performance.
Collapse
|
9
|
Chen D, Tashman K, Palmer DS, Neale B, Roeder K, Bloemendal A, Churchhouse C, Ke ZT. A data harmonization pipeline to leverage external controls and boost power in GWAS. Hum Mol Genet 2021; 31:481-489. [PMID: 34508597 PMCID: PMC8825237 DOI: 10.1093/hmg/ddab261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 09/02/2021] [Accepted: 09/03/2021] [Indexed: 11/12/2022] Open
Abstract
The use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors, and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control (QC) and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27 517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn's disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required.
Collapse
Affiliation(s)
- Danfeng Chen
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, 08544, New Jersey, United States
| | - Katherine Tashman
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Duncan S Palmer
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Benjamin Neale
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States.,Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, 02142, Massachusetts, United States
| | - Kathryn Roeder
- Department of Statistics, Carnegie Mellon University, Pittsburgh, 15213, Pennsylvania, United States
| | - Alex Bloemendal
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Claire Churchhouse
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States.,Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, 02142, Massachusetts, United States
| | - Zheng Tracy Ke
- Department of Statistics, Harvard University, Cambridge, 02138, Massachusetts, United States
| |
Collapse
|
10
|
Aleknonytė-Resch M, Szymczak S, Freitag-Wolf S, Dempfle A, Krawczak M. Genotype imputation in case-only studies of gene-environment interaction: validity and power. Hum Genet 2021; 140:1217-1228. [PMID: 34041609 PMCID: PMC8263402 DOI: 10.1007/s00439-021-02294-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 05/10/2021] [Indexed: 11/26/2022]
Abstract
Case-only (CO) studies are a powerful means to uncover gene-environment (G × E) interactions for complex human diseases. Moreover, such studies may in principle also draw upon genotype imputation to increase statistical power even further. However, genotype imputation usually employs healthy controls such as the Haplotype Reference Consortium (HRC) data as an imputation base, which may systematically perturb CO studies in genomic regions with main effects upon disease risk. Using genotype data from 719 German Crohn Disease (CD) patients, we investigated the level of imputation accuracy achievable for single nucleotide polymorphisms (SNPs) with or without a genetic main effect, and with varying minor allele frequency (MAF). Genotypes were imputed from neighbouring SNPs at different levels of linkage disequilibrium (LD) to the target SNP using the HRC data as an imputation base. Comparison of the true and imputed genotypes revealed lower imputation accuracy for SNPs with strong main effects. We also simulated different levels of G × E interaction to evaluate the potential loss of statistical validity and power incurred by the use of imputed genotypes. Simulations under the null hypothesis revealed that genotype imputation does not inflate the type I error rate of CO studies of G × E. However, the statistical power was found to be reduced by imputation, particularly for SNPs with low MAF, and a gradual loss of statistical power resulted when the level of LD to the SNPs driving the imputation decreased. Our study thus highlights that genotype imputation should be employed with great care in CO studies of G × E interaction.
Collapse
Affiliation(s)
| | - Silke Szymczak
- Institute of Medical Informatics and Statistics, Kiel University, Kiel, Germany
- Institute of Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany
| | - Sandra Freitag-Wolf
- Institute of Medical Informatics and Statistics, Kiel University, Kiel, Germany
| | - Astrid Dempfle
- Institute of Medical Informatics and Statistics, Kiel University, Kiel, Germany
| | - Michael Krawczak
- Institute of Medical Informatics and Statistics, Kiel University, Kiel, Germany.
| |
Collapse
|
11
|
Torkamaneh D, Belzile F. Accurate Imputation of Untyped Variants from Deep Sequencing Data. Methods Mol Biol 2021; 2243:271-281. [PMID: 33606262 DOI: 10.1007/978-1-0716-1103-6_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
The quality, statistical power, and resolution of genome-wide association studies (GWAS) are largely dependent on the comprehensiveness of genotypic data. Over the last few years, despite the constant decrease in the price of sequencing, whole-genome sequencing (WGS) of association panels comprising a large number of samples remains cost-prohibitive. Therefore, most GWAS populations are still genotyped using low-coverage genotyping methods resulting in incomplete datasets. Imputation of untyped variants is a powerful method to maximize the number of SNPs identified in study samples, it increases the power and resolution of GWAS and allows to integrate genotyping datasets obtained from various sources. Here, we describe the key concepts underlying imputation of untyped variants, including the architecture of reference panels, and review some of the associated challenges and how these can be addressed. We also discuss the need and available methods to rigorously assess the accuracy of imputed data prior to their use in any genetic study.
Collapse
Affiliation(s)
- Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, Canada.,Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada.,Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - François Belzile
- Département de Phytologie, Université Laval, Québec City, QC, Canada. .,Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada.
| |
Collapse
|
12
|
Torkamaneh D, Laroche J, Valliyodan B, O'Donoughue L, Cober E, Rajcan I, Vilela Abdelnoor R, Sreedasyam A, Schmutz J, Nguyen HT, Belzile F. Soybean (Glycine max) Haplotype Map (GmHapMap): a universal resource for soybean translational and functional genomics. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:324-334. [PMID: 32794321 DOI: 10.1101/534578] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 07/24/2020] [Accepted: 08/07/2020] [Indexed: 05/27/2023]
Abstract
Here, we describe a worldwide haplotype map for soybean (GmHapMap) constructed using whole-genome sequence data for 1007 Glycine max accessions and yielding 14.9 million variants as well as 4.3 M tag single-nucleotide polymorphisms (SNPs). When sampling random subsets of these accessions, the number of variants and tag SNPs plateaued beyond approximately 800 and 600 accessions, respectively. This suggests extensive coverage of diversity within the cultivated soybean. GmHapMap variants were imputed onto 21 618 previously genotyped accessions with up to 96% success for common alleles. A local association analysis was performed with the imputed data using markers located in a 1-Mb region known to contribute to seed oil content and enabled us to identify a candidate causal SNP residing in the NPC1 gene. We determined gene-centric haplotypes (407 867 GCHs) for the 55 589 genes and showed that such haplotypes can help to identify alleles that differ in the resulting phenotype. Finally, we predicted 18 031 putative loss-of-function (LOF) mutations in 10 662 genes and illustrated how such a resource can be used to explore gene function. The GmHapMap provides a unique worldwide resource for applied soybean genomics and breeding.
Collapse
Affiliation(s)
- Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - Jérôme Laroche
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada
| | - Babu Valliyodan
- National Center for Soybean Biotechnology and Division of Plant Sciences, University of Missouri, Columbia, MO, USA
| | - Louise O'Donoughue
- CÉROM, Centre de recherche Sur Les Grains Inc., Saint-Mathieu de Beloeil, QC, Canada
| | - Elroy Cober
- Agriculture and Agri-Food Canada, Ottawa, ON, Canada
| | - Istvan Rajcan
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - Ricardo Vilela Abdelnoor
- Brazilian Corporation of Agricultural Research (Embrapa Soja), Warta County, PR, Brazil
- Londrina State University (UEL), Londrina, PR, Brazil
| | | | - Jeremy Schmutz
- Institute for Biotechnology, HudsonAlpha, Huntsville, AL, USA
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA
| | - Henry T Nguyen
- National Center for Soybean Biotechnology and Division of Plant Sciences, University of Missouri, Columbia, MO, USA
| | - François Belzile
- Département de Phytologie, Université Laval, Québec City, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec City, QC, Canada
| |
Collapse
|
13
|
Torkamaneh D, Laroche J, Valliyodan B, O’Donoughue L, Cober E, Rajcan I, Vilela Abdelnoor R, Sreedasyam A, Schmutz J, Nguyen HT, Belzile F. Soybean (Glycine max) Haplotype Map (GmHapMap): a universal resource for soybean translational and functional genomics. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:324-334. [PMID: 32794321 PMCID: PMC7868971 DOI: 10.1111/pbi.13466] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 07/24/2020] [Accepted: 08/07/2020] [Indexed: 05/10/2023]
Abstract
Here, we describe a worldwide haplotype map for soybean (GmHapMap) constructed using whole-genome sequence data for 1007 Glycine max accessions and yielding 14.9 million variants as well as 4.3 M tag single-nucleotide polymorphisms (SNPs). When sampling random subsets of these accessions, the number of variants and tag SNPs plateaued beyond approximately 800 and 600 accessions, respectively. This suggests extensive coverage of diversity within the cultivated soybean. GmHapMap variants were imputed onto 21 618 previously genotyped accessions with up to 96% success for common alleles. A local association analysis was performed with the imputed data using markers located in a 1-Mb region known to contribute to seed oil content and enabled us to identify a candidate causal SNP residing in the NPC1 gene. We determined gene-centric haplotypes (407 867 GCHs) for the 55 589 genes and showed that such haplotypes can help to identify alleles that differ in the resulting phenotype. Finally, we predicted 18 031 putative loss-of-function (LOF) mutations in 10 662 genes and illustrated how such a resource can be used to explore gene function. The GmHapMap provides a unique worldwide resource for applied soybean genomics and breeding.
Collapse
Affiliation(s)
- Davoud Torkamaneh
- Département de PhytologieUniversité LavalQuébec CityQCCanada
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébec CityQCCanada
- Department of Plant AgricultureUniversity of GuelphGuelphONCanada
| | - Jérôme Laroche
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébec CityQCCanada
| | - Babu Valliyodan
- National Center for Soybean Biotechnology and Division of Plant SciencesUniversity of MissouriColumbiaMOUSA
| | - Louise O’Donoughue
- CÉROMCentre de recherche Sur Les Grains Inc.Saint‐Mathieu de BeloeilQCCanada
| | - Elroy Cober
- Agriculture and Agri‐Food CanadaOttawaONCanada
| | - Istvan Rajcan
- Department of Plant AgricultureUniversity of GuelphGuelphONCanada
| | - Ricardo Vilela Abdelnoor
- Brazilian Corporation of Agricultural Research (Embrapa Soja)Warta CountyPRBrazil
- Londrina State University (UEL)LondrinaPRBrazil
| | | | - Jeremy Schmutz
- Institute for BiotechnologyHudsonAlphaHuntsvilleALUSA
- Department of EnergyJoint Genome InstituteWalnut CreekCAUSA
| | - Henry T. Nguyen
- National Center for Soybean Biotechnology and Division of Plant SciencesUniversity of MissouriColumbiaMOUSA
| | - François Belzile
- Département de PhytologieUniversité LavalQuébec CityQCCanada
- Institut de Biologie Intégrative et des Systèmes (IBIS)Université LavalQuébec CityQCCanada
| |
Collapse
|
14
|
Chen SF, Dias R, Evans D, Salfati EL, Liu S, Wineinger NE, Torkamani A. Genotype imputation and variability in polygenic risk score estimation. Genome Med 2020; 12:100. [PMID: 33225976 PMCID: PMC7682022 DOI: 10.1186/s13073-020-00801-x] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 11/09/2020] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Polygenic risk scores (PRSs) are a summarization of an individual's genetic risk for a disease or trait. These scores are being generated in research and commercial settings to study how they may be used to guide healthcare decisions. PRSs should be updated as genetic knowledgebases improve; however, no guidelines exist for their generation or updating. METHODS Here, we characterize the variability introduced in PRS calculation by a common computational process used in their generation-genotype imputation. We evaluated PRS variability when performing genotype imputation using 3 different pre-phasing tools (Beagle, Eagle, SHAPEIT) and 2 different imputation tools (Beagle, Minimac4), relative to a WGS-based gold standard. Fourteen different PRSs spanning different disease architectures and PRS generation approaches were evaluated. RESULTS We find that genotype imputation can introduce variability in calculated PRSs at the individual level without any change to the underlying genetic model. The degree of variability introduced by genotype imputation differs across algorithms, where pre-phasing algorithms with stochastic elements introduce the greatest degree of score variability. In most cases, PRS variability due to imputation is minor (< 5 percentile rank change) and does not influence the interpretation of the score. PRS percentile fluctuations are also reduced in the more informative tails of the PRS distribution. However, in rare instances, PRS instability at the individual level can result in singular PRS calculations that differ substantially from a whole genome sequence-based gold standard score. CONCLUSIONS Our study highlights some challenges in applying population genetics tools to individual-level genetic analysis including return of results. Rare individual-level variability events are masked by a high degree of overall score reproducibility at the population level. In order to avoid PRS result fluctuations during updates, we suggest that deterministic imputation processes or the average of multiple iterations of stochastic imputation processes be used to generate and deliver PRS results.
Collapse
Affiliation(s)
- Shang-Fu Chen
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
| | - Raquel Dias
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
| | - Doug Evans
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
| | - Elias L Salfati
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
| | - Shuchen Liu
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
| | - Nathan E Wineinger
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
| | - Ali Torkamani
- Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA.
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA.
| |
Collapse
|
15
|
Hermisdorff IDC, Costa RB, de Albuquerque LG, Pausch H, Kadri NK. Investigating the accuracy of imputing autosomal variants in Nellore cattle using the ARS-UCD1.2 assembly of the bovine genome. BMC Genomics 2020; 21:772. [PMID: 33167856 PMCID: PMC7654006 DOI: 10.1186/s12864-020-07184-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Accepted: 10/26/2020] [Indexed: 11/22/2022] Open
Abstract
Background Imputation accuracy among other things depends on the size of the reference panel, the marker’s minor allele frequency (MAF), and the correct placement of single nucleotide polymorphism (SNP) on the reference genome assembly. Using high-density genotypes of 3938 Nellore cattle from Brazil, we investigated the accuracy of imputation from 50 K to 777 K SNP density using Minimac3, when map positions were determined according to the bovine genome assemblies UMD3.1 and ARS-UCD1.2. We assessed the effect of reference and target panel sizes on the pre-phasing based imputation quality using ten-fold cross-validation. Further, we compared the reliability of the model-based imputation quality score (Rsq) from Minimac3 to the empirical imputation accuracy. Results The overall accuracy of imputation measured as the squared correlation between true and imputed allele dosages (R2dose) was almost identical using either the UMD3.1 or ARS-UCD1.2 genome assembly. When the size of the reference panel increased from 250 to 2000, R2dose increased from 0.845 to 0.917, and the number of polymorphic markers in the imputed data set increased from 586,701 to 618,660. Advantages in both accuracy and marker density were also observed when larger target panels were imputed, likely resulting from more accurate haplotype inference. Imputation accuracy increased from 0.903 to 0.913, and the marker density in the imputed data increased from 593,239 to 595,570 when haplotypes were inferred in 500 and 2900 target animals. The model-based imputation quality scores from Minimac3 (Rsq) were systematically higher than empirically estimated accuracies. However, both metrics were positively correlated and the correlation increased with the size of the reference panel and MAF of imputed variants. Conclusions Accurate imputation of BovineHD BeadChip markers is possible in Nellore cattle using the new bovine reference genome assembly ARS-UCD1.2. The use of large reference and target panels improves the accuracy of the imputed genotypes and provides genotypes for more markers segregating at low frequency for downstream genomic analyses. The model-based imputation quality score from Minimac3 (Rsq) can be used to detect poorly imputed variants but its reliability depends on the size of the reference panel and MAF of the imputed variants. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-020-07184-8.
Collapse
Affiliation(s)
- Isis da Costa Hermisdorff
- School of Veterinary Medicine and Animal Science, Federal University of Bahia (UFBA), Salvador, Brazil.,Animal Genomics, ETH Zurich, Zurich, Switzerland
| | - Raphael Bermal Costa
- School of Veterinary Medicine and Animal Science, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Lucia Galvão de Albuquerque
- Animal Science Department, School of Agricultural and Veterinary Sciences, São Paulo State University (Unesp), Jaboticabal, São Paulo, Brazil
| | | | | |
Collapse
|
16
|
Rowan TN, Hoff JL, Crum TE, Taylor JF, Schnabel RD, Decker JE. A multi-breed reference panel and additional rare variants maximize imputation accuracy in cattle. Genet Sel Evol 2019; 51:77. [PMID: 31878893 PMCID: PMC6933688 DOI: 10.1186/s12711-019-0519-x] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 12/16/2019] [Indexed: 01/08/2023] Open
Abstract
Background During the last decade, the use of common-variant array-based single nucleotide polymorphism (SNP) genotyping in the beef and dairy industries has produced an astounding amount of medium-to-low density genomic data. Although low-density assays work well in the context of genomic prediction, they are less useful for detecting and mapping causal variants and the effects of rare variants are not captured. The objective of this project was to maximize the accuracies of genotype imputation from medium- and low-density assays to the marker set obtained by combining two high-density research assays (~ 850,000 SNPs), the Illumina BovineHD and the GGP-F250 assays, which contains a large proportion of rare and potentially functional variants and for which the assay design is described here. This 850 K SNP set is useful for both imputation to sequence-level genotypes and direct downstream analysis. Results We found that a large multi-breed composite imputation reference panel that includes 36,131 samples with either BovineHD and/or GGP-F250 genotypes significantly increased imputation accuracy compared with a within-breed reference panel, particularly at variants with low minor allele frequencies. Individual animal imputation accuracies were maximized when more genetically similar animals were represented in the composite reference panel, particularly with complete 850 K genotypes. The addition of rare variants from the GGP-F250 assay to our composite reference panel significantly increased the imputation accuracy of rare variants that are exclusively present on the BovineHD assay. In addition, we show that an assay marker density of 50 K SNPs balances cost and accuracy for imputation to 850 K. Conclusions Using high-density genotypes on all available individuals in a multi-breed reference panel maximized imputation accuracy for tested cattle populations. Admixed animals or those from breeds with a limited representation in the composite reference panel were still imputed at high accuracy, which is expected to further increase as the reference panel expands. We anticipate that the addition of rare variants from the GGP-F250 assay will increase the accuracy of imputation to sequence level.
Collapse
Affiliation(s)
- Troy N Rowan
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Jesse L Hoff
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Tamar E Crum
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Jeremy F Taylor
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Robert D Schnabel
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA. .,Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Jared E Decker
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA. .,Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
17
|
Saccone NL, Emery LS, Sofer T, Gogarten SM, Becker DM, Bottinger EP, Chen LS, Culverhouse RC, Duan W, Hancock DB, Hosgood HD, Johnson EO, Loos RJF, Louie T, Papanicolaou G, Perreira KM, Rodriquez EJ, Schurmann C, Stilp AM, Szpiro AA, Talavera GA, Taylor KD, Thrasher JF, Yanek LR, Laurie CC, Pérez-Stable EJ, Bierut LJ, Kaplan RC. Genome-Wide Association Study of Heavy Smoking and Daily/Nondaily Smoking in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). Nicotine Tob Res 2019; 20:448-457. [PMID: 28520984 DOI: 10.1093/ntr/ntx107] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2016] [Accepted: 05/11/2017] [Indexed: 02/07/2023]
Abstract
Introduction Genetic variants associated with nicotine dependence have previously been identified, primarily in European-ancestry populations. No genome-wide association studies (GWAS) have been reported for smoking behaviors in Hispanics/Latinos in the United States and Latin America, who are of mixed ancestry with European, African, and American Indigenous components. Methods We examined genetic associations with smoking behaviors in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) (N = 12 741 with smoking data, 5119 ever-smokers), using ~2.3 million genotyped variants imputed to the 1000 Genomes Project phase 3. Mixed logistic regression models accounted for population structure, sampling, relatedness, sex, and age. Results The known region of CHRNA5, which encodes the α5 cholinergic nicotinic receptor subunit, was associated with heavy smoking at genome-wide significance (p ≤ 5 × 10-8) in a comparison of 1929 ever-smokers reporting cigarettes per day (CPD) > 10 versus 3156 reporting CPD ≤ 10. The functional variant rs16969968 in CHRNA5 had a p value of 2.20 × 10-7 and odds ratio (OR) of 1.32 for the minor allele (A); its minor allele frequency was 0.22 overall and similar across Hispanic/Latino background groups (Central American = 0.17; South American = 0.19; Mexican = 0.18; Puerto Rican = 0.22; Cuban = 0.29; Dominican = 0.19). CHRNA4 on chromosome 20 attained p < 10-4, supporting prior findings in non-Hispanics. For nondaily smoking, which is prevalent in Hispanic/Latino smokers, compared to daily smoking, loci on chromosomes 2 and 4 achieved genome-wide significance; replication attempts were limited by small Hispanic/Latino sample sizes. Conclusions Associations of nicotinic receptor gene variants with smoking, first reported in non-Hispanic European-ancestry populations, generalized to Hispanics/Latinos despite different patterns of smoking behavior. Implications We conducted the first large-scale genome-wide association study (GWAS) of smoking behavior in a US Hispanic/Latino cohort, and the first GWAS of daily/nondaily smoking in any population. Results show that the region of the nicotinic receptor subunit gene CHRNA5, which in non-Hispanic European-ancestry smokers has been associated with heavy smoking as well as cessation and treatment efficacy, is also significantly associated with heavy smoking in this Hispanic/Latino cohort. The results are an important addition to understanding the impact of genetic variants in understudied Hispanic/Latino smokers.
Collapse
Affiliation(s)
- Nancy L Saccone
- Department of Genetics, Washington University School of Medicine, St. Louis, MO
| | - Leslie S Emery
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Tamar Sofer
- Department of Biostatistics, University of Washington, Seattle, WA
| | | | - Diane M Becker
- GeneSTAR Research Program, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD
| | - Erwin P Bottinger
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| | - Li-Shiun Chen
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO
| | | | - Weimin Duan
- Department of Genetics, Washington University School of Medicine, St. Louis, MO
| | - Dana B Hancock
- Behavioral and Urban Health Program, Behavioral Health and Criminal Justice Division, RTI International, Research Triangle Park, NC
| | - H Dean Hosgood
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY
| | - Eric O Johnson
- Fellow Program and Behavioral Health and Criminal Justice Division, RTI International, Research Triangle Park, NC
| | - Ruth J F Loos
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| | - Tin Louie
- Department of Biostatistics, University of Washington, Seattle, WA
| | - George Papanicolaou
- Division of Cardiovascular Sciences, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD
| | - Krista M Perreira
- Department of Public Policy, University of North Carolina at Chapel Hill, Chapel Hill, NC
| | - Erik J Rodriquez
- National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD.,Division of General Internal Medicine, University of California, San Francisco, San Francisco, CA
| | - Claudia Schurmann
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| | - Adrienne M Stilp
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Adam A Szpiro
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Gregory A Talavera
- Graduate School of Public Health, San Diego State University, San Diego, CA
| | - Kent D Taylor
- Institute for Translational Genomics and Population Sciences, Los Angeles Biomedical Research Institute and Department of Pediatrics, Harbor-UCLA Medical Center, Torrance, CA
| | - James F Thrasher
- Department of Health Promotion, Education and Behavior, University of South Carolina, Columbia, SC
| | - Lisa R Yanek
- GeneSTAR Research Program, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD
| | - Cathy C Laurie
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Eliseo J Pérez-Stable
- National Institute on Minority Health and Health Disparities, National Institutes of Health, Bethesda, MD
| | - Laura J Bierut
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO
| | - Robert C Kaplan
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY
| |
Collapse
|
18
|
Ullah E, Mall R, Abbas MM, Kunji K, Nato AQ, Bensmail H, Wijsman EM, Saad M. Comparison and assessment of family- and population-based genotype imputation methods in large pedigrees. Genome Res 2018; 29:125-134. [PMID: 30514702 PMCID: PMC6314157 DOI: 10.1101/gr.236315.118] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Accepted: 11/30/2018] [Indexed: 01/19/2023]
Abstract
Genotype imputation is widely used in genome-wide association studies to boost variant density, allowing increased power in association testing. Many studies currently include pedigree data due to increasing interest in rare variants coupled with the availability of appropriate analysis tools. The performance of population-based (subjects are unrelated) imputation methods is well established. However, the performance of family- and population-based imputation methods on family data has been subject to much less scrutiny. Here, we extensively compare several family- and population-based imputation methods on family data of large pedigrees with both European and African ancestry. Our comparison includes many widely used family- and population-based tools and another method, Ped_Pop, which combines family- and population-based imputation results. We also compare four subject selection strategies for full sequencing to serve as the reference panel for imputation: GIGI-Pick, ExomePicks, PRIMUS, and random selection. Moreover, we compare two imputation accuracy metrics: the Imputation Quality Score and Pearson's correlation R 2 for predicting power of association analysis using imputation results. Our results show that (1) GIGI outperforms Merlin; (2) family-based imputation outperforms population-based imputation for rare variants but not for common ones; (3) combining family- and population-based imputation outperforms all imputation approaches for all minor allele frequencies; (4) GIGI-Pick gives the best selection strategy based on the R 2 criterion; and (5) R 2 is the best measure of imputation accuracy. Our study is the first to extensively evaluate the imputation performance of many available family- and population-based tools on the same family data and provides guidelines for future studies.
Collapse
Affiliation(s)
- Ehsan Ullah
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Mostafa M Abbas
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Khalid Kunji
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Alejandro Q Nato
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington 98195-9460, USA.,Department of Biomedical Sciences, Joan C. Edwards School of Medicine, Marshall University, Huntington, West Virginia 25755, USA
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Ellen M Wijsman
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington 98195-9460, USA.,Department of Biostatistics, University of Washington, Seattle, Washington 98195-9460, USA
| | - Mohamad Saad
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| |
Collapse
|
19
|
Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH, Duggal P. Genotype imputation performance of three reference panels using African ancestry individuals. Hum Genet 2018; 137:281-292. [PMID: 29637265 PMCID: PMC6209094 DOI: 10.1007/s00439-018-1881-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 03/31/2018] [Indexed: 12/22/2022]
Abstract
Genotype imputation estimates unobserved genotypes from genome-wide makers, to increase genome coverage and power for genome-wide association studies. Imputation has been successful for European ancestry populations in which very large reference panels are available. Smaller subsets of African descent populations are available in 1000 Genomes (1000G), the Consortium on Asthma among African ancestry Populations in the Americas (CAAPA) and the Haplotype Reference Consortium (HRC). We compared the performance of these reference panels when imputing variation in 3747 African Americans (AA) from two cohorts (HCV and COPDGene) genotyped using Illumina Omni microarrays. The haplotypes of 2504 (1000G), 883 (CAAPA) and 32,470 individuals (HRC) were used as reference. We compared the number of variants, imputation quality, imputation accuracy and coverage between panels. In both cohorts, 1000G imputed 1.5-1.6× more variants than CAAPA and 1.2× more than HRC. Similar findings were observed for variants with imputation R2 > 0.5 and for rare, low-frequency, and common variants. When merging imputed variants of the three panels, the total number was 62-63 M with 20 M overlapping variants imputed by all three panels, and a range of 5-15 M variants imputed exclusively with one of them. For overlapping variants, imputation quality was highest for HRC, followed by 1000G, then CAAPA, and improved as the minor allele frequency increased. 1000G, HRC and CAAPA provided high performance and accuracy for imputation of African American individuals, increasing the number of variants available for subsequent analyses. These panels are complementary and would benefit from the development of an integrated African reference panel.
Collapse
Affiliation(s)
| | - Margaret M Parker
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Liliana Franco
- National School of Public Health, Universidad de Antioquia, Medellín, Colombia
- School of Medicine, Universidad Pontificia Bolivariana, Medellín, Colombia
| | - Michael H Cho
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | | | - Terri H Beaty
- Johns Hopkins University, Bloomberg School of Public Health, Baltimore, MD, USA
| | - Priya Duggal
- Johns Hopkins University, Bloomberg School of Public Health, Baltimore, MD, USA.
| |
Collapse
|
20
|
Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy. Sci Rep 2017; 7:6733. [PMID: 28751670 PMCID: PMC5532257 DOI: 10.1038/s41598-017-06905-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Accepted: 06/20/2017] [Indexed: 12/23/2022] Open
Abstract
Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.
Collapse
|