1
|
Keys KL, Mak ACY, White MJ, Eckalbar WL, Dahl AW, Mefford J, Mikhaylova AV, Contreras MG, Elhawary JR, Eng C, Hu D, Huntsman S, Oh SS, Salazar S, Lenoir MA, Ye JC, Thornton TA, Zaitlen N, Burchard EG, Gignoux CR. On the cross-population generalizability of gene expression prediction models. PLoS Genet 2020; 16:e1008927. [PMID: 32797036 PMCID: PMC7449671 DOI: 10.1371/journal.pgen.1008927] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 08/26/2020] [Accepted: 06/10/2020] [Indexed: 11/21/2022] Open
Abstract
The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.
Collapse
Affiliation(s)
- Kevin L. Keys
- Department of Medicine, University of California, San Francisco, California, United States of America
- Berkeley Institute for Data Science, University of California, Berkeley, California, United States of America
| | - Angel C. Y. Mak
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Marquitta J. White
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Walter L. Eckalbar
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Andrew W. Dahl
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Joel Mefford
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Anna V. Mikhaylova
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - María G. Contreras
- Department of Medicine, University of California, San Francisco, California, United States of America
- San Francisco State University, San Francisco, California, United States of America
| | - Jennifer R. Elhawary
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Celeste Eng
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Donglei Hu
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Scott Huntsman
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Sam S. Oh
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Sandra Salazar
- Department of Medicine, University of California, San Francisco, California, United States of America
| | | | - Jimmie C. Ye
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Biosciences, University of California, San Francisco, California, United States of America
| | - Timothy A. Thornton
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Noah Zaitlen
- Department of Neurology, University of California, Los Angeles, California, United States of America
| | - Esteban G. Burchard
- Department of Medicine, University of California, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Biosciences, University of California, San Francisco, California, United States of America
| | - Christopher R. Gignoux
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
- Department of Biostatistics and Informatics, School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
| |
Collapse
|
2
|
Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy. Sci Rep 2017; 7:6733. [PMID: 28751670 PMCID: PMC5532257 DOI: 10.1038/s41598-017-06905-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Accepted: 06/20/2017] [Indexed: 12/23/2022] Open
Abstract
Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.
Collapse
|
3
|
A thrifty variant in CREBRF strongly influences body mass index in Samoans. Nat Genet 2016; 48:1049-1054. [PMID: 27455349 PMCID: PMC5069069 DOI: 10.1038/ng.3620] [Citation(s) in RCA: 157] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Accepted: 06/15/2016] [Indexed: 12/14/2022]
Abstract
Samoans are a unique founder population with a high prevalence of obesity, making them well suited for identifying new genetic contributors to obesity. We conducted a genome-wide association study (GWAS) in 3,072 Samoans, discovered a variant, rs12513649, strongly associated with body mass index (BMI) (P = 5.3 × 10(-14)), and replicated the association in 2,102 additional Samoans (P = 1.2 × 10(-9)). Targeted sequencing identified a strongly associated missense variant, rs373863828 (p.Arg457Gln), in CREBRF (meta P = 1.4 × 10(-20)). Although this variant is extremely rare in other populations, it is common in Samoans (frequency of 0.259), with an effect size much larger than that of any other known common BMI risk variant (1.36-1.45 kg/m(2) per copy of the risk-associated allele). In comparison to wild-type CREBRF, the Arg457Gln variant when overexpressed selectively decreased energy use and increased fat storage in an adipocyte cell model. These data, in combination with evidence of positive selection of the allele encoding p.Arg457Gln, support a 'thrifty' variant hypothesis as a factor in human obesity.
Collapse
|