1
|
Cahoon JL, Rui X, Tang E, Simons C, Langie J, Chen M, Lo YC, Chiang CWK. Imputation accuracy across global human populations. Am J Hum Genet 2024; 111:979-989. [PMID: 38604166 PMCID: PMC11080279 DOI: 10.1016/j.ajhg.2024.03.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 04/13/2024] Open
Abstract
Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of references from non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative improved the imputation of admixed African-ancestry and Hispanic/Latino samples, but imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we imputed the genotypes of over 43,000 individuals across 123 populations around the world and identified numerous populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for variants with minor allele frequencies between 1% and 5% in Saudi Arabians (n = 1,061), Vietnamese (n = 1,264), Thai (n = 2,435), and Papua New Guineans (n = 776) were 0.79, 0.78, 0.76, and 0.62, respectively, compared to 0.90-0.93 for comparable European populations matched in sample size and SNP array content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European-ancestry reference increased, as predicted. Using sequencing data as ground truth, we also showed that Rsq may over-estimate imputation accuracy for non-European populations more than European populations, suggesting further disparity in accuracy between populations. Using 1,496 sequenced individuals from Taiwan Biobank as a second reference panel to TOPMed, we also assessed a strategy to improve imputation for non-European populations with meta-imputation, but this design did not improve accuracy across frequency spectra. Taken together, our analyses suggest that we must ultimately strive to increase diversity and size to promote equity within genetics research.
Collapse
Affiliation(s)
- Jordan L Cahoon
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Xinyue Rui
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Echo Tang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Christopher Simons
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA
| | - Jalen Langie
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Minhui Chen
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Ying-Chu Lo
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, Los Angeles, CA 90089, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, Los Angeles, CA 90033, USA.
| |
Collapse
|
2
|
Dinh BL, Tang E, Taparra K, Nakatsuka N, Chen F, Chiang CWK. Recombination map tailored to Native Hawaiians may improve robustness of genomic scans for positive selection. Hum Genet 2024; 143:85-99. [PMID: 38157018 PMCID: PMC10794367 DOI: 10.1007/s00439-023-02625-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 11/25/2023] [Indexed: 01/03/2024]
Abstract
Recombination events establish the patterns of haplotypic structure in a population and estimates of recombination rates are used in several downstream population and statistical genetic analyses. Using suboptimal maps from distantly related populations may reduce the efficacy of genomic analyses, particularly for underrepresented populations such as the Native Hawaiians. To overcome this challenge, we constructed recombination maps using genome-wide array data from two study samples of Native Hawaiians: one reflecting the current admixed state of Native Hawaiians (NH map) and one based on individuals of enriched Polynesian ancestries (PNS map) with the potential to be used for less admixed Polynesian populations such as the Samoans. We found the recombination landscape to be less correlated with those from other continental populations (e.g. Spearman's rho = 0.79 between PNS and CEU (Utah residents with Northern and Western European ancestry) compared to 0.92 between YRI (Yoruba in Ibadan, Nigeria) and CEU at 50 kb resolution), likely driven by the unique demographic history of the Native Hawaiians. PNS also shared the fewest recombination hotspots with other populations (e.g. 8% of hotspots shared between PNS and CEU compared to 27% of hotspots shared between YRI and CEU). We found that downstream analyses in the Native Hawaiian population, such as local ancestry inference, imputation, and IBD segment and relatedness detections, would achieve similar efficacy when using the NH map compared to an omnibus map. However, for genome scans of adaptive loci using integrated haplotype scores, we found several loci with apparent genome-wide significant signals (|Z-score|> 4) in Native Hawaiians that would not have been significant when analyzed using NH-specific maps. Population-specific recombination maps may therefore improve the robustness of haplotype-based statistics and help us better characterize the evolutionary history that may underlie Native Hawaiian-specific health conditions that persist today.
Collapse
Affiliation(s)
- Bryan L Dinh
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Echo Tang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Kekoa Taparra
- Department of Radiation Oncology, Stanford University, Palo Alto, CA, USA
| | | | - Fei Chen
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
3
|
Lo YC, Chan TF, Jeon S, Maskarinec G, Taparra K, Nakatsuka N, Yu M, Chen CY, Lin YF, Wilkens LR, Le Marchand L, Haiman CA, Chiang CWK. The accuracy of polygenic score models for anthropometric traits and Type II Diabetes in the Native Hawaiian Population. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.12.25.23300499. [PMID: 38234828 PMCID: PMC10793530 DOI: 10.1101/2023.12.25.23300499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Polygenic scores (PGS) are promising in stratifying individuals based on the genetic susceptibility to complex diseases or traits. However, the accuracy of PGS models, typically trained in European- or East Asian-ancestry populations, tend to perform poorly in other ethnic minority populations, and their accuracies have not been evaluated for Native Hawaiians. Using body mass index, height, and type-2 diabetes as examples of highly polygenic traits, we evaluated the prediction accuracies of PGS models in a large Native Hawaiian sample from the Multiethnic Cohort with up to 5,300 individuals. We evaluated both publicly available PGS models or genome-wide PGS models trained in this study using the largest available GWAS. We found evidence of lowered prediction accuracies for the PGS models in some cases, particularly for height. We also found that using the Native Hawaiian samples as an optimization cohort during training did not consistently improve PGS performance. Moreover, even the best performing PGS models among Native Hawaiians would have lowered prediction accuracy among the subset of individuals most enriched with Polynesian ancestry. Our findings indicate that factors such as admixture histories, sample size and diversity in GWAS can influence PGS performance for complex traits among Native Hawaiian samples. This study provides an initial survey of PGS performance among Native Hawaiians and exposes the current gaps and challenges associated with improving polygenic prediction models for underrepresented minority populations.
Collapse
Affiliation(s)
- Ying-Chu Lo
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Tsz Fung Chan
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Soyoung Jeon
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Gertraud Maskarinec
- Epidemiology Program, University of Hawai'i Cancer Center, University of Hawai'i, Manoa, Honolulu, HI, USA
| | - Kekoa Taparra
- Standard Health Care, Department of Radiation Oncology, Palo Alto, CA, USA
| | | | - Mingrui Yu
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Neuropsychiatric Research, National Health Research Institutes, Miaoli, Taiwan
| | - Chia-Yen Chen
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Neuropsychiatric Research, National Health Research Institutes, Miaoli, Taiwan
- Biogen, Cambridge, MA, USA
- Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Yen-Feng Lin
- Center for Neuropsychiatric Research, National Health Research Institutes, Miaoli, Taiwan
- Department of Public Health & Medical Humanities, School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan
- Institute of Behavioral Medicine, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Lynne R Wilkens
- Epidemiology Program, University of Hawai'i Cancer Center, University of Hawai'i, Manoa, Honolulu, HI, USA
| | - Loic Le Marchand
- Epidemiology Program, University of Hawai'i Cancer Center, University of Hawai'i, Manoa, Honolulu, HI, USA
| | - Christopher A Haiman
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Cancer Epidemiology Program, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Cancer Epidemiology Program, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
4
|
Cahoon JL, Rui X, Tang E, Simons C, Langie J, Chen M, Lo YC, Chiang CWK. Imputation Accuracy Across Global Human Populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.22.541241. [PMID: 37292811 PMCID: PMC10245797 DOI: 10.1101/2023.05.22.541241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Genotype imputation is now fundamental for genome-wide association studies but lacks fairness due to the underrepresentation of populations with non-European ancestries. The state-of-the-art imputation reference panel released by the Trans-Omics for Precision Medicine (TOPMed) initiative contains a substantial number of admixed African-ancestry and Hispanic/Latino samples to impute these populations with nearly the same accuracy as European-ancestry cohorts. However, imputation for populations primarily residing outside of North America may still fall short in performance due to persisting underrepresentation. To illustrate this point, we curated genome-wide array data from 23 publications published between 2008 to 2021. In total, we imputed over 43k individuals across 123 populations around the world. We identified a number of populations where imputation accuracy paled in comparison to that of European-ancestry populations. For instance, the mean imputation r-squared (Rsq) for 1-5% alleles in Saudi Arabians (N=1061), Vietnamese (N=1264), Thai (N=2435), and Papua New Guineans (N=776) were 0.79, 0.78, 0.76, and 0.62, respectively. In contrast, the mean Rsq ranged from 0.90 to 0.93 for comparable European populations matched in sample size and SNP content. Outside of Africa and Latin America, Rsq appeared to decrease as genetic distances to European reference increased, as predicted. Further analysis using sequencing data as ground truth suggested that imputation software may over-estimate imputation accuracy for non-European populations than European populations, suggesting further disparity between populations. Using 1496 whole genome sequenced individuals from Taiwan Biobank as a reference, we also assessed a strategy to improve imputation for non-European populations with meta-imputation, which can combine results from TOPMed with smaller population-specific reference panels. We found that meta-imputation in this design did not improve Rsq genome-wide. Taken together, our analysis suggests that with the current size of alternative reference panels, meta-imputation alone cannot improve imputation efficacy for underrepresented cohorts and we must ultimately strive to increase diversity and size to promote equity within genetics research.
Collapse
Affiliation(s)
- Jordan L. Cahoon
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Xinyue Rui
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Echo Tang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Christopher Simons
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jalen Langie
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Minhui Chen
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Ying-Chu Lo
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Charleston W. K. Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| |
Collapse
|
5
|
Dinh BL, Tang E, Taparra K, Nakatsuka N, Chen F, Chiang CWK. Recombination map tailored to Native Hawaiians improves robustness of genomic scans for positive selection. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.12.548735. [PMID: 37503129 PMCID: PMC10370006 DOI: 10.1101/2023.07.12.548735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Recombination events establish the patterns of haplotypic structure in a population and estimates of recombination rates are used in several downstream population and statistical genetic analyses. Using suboptimal maps from distantly related populations may reduce the efficacy of genomic analyses, particularly for underrepresented populations such as the Native Hawaiians. To overcome this challenge, we constructed recombination maps using genome-wide array data from two study samples of Native Hawaiians: one reflecting the current admixed state of Native Hawaiians (NH map), and one based on individuals of enriched Polynesian ancestries (PNS map) with the potential to be used for less admixed Polynesian populations such as the Samoans. We found the recombination landscape to be less correlated with those from other continental populations (e.g. Spearman's rho = 0.79 between PNS and CEU (Utah residents with Northern and Western European ancestry) compared to 0.92 between YRI (Yoruba in Ibadan, Nigeria) and CEU at 50 kb resolution), likely driven by the unique demographic history of the Native Hawaiians. PNS also shared the fewest recombination hotspots with other populations (e.g. 8% of hotspots shared between PNS and CEU compared to 27% of hotspots shared between YRI and CEU). We found that downstream analyses in the Native Hawaiian population, such as local ancestry inference, imputation, and IBD segment and relatedness detections, would achieve similar efficacy when using the NH map compared to an omnibus map. However, for genome scans of adaptive loci using integrated haplotype scores, we found several loci with apparent genome-wide significant signals (|Z-score| > 4) in Native Hawaiians that would not have been significant when analyzed using NH-specific maps. Population-specific recombination maps may therefore improve the robustness of haplotype-based statistics and help us better characterize the evolutionary history that may underlie Native Hawaiian-specific health conditions that persist today.
Collapse
Affiliation(s)
- Bryan L Dinh
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Echo Tang
- Department of Quantitative and Computational Biology, University of Southern California
| | - Kekoa Taparra
- Department of Radiation Oncology, Stanford University, Palo Alto, California
| | | | - Fei Chen
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Charleston W K Chiang
- Department of Quantitative and Computational Biology, University of Southern California
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| |
Collapse
|
6
|
Fan C, Mancuso N, Chiang CWK. A genealogical estimate of genetic relationships. Am J Hum Genet 2022; 109:812-824. [PMID: 35417677 PMCID: PMC9118131 DOI: 10.1016/j.ajhg.2022.03.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Accepted: 03/25/2022] [Indexed: 12/23/2022] Open
Abstract
The application of genetic relationships among individuals, characterized by a genetic relationship matrix (GRM), has far-reaching effects in human genetics. However, the current standard to calculate the GRM treats linked markers as independent and does not explicitly model the underlying genealogical history of the study sample. Here, we propose a coalescent-informed framework, namely the expected GRM (eGRM), to infer the expected relatedness between pairs of individuals given an ancestral recombination graph (ARG) of the sample. Through extensive simulations, we show that the eGRM is an unbiased estimate of latent pairwise genome-wide relatedness and is robust when computed with ARG inferred from incomplete genetic data. As a result, the eGRM better captures the structure of a population than the canonical GRM, even when using the same genetic information. More importantly, our framework allows a principled approach to estimate the eGRM at different time depths of the ARG, thereby revealing the time-varying nature of population structure in a sample. When applied to SNP array genotypes from a population sample from Northern and Eastern Finland, we find that clustering analysis with the eGRM reveals population structure driven by subpopulations that would not be apparent via the canonical GRM and that temporally the population model is consistent with recent divergence and expansion. Taken together, our proposed eGRM provides a robust tree-centric estimate of relatedness with wide application to genetic studies.
Collapse
Affiliation(s)
- Caoqi Fan
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| | - Nicholas Mancuso
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Charleston W K Chiang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|