1
|
Tsuo K, Shi Z, Ge T, Mandla R, Hou K, Ding Y, Pasaniuc B, Wang Y, Martin AR. All of Us diversity and scale improve polygenic prediction contextually with greatest improvements for under-represented populations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.06.606846. [PMID: 39149254 PMCID: PMC11326295 DOI: 10.1101/2024.08.06.606846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Recent studies have demonstrated that polygenic risk scores (PRS) trained on multi-ancestry data can improve prediction accuracy in groups historically underrepresented in genomic studies, but the availability of linked health and genetic data from large-scale diverse cohorts representative of a wide spectrum of human diversity remains limited. To address this need, the All of Us research program (AoU) generated whole-genome sequences of 245,388 individuals who collectively reflect the diversity of the USA. Leveraging this resource and another widely-used population-scale biobank, the UK Biobank (UKB) with a half million participants, we developed PRS trained on multi-ancestry and multi-biobank data with up to ~750,000 participants for 32 common, complex traits and diseases across a range of genetic architectures. We then compared effects of ancestry, PRS methodology, and genetic architecture on PRS accuracy across a held out subset of ancestrally diverse AoU participants. Due to the more heterogeneous study design of AoU, we found lower heritability on average compared to UKB (0.075 vs 0.165), which limited the maximal achievable PRS accuracy in AoU. Overall, we found that the increased diversity of AoU significantly improved PRS performance in some participants in AoU, especially underrepresented individuals, across multiple phenotypes. Notably, maximizing sample size by combining discovery data across AoU and UKB is not the optimal approach for predicting some phenotypes in African ancestry populations; rather, using data from only AoU for these traits resulted in the greatest accuracy. This was especially true for less polygenic traits with large ancestry-enriched effects, such as neutrophil count (R 2: 0.055 vs. 0.035 using AoU vs. cross-biobank meta-analysis, respectively, because of e.g. DARC). Lastly, we calculated individual-level PRS accuracies rather than grouping by continental ancestry, a critical step towards interpretability in precision medicine. Individualized PRS accuracy decays linearly as a function of ancestry divergence, but the slope was smaller using multi-ancestry GWAS compared to using European GWAS. Our results highlight the potential of biobanks with more balanced representations of human diversity to facilitate more accurate PRS for the individuals least represented in genomic studies.
Collapse
Affiliation(s)
- Kristin Tsuo
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Zhuozheng Shi
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Tian Ge
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Center for Precision Psychiatry, Massachusetts General Hospital, Boston, MA, USA
| | - Ravi Mandla
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Kangcheng Hou
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Yi Ding
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Bogdan Pasaniuc
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Institute for Precision Health, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Ying Wang
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Alicia R Martin
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
2
|
Tubbs JD, Chen Y, Duan R, Huang H, Ge T. Real-time dynamic polygenic prediction for streaming data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.07.12.24310357. [PMID: 39040195 PMCID: PMC11261927 DOI: 10.1101/2024.07.12.24310357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
Polygenic risk scores (PRSs) are promising tools for advancing precision medicine. However, existing PRS construction methods rely on static summary statistics derived from genome-wide association studies (GWASs), which are often updated at lengthy intervals. As genetic data and health outcomes are continuously being generated at an ever-increasing pace, the current PRS training and deployment paradigm is suboptimal in maximizing the prediction accuracy of PRSs for incoming patients in healthcare settings. Here, we introduce real-time PRS-CS (rtPRS-CS), which enables online, dynamic refinement and calibration of PRS as each new sample is collected, without the need to perform intermediate GWASs. Through extensive simulation studies, we evaluate the performance of rtPRS-CS across various genetic architectures and training sample sizes. Leveraging quantitative traits from the Mass General Brigham Biobank and UK Biobank, we show that rtPRS-CS can integrate massive streaming data to enhance PRS prediction over time. We further apply rtPRS-CS to 22 schizophrenia cohorts in 7 Asian regions, demonstrating the clinical utility of rtPRS-CS in dynamically predicting and stratifying disease risk across diverse genetic ancestries.
Collapse
Affiliation(s)
- Justin D. Tubbs
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA
| | - Yu Chen
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA
- Department of Medicine, Massachusetts General Hospital, Boston, MA
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Hailiang Huang
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA
- Department of Medicine, Massachusetts General Hospital, Boston, MA
| | - Tian Ge
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA
- Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Boston, MA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA
| |
Collapse
|
3
|
Boye C, Nirmalan S, Ranjbaran A, Luca F. Genotype × environment interactions in gene regulation and complex traits. Nat Genet 2024; 56:1057-1068. [PMID: 38858456 DOI: 10.1038/s41588-024-01776-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 04/25/2024] [Indexed: 06/12/2024]
Abstract
Genotype × environment interactions (GxE) have long been recognized as a key mechanism underlying human phenotypic variation. Technological developments over the past 15 years have dramatically expanded our appreciation of the role of GxE in both gene regulation and complex traits. The richness and complexity of these datasets also required parallel efforts to develop robust and sensitive statistical and computational approaches. Although our understanding of the genetic architecture of molecular and complex traits has been maturing, a large proportion of complex trait heritability remains unexplained. Furthermore, there are increasing efforts to characterize the effect of environmental exposure on human health. We therefore review GxE in human gene regulation and complex traits, advocating for a comprehensive approach that jointly considers genetic and environmental factors in human health and disease.
Collapse
Affiliation(s)
- Carly Boye
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, US
| | - Shreya Nirmalan
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, US
| | - Ali Ranjbaran
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, US
| | - Francesca Luca
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, US.
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, US.
- Department of Biology, University of Rome "Tor Vergata", Rome, Italy.
| |
Collapse
|
4
|
Hong SC, Muyas F, Cortés-Ciriano I, Hormoz S. scAI-SNP: a method for inferring ancestry from single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.14.594208. [PMID: 38798590 PMCID: PMC11118306 DOI: 10.1101/2024.05.14.594208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Collaborative efforts, such as the Human Cell Atlas, are rapidly accumulating large amounts of single-cell data. To ensure that single-cell atlases are representative of human genetic diversity, we need to determine the ancestry of the donors from whom single-cell data are generated. Self-reporting of race and ethnicity, although important, can be biased and is not always available for the datasets already collected. Here, we introduce scAI-SNP, a tool to infer ancestry directly from single-cell genomics data. To train scAI-SNP, we identified 4.5 million ancestry-informative single-nucleotide polymorphisms (SNPs) in the 1000 Genomes Project dataset across 3201 individuals from 26 population groups. For a query single-cell data set, scAI-SNP uses these ancestry-informative SNPs to compute the contribution of each of the 26 population groups to the ancestry of the donor from whom the cells were obtained. Using diverse single-cell data sets with matched whole-genome sequencing data, we show that scAI-SNP is robust to the sparsity of single-cell data, can accurately and consistently infer ancestry from samples derived from diverse types of tissues and cancer cells, and can be applied to different modalities of single-cell profiling assays, such as single-cell RNA-seq and single-cell ATAC-seq. Finally, we argue that ensuring that single-cell atlases represent diverse ancestry, ideally alongside race and ethnicity, is ultimately important for improved and equitable health outcomes by accounting for human diversity.
Collapse
Affiliation(s)
- Sung Chul Hong
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Francesc Muyas
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Isidro Cortés-Ciriano
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Sahand Hormoz
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
5
|
Peyrot WJ, Panagiotaropoulou G, Olde Loohuis LM, Adams MJ, Awasthi S, Ge T, McIntosh AM, Mitchell BL, Mullins N, O'Connell KS, Penninx BWJH, Posthuma D, Ripke S, Ruderfer DM, Uffelmann E, Vilhjalmsson BJ, Zhu Z, Smoller JW, Price AL. Distinguishing different psychiatric disorders using DDx-PRS. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.02.02.24302228. [PMID: 38352307 PMCID: PMC10862992 DOI: 10.1101/2024.02.02.24302228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/24/2024]
Abstract
Despite great progress on methods for case-control polygenic prediction (e.g. schizophrenia vs. control), there remains an unmet need for a method that genetically distinguishes clinically related disorders (e.g. schizophrenia (SCZ) vs. bipolar disorder (BIP) vs. depression (MDD) vs. control); such a method could have important clinical value, especially at disorder onset when differential diagnosis can be challenging. Here, we introduce a method, Differential Diagnosis-Polygenic Risk Score (DDx-PRS), that jointly estimates posterior probabilities of each possible diagnostic category (e.g. SCZ=50%, BIP=25%, MDD=15%, control=10%) by modeling variance/covariance structure across disorders, leveraging case-control polygenic risk scores (PRS) for each disorder (computed using existing methods) and prior clinical probabilities for each diagnostic category. DDx-PRS uses only summary-level training data and does not use tuning data, facilitating implementation in clinical settings. In simulations, DDx-PRS was well-calibrated (whereas a simpler approach that analyzes each disorder marginally was poorly calibrated), and effective in distinguishing each diagnostic category vs. the rest. We then applied DDx-PRS to Psychiatric Genomics Consortium SCZ/BIP/MDD/control data, including summary-level training data from 3 case-control GWAS ( N =41,917-173,140 cases; total N =1,048,683) and held-out test data from different cohorts with equal numbers of each diagnostic category (total N =11,460). DDx-PRS was well-calibrated and well-powered relative to these training sample sizes, attaining AUCs of 0.66 for SCZ vs. rest, 0.64 for BIP vs. rest, 0.59 for MDD vs. rest, and 0.68 for control vs. rest. DDx-PRS produced comparable results to methods that leverage tuning data, confirming that DDx-PRS is an effective method. True diagnosis probabilities in top deciles of predicted diagnosis probabilities were considerably larger than prior baseline probabilities, particularly in projections to larger training sample sizes, implying considerable potential for clinical utility under certain circumstances. In conclusion, DDx-PRS is an effective method for distinguishing clinically related disorders.
Collapse
|
6
|
Uffelmann E, Price AL, Posthuma D, Peyrot WJ. Estimating Disorder Probability Based on Polygenic Prediction Using the BPC Approach. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.01.12.24301157. [PMID: 38260678 PMCID: PMC10802765 DOI: 10.1101/2024.01.12.24301157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Polygenic Scores (PGSs) summarize an individual's genetic propensity for a given trait in a single value, based on SNP effect sizes derived from Genome-Wide Association Study (GWAS) results. Methods have been developed that apply Bayesian approaches to improve the prediction accuracy of PGSs through optimization of estimated effect sizes. While these methods are generally well-calibrated for continuous traits (implying the predicted values are on average equal to the true trait values), they are not well-calibrated for binary disorder traits in ascertained samples. This is a problem because well-calibrated PGSs are needed to reliably compute the absolute disorder probability for an individual to facilitate future clinical implementation. Here we introduce the Bayesian polygenic score Probability Conversion (BPC) approach, which computes an individual's predicted disorder probability using GWAS summary statistics, an existing Bayesian PGS method (e.g. PRScs, SBayesR), the individual's genotype data, and a prior disorder probability. The BPC approach transforms the PGS to its underlying liability scale, computes the variances of the PGS in cases and controls, and applies Bayes' Theorem to compute the absolute disorder probability; it is practical in its application as it does not require a tuning dataset with both genotype and phenotype data. We applied the BPC approach to extensive simulated data and empirical data of nine disorders. The BPC approach yielded well-calibrated results that were consistently better than the results of another recently published approach.
Collapse
Affiliation(s)
- Emil Uffelmann
- Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience, Vrije Universiteit Amsterdam
| | | | | | - Alkes L. Price
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Danielle Posthuma
- Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience, Vrije Universiteit Amsterdam
- Department of Child and Adolescent Psychiatry and Pediatric Psychology, Section Complex, Trait Genetics, Amsterdam Neuroscience, Vrije Universiteit Medical Center, Amsterdam University Medical Center, Amsterdam, The Netherlands
| | - Wouter J. Peyrot
- Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, Amsterdam Neuroscience, Vrije Universiteit Amsterdam
- Department of Psychiatry, Amsterdam UMC, The Netherlands
| |
Collapse
|
7
|
Veller C, Przeworski M, Coop G. Causal interpretations of family GWAS in the presence of heterogeneous effects. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.13.566950. [PMID: 38014124 PMCID: PMC10680648 DOI: 10.1101/2023.11.13.566950] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Family-based genome-wide association studies (GWAS) have emerged as a gold standard for assessing causal effects of alleles and polygenic scores. Notably, family studies are often claimed to provide an unbiased estimate of the average causal effect (or average treatment effect; ATE) of an allele, on the basis of an analogy between the random transmission of alleles from parents to children and a randomized controlled trial. Here, we show that this interpretation does not hold in general. Because Mendelian segregation only randomizes alleles among children of heterozygotes, the effects of alleles in the children of homozygotes are not observable. Consequently, if an allele has different average effects in the children of homozygotes and heterozygotes, as can arise in the presence of gene-by-environment interactions, gene-by-gene interactions, or differences in LD patterns, family studies provide a biased estimate of the average effect in the sample. At a single locus, family-based association studies can be thought of as providing an unbiased estimate of the average effect in the children of heterozygotes (i.e., a local average treatment effect; LATE). This interpretation does not extend to polygenic scores, however, because different sets of SNPs are heterozygous in each family. Therefore, other than under specific conditions, the within-family regression slope of a PGS cannot be assumed to provide an unbiased estimate for any subset or weighted average of families. Instead, family-based studies can be reinterpreted as enabling an unbiased estimate of the extent to which Mendelian segregation at loci in the PGS contributes to the population-level variance in the trait. Because this estimate does not include the between-family variance, however, this interpretation applies to only (roughly) half of the sample PGS variance. In practice, the potential biases of a family-based GWAS are likely smaller than those arising from confounding in a standard, population-based GWAS, and so family studies remain important for the dissection of genetic contributions to phenotypic variation. Nonetheless, the causal interpretation of family-based GWAS estimates is less straightforward than has been widely appreciated.
Collapse
Affiliation(s)
- Carl Veller
- Department of Ecology and Evolution, University of Chicago
| | - Molly Przeworski
- Department of Biological Sciences, Columbia University
- Department of Systems Biology, Columbia University
| | - Graham Coop
- Center for Population Biology and Department of Evolution and Ecology, University of California, Davis
| |
Collapse
|