1
|
Zhao B, Yang X, Zhu H. Estimating trans-ancestry genetic correlation with unbalanced data resources. J Am Stat Assoc 2024; 119:839-850. [PMID: 39219674 PMCID: PMC11364214 DOI: 10.1080/01621459.2024.2344703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 04/07/2024] [Indexed: 09/04/2024]
Abstract
The aim of this paper is to propose a novel method for estimating trans-ancestry genetic correlations in genome-wide association studies (GWAS) using genetically-predicted observations. These correlations describe how genetic architecture of complex traits varies among populations. Our new estimator corrects for biases arising from prediction errors in high-dimensional weak GWAS signals, while addressing the ethnic diversity inherent in GWAS data, such as linkage disequilibrium (LD) differences. A distinguishing feature of our approach is its flexibility regarding sample sizes: it necessitates a large GWAS sample only from one population, while the secondary population may have a much smaller cohort, even in the hundreds. This design directly addresses the existing imbalance in GWAS data resources, where datasets for European populations typically outnumber those of non-European ancestries. Through extensive simulations and real data analysis from the UK Biobank study encompassing 26 complex traits, we validate the reliability of our method. Our results illuminate the broader implications of transferring genetic findings across diverse populations.
Collapse
Affiliation(s)
- Bingxin Zhao
- Department of Statistics and Data Science, University of Pennsylvania
| | | | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill
| |
Collapse
|
2
|
Zhao B, Zou F, Zhu H. Cross-trait prediction accuracy of summary statistics in genome-wide association studies. Biometrics 2023; 79:841-853. [PMID: 35278218 PMCID: PMC9464799 DOI: 10.1111/biom.13661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 02/25/2022] [Indexed: 11/27/2022]
Abstract
In the era of big data, univariate models have widely been used as a workhorse tool for quickly producing marginal estimators; and this is true even when in a high-dimensional dense setting, in which many features are "true," but weak signals. Genome-wide association studies (GWAS) epitomize this type of setting. Although the GWAS marginal estimator is popular, it has long been criticized for ignoring the correlation structure of genetic variants (i.e., the linkage disequilibrium [LD] pattern). In this paper, we study the effects of LD pattern on the GWAS marginal estimator and investigate whether or not additionally accounting for the LD can improve the prediction accuracy of complex traits. We consider a general high-dimensional dense setting for GWAS and study a class of ridge-type estimators, including the popular marginal estimator and the best linear unbiased prediction (BLUP) estimator as two special cases. We show that the performance of GWAS marginal estimator depends on the LD pattern through the first three moments of its eigenvalue distribution. Furthermore, we uncover that the relative performance of GWAS marginal and BLUP estimators highly depends on the ratio of GWAS sample size over the number of genetic variants. Particularly, our finding reveals that the marginal estimator can easily become near-optimal within this class when the sample size is relatively small, even though it ignores the LD pattern. On the other hand, BLUP estimator has substantially better performance than the marginal estimator as the sample size increases toward the number of genetic variants, which is typically in millions. Therefore, adjusting for the LD (such as in the BLUP) is most needed when GWAS sample size is large. We illustrate the importance of our results by using the simulated data and real GWAS.
Collapse
Affiliation(s)
- Bingxin Zhao
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, U.S.A
| | - Fei Zou
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, U.S.A
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, U.S.A
| |
Collapse
|
3
|
Carpentier A, Collier O, Comminges L, Tsybakov AB, Wang Y. Estimation of the ℓ2-norm and testing in sparse linear regression with unknown variance. BERNOULLI 2022. [DOI: 10.3150/21-bej1436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
| | - Olivier Collier
- Modal’X, Université Paris-Nanterre, Nanterre and CREST, Paris, France
| | | | | | - Yuhao Wang
- Tsinghua University, Beijing, China and Shanghai Qi Zhi Institute, Shanghai, China
| |
Collapse
|
4
|
Chen HY, Li H, Argos M, Persky VW, Turyk ME. Statistical Methods for Assessing the Explained Variation of a Health Outcome by a Mixture of Exposures. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:2693. [PMID: 35270383 PMCID: PMC8910055 DOI: 10.3390/ijerph19052693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/13/2022] [Accepted: 02/18/2022] [Indexed: 12/04/2022]
Abstract
Exposures to environmental pollutants are often composed of mixtures of chemicals that can be highly correlated because of similar sources and/or chemical structures. The effect of an individual chemical on a health outcome can be weak and difficult to detect because of the relatively low level of exposures to many environmental pollutants. To tackle the challenging problem of assessing the health risk of exposure to a mixture of environmental pollutants, we propose a statistical approach to assessing the proportion of the variation of an outcome explained by a mixture of pollutants. The proposed approach avoids the difficult task of identifying specific pollutants that are responsible for the effects and may also be used to assess interactions among exposures. Extensive simulation results demonstrate that the proposed approach has very good performance. Application of the proposed approach is illustrated by investigating the main and interaction effects of the chemical pollutants on systolic and diastolic blood pressure in participants from the National Health and Nutrition Examination Survey.
Collapse
Affiliation(s)
- Hua Yun Chen
- Division of Epidemiology & Biostatistics, School of Public Health, University of Illinois at Chicago, 1603 West Taylor Street, Chicago, IL 60612, USA; (H.L.); (M.A.); (V.W.P.); (M.E.T.)
| | | | | | | | | |
Collapse
|
5
|
Livne I, Azriel D, Goldberg Y. Improved estimators for semi-supervised high-dimensional regression model. Electron J Stat 2022. [DOI: 10.1214/22-ejs2070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Ilan Livne
- The Faculty of Industrial Engineering and Management, Technion, Israel
| | - David Azriel
- The Faculty of Industrial Engineering and Management, Technion, Israel
| | - Yair Goldberg
- The Faculty of Industrial Engineering and Management, Technion, Israel
| |
Collapse
|
6
|
Chen X, Liu Q, Tong XT. Dimension independent excess risk by stochastic gradient descent. Electron J Stat 2022. [DOI: 10.1214/22-ejs2055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Xi Chen
- Stern School of Business, New York University
| | - Qiang Liu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Xin T. Tong
- Department of Mathematics, National University of Singapore
| |
Collapse
|
7
|
Zhang Y, Lu Q, Ye Y, Huang K, Liu W, Wu Y, Zhong X, Li B, Yu Z, Travers BG, Werling DM, Li JJ, Zhao H. SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Genome Biol 2021; 22:262. [PMID: 34493297 PMCID: PMC8422619 DOI: 10.1186/s13059-021-02478-w] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 08/23/2021] [Indexed: 01/09/2023] Open
Abstract
Local genetic correlation quantifies the genetic similarity of complex traits in specific genomic regions. However, accurate estimation of local genetic correlation remains challenging, due to linkage disequilibrium in local genomic regions and sample overlap across studies. We introduce SUPERGNOVA, a statistical framework to estimate local genetic correlations using summary statistics from genome-wide association studies. We demonstrate that SUPERGNOVA outperforms existing methods through simulations and analyses of 30 complex traits. In particular, we show that the positive yet paradoxical genetic correlation between autism spectrum disorder and cognitive performance could be explained by two etiologically distinct genetic signatures with bidirectional local genetic correlations.
Collapse
Affiliation(s)
- Yiliang Zhang
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, CT, 06520, USA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Yixuan Ye
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06510, USA
| | - Kunling Huang
- Department of Statistics, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Wei Liu
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06510, USA
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Xiaoyuan Zhong
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Boyang Li
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, CT, 06520, USA
| | - Zhaolong Yu
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06510, USA
| | - Brittany G Travers
- Occupational Therapy Program in the Department of Kinesiology, University of Wisconsin-Madison, Madison, WI, 53706, USA
- Waisman Center, University of Wisconsin-Madison, Madison, WI, 53705, USA
| | - Donna M Werling
- Waisman Center, University of Wisconsin-Madison, Madison, WI, 53705, USA
- Laboratory of Genetics, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - James J Li
- Waisman Center, University of Wisconsin-Madison, Madison, WI, 53705, USA
- Department of Psychology, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, 60 College Street, New Haven, CT, 06520, USA.
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06510, USA.
- Department of Genetics, Yale School of Medicine, New Haven, CT, 06510, USA.
| |
Collapse
|
8
|
Zhang Y, Cheng Y, Jiang W, Ye Y, Lu Q, Zhao H. Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics. Brief Bioinform 2021; 22:bbaa442. [PMID: 33497438 PMCID: PMC8425307 DOI: 10.1093/bib/bbaa442] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 12/12/2020] [Accepted: 12/30/2020] [Indexed: 01/03/2023] Open
Abstract
Genetic correlation is the correlation of phenotypic effects by genetic variants across the genome on two phenotypes. It is an informative metric to quantify the overall genetic similarity between complex traits, which provides insights into their polygenic genetic architecture. Several methods have been proposed to estimate genetic correlation based on data collected from genome-wide association studies (GWAS). Due to the easy access of GWAS summary statistics and computational efficiency, methods only requiring GWAS summary statistics as input have become more popular than methods utilizing individual-level genotype data. Here, we present a benchmark study for different summary-statistics-based genetic correlation estimation methods through simulation and real data applications. We focus on two major technical challenges in estimating genetic correlation: marker dependency caused by linkage disequilibrium (LD) and sample overlap between different studies. To assess the performance of different methods in the presence of these two challenges, we first conducted comprehensive simulations with diverse LD patterns and sample overlaps. Then we applied these methods to real GWAS summary statistics for a wide spectrum of complex traits. Based on these experiments, we conclude that methods relying on accurate LD estimation are less robust in real data applications due to the imprecision of LD obtained from reference panels. Our findings offer guidance on how to choose appropriate methods for genetic correlation estimation in post-GWAS analysis.
Collapse
Affiliation(s)
- Yiliang Zhang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
| | - Youshu Cheng
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
| | - Wei Jiang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
| | - Yixuan Ye
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06510, USA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, USA
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, USA
- Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06510, USA
- Department of Genetics, Yale School of Medicine, New Haven, CT 06510, USA
| |
Collapse
|
9
|
Comminges L, Collier O, Ndaoud M, Tsybakov AB. Adaptive robust estimation in sparse vector model. Ann Stat 2021. [DOI: 10.1214/20-aos2002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- L. Comminges
- CEREMADE, Université Paris-Dauphine, PSL and CREST
| | - O. Collier
- Modal’X, UPL, Université Paris Nanterre and CREST
| | | | | |
Collapse
|
10
|
Zhao B, Zhu H. On Genetic Correlation Estimation With Summary Statistics From Genome-Wide Association Studies. J Am Stat Assoc 2021; 117:1-11. [PMID: 35757777 PMCID: PMC9232179 DOI: 10.1080/01621459.2021.1906684] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 03/12/2021] [Accepted: 03/16/2021] [Indexed: 01/03/2023]
Abstract
Cross-trait polygenic risk score (PRS) method has gained popularity for assessing genetic correlation of complex traits using summary statistics from biobank-scale genome-wide association studies (GWAS). However, empirical evidence has shown a common bias phenomenon that highly significant cross-trait PRS can only account for a very small amount of genetic variance (R 2 can be < 1%) in independent testing GWAS. The aim of this paper is to investigate and address the bias phenomenon of cross-trait PRS in numerous GWAS applications. We show that the estimated genetic correlation can be asymptotically biased toward zero. A consistent cross-trait PRS estimator is then proposed to correct such asymptotic bias. In addition, we investigate whether or not SNP screening by GWAS p-values can lead to improved estimation and show the effect of overlapping samples among GWAS. We analyze GWAS summary statistics of reaction time and brain structural magnetic resonance imaging-based features measured in the Pediatric Imaging, Neurocognition, and Genetics study. We find that the raw cross-trait PRS estimators heavily underestimate the genetic similarity between cognitive function and human brain structures (mean R 2 = 1.32%), whereas the bias-corrected estimators uncover the moderate degree of genetic overlap between these closely related heritable traits (mean R 2 = 22.42%). Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Collapse
Affiliation(s)
- Bingxin Zhao
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC
| |
Collapse
|
11
|
Abstract
Summary
Genome-wide association studies have identified thousands of genetic variants that are associated with complex traits. Many complex traits are shown to share genetic etiology. Although various genetic correlation measures and their estimators have been developed, rigorous statistical analysis of their properties, including their robustness to model assumptions, is still lacking. We develop a method of moments estimator of genetic correlation between two traits in the framework of high-dimensional linear models. We show that the genetic correlation defined based on the regression coefficients and the linkage disequilibrium matrix can be decomposed into both the pleiotropic effects and correlations due to linkage disequilibrium between the causal loci of the two traits. The proposed estimator can be computed from summary association statistics when the raw genotype data are not available. Theoretical properties of the estimator in terms of consistency and asymptotic normality are provided. The proposed estimator is closely related to the estimator from the linkage disequilibrium score regression. However, our analysis reveals that the linkage disequilibrium score regression method does not make full use of the linkage disequilibrium information, and its jackknife variance estimate can be biased when the model assumptions are violated. Simulations and real data analysis results show that the proposed estimator is more robust and has better interpretability than the linkage disequilibrium score regression method under different genetic architectures.
Collapse
Affiliation(s)
- Jianqiao Wang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
12
|
Guo Z, Renaux C, Bühlmann P, Cai T. Group inference in high dimensions with applications to hierarchical testing. Electron J Stat 2021. [DOI: 10.1214/21-ejs1955] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Zijian Guo
- Department of Statistics, Rutgers University, New Jersey 08854, U.S.A
| | - Claude Renaux
- Seminar für Statistik, ETH Zürich, 8092 Zürich, Switzerland
| | - Peter Bühlmann
- Seminar für Statistik, ETH Zürich, 8092 Zürich, Switzerland
| | - Tony Cai
- Department of Statistics, University of Pennsylvania, Pennsylvania 19104, U.S.A
| |
Collapse
|
13
|
Javanmard A, Lee JD. A flexible framework for hypothesis testing in high dimensions. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12373] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
14
|
Tony Cai T, Guo Z. Semisupervised inference for explained variance in high dimensional linear regression and its applications. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12357] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- T. Tony Cai
- University of Pennsylvania; Philadelphia USA
| | | |
Collapse
|