1
|
Affiliation(s)
- Erik W. Zwet
- Department of Biomedical Data Sciences Leiden University Medical Center Leiden The Netherlands
| | - Eric A. Cator
- Faculty of Science Radboud University Nijmegen The Netherlands
| |
Collapse
|
2
|
Sundar VS, Fan CC, Holland D, Dale AM. Determining Genetic Causal Variants Through Multivariate Regression Using Mixture Model Penalty. Front Genet 2018; 9:77. [PMID: 29556250 PMCID: PMC5844985 DOI: 10.3389/fgene.2018.00077] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2017] [Accepted: 02/19/2018] [Indexed: 01/16/2023] Open
Abstract
With the availability of high-throughput sequencing data, identification of genetic causal variants accurately requires the efficient incorporation of function annotation data into the optimization routine. This motivates the need for development of novel methods for genome wide association studies with special focus on fine-mapping capabilities. A penalty function method that is simple to implement and capable of integrating functional annotation information into the estimation procedure, is proposed in this work. The idea is to use the prior distribution of the effect sizes explicitly as a penalty function. The estimates obtained are shown to be better correlated with the true effect sizes (in comparison with a few existing techniques). An increase in the positive and negative predictive value is demonstrated using Hapgen2 simulated data.
Collapse
Affiliation(s)
- V. S. Sundar
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, United States
- Department of Radiology, University of California, San Diego, La Jolla, CA, United States
- *Correspondence: V. S. Sundar
| | - Chun-Chieh Fan
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, United States
- Department of Cognitive Sciences, University of California, San Diego, La Jolla, CA, United States
| | - Dominic Holland
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, United States
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States
| | - Anders M. Dale
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, United States
- Department of Radiology, University of California, San Diego, La Jolla, CA, United States
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, United States
- Anders M. Dale
| |
Collapse
|
3
|
Hu J, Zhang W, Li X, Pan D, Li Q. Efficient estimation of disease odds ratios for follow-up genetic association studies. Stat Methods Med Res 2017; 28:1927-1941. [PMID: 29157118 DOI: 10.1177/0962280217741771] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In the past decade, genome-wide association studies have identified thousands of susceptible variants associated with complex human diseases and traits. Conducting follow-up genetic association studies has become a standard approach to validate the findings of genome-wide association studies. One problem of high interest in genetic association studies is to accurately estimate the strength of the association, which is often quantified by odds ratios in case-control studies. However, estimating the association directly by follow-up studies is inefficient since this approach ignores information from the genome-wide association studies. In this article, an estimator called GFcom, which integrates information from genome-wide association studies and follow-up studies, is proposed. The estimator includes both the point estimate and corresponding confidence interval. GFcom is more efficient than competing estimators regarding MSE and the length of confidence intervals. The superiority of GFcom is particularly evident when the genome-wide association study suffers from severe selection bias. Comprehensive simulation studies and applications to three real follow-up studies demonstrate the performance of the proposed estimator. An R package, "GFcom", implementing our method is publicly available at https://github.com/JiyuanHu/GFcom .
Collapse
Affiliation(s)
- Jiyuan Hu
- 1 Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, PR China.,2 Department of Population Health, New York University, New York, NY, USA
| | - Wei Zhang
- 3 Biostatistics and Bioinformatics Branch, National Institute of Child Health and Human Development, Bethesda, MD, USA
| | - Xinmin Li
- 4 School of Mathematics and Statistics, Qingdao University, Qingdao, PR China
| | - Dongdong Pan
- 5 Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming, PR China
| | - Qizhai Li
- 6 LSC, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, PR China
| |
Collapse
|
4
|
Grinde KE, Arbet J, Green A, O'Connell M, Valcarcel A, Westra J, Tintle N. Illustrating, Quantifying, and Correcting for Bias in Post-hoc Analysis of Gene-Based Rare Variant Tests of Association. Front Genet 2017; 8:117. [PMID: 28959274 PMCID: PMC5603735 DOI: 10.3389/fgene.2017.00117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2017] [Accepted: 08/25/2017] [Indexed: 11/13/2022] Open
Abstract
To date, gene-based rare variant testing approaches have focused on aggregating information across sets of variants to maximize statistical power in identifying genes showing significant association with diseases. Beyond identifying genes that are associated with diseases, the identification of causal variant(s) in those genes and estimation of their effect is crucial for planning replication studies and characterizing the genetic architecture of the locus. However, we illustrate that straightforward single-marker association statistics can suffer from substantial bias introduced by conditioning on gene-based test significance, due to the phenomenon often referred to as "winner's curse." We illustrate the ramifications of this bias on variant effect size estimation and variant prioritization/ranking approaches, outline parameters of genetic architecture that affect this bias, and propose a bootstrap resampling method to correct for this bias. We find that our correction method significantly reduces the bias due to winner's curse (average two-fold decrease in bias, p < 2.2 × 10-6) and, consequently, substantially improves mean squared error and variant prioritization/ranking. The method is particularly helpful in adjustment for winner's curse effects when the initial gene-based test has low power and for relatively more common, non-causal variants. Adjustment for winner's curse is recommended for all post-hoc estimation and ranking of variants after a gene-based test. Further work is necessary to continue seeking ways to reduce bias and improve inference in post-hoc analysis of gene-based tests under a wide variety of genetic architectures.
Collapse
Affiliation(s)
- Kelsey E Grinde
- Department of Biostatistics, University of WashingtonSeattle, WA, United States
| | - Jaron Arbet
- Department of Biostatistics, University of MinnesotaMinneapolis, MN, United States
| | - Alden Green
- Department of Statistics, Carnegie Mellon UniversityPittsburgh, PA, United States
| | - Michael O'Connell
- Department of Biostatistics, University of MinnesotaMinneapolis, MN, United States
| | - Alessandra Valcarcel
- Department of Biostatistics and Epidemiology, University of PennsylvaniaPhiladelphia, PA, United States
| | - Jason Westra
- Department of Statistics, Iowa State UniversityAmes, IA, United States.,Department of Mathematics, Statistics, and Computer Science, Dordt CollegeSioux Center, IA, United States
| | - Nathan Tintle
- Department of Mathematics, Statistics, and Computer Science, Dordt CollegeSioux Center, IA, United States
| |
Collapse
|
5
|
Reid S, Taylor J, Tibshirani R. Post-selection point and interval estimation of signal sizes in Gaussian samples. CAN J STAT 2017. [DOI: 10.1002/cjs.11320] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Stephen Reid
- Department of Statistics; 390 Serra Mall, Stanford University; Stanford, CA 94305 U.S.A
| | - Jonathan Taylor
- Department of Statistics; 390 Serra Mall, Stanford University; Stanford, CA 94305 U.S.A
| | - Robert Tibshirani
- Department of Statistics; 390 Serra Mall, Stanford University; Stanford, CA 94305 U.S.A
- Department of Health Research and Policy; 150 Governor's Lane, HRP Redwood Building, Stanford University School of Medicine; Stanford, CA 94305 U.S.A
| |
Collapse
|
6
|
Bigdeli TB, Lee D, Webb BT, Riley BP, Vladimirov VI, Fanous AH, Kendler KS, Bacanu SA. A simple yet accurate correction for winner's curse can predict signals discovered in much larger genome scans. Bioinformatics 2016; 32:2598-603. [PMID: 27187203 PMCID: PMC5013908 DOI: 10.1093/bioinformatics/btw303] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Accepted: 05/06/2016] [Indexed: 11/14/2022] Open
Abstract
Motivation: For genetic studies, statistically significant variants explain far less trait variance than ‘sub-threshold’ association signals. To dimension follow-up studies, researchers need to accurately estimate ‘true’ effect sizes at each SNP, e.g. the true mean of odds ratios (ORs)/regression coefficients (RRs) or Z-score noncentralities. Naïve estimates of effect sizes incur winner’s curse biases, which are reduced only by laborious winner’s curse adjustments (WCAs). Given that Z-scores estimates can be theoretically translated on other scales, we propose a simple method to compute WCA for Z-scores, i.e. their true means/noncentralities. Results:WCA of Z-scores shrinks these towards zero while, on P-value scale, multiple testing adjustment (MTA) shrinks P-values toward one, which corresponds to the zero Z-score value. Thus, WCA on Z-scores scale is a proxy for MTA on P-value scale. Therefore, to estimate Z-score noncentralities for all SNPs in genome scans, we propose FDR Inverse Quantile Transformation (FIQT). It (i) performs the simpler MTA of P-values using FDR and (ii) obtains noncentralities by back-transforming MTA P-values on Z-score scale. When compared to competitors, realistic simulations suggest that FIQT is more (i) accurate and (ii) computationally efficient by orders of magnitude. Practical application of FIQT to Psychiatric Genetic Consortium schizophrenia cohort predicts a non-trivial fraction of sub-threshold signals which become significant in much larger supersamples. Conclusions: FIQT is a simple, yet accurate, WCA method for Z-scores (and ORs/RRs, via simple transformations). Availability and Implementation: A 10 lines R function implementation is available at https://github.com/bacanusa/FIQT. Contact:sabacanu@vcu.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- T Bernard Bigdeli
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| | - Donghyung Lee
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| | - Bradley Todd Webb
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| | - Brien P Riley
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| | - Vladimir I Vladimirov
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics Center for Biomarker Research & Personalized Medicine, Virginia Commonwealth University, Richmond, VA 23298, USA Lieber Institute for Brain Development, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Ayman H Fanous
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| | - Kenneth S Kendler
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| | - Silviu-Alin Bacanu
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics
| |
Collapse
|
7
|
Jiang W, Yu W. Power estimation and sample size determination for replication studies of genome-wide association studies. BMC Genomics 2016; 17 Suppl 1:3. [PMID: 26818952 PMCID: PMC4895704 DOI: 10.1186/s12864-015-2296-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Replication study is a commonly used verification method to filter out false positives in genome-wide association studies (GWAS). If an association can be confirmed in a replication study, it will have a high confidence to be true positive. To design a replication study, traditional approaches calculate power by treating replication study as another independent primary study. These approaches do not use the information given by primary study. Besides, they need to specify a minimum detectable effect size, which may be subjective. One may think to replace the minimum effect size with the observed effect sizes in the power calculation. However, this approach will make the designed replication study underpowered since we are only interested in the positive associations from the primary study and the problem of the “winner’s curse” will occur. Results An Empirical Bayes (EB) based method is proposed to estimate the power of replication study for each association. The corresponding credible interval is estimated in the proposed approach. Simulation experiments show that our method is better than other plug-in based estimators in terms of overcoming the winner’s curse and providing higher estimation accuracy. The coverage probability of given credible interval is well-calibrated in the simulation experiments. Weighted average method is used to estimate the average power of all underlying true associations. This is used to determine the sample size of replication study. Sample sizes are estimated on 6 diseases from Wellcome Trust Case Control Consortium (WTCCC) using our method. They are higher than sample sizes estimated by plugging observed effect sizes in power calculation. Conclusions Our new method can objectively determine replication study’s sample size by using information extracted from primary study. Also the winner’s curse is alleviated. Thus, it is a better choice when designing replication studies of GWAS. The R-package is available at: http://bioinformatics.ust.hk/RPower.html.
Collapse
Affiliation(s)
- Wei Jiang
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| | - Weichuan Yu
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China.
| |
Collapse
|
8
|
Faye LL, Machiela MJ, Kraft P, Bull SB, Sun L. Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification. PLoS Genet 2013; 9:e1003609. [PMID: 23950724 PMCID: PMC3738448 DOI: 10.1371/journal.pgen.1003609] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2012] [Accepted: 05/20/2013] [Indexed: 11/30/2022] Open
Abstract
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website. As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.
Collapse
Affiliation(s)
- Laura L. Faye
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Samuel Lunenfeld Research Institute, Prosserman Centre for Health Research, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - Mitchell J. Machiela
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Peter Kraft
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Shelley B. Bull
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Samuel Lunenfeld Research Institute, Prosserman Centre for Health Research, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - Lei Sun
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- * E-mail:
| |
Collapse
|
9
|
Liu D, Leal S. Estimating genetic effects and quantifying missing heritability explained by identified rare-variant associations. Am J Hum Genet 2012; 91:585-96. [PMID: 23022102 DOI: 10.1016/j.ajhg.2012.08.008] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Revised: 06/19/2012] [Accepted: 08/08/2012] [Indexed: 01/01/2023] Open
Abstract
Next-generation sequencing has led to many complex-trait rare-variant (RV) association studies. Although single-variant association analysis can be performed, it is grossly underpowered. Therefore, researchers have developed many RV association tests that aggregate multiple variant sites across a genetic region (e.g., gene), and test for the association between the trait and the aggregated genotype. After these aggregate tests detect an association, it is only possible to estimate the average genetic effect for a group of RVs. As a result of the "winner's curse," such an estimate can be biased. Although for common variants one can obtain unbiased estimates of genetic parameters by analyzing a replication sample, for RVs it is desirable to obtain unbiased genetic estimates for the study where the association is identified. This is because there can be substantial heterogeneity of RV sites and frequencies even among closely related populations. In order to obtain an unbiased estimate for aggregated RV analysis, we developed bootstrap-sample-split algorithms to reduce the bias of the winner's curse. The unbiased estimates are greatly important for understanding the population-specific contribution of RVs to the heritability of complex traits. We also demonstrate both theoretically and via simulations that for aggregate RV analysis the genetic variance for a gene or region will always be underestimated, sometimes substantially, because of the presence of noncausal variants or because of the presence of causal variants with effects of different magnitudes or directions. Therefore, even if RVs play a major role in the complex-trait etiologies, a portion of the heritability will remain missing, and the contribution of RVs to the complex-trait etiologies will be underestimated.
Collapse
|
10
|
Ferguson JP, Cho JH, Yang C, Zhao H. Empirical Bayes correction for the Winner's Curse in genetic association studies. Genet Epidemiol 2012; 37:60-8. [PMID: 23012258 DOI: 10.1002/gepi.21683] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2012] [Revised: 08/14/2012] [Accepted: 08/17/2012] [Indexed: 01/03/2023]
Abstract
We consider an Empirical Bayes method to correct for the Winner's Curse phenomenon in genome-wide association studies. Our method utilizes the collective distribution of all odds ratios (ORs) to determine the appropriate correction for a particular single-nucleotide polymorphism (SNP). We can show that this approach is squared error optimal provided that this collective distribution is accurately estimated in its tails. To improve the performance when correcting the OR estimates for the most highly associated SNPs, we develop a second estimator that adaptively combines the Empirical Bayes estimator with a previously considered Conditional Likelihood estimator. The applications of these methods to both simulated and real data suggest improved performance in reducing selection bias.
Collapse
Affiliation(s)
- John P Ferguson
- Section of Digestive Diseases, Yale School of Medicine, New Haven, Connecticut 06511, USA.
| | | | | | | |
Collapse
|
11
|
Zhou XK, Liu F, Dannenberg AJ. A Bayesian model averaging approach for observational gene expression studies. Ann Appl Stat 2012. [DOI: 10.1214/11-aoas526] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Sun L, Dimitromanolakis A, Faye LL, Paterson AD, Waggott D, Bull SB. BR-squared: a practical solution to the winner's curse in genome-wide scans. Hum Genet 2011; 129:545-52. [PMID: 21246217 PMCID: PMC3074069 DOI: 10.1007/s00439-011-0948-2] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2010] [Accepted: 01/03/2011] [Indexed: 11/26/2022]
Abstract
The detrimental effects of the winner's curse, including overestimation of the genetic effects of associated variants and underestimation of sufficient sample sizes for replication studies are well-recognized in genome-wide association studies (GWAS). These effects can be expected to worsen as the field moves from GWAS into whole genome sequencing. To date, few studies have reported statistical adjustments to the naive estimates, due to the lack of suitable statistical methods and computational tools. We have developed an efficient genome-wide non-parametric method that explicitly accounts for the threshold, ranking, and allele frequency effects in whole genome scans. Here, we implement the method to provide bias-reduced estimates via bootstrap re-sampling (BR-squared) for association studies of both disease status and quantitative traits, and we report the results of applying BR-squared to GWAS of psoriasis and HbA1c. We observed over 50% reduction in the genetic effect size estimation for many associated SNPs. This translates into a greater than fourfold increase in sample size requirements for successful replication studies, which in part explains some of the apparent failures in replicating the original signals. Our analysis suggests that adjusting for the winner's curse is critical for interpreting findings from whole genome scans and planning replication and meta-GWAS studies, as well as in attempts to translate findings into the clinical setting.
Collapse
Affiliation(s)
- Lei Sun
- Dalla Lana School of Public Health, University of Toronto, 155 College Street, Toronto, Ontario, Canada.
| | | | | | | | | | | |
Collapse
|