1
|
Deng L, Fu S, Yu K. Bias and mean squared error in Mendelian randomization with invalid instrumental variables. Genet Epidemiol 2024; 48:27-41. [PMID: 37970963 DOI: 10.1002/gepi.22541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 10/09/2023] [Accepted: 10/26/2023] [Indexed: 11/19/2023]
Abstract
Mendelian randomization (MR) is a statistical method that utilizes genetic variants as instrumental variables (IVs) to investigate causal relationships between risk factors and outcomes. Although MR has gained popularity in recent years due to its ability to analyze summary statistics from genome-wide association studies (GWAS), it requires a substantial number of single nucleotide polymorphisms (SNPs) as IVs to ensure sufficient power for detecting causal effects. Unfortunately, the complex genetic heritability of many traits can lead to the use of invalid IVs that affect both the risk factor and the outcome directly or through an unobserved confounder. This can result in biased and imprecise estimates, as reflected by a larger mean squared error (MSE). In this study, we focus on the widely used two-stage least squares (2SLS) method and derive formulas for its bias and MSE when estimating causal effects using invalid IVs. Using those formulas, we identify conditions under which the 2SLS estimate is unbiased and reveal how the independent or correlated pleiotropic effects influence the accuracy and precision of the 2SLS estimate. We validate these formulas through extensive simulation studies and demonstrate the application of those formulas in an MR study to evaluate the causal effect of the waist-to-hip ratio on various sleeping patterns. Our results can aid in designing future MR studies and serve as benchmarks for assessing more sophisticated MR methods.
Collapse
Affiliation(s)
- Lu Deng
- School of Statistics and Data Science, Nankai University, Tianjin, China
| | - Sheng Fu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, USA
| |
Collapse
|
2
|
Katki HA, Berndt SI, Machiela MJ, Stewart DR, Garcia-Closas M, Kim J, Shi J, Yu K, Rothman N. Increase in power by obtaining 10 or more controls per case when type-1 error is small in large-scale association studies. BMC Med Res Methodol 2023; 23:153. [PMID: 37386403 PMCID: PMC10308790 DOI: 10.1186/s12874-023-01973-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Accepted: 06/10/2023] [Indexed: 07/01/2023] Open
Abstract
BACKGROUND The rule of thumb that there is little gain in statistical power by obtaining more than 4 controls per case, is based on type-1 error α = 0.05. However, association studies that evaluate thousands or millions of associations use smaller α and may have access to plentiful controls. We investigate power gains, and reductions in p-values, when increasing well beyond 4 controls per case, for small α. METHODS We calculate the power, the median expected p-value, and the minimum detectable odds-ratio (OR), as a function of the number of controls/case, as α decreases. RESULTS As α decreases, at each ratio of controls per case, the increase in power is larger than for α = 0.05. For α between 10-6 and 10-9 (typical for thousands or millions of associations), increasing from 4 controls per case to 10-50 controls per case increases power. For example, a study with power = 0.2 (α = 5 × 10-8) with 1 control/case has power = 0.65 with 4 controls/case, but with 10 controls/case has power = 0.78, and with 50 controls/case has power = 0.84. For situations where obtaining more than 4 controls per case provides small increases in power beyond 0.9 (at small α), the expected p-value can decrease by orders-of-magnitude below α. Increasing from 1 to 4 controls/case reduces the minimum detectable OR toward the null by 20.9%, and from 4 to 50 controls/case reduces by an additional 9.7%, a result which applies regardless of α and hence also applies to "regular" α = 0.05 epidemiology. CONCLUSIONS At small α, versus 4 controls/case, recruiting 10 or more controls/cases can increase power, reduce the expected p-value by 1-2 orders of magnitude, and meaningfully reduce the minimum detectable OR. These benefits of increasing the controls/case ratio increase as the number of cases increases, although the amount of benefit depends on exposure frequencies and true OR. Provided that controls are comparable to cases, our findings suggest greater sharing of comparable controls in large-scale association studies.
Collapse
Affiliation(s)
- Hormuzd A Katki
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Sonja I Berndt
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Mitchell J Machiela
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Douglas R Stewart
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Montserrat Garcia-Closas
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jung Kim
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Jianxin Shi
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Nathaniel Rothman
- Division of Cancer Epidemiology and Genetics, Department of Health and Human Services, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
3
|
Shi G. Genome-wide variance quantitative trait locus analysis suggests small interaction effects in blood pressure traits. Sci Rep 2022; 12:12649. [PMID: 35879408 PMCID: PMC9314370 DOI: 10.1038/s41598-022-16908-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Accepted: 07/18/2022] [Indexed: 11/09/2022] Open
Abstract
Genome-wide variance quantitative trait loci (vQTL) analysis complements genome-wide association study (GWAS) and has the potential to identify novel variants associated with the trait, explain additional trait variance and lead to the identification of factors that modulate the genetic effects. I conducted genome-wide analysis of the UK Biobank data and identified 27 vQTLs associated with systolic blood pressure (SBP), diastolic blood pressure (DBP) and pulse pressure (PP). The top single-nucleotide polymorphisms (SNPs) are enriched for expression QTLs (eQTLs) or splicing QTLs (sQTLs) annotated by GTEx, suggesting their regulatory roles in mediating the associations with blood pressure (BP). Of the 27 vQTLs, 14 are known BP-associated QTLs discovered by GWASs. The heteroscedasticity effects of the 13 novel vQTLs are larger than their genetic main effects, which were not detected by existing GWASs. The total R-squared of the 27 top SNPs due to variance heteroscedasticity is 0.28%, compared with 0.50% owing to their main effects. The overall effect size of the variance heteroscedasticity is small in GWAS SNPs compared with their main effects. For the 411, 384 and 285 GWAS SNPs associated with SBP, DBP and PP, respectively, their heteroscedasticity effects were 0.52%, 0.43%, and 0.16%, and their main effects were 5.13%, 5.61%, and 3.75%, respectively. The number and effects of the vQTLs are small, which suggests that the effects of gene-environment and gene-gene interactions are small. The main effects of the SNPs remain the major source of genetic variance for BP, which would probably be true for other complex traits as well.
Collapse
Affiliation(s)
- Gang Shi
- School of Telecommunications Engineering, Xidian University, 2 South Taibai Road, Xi'an, 710071, Shaanxi, China.
| |
Collapse
|
4
|
Deng L, Zhang H, Yu K. Power calculation for the general two‐sample Mendelian randomization analysis. Genet Epidemiol 2020; 44:290-299. [DOI: 10.1002/gepi.22284] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 01/09/2020] [Accepted: 01/28/2020] [Indexed: 11/08/2022]
Affiliation(s)
- Lu Deng
- Division of Cancer Epidemiology and GeneticsNational Cancer Institute Rockville Maryland
| | - Han Zhang
- Division of Cancer Epidemiology and GeneticsNational Cancer Institute Rockville Maryland
| | - Kai Yu
- Division of Cancer Epidemiology and GeneticsNational Cancer Institute Rockville Maryland
| |
Collapse
|
5
|
Davenport S, Nichols TE. Selective peak inference: Unbiased estimation of raw and standardized effect size at local maxima. Neuroimage 2019; 209:116375. [PMID: 31866164 DOI: 10.1016/j.neuroimage.2019.116375] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Revised: 11/11/2019] [Accepted: 11/17/2019] [Indexed: 11/29/2022] Open
Abstract
The spatial signals in neuroimaging mass univariate analyses can be characterized in a number of ways, but one widely used approach is peak inference: the identification of peaks in the image. Peak locations and magnitudes provide a useful summary of activation and are routinely reported, however, the magnitudes reflect selection bias as these points have both survived a threshold and are local maxima. In this paper we propose the use of resampling methods to estimate and correct this bias in order to estimate both the raw units change as well as standardized effect size measured with Cohen's d and partial R2. We evaluate our method with a massive open dataset, and discuss how the corrected estimates can be used to perform power analyses. Keywords: fMRI, selective inference, winner's curse, regression to the mean, bias, bootstrap, local maxima, UK biobank, power analyses, massive linear modeling.
Collapse
Affiliation(s)
- Samuel Davenport
- Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK.
| | - Thomas E Nichols
- Oxford Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Population Health, University of Oxford, Oxford, OX3 7LF, UK; Wellcome Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neuro-sciences, University of Oxford, Oxford, OX3 9DU, UK; Department of Statistics, University of Warwick, Coventry, CV4 7AL, UK
| |
Collapse
|
6
|
Deng L, Zhang H, Song L, Yu K. Approximation of bias and mean-squared error in two-sample Mendelian randomization analyses. Biometrics 2019; 76:369-379. [PMID: 31651042 PMCID: PMC7182476 DOI: 10.1111/biom.13169] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 08/08/2019] [Accepted: 10/08/2019] [Indexed: 12/12/2022]
Abstract
Mendelian randomization (MR) is a type of instrumental variable (IV) analysis that uses genetic variants as IVs for a risk factor to study its causal effect on an outcome. Extensive investigations on the performance of IV analysis procedures, such as the one based on the two-stage least squares (2SLS) procedure, have been conducted under the one-sample scenario, where measures on IVs, the risk factor, and the outcome are assumed to be available for each study participant. Recent MR analysis usually is performed with data from two independent or partially overlapping genetic association studies (two-sample setting), with one providing information on the association between the IVs and the outcome, and the other on the association between the IVs and the risk factor. We investigate the performance of 2SLS in the two-sample-based MR when the IVs are weakly associated with the risk factor. We derive closed form formulas for the bias and mean squared error of the 2SLS estimate and verify them with numeric simulations under realistic circumstances. Using these analytic formulas, we can study the pros and cons of conducting MR analysis under one-sample and two-sample settings and assess the impact of having overlapping samples. We also propose and validate a bias-corrected estimator for the causal effect.
Collapse
Affiliation(s)
- Lu Deng
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland
| | - Han Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland
| | - Lei Song
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland
| |
Collapse
|
7
|
Grinde KE, Arbet J, Green A, O'Connell M, Valcarcel A, Westra J, Tintle N. Illustrating, Quantifying, and Correcting for Bias in Post-hoc Analysis of Gene-Based Rare Variant Tests of Association. Front Genet 2017; 8:117. [PMID: 28959274 PMCID: PMC5603735 DOI: 10.3389/fgene.2017.00117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2017] [Accepted: 08/25/2017] [Indexed: 11/13/2022] Open
Abstract
To date, gene-based rare variant testing approaches have focused on aggregating information across sets of variants to maximize statistical power in identifying genes showing significant association with diseases. Beyond identifying genes that are associated with diseases, the identification of causal variant(s) in those genes and estimation of their effect is crucial for planning replication studies and characterizing the genetic architecture of the locus. However, we illustrate that straightforward single-marker association statistics can suffer from substantial bias introduced by conditioning on gene-based test significance, due to the phenomenon often referred to as "winner's curse." We illustrate the ramifications of this bias on variant effect size estimation and variant prioritization/ranking approaches, outline parameters of genetic architecture that affect this bias, and propose a bootstrap resampling method to correct for this bias. We find that our correction method significantly reduces the bias due to winner's curse (average two-fold decrease in bias, p < 2.2 × 10-6) and, consequently, substantially improves mean squared error and variant prioritization/ranking. The method is particularly helpful in adjustment for winner's curse effects when the initial gene-based test has low power and for relatively more common, non-causal variants. Adjustment for winner's curse is recommended for all post-hoc estimation and ranking of variants after a gene-based test. Further work is necessary to continue seeking ways to reduce bias and improve inference in post-hoc analysis of gene-based tests under a wide variety of genetic architectures.
Collapse
Affiliation(s)
- Kelsey E Grinde
- Department of Biostatistics, University of WashingtonSeattle, WA, United States
| | - Jaron Arbet
- Department of Biostatistics, University of MinnesotaMinneapolis, MN, United States
| | - Alden Green
- Department of Statistics, Carnegie Mellon UniversityPittsburgh, PA, United States
| | - Michael O'Connell
- Department of Biostatistics, University of MinnesotaMinneapolis, MN, United States
| | - Alessandra Valcarcel
- Department of Biostatistics and Epidemiology, University of PennsylvaniaPhiladelphia, PA, United States
| | - Jason Westra
- Department of Statistics, Iowa State UniversityAmes, IA, United States.,Department of Mathematics, Statistics, and Computer Science, Dordt CollegeSioux Center, IA, United States
| | - Nathan Tintle
- Department of Mathematics, Statistics, and Computer Science, Dordt CollegeSioux Center, IA, United States
| |
Collapse
|
8
|
Pan DD, Li ZB, Li QZ, Kam Fung W. A Novel Powerful Joint Analysis with Data Fusion in Two-stage Case–Control Genome-wide Association Studies. COMMUN STAT-SIMUL C 2016. [DOI: 10.1080/03610918.2014.901360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
9
|
Zou W, Ouyang H. Using local multiplicity to improve effect estimation from a hypothesis-generating pharmacogenetics study. THE PHARMACOGENOMICS JOURNAL 2016; 16:107-112. [PMID: 25802090 DOI: 10.1038/tpj.2015.19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 12/29/2014] [Accepted: 01/28/2015] [Indexed: 06/04/2023]
Abstract
We propose a multiple estimation adjustment (MEA) method to correct effect overestimation due to selection bias from a hypothesis-generating study (HGS) in pharmacogenetics. MEA uses a hierarchical Bayesian approach to model individual effect estimates from maximal likelihood estimation (MLE) in a region jointly and shrinks them toward the regional effect. Unlike many methods that model a fixed selection scheme, MEA capitalizes on local multiplicity independent of selection. We compared mean square errors (MSEs) in simulated HGSs from naive MLE, MEA and a conditional likelihood adjustment (CLA) method that model threshold selection bias. We observed that MEA effectively reduced MSE from MLE on null effects with or without selection, and had a clear advantage over CLA on extreme MLE estimates from null effects under lenient threshold selection in small samples, which are common among 'top' associations from a pharmacogenetics HGS.
Collapse
Affiliation(s)
- W Zou
- Biostatistics, Genentech, Inc., 1 DNA Way, South San Francisco, CA, USA
| | - H Ouyang
- Global Statistical Sciences (GSS) - Oncology, Lilly Corporate Center, Eli Lilly and Company, Indianapolis, IN, USA
| |
Collapse
|
10
|
Zheng J, Rao DC, Shi G. An update on genome-wide association studies of hypertension. ACTA ACUST UNITED AC 2015. [DOI: 10.1186/s40535-015-0013-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
11
|
Poirier JG, Faye LL, Dimitromanolakis A, Paterson AD, Sun L, Bull SB. Resampling to Address the Winner's Curse in Genetic Association Analysis of Time to Event. Genet Epidemiol 2015; 39:518-28. [PMID: 26411674 PMCID: PMC4609263 DOI: 10.1002/gepi.21920] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2014] [Revised: 06/10/2015] [Accepted: 07/17/2015] [Indexed: 01/27/2023]
Abstract
The “winner's curse” is a subtle and difficult problem in interpretation of genetic association, in which association estimates from large‐scale gene detection studies are larger in magnitude than those from subsequent replication studies. This is practically important because use of a biased estimate from the original study will yield an underestimate of sample size requirements for replication, leaving the investigators with an underpowered study. Motivated by investigation of the genetics of type 1 diabetes complications in a longitudinal cohort of participants in the Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/EDIC) Genetics Study, we apply a bootstrap resampling method in analysis of time to nephropathy under a Cox proportional hazards model, examining 1,213 single‐nucleotide polymorphisms (SNPs) in 201 candidate genes custom genotyped in 1,361 white probands. Among 15 top‐ranked SNPs, bias reduction in log hazard ratio estimates ranges from 43.1% to 80.5%. In simulation studies based on the observed DCCT/EDIC genotype data, genome‐wide bootstrap estimates for false‐positive SNPs and for true‐positive SNPs with low‐to‐moderate power are closer to the true values than uncorrected naïve estimates, but tend to overcorrect SNPs with high power. This bias‐reduction technique is generally applicable for complex trait studies including quantitative, binary, and time‐to‐event traits.
Collapse
Affiliation(s)
- Julia G Poirier
- Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Canada
| | - Laura L Faye
- Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Canada.,Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | | | - Andrew D Paterson
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Hospital for Sick Children Research Institute, Toronto, Canada
| | - Lei Sun
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Department of Statistical Sciences, University of Toronto, Toronto, Canada
| | - Shelley B Bull
- Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Canada.,Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| |
Collapse
|
12
|
Schielzeth H, Husby A. Challenges and prospects in genome-wide quantitative trait loci mapping of standing genetic variation in natural populations. Ann N Y Acad Sci 2014; 1320:35-57. [PMID: 24689944 DOI: 10.1111/nyas.12397] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
A considerable challenge in evolutionary genetics is to understand the genetic mechanisms that facilitate or impede evolutionary adaptation in natural populations. For this, we must understand the genetic loci contributing to trait variation and the selective forces acting on them. The decreased costs and increased feasibility of obtaining genotypic data on a large number of individuals have greatly facilitated gene mapping in natural populations, particularly because organisms whose genetics have been historically difficult to study are now within reach. Here we review the methods available to evolutionary ecologists interested in dissecting the genetic basis of traits in natural populations. Our focus lies on standing genetic variation in outbred populations. We present an overview of the current state of research in the field, covering studies on both plants and animals. We also draw attention to particular challenges associated with the discovery of quantitative trait loci and discuss parallels to studies on crops, livestock, and humans. Finally, we point to some likely future developments in genetic mapping studies.
Collapse
Affiliation(s)
- Holger Schielzeth
- Department of Evolutionary Biology, Bielefeld University, Bielefeld, Germany
| | | |
Collapse
|
13
|
Robust Joint Analysis with Data Fusion in Two-Stage Quantitative Trait Genome-Wide Association Studies. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:843563. [PMID: 24288575 PMCID: PMC3832968 DOI: 10.1155/2013/843563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/06/2013] [Accepted: 07/29/2013] [Indexed: 11/17/2022]
Abstract
Genome-wide association studies (GWASs) in identifying the disease-associated genetic variants
have been proved to be a great pioneering work. Two-stage design and analysis are often adopted in
GWASs. Considering the genetic model uncertainty, many robust procedures have been proposed and
applied in GWASs. However, the existing approaches mostly focused on binary traits, and few work
has been done on continuous (quantitative) traits, since the statistical significance of these robust tests
is difficult to calculate. In this paper, we develop a powerful F-statistic-based robust joint analysis
method for quantitative traits using the combined raw data from both stages in the framework of
two-staged GWASs. Explicit expressions are obtained to calculate the statistical significance and
power. We show using simulations that the proposed method is substantially more robust than the
F-test based on the additive model when the underlying genetic model is unknown. An example for
rheumatic arthritis (RA) is used for illustration.
Collapse
|
14
|
Oki NO, Motsinger-Reif AA. Multifactor dimensionality reduction as a filter-based approach for genome wide association studies. Front Genet 2011; 2:80. [PMID: 22303374 PMCID: PMC3268633 DOI: 10.3389/fgene.2011.00080] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2011] [Accepted: 10/26/2011] [Indexed: 11/13/2022] Open
Abstract
Advances in genotyping technology and the multitude of genetic data available now provide a vast amount of data that is proving to be useful in the quest for a better understanding of human genetic diseases through the study of genetic variation. This has led to the development of approaches such as genome wide association studies (GWAS) designed specifically for interrogating variants across the genome for association with disease, typically by testing single locus, univariate associations. More recently it has been accepted that epistatic (interaction) effects may also be great contributors to these genetic effects, and GWAS methods are now being applied to find epistatic effects. The challenge for these methods still remain in prioritization and interpretation of results, as it has also become standard for initial findings to be independently investigated in replication cohorts or functional studies. This is motivating the development and implementation of filter-based approaches to prioritize variants found to be significant in a discovery stage for follow-up for replication. Such filters must be able to detect both univariate and interactive effects. In the current study we present and evaluate the use of multifactor dimensionality reduction (MDR) as such a filter, with simulated data and a wide range of effect sizes. Additionally, we compare the performance of the MDR filter to a similar filter approach using logistic regression (LR), the more traditional approach used in GWAS analysis, as well as evaporative cooling (EC)-another prominent machine learning filtering method. The results of our simulation study show that MDR is an effective method for such prioritization, and that it can detect main effects, and interactions with or without marginal effects. Importantly, it performed as well as EC and LR for main effect models. It also significantly outperforms LR for various two-locus epistatic models, while it has equivalent results as EC for the epistatic models. The results of this study demonstrate the potential of MDR as a filter to detect gene-gene interactions in GWAS studies.
Collapse
Affiliation(s)
- Noffisat O. Oki
- Bioinformatics Research Center, North Carolina State UniversityRaleigh, NC, USA
| | - Alison A. Motsinger-Reif
- Bioinformatics Research Center, North Carolina State UniversityRaleigh, NC, USA
- Department of Statistics, North Carolina State UniversityRaleigh, NC, USA
| |
Collapse
|
15
|
Faye LL, Sun L, Dimitromanolakis A, Bull SB. A flexible genome-wide bootstrap method that accounts for ranking and threshold-selection bias in GWAS interpretation and replication study design. Stat Med 2011; 30:1898-912. [PMID: 21538984 DOI: 10.1002/sim.4228] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2010] [Accepted: 02/15/2011] [Indexed: 11/07/2022]
Abstract
The phenomenon known as the winner's curse is a form of selection bias that affects estimates of genetic association. In genome-wide association studies (GWAS) the bias is exacerbated by the use of stringent selection thresholds and ranking over hundreds of thousands of single nucleotide polymorphisms (SNPs). We develop an improved multi-locus bootstrap point estimate and confidence interval, which accounts for both ranking- and threshold-selection bias in the presence of genome-wide SNP linkage disequilibrium structure. The bootstrap method easily adapts to various study designs and alternative test statistics as well as complex SNP selection criteria. The latter is demonstrated by our application to the Wellcome Trust Case Control Consortium findings, in which the selection criterion was the minimum of the p-values for the additive and genotypic genetic effect models. In contrast, existing likelihood-based bias-reduced estimators account for the selection criterion applied to an SNP as if it were the only one tested, and so are more simple computationally, but do not address ranking across SNPs. Our simulation studies show that the bootstrap bias-reduced estimates are usually closer to the true genetic effect than the likelihood estimates and are less variable with a narrower confidence interval. Replication study sample size requirements computed from the bootstrap bias-reduced estimates are adequate 75-90 per cent of the time compared to 53-60 per cent of the time for the likelihood method. The bootstrap methods are implemented in a user-friendly package able to provide point and interval estimation for both binary and quantitative phenotypes in large-scale GWAS.
Collapse
Affiliation(s)
- Laura L Faye
- Dalla Lana School of Public Health, University of Toronto, ON, Canada
| | | | | | | |
Collapse
|
16
|
Xu L, Craiu RV, Sun L. Bayesian methods to overcome the winner’s curse in genetic studies. Ann Appl Stat 2011. [DOI: 10.1214/10-aoas373] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
Xiao R, Boehnke M. Quantifying and correcting for the winner's curse in quantitative-trait association studies. Genet Epidemiol 2011; 35:133-8. [PMID: 21284035 DOI: 10.1002/gepi.20551] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2010] [Revised: 10/05/2010] [Accepted: 10/28/2010] [Indexed: 11/06/2022]
Abstract
Quantitative traits (QT) are an important focus of human genetic studies both because of interest in the traits themselves and because of their role as risk factors for many human diseases. For large-scale QT association studies including genome-wide association studies, investigators usually focus on genetic loci showing significant evidence for SNP-QT association, and genetic effect size tends to be overestimated as a consequence of the winner's curse. In this paper, we study the impact of the winner's curse on QT association studies in which the genetic effect size is parameterized as the slope in a linear regression model. We demonstrate by analytical calculation that the overestimation in the regression slope estimate decreases as power increases. To reduce the ascertainment bias, we propose a three-parameter maximum likelihood method and then simplify this to a one-parameter method by excluding nuisance parameters. We show that both methods reduce the bias when power to detect association is low or moderate, and that the one-parameter model generally results in smaller variance in the estimate.
Collapse
Affiliation(s)
- Rui Xiao
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104-6021, USA.
| | | |
Collapse
|
18
|
Pan D, Li Q, Jiang N, Liu A, Yu K. Robust joint analysis allowing for model uncertainty in two-stage genetic association studies. BMC Bioinformatics 2011; 12:9. [PMID: 21211060 PMCID: PMC3027114 DOI: 10.1186/1471-2105-12-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2010] [Accepted: 01/07/2011] [Indexed: 01/10/2023] Open
Abstract
Background The cost efficient two-stage design is often used in genome-wide association studies (GWASs) in searching for genetic loci underlying the susceptibility for complex diseases. Replication-based analysis, which considers data from each stage separately, often suffers from loss of efficiency. Joint test that combines data from both stages has been proposed and widely used to improve efficiency. However, existing joint analyses are based on test statistics derived under an assumed genetic model, and thus might not have robust performance when the assumed genetic model is not appropriate. Results In this paper, we propose joint analyses based on two robust tests, MERT and MAX3, for GWASs under a two-stage design. We developed computationally efficient procedures and formulas for significant level evaluation and power calculation. The performances of the proposed approaches are investigated through the extensive simulation studies and a real example. Numerical results show that the joint analysis based on the MAX3 test statistic has the best overall performance. Conclusions MAX3 joint analysis is the most robust procedure among the considered joint analyses, and we recommend using it in a two-stage genome-wide association study.
Collapse
Affiliation(s)
- Dongdong Pan
- Department of Statistics, Yunnan University, Kunming 650091, PR China
| | | | | | | | | |
Collapse
|
19
|
Shi G, Boerwinkle E, Morrison AC, Gu CC, Chakravarti A, Rao DC. Mining gold dust under the genome wide significance level: a two-stage approach to analysis of GWAS. Genet Epidemiol 2010; 35:111-8. [PMID: 21254218 DOI: 10.1002/gepi.20556] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Revised: 10/27/2010] [Accepted: 11/17/2010] [Indexed: 12/14/2022]
Abstract
We propose a two-stage approach to analyze genome-wide association data in order to identify a set of promising single-nucleotide polymorphisms (SNPs). In stage one, we select a list of top signals from single SNP analyses by controlling false discovery rate. In stage two, we use the least absolute shrinkage and selection operator (LASSO) regression to reduce false positives. The proposed approach was evaluated using simulated quantitative traits based on genome-wide SNP data on 8,861 Caucasian individuals from the Atherosclerosis Risk in Communities (ARIC) Study. Our first stage, targeted at controlling false negatives, yields better power than using Bonferroni-corrected significance level. The LASSO regression reduces the number of significant SNPs in stage two: it reduces false-positive SNPs and it reduces true-positive SNPs also at simulated causal loci due to linkage disequilibrium. Interestingly, the LASSO regression preserves the power from stage one, i.e., the number of causal loci detected from the LASSO regression in stage two is almost the same as in stage one, while reducing false positives further. Real data on systolic blood pressure in the ARIC study was analyzed using our two-stage approach which identified two significant SNPs, one of which was reported to be genome-significant in a meta-analysis containing a much larger sample size. On the other hand, a single SNP association scan did not yield any significant results.
Collapse
Affiliation(s)
- Gang Shi
- Division of Biostatistics, Washington University School of Medicine, Saint Louis, Missouri 63110-1093, USA.
| | | | | | | | | | | |
Collapse
|
20
|
Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 2010; 42:570-5. [PMID: 20562874 DOI: 10.1038/ng.610] [Citation(s) in RCA: 491] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2010] [Accepted: 05/26/2010] [Indexed: 02/07/2023]
Abstract
We report a set of tools to estimate the number of susceptibility loci and the distribution of their effect sizes for a trait on the basis of discoveries from existing genome-wide association studies (GWASs). We propose statistical power calculations for future GWASs using estimated distributions of effect sizes. Using reported GWAS findings for height, Crohn's disease and breast, prostate and colorectal (BPC) cancers, we determine that each of these traits is likely to harbor additional loci within the spectrum of low-penetrance common variants. These loci, which can be identified from sufficiently powerful GWASs, together could explain at least 15-20% of the known heritability of these traits. However, for BPC cancers, which have modest familial aggregation, our analysis suggests that risk models based on common variants alone will have modest discriminatory power (63.5% area under curve), even with new discoveries.
Collapse
|
21
|
Abstract
A two-stage design is cost-effective for genome-wide association studies (GWAS) testing hundreds of thousands of single nucleotide polymorphisms (SNPs). In this design, each SNP is genotyped in stage 1 using a fraction of case-control samples. Top-ranked SNPs are selected and genotyped in stage 2 using additional samples. A joint analysis, combining statistics from both stages, is applied in the second stage. Follow-up studies can be regarded as a two-stage design. Once some potential SNPs are identified, independent samples are further genotyped and analyzed separately or jointly with previous data to confirm the findings. When the underlying genetic model is known, an asymptotically optimal trend test (TT) can be used at each analysis. In practice, however, genetic models for SNPs with true associations are usually unknown. In this case, the existing methods for analysis of the two-stage design and follow-up studies are not robust across different genetic models. We propose a simple robust procedure with genetic model selection to the two-stage GWAS. Our results show that, if the optimal TT has about 80% power when the genetic model is known, then the existing methods for analysis of the two-stage design have minimum powers about 20% across the four common genetic models (when the true model is unknown), while our robust procedure has minimum powers about 70% across the same genetic models. The results can be also applied to follow-up and replication studies with a joint analysis.
Collapse
Affiliation(s)
- Minjung Kwak
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, MSC 7913, Bethesda, Maryland 20892-7913, USA
| | | | | |
Collapse
|
22
|
Zhong H, Prentice RL. Correcting "winner's curse" in odds ratios from genomewide association findings for major complex human diseases. Genet Epidemiol 2010; 34:78-91. [PMID: 19639606 DOI: 10.1002/gepi.20437] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Genome-wide association studies (GWAS) provide an important approach for identifying common genetic variants that predispose to human disease. However, odds ratio (OR) estimates for the reported findings from GWAS discovery data are typically affected by a bias away from the null sometimes referred to the "winner's curse". Also standard confidence intervals (CIs) may have far from the desired coverage rates. We applied a bias reduction method to GWAS findings from several major complex human diseases, including breast cancer, colorectal cancer, lung cancer, prostate cancer, type I diabetes, and type II diabetes. We found the simple bias correction procedure allows one to estimate bias-adjusted ORs that have substantial consistency with ORs from subsequent replication studies, and that corresponding selection-adjusted CIs appear to help quantify the uncertainty of the findings. Selection-adjusted ORs and CIs can provide a reliable summary of GWAS data, and can help to choose single nucleotide polymorphisms for subsequent validation studies.
Collapse
Affiliation(s)
- Hua Zhong
- Department of Genetics, Rosetta Inpharmatics, LLC, 401 Terry Ave North, Seattle, WA 98109, USA.
| | | |
Collapse
|
23
|
Joo J, Kwak M, Chen Z, Zheng G. Efficiency robust statistics for genetic linkage and association studies under genetic model uncertainty. Stat Med 2010; 29:158-80. [PMID: 19918942 DOI: 10.1002/sim.3759] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
When testing genetic linkage and association, test statistics that follow a normal or Chi-square distributions are often used. These statistics are usually derived under a specific mode of inheritance (genetic model). Common genetic models include, but not limited to, the recessive, additive, multiplicative, and dominant models. For many diseases, their underlying genetic models are often unknown. Instead, a family of scientifically plausible genetic models may be available, which includes the four commonly used models. Hence, the optimal test is not available. Employing a single test statistic which is optimal for one model may suffer from substantial loss of power when the model is misspecified. In this situation efficient robust tests are useful. In this tutorial, we first review several commonly used robust statistics, including maximum efficiency robust tests, maximal tests, and constrained likelihood ratio tests for three common designs in genetic studies: (i) linkage analysis using affected sib-pairs, (ii) association studies using parents-offspring trios, and (iii) case-control association studies (unmatched and matched). Codes in the R statistical language for applying these robust statistics to test for linkage and association are presented with examples. We also provide some comparisons of the performance of the various robust tests via simulation studies. Guidelines for applications are also given for each study design. Finally, applications of robust tests to genome-wide association studies and meta-analysis are discussed.
Collapse
Affiliation(s)
- Jungnam Joo
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, MD 20892, USA
| | | | | | | |
Collapse
|
24
|
Thomas DC, Casey G, Conti DV, Haile RW, Lewinger JP, Stram DO. Methodological Issues in Multistage Genome-wide Association Studies. Stat Sci 2009; 24:414-429. [PMID: 20607129 DOI: 10.1214/09-sts288] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Because of the high cost of commercial genotyping chip technologies, many investigations have used a two-stage design for genome-wide association studies, using part of the sample for an initial discovery of "promising" SNPs at a less stringent significance level and the remainder in a joint analysis of just these SNPs using custom genotyping. Typical cost savings of about 50% are possible with this design to obtain comparable levels of overall type I error and power by using about half the sample for stage I and carrying about 0.1% of SNPs forward to the second stage, the optimal design depending primarily upon the ratio of costs per genotype for stages I and II. However, with the rapidly declining costs of the commercial panels, the generally low observed ORs of current studies, and many studies aiming to test multiple hypotheses and multiple endpoints, many investigators are abandoning the two-stage design in favor of simply genotyping all available subjects using a standard high-density panel. Concern is sometimes raised about the absence of a "replication" panel in this approach, as required by some high-profile journals, but it must be appreciated that the two-stage design is not a discovery/replication design but simply a more efficient design for discovery using a joint analysis of the data from both stages. Once a subset of highly-significant associations has been discovered, a truly independent "exact replication" study is needed in a similar population of the same promising SNPs using similar methods. This can then be followed by (1) "generalizability" studies to assess the full scope of replicated associations across different races, different endpoints, different interactions, etc.; (2) fine-mapping or re-sequencing to try to identify the causal variant; and (3) experimental studies of the biological function of these genes. Multistage sampling designs may be more useful at this stage, say for selecting subsets of subjects for deep re-sequencing of regions identified in the GWAS.
Collapse
Affiliation(s)
- Duncan C Thomas
- Department of Preventive Medicine, University of Southern California
| | | | | | | | | | | |
Collapse
|
25
|
Abstract
Replication helps ensure that a genotype-phenotype association observed in a genome-wide association (GWA) study represents a credible association and is not a chance finding or an artifact due to uncontrolled biases. We discuss prerequisites for exact replication; issues of heterogeneity; advantages and disadvantages of different methods of data synthesis across multiple studies; frequentist vs. Bayesian inferences for replication; and challenges that arise from multi-team collaborations. While consistent replication can greatly improve the credibility of a genotype-phenotype association, it may not eliminate spurious associations due to biases shared by many studies. Conversely, lack of replication in well-powered follow-up studies usually invalidates the initially proposed association, although occasionally it may point to differences in linkage disequilibrium or effect modifiers across studies.
Collapse
Affiliation(s)
- Peter Kraft
- Departments of Epidemiology and Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | | | | |
Collapse
|
26
|
Meta-analysis of genetic association studies: methodologies, between-study heterogeneity and winner's curse. J Hum Genet 2009; 54:615-23. [PMID: 19851339 DOI: 10.1038/jhg.2009.95] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
27
|
Rosenberg NA, Vanliere JM. Replication of genetic associations as pseudoreplication due to shared genealogy. Genet Epidemiol 2009; 33:479-87. [PMID: 19191270 DOI: 10.1002/gepi.20400] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The genotypes of individuals in replicate genetic association studies have some level of correlation due to shared descent in the complete pedigree of all living humans. As a result of this genealogical sharing, replicate studies that search for genotype-phenotype associations using linkage disequilibrium between marker loci and disease-susceptibility loci can be considered as "pseudoreplicates" rather than true replicates. We examine the size of the pseudoreplication effect in association studies simulated from evolutionary models of the history of a population, evaluating the excess probability that both of a pair of studies detect a disease association compared to the probability expected under the assumption that the two studies are independent. Each of nine combinations of a demographic model and a penetrance model leads to a detectable pseudoreplication effect, suggesting that the degree of support that can be attributed to a replicated genetic association result is less than that which can be attributed to a replicated result in a context of true independence.
Collapse
Affiliation(s)
- Noah A Rosenberg
- Department of Human Genetics, Center for Computational Medicine and Biology, and the Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109-2218, USA.
| | | |
Collapse
|
28
|
Empirical Bayes and semi-Bayes adjustments for a vast number of estimations. Eur J Epidemiol 2009; 24:737-41. [PMID: 19813100 DOI: 10.1007/s10654-009-9393-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2009] [Accepted: 09/24/2009] [Indexed: 10/20/2022]
Abstract
Investigators in modern molecular/genetic epidemiology studies commonly analyze data on a vast number of candidate genetic markers. In such situations, rather than conventional estimation of effects (odds ratios), more accurate estimation methods are needed. The author proposes consideration of empirical Bayes and semi-Bayes methods, which yield 'adjustments for multiple estimations' by shrinking conventional effect estimates towards the overall average effect.
Collapse
|
29
|
Strömberg U, Björk J, Vineis P, Broberg K, Zeggini E. Ranking of genome-wide association scan signals by different measures. Int J Epidemiol 2009; 38:1364-73. [PMID: 19734549 PMCID: PMC3072755 DOI: 10.1093/ije/dyp285] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND The P-value approach has been employed to prioritizing genome-wide association (GWA) scan signals, with a genome-wide significance defined by a prior P-value threshold, although this is not ideal. A rationale put forward is that the association signals rather should be expected to give less support for single nucleotide polymorphisms (SNPs) that are rare (with associated low-power tests) than for common SNPs with equivalent P-values, unless investigators believe, a priori, that rare causative variants contribute to the disease and have more pronounced effects. METHODS Using data from a GWA scan for type 2 diabetes (1924 cases, 2938 controls, 393 453 SNPs), we compared P-values with four alternative signal measures: likelihood ratio (LR), Bayes factor (BF; with a specified prior distribution for true effects), 'frequentist factor' (FF; reflecting the ratio between estimated--post-data-- 'power' and P-value) and probability of pronounced effect size (PrPES). RESULTS The 19 common SNPs [minor allele frequency (MAF) among the controls >29%] yielding strong P-value signals (P < 5 x 10(-7)) were also top ranked by the other approaches. There was a strong similarity between the P-values, LR and BF signals, in terms of ranking SNPs. In contrast, FF and PrPES signals down-weighted rare SNPs (control MAF <10%) with low P-values. CONCLUSIONS For prioritization of signals that do not achieve compelling levels of evidence for association, the main driving force behind observed differences between the various association signals appears to be SNP MAF. The statistical power afforded by follow-up samples for establishing replication should be taken into account when tailoring the signal selection strategy.
Collapse
Affiliation(s)
- Ulf Strömberg
- Department of Occupational and Environmental Medicine, Lund University Hospital, Lund, Sweden.
| | | | | | | | | |
Collapse
|
30
|
Xiao R, Boehnke M. Quantifying and correcting for the winner's curse in genetic association studies. Genet Epidemiol 2009; 33:453-62. [PMID: 19140131 DOI: 10.1002/gepi.20398] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Genetic association studies are a powerful tool to detect genetic variants that predispose to human disease. Once an associated variant is identified, investigators are also interested in estimating the effect of the identified variant on disease risk. Estimates of the genetic effect based on new association findings tend to be upwardly biased due to a phenomenon known as the "winner's curse." Overestimation of genetic effect size in initial studies may cause follow-up studies to be underpowered and so to fail. In this paper, we quantify the impact of the winner's curse on the allele frequency difference and odds ratio estimators for one- and two-stage case-control association studies. We then propose an ascertainment-corrected maximum likelihood method to reduce the bias of these estimators. We show that overestimation of the genetic effect by the uncorrected estimator decreases as the power of the association study increases and that the ascertainment-corrected method reduces absolute bias and mean square error unless power to detect association is high.
Collapse
Affiliation(s)
- Rui Xiao
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, USA
| | | |
Collapse
|
31
|
Bowden J, Dudbridge F. Unbiased estimation of odds ratios: combining genomewide association scans with replication studies. Genet Epidemiol 2009; 33:406-18. [PMID: 19140132 PMCID: PMC2726957 DOI: 10.1002/gepi.20394] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Odds ratios or other effect sizes estimated from genome scans are upwardly biased, because only the top-ranking associations are reported, and moreover only if they reach a defined level of significance. No unbiased estimate exists based on data selected in this fashion, but replication studies are routinely performed that allow unbiased estimation of the effect sizes. Estimation based on replication data alone is inefficient in the sense that the initial scan could, in principle, contribute information on the effect size. We propose an unbiased estimator combining information from both the initial scan and the replication study, which is more efficient than that based just on the replication. Specifically, we adjust the standard combined estimate to allow for selection by rank and significance in the initial scan. Our approach explicitly allows for multiple associations arising from a scan, and is robust to mis-specification of a significance threshold. We require replication data to be available but argue that, in most applications, estimates of effect sizes are only useful when associations have been replicated. We illustrate our approach on some recently completed scans and explore its efficiency by simulation. Genet. Epidemiol. 33:406–418, 2009. © 2009 Wiley-Liss, Inc.
Collapse
|
32
|
Li Q, Yu K. Inference of non-centrality parameter of a truncated non-central chi-squared distribution. J Stat Plan Inference 2009. [DOI: 10.1016/j.jspi.2008.11.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
33
|
Li Q, Wacholder S, Hunter DJ, Hoover RN, Chanock S, Thomas G, Yu K. Genetic background comparison using distance-based regression, with applications in population stratification evaluation and adjustment. Genet Epidemiol 2009; 33:432-41. [PMID: 19140130 PMCID: PMC2706300 DOI: 10.1002/gepi.20396] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Population stratification (PS) can lead to an inflated rate of false-positive findings in genome-wide association studies (GWAS). The commonly used approach of adjustment for a fixed number of principal components (PCs) could have a deleterious impact on power when selected PCs are equally distributed in cases and controls, or the adjustment of certain covariates, such as self-identified ethnicity or recruitment center, already included in the association analyses, correctly maps to major axes of genetic heterogeneity. We propose a computationally efficient procedure, PC-Finder, to identify a minimal set of PCs while permitting an effective correction for PS. A general pseudo F statistic, derived from a non-parametric multivariate regression model, can be used to assess whether PS exists or has been adequately corrected by a set of selected PCs. Empirical data from two GWAS conducted as part of the Cancer Genetic Markers of Susceptibility (CGEMS) project demonstrate the application of the procedure. Furthermore, simulation studies show the power advantage of the proposed procedure in GWAS over currently used PS correction strategies, particularly when the PCs with substantial genetic variation are distributed similarly in cases and controls and therefore do not induce PS.
Collapse
Affiliation(s)
- Qizhai Li
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Sholom Wacholder
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - David J. Hunter
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA
| | - Robert N. Hoover
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Stephen Chanock
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Gilles Thomas
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
| | - Kai Yu
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892 USA
| |
Collapse
|
34
|
Wacholder S, Rotunno M. Control Selection Options for Genome-Wide Association Studies in Cohorts. Cancer Epidemiol Biomarkers Prev 2009; 18:695-7. [DOI: 10.1158/1055-9965.epi-08-1114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Abstract
Investigators planning studies within cohorts have many options for choosing an efficient sampling design for genome-wide association and other molecular epidemiology studies. Consideration of person-year and proportional hazards analyses of full cohorts may add further insight into ramifications of different designs. Empirical evidence from genome-wide association studies can supplement intuition and simulations in comparing properties of various case-control designs within cohorts. Additional theoretical and empirical work, justification of sampling choice in publications, and consideration of context and scientific aims can improve designs and, thereby, increase the scientific value and cost effectiveness of future studies. (Cancer Epidemiol Biomarkers Prev 2009;18(3):695–7)
Collapse
Affiliation(s)
- Sholom Wacholder
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Bethesda, Maryland
| | - Melissa Rotunno
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Bethesda, Maryland
| |
Collapse
|
35
|
Scherag A, Hebebrand J, Schäfer H, Müller HH. Flexible designs for genomewide association studies. Biometrics 2009; 65:815-21. [PMID: 19173695 DOI: 10.1111/j.1541-0420.2008.01174.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Genomewide association studies attempting to unravel the genetic etiology of complex traits have recently gained attention. Frequently, these studies employ a sequential genotyping strategy: A large panel of markers is examined in a subsample of subjects, and the most promising markers are genotyped in the remaining subjects. In this article, we introduce a novel method for such designs enabling investigators to, for example, modify marker densities and sample proportions while strongly controlling the family-wise type I error rate. Loss of efficiency is avoided by redistributing conditional type I error rates of discarded markers. Our approach can be combined with cost optimal designs and entails a greater flexibility than all previously suggested designs. Among other features, it allows for marker selections based upon biological criteria instead of statistical criteria alone, or the option to modify the sample size at any time during the course of the project. For practical applicability, we develop a new algorithm, subsequently evaluate it by simulations, and illustrate it using a real data set.
Collapse
Affiliation(s)
- André Scherag
- Institute of Medical Biometry and Epidemiology, Philipps-University, Marburg, Germany
| | | | | | | |
Collapse
|
36
|
Abstract
The estimated effect of a marker allele from the initial study reporting the marker-allele association is often exaggerated relative to the estimated effect in follow-up studies (the "winner's curse" phenomenon). This is a particular concern for genome-wide association studies, where markers typically must pass very stringent significance thresholds to be selected for replication. A related problem is the overestimation of the predictive accuracy that occurs when the same data set is used to select a multilocus risk model from a wide range of possible models and then estimate the accuracy of the final model ("over-fitting"). Even in the absence of these quantitative biases, researchers can over-state the qualitative importance of their findings--for example, by focusing on relative risks in a context where sensitivity and specificity may be more appropriate measures. Epidemiologists need to be aware of these potential problems: as authors, to avoid or minimize them, and as readers, to detect them.
Collapse
|
37
|
Han B, Kang HM, Seo MS, Zaitlen N, Eskin E. Efficient association study design via power-optimized tag SNP selection. Ann Hum Genet 2008; 72:834-47. [PMID: 18702637 DOI: 10.1111/j.1469-1809.2008.00469.x] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Discovering statistical correlation between causal genetic variation and clinical traits through association studies is an important method for identifying the genetic basis of human diseases. Since fully resequencing a cohort is prohibitively costly, genetic association studies take advantage of local correlation structure (or linkage disequilibrium) between single nucleotide polymorphisms (SNPs) by selecting a subset of SNPs to be genotyped (tag SNPs). While many current association studies are performed using commercially available high-throughput genotyping products that define a set of tag SNPs, choosing tag SNPs remains an important problem for both custom follow-up studies as well as designing the high-throughput genotyping products themselves. The most widely used tag SNP selection method optimizes the correlation between SNPs (r(2)). However, tag SNPs chosen based on an r(2) criterion do not necessarily maximize the statistical power of an association study. We propose a study design framework that chooses SNPs to maximize power and efficiently measures the power through empirical simulation. Empirical results based on the HapMap data show that our method gains considerable power over a widely used r(2)-based method, or equivalently reduces the number of tag SNPs required to attain the desired power of a study. Our power-optimized 100k whole genome tag set provides equivalent power to the Affymetrix 500k chip for the CEU population. For the design of custom follow-up studies, our method provides up to twice the power increase using the same number of tag SNPs as r(2)-based methods. Our method is publicly available via web server at http://design.cs.ucla.edu.
Collapse
Affiliation(s)
- B Han
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA
| | | | | | | | | |
Collapse
|
38
|
Katki HA. Invited Commentary: Evidence-based Evaluation of p Values and Bayes Factors. Am J Epidemiol 2008. [DOI: 10.1093/aje/kwn148] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
39
|
Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 2008; 9:356-69. [PMID: 18398418 DOI: 10.1038/nrg2344] [Citation(s) in RCA: 1873] [Impact Index Per Article: 117.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The past year has witnessed substantial advances in understanding the genetic basis of many common phenotypes of biomedical importance. These advances have been the result of systematic, well-powered, genome-wide surveys exploring the relationships between common sequence variation and disease predisposition. This approach has revealed over 50 disease-susceptibility loci and has provided insights into the allelic architecture of multifactorial traits. At the same time, much has been learned about the successful prosecution of association studies on such a scale. This Review highlights the knowledge gained, defines areas of emerging consensus, and describes the challenges that remain as researchers seek to obtain more complete descriptions of the susceptibility architecture of biomedical traits of interest and to translate the information gathered into improvements in clinical management.
Collapse
|
40
|
Ghosh A, Zou F, Wright FA. Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet 2008; 82:1064-74. [PMID: 18423522 DOI: 10.1016/j.ajhg.2008.03.002] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2007] [Revised: 02/28/2008] [Accepted: 03/05/2008] [Indexed: 11/29/2022] Open
Abstract
In modern whole-genome scans, the use of stringent thresholds to control the genome-wide testing error distorts the estimation process, producing estimated effect sizes that may be on average far greater in magnitude than the true effect sizes. We introduce a method, based on the estimate of genetic effect and its standard error as reported by standard statistical software, to correct for this bias in case-control association studies. Our approach is widely applicable, is far easier to implement than competing approaches, and may often be applied to published studies without access to the original data. We evaluate the performance of our approach via extensive simulations for a range of genetic models, minor allele frequencies, and genetic effect sizes. Compared to the naive estimation procedure, our approach reduces the bias and the mean squared error, especially for modest effect sizes. We also develop a principled method to construct confidence intervals for the genetic effect that acknowledges the conditioning on statistical significance. Our approach is described in the specific context of odds ratios and logistic modeling but is more widely applicable. Application to recently published data sets demonstrates the relevance of our approach to modern genome scans.
Collapse
Affiliation(s)
- Arpita Ghosh
- Department of Biostatistics, The University of North Carolina at Chapel Hill, NC 27599, USA
| | | | | |
Collapse
|
41
|
Li Q, Zheng G, Li Z, Yu K. Efficient approximation of P-value of the maximum of correlated tests, with applications to genome-wide association studies. Ann Hum Genet 2008; 72:397-406. [PMID: 18318785 DOI: 10.1111/j.1469-1809.2008.00437.x] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Genome-wide association study (GWAS), typically involving 100,000 to 500,000 single-nucleotide polymorphisms (SNPs), is a powerful approach to identify disease susceptibility loci. In a GWAS, single-marker analysis, which tests one SNP at a time, is usually used as the first stage to screen SNPs across the genome in order to identify a small fraction of promising SNPs with relatively low p-values for further and more focused studies. For single-marker analysis, the trend test derived for an additive genetic model is often used. This may not be robust when the additive assumption is not appropriate for the true underlying disease model. A robust test, MAX, based on the maximum of three trend test statistics derived for recessive, additive, and dominant models, has been proposed recently for GWAS. But its p-value has to be evaluated through a resampling-based procedure, which is computationally challenging for the analysis of GWAS. Obtaining the p-value for MAX with adjustment for the covariates can be even more time-consuming. In this article, we provide a simple approximation for the p-value of the MAX test with or without adjusting for the covariates. The new method avoids resampling steps and thus makes the MAX test readily applicable to GWAS. We use simulation studies as well as real datasets on 17 confirmed disease-associated SNPs to assess the accuracy of the proposed method. We also apply the method to the GWAS of coronary artery disease.
Collapse
Affiliation(s)
- Qizhai Li
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| | | | | | | |
Collapse
|
42
|
Zhong H, Prentice RL. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 2008; 9:621-34. [PMID: 18310059 DOI: 10.1093/biostatistics/kxn001] [Citation(s) in RCA: 121] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies (GWAS) provide an important approach to identifying common genetic variants that predispose to human disease. A typical GWAS may genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) located throughout the human genome in a set of cases and controls. Logistic regression is often used to test for association between a SNP genotype and case versus control status, with corresponding odds ratios (ORs) typically reported only for those SNPs meeting selection criteria. However, when these estimates are based on the original data used to detect the variant, the results are affected by a selection bias sometimes referred to the "winner's curse" (Capen and others, 1971). The actual genetic association is typically overestimated. We show that such selection bias may be severe in the sense that the conditional expectation of the standard OR estimator may be quite far away from the underlying parameter. Also standard confidence intervals (CIs) may have far from the desired coverage rate for the selected ORs. We propose and evaluate 3 bias-reduced estimators, and also corresponding weighted estimators that combine corrected and uncorrected estimators, to reduce selection bias. Their corresponding CIs are also proposed. We study the performance of these estimators using simulated data sets and show that they reduce the bias and give CI coverage close to the desired level under various scenarios, even for associations having only small statistical power.
Collapse
Affiliation(s)
- Hua Zhong
- Department of Biostatistics, University of Washington, Seattle, WA 98105, USA.
| | | |
Collapse
|
43
|
Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls. Hum Genet 2007; 123:1-14. [PMID: 18026754 DOI: 10.1007/s00439-007-0445-9] [Citation(s) in RCA: 140] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2007] [Accepted: 10/29/2007] [Indexed: 12/14/2022]
Abstract
Meta-analysis offers the opportunity to combine evidence from retrospectively accumulated or prospectively generated data. Meta-analyses may provide summary estimates and can help in detecting and addressing potential inconsistency between the combined datasets. Application of meta-analysis in genetic associations presents considerable potential and several pitfalls. In this review, we present basic principles of meta-analytic methods, adapted for human genome epidemiology. We describe issues that arise in the retrospective or the prospective collection of relevant data through various sources, common traps to consider in the appraisal of evidence and potential biases that may interfere. We describe the relative merits and caveats for common methods used to trace inconsistency across studies along with possible reasons for non-replication of proposed associations. Different statistical models may be employed to combine data and some common misconceptions may arise in the process. Several meta-analysis diagnostics are often applied or misapplied in the literature, and we comment on their use and limitations. An alternative to overcome limitations arising from retrospective combination of data from published studies is to create networks of research teams working in the same field and perform collaborative meta-analyses of individual participant data, ideally on a prospective basis. We discuss the advantages and the challenges inherent in such collaborative approaches. Meta-analysis can be a useful tool in dissecting the genetics of complex diseases and traits, provided its methods are properly applied and interpreted.
Collapse
|