1
|
Deng Y, He Y, Xu G, Pan W. Speeding up Monte Carlo simulations for the adaptive sum of powered score test with importance sampling. Biometrics 2022; 78:261-273. [PMID: 33215683 PMCID: PMC8134502 DOI: 10.1111/biom.13407] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 08/30/2020] [Accepted: 10/29/2020] [Indexed: 12/21/2022]
Abstract
A central but challenging problem in genetic studies is to test for (usually weak) associations between a complex trait (e.g., a disease status) and sets of multiple genetic variants. Due to the lack of a uniformly most powerful test, data-adaptive tests, such as the adaptive sum of powered score (aSPU) test, are advantageous in maintaining high power against a wide range of alternatives. However, there is often no closed-form to accurately and analytically calculate the p-values of many adaptive tests like aSPU, thus Monte Carlo (MC) simulations are often used, which can be time consuming to achieve a stringent significance level (e.g., 5e-8) used in genome-wide association studies (GWAS). To estimate such a small p-value, we need a huge number of MC simulations (e.g., 1e+10). As an alternative, we propose using importance sampling to speed up such calculations. We develop some theory to motivate a proposed algorithm for the aSPU test, and show that the proposed method is computationally more efficient than the standard MC simulations. Using both simulated and real data, we demonstrate the superior performance of the new method over the standard MC simulations.
Collapse
Affiliation(s)
- Yangqing Deng
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA,Department of Mathematics, University of North Texas, Denton, TX 76203, USA
| | - Yinqiu He
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Gongjun Xu
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA,Corresponding author:
| |
Collapse
|
2
|
Kawaguchi ES, Li G, Lewinger JP, Gauderman WJ. Two-step hypothesis testing to detect gene-environment interactions in a genome-wide scan with a survival endpoint. Stat Med 2022; 41:1644-1657. [PMID: 35075649 PMCID: PMC9007892 DOI: 10.1002/sim.9319] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 11/10/2021] [Accepted: 12/26/2021] [Indexed: 01/13/2023]
Abstract
Defined by their genetic profile, individuals may exhibit differential clinical outcomes due to an environmental exposure. Identifying subgroups based on specific exposure-modifying genes can lead to targeted interventions and focused studies. Genome-wide interaction scans (GWIS) can be performed to identify such genes, but these scans typically suffer from low power due to the large multiple testing burden. We provide a novel framework for powerful two-step hypothesis tests for GWIS with a time-to-event endpoint under the Cox proportional hazards model. In the Cox regression setting, we develop an approach that prioritizes genes for Step-2 G × E testing based on a carefully constructed Step-1 screening procedure. Simulation results demonstrate this two-step approach can lead to substantially higher power for identifying gene-environment ( G × E ) interactions compared to the standard GWIS while preserving the family wise error rate over a range of scenarios. In a taxane-anthracycline chemotherapy study for breast cancer patients, the two-step approach identifies several gene expression by treatment interactions that would not be detected using the standard GWIS.
Collapse
Affiliation(s)
- Eric S Kawaguchi
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| | - Gang Li
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, California, USA.,Department of Computational Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Juan Pablo Lewinger
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| | - W James Gauderman
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
3
|
Schweiger R, Fisher E, Weissbrod O, Rahmani E, Müller-Nurasyid M, Kunze S, Gieger C, Waldenberger M, Rosset S, Halperin E. Detecting heritable phenotypes without a model using fast permutation testing for heritability and set-tests. Nat Commun 2018; 9:4919. [PMID: 30464216 PMCID: PMC6249264 DOI: 10.1038/s41467-018-07276-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 10/26/2018] [Indexed: 01/08/2023] Open
Abstract
Testing for association between a set of genetic markers and a phenotype is a fundamental task in genetic studies. Standard approaches for heritability and set testing strongly rely on parametric models that make specific assumptions regarding phenotypic variability. Here, we show that resulting p-values may be inflated by up to 15 orders of magnitude, in a heritability study of methylation measurements, and in a heritability and expression quantitative trait loci analysis of gene expression profiles. We propose FEATHER, a method for fast permutation-based testing of marker sets and of heritability, which properly controls for false-positive results. FEATHER eliminated 47% of methylation sites found to be heritable by the parametric test, suggesting a substantial inflation of false-positive findings by alternative methods. Our approach can rapidly identify heritable phenotypes out of millions of phenotypes acquired via high-throughput technologies, does not suffer from model misspecification and is highly efficient. Standard approaches for heritability and set testing in statistical genetics rely on parametric models that might not hold in reality and give inflated p-values. Here, the authors develop a fast method for permutation-based testing of marker sets and of heritability that does not suffer from model misspecification.
Collapse
Affiliation(s)
- Regev Schweiger
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801, Israel.
| | - Eyal Fisher
- School of Mathematical Sciences, Department of Statistics, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, 02115, MA, USA
| | - Elior Rahmani
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Martina Müller-Nurasyid
- Institute of Genetic Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, 85764, Germany.,Department of Medicine I, Ludwig-Maximilians-Universität, Munich, 80539, Germany.,DZHK (German Centre for Cardiovascular Research), partner site Munich Heart Alliance, Munich, 80636, Germany
| | - Sonja Kunze
- Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, 85764, Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, 85764, Neuherberg, Germany
| | - Christian Gieger
- Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, 85764, Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, 85764, Neuherberg, Germany
| | - Melanie Waldenberger
- DZHK (German Centre for Cardiovascular Research), partner site Munich Heart Alliance, Munich, 80636, Germany.,Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, 85764, Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, 85764, Neuherberg, Germany
| | - Saharon Rosset
- School of Mathematical Sciences, Department of Statistics, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Eran Halperin
- Los Angeles, University of California Los Angeles, Los Angeles, 90095, CA, USA.,Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, 90095, CA, USA
| |
Collapse
|
4
|
Segal BD, Braun T, Elliott MR, Jiang H. Fast approximation of small p-values in permutation tests by partitioning the permutations. Biometrics 2017. [PMID: 29542118 DOI: 10.1111/biom.12731] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Researchers in genetics and other life sciences commonly use permutation tests to evaluate differences between groups. Permutation tests have desirable properties, including exactness if data are exchangeable, and are applicable even when the distribution of the test statistic is analytically intractable. However, permutation tests can be computationally intensive. We propose both an asymptotic approximation and a resampling algorithm for quickly estimating small permutation p-values (e.g., <10-6) for the difference and ratio of means in two-sample tests. Our methods are based on the distribution of test statistics within and across partitions of the permutations, which we define. In this article, we present our methods and demonstrate their use through simulations and an application to cancer genomic data. Through simulations, we find that our resampling algorithm is more computationally efficient than another leading alternative, particularly for extremely small p-values (e.g., <10-30). Through application to cancer genomic data, we find that our methods can successfully identify up- and down-regulated genes. While we focus on the difference and ratio of means, we speculate that our approaches may work in other settings.
Collapse
Affiliation(s)
- Brian D Segal
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109-2029, U.S.A
| | - Thomas Braun
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109-2029, U.S.A
| | - Michael R Elliott
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109-2029, U.S.A
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48109-2029, U.S.A
| |
Collapse
|
5
|
Zhou YH, Wright FA. Hypothesis testing at the extremes: fast and robust association for high-throughput data. Biostatistics 2015; 16:611-25. [PMID: 25792622 DOI: 10.1093/biostatistics/kxv007] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 02/16/2015] [Indexed: 01/16/2023] Open
Abstract
A number of biomedical problems require performing many hypothesis tests, with an attendant need to apply stringent thresholds. Often the data take the form of a series of predictor vectors, each of which must be compared with a single response vector, perhaps with nuisance covariates. Parametric tests of association are often used, but can result in inaccurate type I error at the extreme thresholds, even for large sample sizes. Furthermore, standard two-sided testing can reduce power compared with the doubled [Formula: see text]-value, due to asymmetry in the null distribution. Exact (permutation) testing is attractive, but can be computationally intensive and cumbersome. We present an approximation to exact association tests of trend that is accurate and fast enough for standard use in high-throughput settings, and can easily provide standard two-sided or doubled [Formula: see text]-values. The approach is shown to be equivalent under permutation to likelihood ratio tests for the most commonly used generalized linear models (GLMs). For linear regression, covariates are handled by working with covariate-residualized responses and predictors. For GLMs, stratified covariates can be handled in a manner similar to exact conditional testing. Simulations and examples illustrate the wide applicability of the approach. The accompanying mcc package is available on CRAN http://cran.r-project.org/web/packages/mcc/index.html.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Bioinformatics Research Center, Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
| | - Fred A Wright
- Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
6
|
Che R, Jack JR, Motsinger-Reif AA, Brown CC. An adaptive permutation approach for genome-wide association study: evaluation and recommendations for use. BioData Min 2014; 7:9. [PMID: 24976866 PMCID: PMC4070098 DOI: 10.1186/1756-0381-7-9] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2013] [Accepted: 06/02/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Permutation testing is a robust and popular approach for significance testing in genomic research, which has the broad advantage of estimating significance non-parametrically, thereby safe guarding against inflated type I error rates. However, the computational efficiency remains a challenging issue that limits its wide application, particularly in genome-wide association studies (GWAS). Because of this, adaptive permutation strategies can be employed to make permutation approaches feasible. While these approaches have been used in practice, there is little research into the statistical properties of these approaches, and little guidance into the proper application of such a strategy for accurate p-value estimation at the GWAS level. METHODS In this work, we advocate an adaptive permutation procedure that is statistically valid as well as computationally feasible in GWAS. We perform extensive simulation experiments to evaluate the robustness of the approach to violations of modeling assumptions and compare the power of the adaptive approach versus standard approaches. We also evaluate the parameter choices in implementing the adaptive permutation approach to provide guidance on proper implementation in real studies. Additionally, we provide an example of the application of adaptive permutation testing on real data. RESULTS The results provide sufficient evidence that the adaptive test is robust to violations of modeling assumptions. In addition, even when modeling assumptions are correct, the power achieved by adaptive permutation is identical to the parametric approach over a range of significance thresholds and effect sizes under the alternative. A framework for proper implementation of the adaptive procedure is also generated. CONCLUSIONS While the adaptive permutation approach presented here is not novel, the current study provides evidence of the validity of the approach, and importantly provides guidance on the proper implementation of such a strategy. Additionally, tools are made available to aid investigators in implementing these approaches.
Collapse
Affiliation(s)
- Ronglin Che
- Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - John R Jack
- Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Alison A Motsinger-Reif
- Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Chad C Brown
- Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
7
|
Rapid and robust resampling-based multiple-testing correction with application in a genome-wide expression quantitative trait loci study. Genetics 2012; 190:1511-20. [PMID: 22298711 DOI: 10.1534/genetics.111.137737] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome-wide expression quantitative trait loci (eQTL) studies have emerged as a powerful tool to understand the genetic basis of gene expression and complex traits. In a typical eQTL study, the huge number of genetic markers and expression traits and their complicated correlations present a challenging multiple-testing correction problem. The resampling-based test using permutation or bootstrap procedures is a standard approach to address the multiple-testing problem in eQTL studies. A brute force application of the resampling-based test to large-scale eQTL data sets is often computationally infeasible. Several computationally efficient methods have been proposed to calculate approximate resampling-based P-values. However, these methods rely on certain assumptions about the correlation structure of the genetic markers, which may not be valid for certain studies. We propose a novel algorithm, rapid and exact multiple testing correction by resampling (REM), to address this challenge. REM calculates the exact resampling-based P-values in a computationally efficient manner. The computational advantage of REM lies in its strategy of pruning the search space by skipping genetic markers whose upper bounds on test statistics are small. REM does not rely on any assumption about the correlation structure of the genetic markers. It can be applied to a variety of resampling-based multiple-testing correction methods including permutation and bootstrap methods. We evaluate REM on three eQTL data sets (yeast, inbred mouse, and human rare variants) and show that it achieves accurate resampling-based P-value estimation with much less computational cost than existing methods. The software is available at http://csbio.unc.edu/eQTL.
Collapse
|
8
|
Zhang Y, Liu JS. Fast and Accurate Approximation to Significance Tests in Genome-Wide Association Studies. J Am Stat Assoc 2011; 106:846-857. [PMID: 22140288 PMCID: PMC3226809 DOI: 10.1198/jasa.2011.ap10657] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Genome-wide association studies commonly involve simultaneous tests of millions of single nucleotide polymorphisms (SNP) for disease association. The SNPs in nearby genomic regions, however, are often highly correlated due to linkage disequilibrium (LD, a genetic term for correlation). Simple Bonferonni correction for multiple comparisons is therefore too conservative. Permutation tests, which are often employed in practice, are both computationally expensive for genome-wide studies and limited in their scopes. We present an accurate and computationally efficient method, based on Poisson de-clumping heuristics, for approximating genome-wide significance of SNP associations. Compared with permutation tests and other multiple comparison adjustment approaches, our method computes the most accurate and robust p-value adjustments for millions of correlated comparisons within seconds. We demonstrate analytically that the accuracy and the efficiency of our method are nearly independent of the sample size, the number of SNPs, and the scale of p-values to be adjusted. In addition, our method can be easily adopted to estimate false discovery rate. When applied to genome-wide SNP datasets, we observed highly variable p-value adjustment results evaluated from different genomic regions. The variation in adjustments along the genome, however, are well conserved between the European and the African populations. The p-value adjustments are significantly correlated with LD among SNPs, recombination rates, and SNP densities. Given the large variability of sequence features in the genome, we further discuss a novel approach of using SNP-specific (local) thresholds to detect genome-wide significant associations. This article has supplementary material online.
Collapse
Affiliation(s)
- Yu Zhang
- Department of Statistics, The Pennsylvania State University, 422A Thomas Building, University Park, PA 16803
| | - Jun S. Liu
- Department of Statistics, Harvard University, 715 Science Center, 1 Oxford St., Cambridge, MA 02138
| |
Collapse
|
9
|
Gao X. Multiple testing corrections for imputed SNPs. Genet Epidemiol 2011; 35:154-8. [PMID: 21254223 DOI: 10.1002/gepi.20563] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Revised: 10/14/2010] [Accepted: 12/20/2010] [Indexed: 01/13/2023]
Abstract
Multiple testing corrections are an active research topic in genetic association studies, especially for genome-wide association studies (GWAS), where tests of association with traits are conducted at millions of imputed SNPs with estimated allelic dosages now. Failure to address multiple comparisons appropriately can introduce excess false-positive results and make subsequent studies following up those results inefficient. Permutation tests are considered the gold standard in multiple testing adjustment; however, this procedure is computationally demanding, especially for GWAS. Notably, the permutation thresholds for the huge number of estimated allelic dosages in real data sets have not been reported. Although many researchers have recently developed algorithms to rapidly approximate the permutation thresholds with accuracy similar to the permutation test, these methods have not been verified with estimated allelic dosages. In this study, we compare recently published multiple testing correction methods using 2.5M estimated allelic dosages. We also derive permutation significance levels based on 10,000 GWAS results under the null hypothesis of no association. Our results show that the simpleM method works well with estimated allelic dosages and gives the closest approximation to the permutation threshold while requiring the least computation time.
Collapse
Affiliation(s)
- Xiaoyi Gao
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, Missouri 63108, USA.
| |
Collapse
|
10
|
Yu K, Liang F, Ciampa J, Chatterjee N. Efficient p-value evaluation for resampling-based tests. Biostatistics 2011; 12:582-93. [PMID: 21209154 DOI: 10.1093/biostatistics/kxq078] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient p-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the p-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100-500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small p-value (e.g. less than 10( - 6)). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.
Collapse
Affiliation(s)
- Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD 20892, USA.
| | | | | | | |
Collapse
|
11
|
Chadeau-Hyam M, Ebbels TMD, Brown IJ, Chan Q, Stamler J, Huang CC, Daviglus ML, Ueshima H, Zhao L, Holmes E, Nicholson JK, Elliott P, De Iorio M. Metabolic profiling and the metabolome-wide association study: significance level for biomarker identification. J Proteome Res 2011; 9:4620-7. [PMID: 20701291 DOI: 10.1021/pr1003449] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
High throughput metabolic profiling via the metabolome-wide association study (MWAS) is a powerful new approach to identify biomarkers of disease risk, but there are methodological challenges: high dimensionality, high level of collinearity, the existence of peak overlap within metabolic spectral data, multiple testing, and selection of a suitable significance threshold. We define the metabolome-wide significance level (MWSL) as the threshold required to control the family wise error rate through a permutation approach. We used 1H NMR spectroscopic profiles of 24 h urinary collections from the INTERMAP study. Our results show that the MWSL primarily depends on sample size and spectral resolution. The MWSL estimates can be used to guide selection of discriminatory biomarkers in MWA studies. In a simulation study, we compare statistical performance of the MWSL approach to two variants of orthogonal partial least-squares (OPLS) method with respect to statistical power, false positive rate and correspondence of ranking of the most significant spectral variables. Our results show that the MWSL approach as estimated by the univariate t test is not outperformed by OPLS and offers a fast and simple method to detect disease-related discriminatory features in human NMR urinary metabolic profiles.
Collapse
Affiliation(s)
- Marc Chadeau-Hyam
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, United Kingdom
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Pahl R, Schäfer H. PERMORY: an LD-exploiting permutation test algorithm for powerful genome-wide association testing. Bioinformatics 2010; 26:2093-100. [PMID: 20605926 DOI: 10.1093/bioinformatics/btq399] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In genome-wide association studies (GWAS) examining hundreds of thousands of genetic markers, the potentially high number of false positive findings requires statistical correction for multiple testing. Permutation tests are considered the gold standard for multiple testing correction in GWAS, because they simultaneously provide unbiased type I error control and high power. At the same time, they demand heavy computational effort, especially with large-scale datasets of modern GWAS. In recent years, the computational problem has been circumvented by using approximations to permutation tests, which, however, may be biased. RESULTS We have tackled the original computational problem of permutation testing in GWAS and herein present a permutation test algorithm one or more orders of magnitude faster than existing implementations, which enables efficient permutation testing on a genome-wide scale. Our algorithm does not rely on any kind of approximation and hence produces unbiased results identical to a standard permutation test. A noteworthy feature of our algorithm is a particularly effective performance when analyzing high-density marker sets. AVAILABILITY Freely available on the web at http://www.permory.org.
Collapse
Affiliation(s)
- Roman Pahl
- Institut für Medizinische Biometrie und Epidemiologie, Philipps-Universität Marburg, Germany.
| | | |
Collapse
|
13
|
ParaHaplo 2.0: a program package for haplotype-estimation and haplotype-based whole-genome association study using parallel computing. SOURCE CODE FOR BIOLOGY AND MEDICINE 2010; 5:5. [PMID: 20525312 PMCID: PMC2892495 DOI: 10.1186/1751-0473-5-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/16/2010] [Accepted: 06/04/2010] [Indexed: 12/30/2022]
Abstract
Background The use of haplotype-based association tests can improve the power of genome-wide association studies. Since the observed genotypes are unordered pairs of alleles, haplotype phase must be inferred. However, estimating haplotype phase is time consuming. When millions of single-nucleotide polymorphisms (SNPs) are analyzed in genome-wide association study, faster methods for haplotype estimation are required. Methods We developed a program package for parallel computation of haplotype estimation. Our program package, ParaHaplo 2.0, is intended for use in workstation clusters using the Intel Message Passing Interface (MPI). We compared the performance of our algorithm to that of the regular permutation test on both Japanese in Tokyo, Japan and Han Chinese in Beijing, China of the HapMap dataset. Results Parallel version of ParaHaplo 2.0 can estimate haplotypes 100 times faster than a non-parallel version of the ParaHaplo. Conclusion ParaHaplo 2.0 is an invaluable tool for conducting haplotype-based genome-wide association studies (GWAS). The need for fast haplotype estimation using parallel computing will become increasingly important as the data sizes of such projects continue to increase. The executable binaries and program sources of ParaHaplo are available at the following address: http://en.sourceforge.jp/projects/parallelgwas/releases/
Collapse
|
14
|
Qin H, Feng T, Zhang S, Sha Q. A data-driven weighting scheme for family-based genome-wide association studies. Eur J Hum Genet 2009; 18:596-603. [PMID: 19935828 DOI: 10.1038/ejhg.2009.201] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Recently, Steen et al proposed a novel two-stage approach for family-based genome-wide association studies. In the first stage, a test based on between-family information is used to rank SNPs according to their P-values or conditional power of the test. In the second stage, the R most promising SNPs are tested using a family-based association test. We call this two-stage approach top R method. Ionita-Laza et al proposed an exponential weighting method within a two-stage framework. In the second stage of this approach, instead of testing top R SNPs, it tests all SNPs and weights the P-values of association test according to the information of the first stage. However, both of the top R and exponential weighting methods only use the information from the first stage to rank SNPs. It seems that the two methods do not use information from the first stage efficiently. Furthermore, it may be unreasonable for the exponential weighting method to use the same weight for all SNPs within a group when only one or a few SNPs are related with a disease. In this article, we propose a data-driven weighting scheme within a two-stage framework. In this method, we use the information from the first stage to determine a SNP-specific weight for each SNP. We use simulation studies to evaluate the performance of our method. The simulation results showed that our proposed method is consistently more powerful than the top R method and the exponential weighting method, regardless of the LD structure, population structure, and family structure.
Collapse
Affiliation(s)
- Huaizhen Qin
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA
| | | | | | | |
Collapse
|
15
|
Misawa K, Kamatani N. ParaHaplo: A program package for haplotype-based whole-genome association study using parallel computing. SOURCE CODE FOR BIOLOGY AND MEDICINE 2009; 4:7. [PMID: 19845960 PMCID: PMC2774321 DOI: 10.1186/1751-0473-4-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/26/2009] [Accepted: 10/21/2009] [Indexed: 01/28/2023]
Abstract
Background Since more than a million single-nucleotide polymorphisms (SNPs) are analyzed in any given genome-wide association study (GWAS), performing multiple comparisons can be problematic. To cope with multiple-comparison problems in GWAS, haplotype-based algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are time-consuming. Faster methods for calculating exact probabilities and executing permutation tests are required. Methods We developed a set of computer programs for the parallel computation of accurate P-values in haplotype-based GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI). We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap. Results ParaHaplo can detect smaller differences between 2 populations than SNP-based GWAS. We also found that parallel-computing techniques made ParaHaplo 100-fold faster than a non-parallel version of the program. Conclusion ParaHaplo is a useful tool in conducting haplotype-based GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computing--such as that used in ParaHaplo--will become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address:
Collapse
Affiliation(s)
- Kazuharu Misawa
- Research Program for Computational Science, Research and Development Group for Next-Generation Integrated Living Matter Simulation, Fusion of Data and Analysis Research and Development Team, RIKEN, 4-6-1 Shirokane-dai, Minato-ku, Tokyo 108-8639, Japan.
| | | |
Collapse
|
16
|
Tang R, Feng T, Sha Q, Zhang S. A variable-sized sliding-window approach for genetic association studies via principal component analysis. Ann Hum Genet 2009; 73:631-7. [PMID: 19735491 DOI: 10.1111/j.1469-1809.2009.00543.x] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Recently with the rapid improvements in high-throughout genotyping techniques, researchers are facing the very challenging task of analysing large-scale genetic associations, especially at the whole-genome level, without an optimal solution. In this study, we propose a new approach for genetic association analysis that is based on a variable-sized sliding-window framework and employs principal component analysis to find the optimum window size. With the help of the bisection algorithm in window-size searching, our method is more computationally efficient than available approaches. We evaluate the performance of the proposed method by comparing it with two other methods-a single-marker method and a variable-length Markov chain method. We demonstrate that, in most cases, the proposed method out-performs the other two methods. Furthermore, since the proposed method is based on genotype data, it does not require any computationally intensive phasing program to account for uncertain haplotype phase.
Collapse
Affiliation(s)
- Rui Tang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA
| | | | | | | |
Collapse
|
17
|
Lyons EJ, Amos W, Berkley JA, Mwangi I, Shafi M, Williams TN, Newton CR, Peshu N, Marsh K, Scott JAG, Hill AVS. Homozygosity and risk of childhood death due to invasive bacterial disease. BMC MEDICAL GENETICS 2009; 10:55. [PMID: 19523202 PMCID: PMC2714084 DOI: 10.1186/1471-2350-10-55] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/27/2009] [Accepted: 06/12/2009] [Indexed: 11/21/2022]
Abstract
Background Genetic heterozygosity is increasingly being shown to be a key predictor of fitness in natural populations, both through inbreeding depression, inbred individuals having low heterozygosity, and also through chance linkage between a marker and a gene under balancing selection. One important component of fitness that is often highlighted is resistance to parasites and other pathogens. However, the significance of equivalent loci in human populations remains unclear. Consequently, we performed a case-control study of fatal invasive bacterial disease in Kenyan children using a genome-wide screen with microsatellite markers. Methods 148 cases, comprising children aged <13 years who died of invasive bacterial disease, (variously, bacteraemia, bacterial meningitis or neonatal sepsis) and 137 age-matched, healthy children were sampled in a prospective study conducted at Kilifi District Hospital, Kenya. Samples were genotyped for 134 microsatellite markers using the ABI LD20 marker set and analysed for an association between homozygosity and mortality. Results At five markers homozygosity was strongly associated with mortality (odds ratio range 4.7 – 12.2) with evidence of interactions between some markers. Mortality was associated with different non-overlapping marker groups in Gram positive and Gram negative bacterial disease. Homozygosity at susceptibility markers was common (prevalence 19–49%) and, with the large effect sizes, this suggests that bacterial disease mortality may be strongly genetically determined. Conclusion Balanced polymorphisms appear to be more widespread in humans than previously appreciated and play a critical role in modulating susceptibility to infectious disease. The effect sizes we report, coupled with the stochasticity of exposure to pathogens suggests that infection and mortality are far from random due to a strong genetic basis.
Collapse
Affiliation(s)
- Emily J Lyons
- The Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Han B, Kang HM, Eskin E. Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. PLoS Genet 2009; 5:e1000456. [PMID: 19381255 PMCID: PMC2663787 DOI: 10.1371/journal.pgen.1000456] [Citation(s) in RCA: 116] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2008] [Accepted: 03/17/2009] [Indexed: 11/18/2022] Open
Abstract
With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies—SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu. In genome-wide association studies, it is important to account for the fact that a large number of genetic variants are tested in order to adequately control for false positives. The simplest way to correct for multiple hypothesis testing is the Bonferroni correction, which multiplies the p-values by the number of markers assuming the markers are independent. Since the markers are correlated due to linkage disequilibrium, this approach leads to a conservative estimate of false positives, thus adversely affecting statistical power. The permutation test is considered the gold standard for accurate multiple testing correction, but is often computationally impractical for large association studies. We propose a method that efficiently and accurately corrects for multiple hypotheses in genome-wide association studies by fully accounting for the local correlation structure between markers. Our method also corrects for the departure of the true distribution of test statistics from the asymptotic distribution, which dramatically improves the accuracy, particularly when many rare variants are included in the tests. Our method shows a near identical accuracy to permutation and shows greater computational efficiency than previously suggested methods. We also provide a method to accurately and efficiently estimate the statistical power of genome-wide association studies.
Collapse
Affiliation(s)
- Buhm Han
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
| | - Hyun Min Kang
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, United States of America
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
19
|
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009; 25:714-21. [PMID: 19176549 PMCID: PMC2732298 DOI: 10.1093/bioinformatics/btp041] [Citation(s) in RCA: 456] [Impact Index Per Article: 30.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2008] [Revised: 12/11/2008] [Accepted: 01/18/2009] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations. METHOD The present article evaluates the performance of lasso penalized logistic regression in case-control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression. RESULTS This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs. AVAILABILITY The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tong Tong Wu
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, USA
| | | | | | | | | |
Collapse
|
20
|
Association mapping and significance estimation via the coalescent. Am J Hum Genet 2008; 83:675-83. [PMID: 19026399 DOI: 10.1016/j.ajhg.2008.10.017] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2008] [Revised: 08/31/2008] [Accepted: 10/21/2008] [Indexed: 11/22/2022] Open
Abstract
The central questions asked in whole-genome association studies are how to locate associated regions in the genome and how to estimate the significance of these findings. Researchers usually do this by testing each SNP separately for association and then applying a suitable correction for multiple-hypothesis testing. However, SNPs are correlated by the unobserved genealogy of the population, and a more powerful statistical methodology would attempt to take this genealogy into account. Leveraging the genealogy in association studies is challenging, however, because the inference of the genealogy from the genotypes is a computationally intensive task, in particular when recombination is modeled, as in ancestral recombination graphs. Furthermore, if large numbers of genealogies are imputed from the genotypes, the power of the study might decrease if these imputed genealogies create an additional multiple-hypothesis testing burden. Indeed, we show in this paper that several existing methods that aim to address this problem suffer either from low power or from a very high false-positive rate; their performance is generally not better than the standard approach of separate testing of SNPs. We suggest a new genealogy-based approach, CAMP (coalescent-based association mapping), that takes into account the trade-off between the complexity of the genealogy and the power lost due to the additional multiple hypotheses. Our experiments show that CAMP yields a significant increase in power relative to that of previous methods and that it can more accurately locate the associated region.
Collapse
|
21
|
Bacanu SA, Nelson MR, Ehm MG. Comparison of association methods for dense marker data. Genet Epidemiol 2008; 32:791-9. [DOI: 10.1002/gepi.20347] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
22
|
Misawa K, Fujii S, Yamazaki T, Takahashi A, Takasaki J, Yanagisawa M, Ohnishi Y, Nakamura Y, Kamatani N. New correction algorithms for multiple comparisons in case-control multilocus association studies based on haplotypes and diplotype configurations. J Hum Genet 2008; 53:789-801. [PMID: 18651098 DOI: 10.1007/s10038-008-0312-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2008] [Accepted: 06/03/2008] [Indexed: 01/19/2023]
Abstract
The multiple comparison problem arises in population-based studies when the association between phenotypes and multilocus genotypes is examined. Although Bonferroni's correction is often used to cope with such a problem, it may yield too conservative conclusions because all of the tests are assumed to be independent. We have developed new correction algorithms for the test of independence between phenotypes and multilocus genotypes at loci in linkage disequilibrium. In one of the algorithms, the exact type I error rate is calculated for the independency test. We found that such exact probabilities can be calculated using a 128 CPU PC cluster if the numbers of cases and controls are not more than 50. As an alternative method, we developed algorithms to calculate asymptotically the type I error rates using a Markov-chain Monte Carlo sampler that provided a good approximation to values calculated by the exact method. When the new algorithms were applied to both simulation and real data, the real overall type I error rates for the loci in linkage disequilibrium were from one-third to half as high as those obtained by Bonferroni's correction. These algorithms are likely to be useful for multilocus association studies for data obtained by case-control and cohort studies.
Collapse
Affiliation(s)
- Kazuharu Misawa
- Research Program for Computational Science, Research and Development Group for Next-Generation Integrated Living Matter Simulation, Fusion of Data and Analysis Research and Development Team, RIKEN, 4-6-1 Shirokane-dai, Minato-ku, Tokyo, 108-8639, Japan.
| | - Shoogo Fujii
- Laboratory for Statistical Analysis, RIKEN Center for Genomic Medicine, Tokyo, Japan.,Department of Computer Science, Waseda University, Tokyo, Japan
| | - Toshimasa Yamazaki
- Laboratory for Statistical Analysis, RIKEN Center for Genomic Medicine, Tokyo, Japan
| | - Atsushi Takahashi
- Laboratory for Statistical Analysis, RIKEN Center for Genomic Medicine, Tokyo, Japan
| | - Junichi Takasaki
- Laboratory for Statistical Analysis, RIKEN Center for Genomic Medicine, Tokyo, Japan
| | | | - Yozo Ohnishi
- Laboratory for SNP Analysis, RIKEN Center for Genomic Medicine, Tokyo, Japan
| | - Yusuke Nakamura
- Laboratory for Pharmacogenetics, RIKEN Center for Genomic Medicine, Tokyo, Japan
| | - Naoyuki Kamatani
- Laboratory for Statistical Analysis, RIKEN Center for Genomic Medicine, Tokyo, Japan.,Division of Genomic Medicine, Department of Advanced Biomedical Engineering and Science, and Institute of Rheumatology, Tokyo Women's Medical University, Tokyo, Japan
| |
Collapse
|
23
|
Browning BL. PRESTO: rapid calculation of order statistic distributions and multiple-testing adjusted P-values via permutation for one and two-stage genetic association studies. BMC Bioinformatics 2008; 9:309. [PMID: 18620604 PMCID: PMC2483288 DOI: 10.1186/1471-2105-9-309] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2008] [Accepted: 07/13/2008] [Indexed: 11/29/2022] Open
Abstract
Background Large-scale genetic association studies can test hundreds of thousands of genetic markers for association with a trait. Since the genetic markers may be correlated, a Bonferroni correction is typically too stringent a correction for multiple testing. Permutation testing is a standard statistical technique for determining statistical significance when performing multiple correlated tests for genetic association. However, permutation testing for large-scale genetic association studies is computationally demanding and calls for optimized algorithms and software. PRESTO is a new software package for genetic association studies that performs fast computation of multiple-testing adjusted P-values via permutation of the trait. Results PRESTO is an order of magnitude faster than other existing permutation testing software, and can analyze a large genome-wide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO has several unique features that are useful in a wide range of studies: it reports empirical null distributions for the top-ranked statistics (i.e. order statistics), it performs user-specified combinations of allelic and genotypic tests, it performs stratified analysis when sampled individuals are from multiple populations and each individual's population of origin is specified, and it determines significance levels for one and two-stage genotyping designs. PRESTO is designed for case-control studies, but can also be applied to trio data (parents and affected offspring) if transmitted parental alleles are coded as case alleles and untransmitted parental alleles are coded as control alleles. Conclusion PRESTO is a platform-independent software package that performs fast and flexible permutation testing for genetic association studies. The PRESTO executable file, Java source code, example data, and documentation are freely available at .
Collapse
Affiliation(s)
- Brian L Browning
- Department of Statistics, The University of Auckland, Auckland, New Zealand.
| |
Collapse
|
24
|
Hoggart CJ, Clark TG, De Iorio M, Whittaker JC, Balding DJ. Genome-wide significance for dense SNP and resequencing data. Genet Epidemiol 2008; 32:179-85. [PMID: 18200594 DOI: 10.1002/gepi.20292] [Citation(s) in RCA: 143] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The problem of multiple testing is an important aspect of genome-wide association studies, and will become more important as marker densities increase. The problem has been tackled with permutation and false discovery rate procedures and with Bayes factors, but each approach faces difficulties that we briefly review. In the current context of multiple studies on different genotyping platforms, we argue for the use of truly genome-wide significance thresholds, based on all polymorphisms whether or not typed in the study. We approximate genome-wide significance thresholds in contemporary West African, East Asian and European populations by simulating sequence data, based on all polymorphisms as well as for a range of single nucleotide polymorphism (SNP) selection criteria. Overall we find that significance thresholds vary by a factor of >20 over the SNP selection criteria and statistical tests that we consider and can be highly dependent on sample size. We compare our results for sequence data to those derived by the HapMap Consortium and find notable differences which may be due to the small sample sizes used in the HapMap estimate.
Collapse
Affiliation(s)
- Clive J Hoggart
- Department of Epidemiology and Public Health, Imperial College London, Norfolk Place, London, UK.
| | | | | | | | | |
Collapse
|
25
|
Ziegler A, König IR, Thompson JR. Biostatistical Aspects of Genome-Wide Association Studies. Biom J 2008; 50:8-28. [DOI: 10.1002/bimj.200710398] [Citation(s) in RCA: 113] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
26
|
Conneely KN, Boehnke M. So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. Am J Hum Genet 2007; 81:1158-68. [PMID: 17966093 DOI: 10.1086/522036] [Citation(s) in RCA: 327] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2007] [Accepted: 08/01/2007] [Indexed: 11/03/2022] Open
Abstract
Contemporary genetic association studies may test hundreds of thousands of genetic variants for association, often with multiple binary and continuous traits or under more than one model of inheritance. Many of these association tests may be correlated with one another because of linkage disequilibrium between nearby markers and correlation between traits and models. Permutation tests and simulation-based methods are often employed to adjust groups of correlated tests for multiple testing, since conventional methods such as Bonferroni correction are overly conservative when tests are correlated. We present here a method of computing P values adjusted for correlated tests (P(ACT)) that attains the accuracy of permutation or simulation-based tests in much less computation time, and we show that our method applies to many common association tests that are based on multiple traits, markers, and genetic models. Simulation demonstrates that P(ACT) attains the power of permutation testing and provides a valid adjustment for hundreds of correlated association tests. In data analyzed as part of the Finland-United States Investigation of NIDDM Genetics (FUSION) study, we observe a near one-to-one relationship (r(2)>.999) between P(ACT) and the corresponding permutation-based P values, achieving the same precision as permutation testing but thousands of times faster.
Collapse
Affiliation(s)
- Karen N Conneely
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, USA.
| | | |
Collapse
|
27
|
Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM. A randomization test for controlling population stratification in whole-genome association studies. Am J Hum Genet 2007; 81:895-905. [PMID: 17924333 PMCID: PMC2265648 DOI: 10.1086/521372] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2007] [Accepted: 07/04/2007] [Indexed: 11/03/2022] Open
Abstract
Population stratification can be a serious obstacle in the analysis of genomewide association studies. We propose a method for evaluating the significance of association scores in whole-genome cohorts with stratification. Our approach is a randomization test akin to a standard permutation test. It conditions on the genotype matrix and thus takes into account not only the population structure but also the complex linkage disequilibrium structure of the genome. As we show in simulation experiments, our method achieves higher power and significantly better control over false-positive rates than do existing methods. In addition, it can be easily applied to whole-genome association studies.
Collapse
Affiliation(s)
- Gad Kimmel
- Computer Science Division, University of California Berkeley, Berkeley, CA 94720, USA.
| | | | | | | | | |
Collapse
|
28
|
Zhang Z, Zhang S, Sha Q. A multi-marker test based on family data in genome-wide association study. BMC Genet 2007; 8:65. [PMID: 17894890 PMCID: PMC2121104 DOI: 10.1186/1471-2156-8-65] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2007] [Accepted: 09/25/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Complex diseases are believed to be the results of many genes and environmental factors. Hence, multi-marker methods that can use the information of markers from different genes are appropriate for mapping complex disease genes. There already have been several multi-marker methods proposed for case-control studies. In this article, we propose a multi-marker test called a Multi-marker Pedigree Disequilibrium Test (MPDT) to analyze family data from genome-wide association studies. If the parental phenotypes are available, we also propose a two-stage test in which a genomic screening test is used to select SNPs, and then the MPDT is used to test the association of the selected SNPs. RESULTS We use simulation studies to evaluate the performance of the MPDT and the two-stage approach. The results show that the MPDT constantly outperforms the single marker transmission/disequilibrium test (TDT) 1. Comparing the power of the two-stage approach with that of the one-stage approach, which approach is more powerful depends on the value of the prevalence; when the prevalence is no less than 10%, the two-stage approach may be more powerful than the one-stage approach. Otherwise, the one-stage approach is more powerful. CONCLUSION The proposed MPDT, is more powerful than the single marker TDT. When the parental phenotypes are available and the prevalence is no less than 10%, the proposed two-stage approach is more powerful than the one-stage approach.
Collapse
Affiliation(s)
- Zhaogong Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, US
- School of Computer Science and Technology, Heilongjiang University, Harbin, 150080, China
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, US
- Department of Mathematics, Heilongjiang University, Harbin, 150080, China
| | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, US
| |
Collapse
|
29
|
Knowles JW, Assimes TL, Li J, Quertermous T, Cooke JP. Genetic susceptibility to peripheral arterial disease: a dark corner in vascular biology. Arterioscler Thromb Vasc Biol 2007; 27:2068-78. [PMID: 17656669 PMCID: PMC4321902 DOI: 10.1161/01.atv.0000282199.66398.8c] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Peripheral arterial disease (PAD) is characterized by reduced blood flow to the limbs, usually as a consequence of atherosclerosis, and affects approximately 12 million Americans. It is a common cause of cardiovascular morbidity and an independent predictor of cardiovascular mortality. Similar to other atherosclerotic diseases, such as coronary artery disease, PAD is the result of the complex interplay between injurious environmental stimuli and genetic predisposing factors of the host. Genetic susceptibility to PAD is likely contributed by sequence variants in multiple genes, each with modest effects. Although many of these variants probably alter susceptibility both to PAD and to coronary artery disease, it is likely that there exists a set of variants specifically to alter susceptibility to PAD. Despite the prevalence of PAD and its high societal burden, relatively little is known about such genetic variants. This review summarizes our limited present knowledge and gives an overview of recent, more powerful approaches to elucidating the genetic basis of PAD. We discuss the advantages and limitations of genetic studies and highlight the need for collaborative networks of PAD investigators for shedding light on this dark corner of vascular biology.
Collapse
Affiliation(s)
- Joshua W Knowles
- Falk Cardiovascular Research Building, Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, CA, 94305-5406, USA.
| | | | | | | | | |
Collapse
|
30
|
Feng T, Zhang S, Sha Q. Two-stage association tests for genome-wide association studies based on family data with arbitrary family structure. Eur J Hum Genet 2007; 15:1169-75. [PMID: 17653107 DOI: 10.1038/sj.ejhg.5201902] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Recently, Steen et al proposed a two-stage approach for genome-wide family-based association studies. In the first stage, a screening test is used to select markers, and in the second stage, a family-based association test is performed on a much smaller set of the selected markers. The two-stage method can be much more powerful than the traditional family-based association tests. In this article, we extend the approach so that it can incorporate parental information and can be applied to an arbitrary pedigree structure. We use simulation studies to evaluate the type I error rates and the power of the proposed methods. Our results show that the two-stage approach that incorporates founders' phenotypes has the correct type I error rates, and is much more powerful than the two-stage approach that uses children's phenotypes only. Also, by carefully choosing the number of markers retained in the first stage, the power of a two-stage approach can be much more than that of the corresponding one-stage approach.
Collapse
Affiliation(s)
- Tao Feng
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA
| | | | | |
Collapse
|
31
|
Davidovich O, Kimmel G, Shamir R. GEVALT: an integrated software tool for genotype analysis. BMC Bioinformatics 2007; 8:36. [PMID: 17270038 PMCID: PMC1797190 DOI: 10.1186/1471-2105-8-36] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2006] [Accepted: 02/01/2007] [Indexed: 11/20/2022] Open
Abstract
Background Genotype information generated by individual and international efforts carries the promise of revolutionizing disease studies and the association of phenotypes with alleles and haplotypes. Given the enormous amounts of public genotype data, tools for analyzing, interpreting and visualizing these data sets are of critical importance to researchers. In past works we have developed algorithms for genotypes phasing and tag SNP selection, which were shown to be quick and accurate. Both algorithms were available until now only as batch executables. Results Here we present GEVALT (GEnotype Visualization and ALgorithmic Tool), a software package designed to simplify and expedite the process of genotype analysis, by providing a common interface to several tasks relating to such analysis. GEVALT combines the strong visual abilities of Haploview with our quick and powerful algorithms for genotypes phasing (GERBIL), tag SNP selection (STAMPA) and permutation testing for evaluating significance of association. All of the above are provided in a visually appealing and interactive interface. Conclusion GEVALT is an integrated viewer that uses state of the art phasing and tag SNP selection algorithms. By streamlining the application of GERBIL and STAMPA together with strong visualization for assessment of the results, GEVALT makes the algorithms accessible to the broad community of researchers in genetics.
Collapse
Affiliation(s)
- Ofir Davidovich
- School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Gad Kimmel
- School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Ron Shamir
- School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|