1
|
Yu D, Koslovsky M, Steiner MC, Mohammadi K, Zhang C, Swartz MD. TRIO RVEMVS: A Bayesian framework for rare variant association analysis with expectation-maximization variable selection using family trio data. PLoS One 2024; 19:e0314502. [PMID: 39630689 PMCID: PMC11616829 DOI: 10.1371/journal.pone.0314502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Accepted: 11/11/2024] [Indexed: 12/07/2024] Open
Abstract
It is commonly reported that rare variants may be more functionally related to complex diseases than common variants. However, individual rare variant association tests remain challenging due to low minor allele frequency in the available samples. This paper proposes an expectation maximization variable selection (EMVS) method to simultaneously detect common and rare variants at the individual variant level using family trio data. TRIO_RVEMVS was assessed in both large (1500 families) and small (350 families) datasets based on simulation. The performance of TRIO_RVEMVS was compared with gene-level kernel and burden association tests that use pedigree data (PedGene) and rare-variant extensions of the transmission disequilibrium test (RV-TDT). At the region level, TRIO_RVEMVS outperformed PedGene and RV-TDT when common variants were included. TRIO_RVEMVS performed competitively with PedGene and outperformed RV-TDT when the analysis was only restricted to rare variants. At the individual variants level, with 1,500 trios, the average true positive rate of individual rare variants that were polymorphic across 500 datasets was 12.20%, and the average false positive rate was 0.74%. In the datasets with 350 trios, the average true and false positive rates of individual rare variants were 13.10% and 1.30%, respectively. When applying TRIO_RVEMVS to real data from the Gabriella Miller Kids First Pediatric Research Program, it identified 3 rare variants in q24.21 and q24.22 associated with the risk of orofacial clefts in the Kids First European population.
Collapse
Affiliation(s)
- Duo Yu
- Division of Biostatistics, Data Science Institute, Medical College of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Matthew Koslovsky
- Department of Statistics, Colorado State University, Fort Collins, Colorado, United States of America
| | - Margaret C. Steiner
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Kusha Mohammadi
- Department of Biostatistics and Data Management, Regeneron Pharmaceuticals, Inc., Tarrytown, New York, United States of America
| | - Chenguang Zhang
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., North Wales, Pennsylvania, United States of America
| | - Michael D. Swartz
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
2
|
Chen Z, Liang H, Wei P. Data-adaptive and pathway-based tests for association studies between somatic mutations and germline variations in human cancers. Genet Epidemiol 2023; 47:617-636. [PMID: 37822029 DOI: 10.1002/gepi.22537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 07/22/2023] [Accepted: 09/18/2023] [Indexed: 10/13/2023]
Abstract
Cancer is a disease driven by a combination of inherited genetic variants and somatic mutations. Recently available large-scale sequencing data of cancer genomes have provided an unprecedented opportunity to study the interactions between them. However, previous studies on this topic have been limited by simple, low statistical power tests such as Fisher's exact test. In this paper, we design data-adaptive and pathway-based tests based on the score statistic for association studies between somatic mutations and germline variations. Previous research has shown that two single-nucleotide polymorphism (SNP)-set-based association tests, adaptive sum of powered score (aSPU) and data-adaptive pathway-based (aSPUpath) tests, increase the power in genome-wide association studies (GWASs) with a single disease trait in a case-control study. We extend aSPU and aSPUpath to multi-traits, that is, somatic mutations of multiple genes in a cohort study, allowing extensive information aggregation at both SNP and gene levels.p $p$ -values from different parameters assuming varying genetic architecture are combined to yield data-adaptive tests for somatic mutations and germline variations. Extensive simulations show that, in comparison with some commonly used methods, our data-adaptive somatic mutations/germline variations tests can be applied to multiple germline SNPs/genes/pathways, and generally have much higher statistical powers while maintaining the appropriate type I error. The proposed tests are applied to a large-scale real-world International Cancer Genome Consortium whole genome sequencing data set of 2583 subjects, detecting more significant and biologically relevant associations compared with the other existing methods on both gene and pathway levels. Our study has systematically identified the associations between various germline variations and somatic mutations across different cancer types, which potentially provides valuable utility for cancer risk prediction, prognosis, and therapeutics.
Collapse
Affiliation(s)
- Zhongyuan Chen
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, Wisconsin, USA
| | - Han Liang
- Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, Texas, USA
| | - Peng Wei
- Department of Biostatistics, MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
3
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Rare variant association on unrelated individuals in case-control studies using aggregation tests: existing methods and current limitations. Brief Bioinform 2023; 24:bbad412. [PMID: 37974506 DOI: 10.1093/bib/bbad412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 10/14/2023] [Accepted: 10/28/2023] [Indexed: 11/19/2023] Open
Abstract
Over the past years, progress made in next-generation sequencing technologies and bioinformatics have sparked a surge in association studies. Especially, genome-wide association studies (GWASs) have demonstrated their effectiveness in identifying disease associations with common genetic variants. Yet, rare variants can contribute to additional disease risk or trait heterogeneity. Because GWASs are underpowered for detecting association with such variants, numerous statistical methods have been recently proposed. Aggregation tests collapse multiple rare variants within a genetic region (e.g. gene, gene set, genomic loci) to test for association. An increasing number of studies using such methods successfully identified trait-associated rare variants and led to a better understanding of the underlying disease mechanism. In this review, we compare existing aggregation tests, their statistical features and scope of application, splitting them into the five classical classes: burden, adaptive burden, variance-component, omnibus and other. Finally, we describe some limitations of current aggregation tests, highlighting potential direction for further investigations.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- WELBIO department, WEL Research Institute, avenue Pasteur, 6, 1300 Wavre, Belgium
| |
Collapse
|
4
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Excalibur: A new ensemble method based on an optimal combination of aggregation tests for rare-variant association testing for sequencing data. PLoS Comput Biol 2023; 19:e1011488. [PMID: 37708232 PMCID: PMC10522036 DOI: 10.1371/journal.pcbi.1011488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 09/26/2023] [Accepted: 09/04/2023] [Indexed: 09/16/2023] Open
Abstract
The development of high-throughput next-generation sequencing technologies and large-scale genetic association studies produced numerous advances in the biostatistics field. Various aggregation tests, i.e. statistical methods that analyze associations of a trait with multiple markers within a genomic region, have produced a variety of novel discoveries. Notwithstanding their usefulness, there is no single test that fits all needs, each suffering from specific drawbacks. Selecting the right aggregation test, while considering an unknown underlying genetic model of the disease, remains an important challenge. Here we propose a new ensemble method, called Excalibur, based on an optimal combination of 36 aggregation tests created after an in-depth study of the limitations of each test and their impact on the quality of result. Our findings demonstrate the ability of our method to control type I error and illustrate that it offers the best average power across all scenarios. The proposed method allows for novel advances in Whole Exome/Genome sequencing association studies, able to handle a wide range of association models, providing researchers with an optimal aggregation analysis for the genetic regions of interest.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
- WELBIO department, WEL Research Institute, Wavre, Belgium
| |
Collapse
|
5
|
Xiong W, Chen Y, Ma S. Unified model-free interaction screening via CV-entropy filter. Comput Stat Data Anal 2023; 180:107684. [PMID: 36910335 PMCID: PMC9997997 DOI: 10.1016/j.csda.2022.107684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
Collapse
Affiliation(s)
- Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing 100872, PR China
| | - Yaxian Chen
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| |
Collapse
|
6
|
Jiang Y, Wang X, Wen C, Jiang Y, Zhang H. Nonparametric two-sample tests of high dimensional mean vectors via random integration. J Am Stat Assoc 2022; 119:701-714. [PMID: 38644938 PMCID: PMC11028772 DOI: 10.1080/01621459.2022.2141636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/24/2022] [Indexed: 11/06/2022]
Abstract
Testing the equality of the means in two samples is a fundamental statistical inferential problem. Most of the existing methods are based on the sum-of-squares or supremum statistics. They are possibly powerful in some situations, but not in others, and they do not work in a unified way. Using random integration of the difference, we develop a framework that includes and extends many existing methods, especially in high-dimensional settings, without restricting the same covariance matrices or sparsity. Under a general multivariate model, we can derive the asymptotic properties of the proposed test statistic without specifying a relationship between the data dimension and sample size explicitly. Specifically, the new framework allows us to better understand the test's properties and select a powerful procedure accordingly. For example, we prove that our proposed test can achieve the power of 1 when nonzero signals in the true mean differences are weakly dense with nearly the same sign. In addition, we delineate the conditions under which the asymptotic relative Pitman efficiency of our proposed test to its competitor is greater than or equal to 1. Extensive numerical studies and a real data example demonstrate the potential of our proposed test.
Collapse
Affiliation(s)
- Yunlu Jiang
- Department of Statistics, College of Economics, Jinan University, Guangzhou, GD 510632, China
| | - Xueqin Wang
- Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei, AH 230026, China
| | - Canhong Wen
- Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei, AH 230026 China
| | - Yukang Jiang
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou, GD 510275, China
| | - Heping Zhang
- Department of Biostatistics Yale University School of Public Health, New Haven, CT 06520-8034, U.S.A
| |
Collapse
|
7
|
Monroe JG, McKay JK, Weigel D, Flood PJ. The population genomics of adaptive loss of function. Heredity (Edinb) 2021; 126:383-395. [PMID: 33574599 PMCID: PMC7878030 DOI: 10.1038/s41437-021-00403-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 12/28/2020] [Accepted: 01/01/2021] [Indexed: 12/23/2022] Open
Abstract
Discoveries of adaptive gene knockouts and widespread losses of complete genes have in recent years led to a major rethink of the early view that loss-of-function alleles are almost always deleterious. Today, surveys of population genomic diversity are revealing extensive loss-of-function and gene content variation, yet the adaptive significance of much of this variation remains unknown. Here we examine the evolutionary dynamics of adaptive loss of function through the lens of population genomics and consider the challenges and opportunities of studying adaptive loss-of-function alleles using population genetics models. We discuss how the theoretically expected existence of allelic heterogeneity, defined as multiple functionally analogous mutations at the same locus, has proven consistent with empirical evidence and why this impedes both the detection of selection and causal relationships with phenotypes. We then review technical progress towards new functionally explicit population genomic tools and genotype-phenotype methods to overcome these limitations. More broadly, we discuss how the challenges of studying adaptive loss of function highlight the value of classifying genomic variation in a way consistent with the functional concept of an allele from classical population genetics.
Collapse
Affiliation(s)
- J Grey Monroe
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany.
- Department of Plant Sciences, University of California Davis, Davis, CA, 95616, USA.
| | - John K McKay
- College of Agriculture, Colorado State University, Fort Collins, CO, 80523, USA
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany
| | - Pádraic J Flood
- Department of Plant Developmental Biology, Max Planck Institute for Plant Breeding Research, 50829, Cologne, Germany
- Department of Plant Breeding, Wageningen University, Wageningen, The Netherlands
| |
Collapse
|
8
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020. [DOI: 10.1007/s12041-019-1166-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020; 99:9. [PMID: 32089528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The sum of squared score (SSU) and sequence kernel association test (SKAT) are the two good alternative tests for genetic association studies in case-control data. Both SSU and SKAT are derived through assuming a dose-response model between the risk of disease and genotypes. However, in practice, the real genetic mode of inheritance is impossible to know. Thus, these two tests might losepower substantially as shown in simulation results when the genetic model is misspecified. Here, to make both the tests suitable in broad situations, we propose two-phase SSU (tpSSU) and two-phase SKAT (tpSKAT), where the Hardy-Weinberg equilibrium test is adopted to choose the genetic model in the first phase and the SSU and SKAT are constructed corresponding to the selected genetic model in the second phase. We found that both tpSSU and tpSKAT outperformed the original SSU and SKAT in most of our simulation scenarios. Byapplying tpSSU and tpSKAT to the study of type 2 diabetes data, we successfully identified some genes that have direct effects on obesity. Besides, we also detected the significant chromosomal region 10q21.22 in GAW16 rheumatoid arthritis dataset, with P<10-6. These findings suggest that tpSSU and tpSKAT can be effective in identifying genetic variants for complex diseases in case-control association studies.
Collapse
Affiliation(s)
- Yuan Xue
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, People's Republic of China.
| | | | | | | | | |
Collapse
|
10
|
The impact of a fine-scale population stratification on rare variant association test results. PLoS One 2018; 13:e0207677. [PMID: 30521541 PMCID: PMC6283567 DOI: 10.1371/journal.pone.0207677] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 11/05/2018] [Indexed: 12/28/2022] Open
Abstract
Population stratification is a well-known confounding factor in both common and rare variant association analyses. Rare variants tend to be more geographically clustered than common variants, because of their more recent origin. However, it is not yet clear if population stratification at a very fine scale (neighboring administrative regions within a country) would lead to statistical bias in rare variant analyses. As the inclusion of convenience controls from external studies is indeed a common procedure, in order to increase the power to detect genetic associations, this problem is important. We studied through simulation the impact of a fine scale population structure on different rare variant association strategies, assessing type I error and power. We showed that principal component analysis (PCA) based methods of adjustment for population stratification adequately corrected type I error inflation at the largest geographical scales, but not at finest scales. We also showed in our simulations that adding controls obviously increased power, but at a considerably lower level when controls were drawn from another population.
Collapse
|
11
|
Lee JY, Moon S, Kim YK, Lee SH, Lee BS, Park MY, Park JE, Jang Y, Han BG. Genome-based exome sequencing analysis identifies GYG1, DIS3L and DDRGK1 are associated with myocardial infarction in Koreans. J Genet 2017; 96:1041-1046. [DOI: 10.1007/s12041-017-0854-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
12
|
Grinde KE, Arbet J, Green A, O'Connell M, Valcarcel A, Westra J, Tintle N. Illustrating, Quantifying, and Correcting for Bias in Post-hoc Analysis of Gene-Based Rare Variant Tests of Association. Front Genet 2017; 8:117. [PMID: 28959274 PMCID: PMC5603735 DOI: 10.3389/fgene.2017.00117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2017] [Accepted: 08/25/2017] [Indexed: 11/13/2022] Open
Abstract
To date, gene-based rare variant testing approaches have focused on aggregating information across sets of variants to maximize statistical power in identifying genes showing significant association with diseases. Beyond identifying genes that are associated with diseases, the identification of causal variant(s) in those genes and estimation of their effect is crucial for planning replication studies and characterizing the genetic architecture of the locus. However, we illustrate that straightforward single-marker association statistics can suffer from substantial bias introduced by conditioning on gene-based test significance, due to the phenomenon often referred to as "winner's curse." We illustrate the ramifications of this bias on variant effect size estimation and variant prioritization/ranking approaches, outline parameters of genetic architecture that affect this bias, and propose a bootstrap resampling method to correct for this bias. We find that our correction method significantly reduces the bias due to winner's curse (average two-fold decrease in bias, p < 2.2 × 10-6) and, consequently, substantially improves mean squared error and variant prioritization/ranking. The method is particularly helpful in adjustment for winner's curse effects when the initial gene-based test has low power and for relatively more common, non-causal variants. Adjustment for winner's curse is recommended for all post-hoc estimation and ranking of variants after a gene-based test. Further work is necessary to continue seeking ways to reduce bias and improve inference in post-hoc analysis of gene-based tests under a wide variety of genetic architectures.
Collapse
Affiliation(s)
- Kelsey E Grinde
- Department of Biostatistics, University of WashingtonSeattle, WA, United States
| | - Jaron Arbet
- Department of Biostatistics, University of MinnesotaMinneapolis, MN, United States
| | - Alden Green
- Department of Statistics, Carnegie Mellon UniversityPittsburgh, PA, United States
| | - Michael O'Connell
- Department of Biostatistics, University of MinnesotaMinneapolis, MN, United States
| | - Alessandra Valcarcel
- Department of Biostatistics and Epidemiology, University of PennsylvaniaPhiladelphia, PA, United States
| | - Jason Westra
- Department of Statistics, Iowa State UniversityAmes, IA, United States.,Department of Mathematics, Statistics, and Computer Science, Dordt CollegeSioux Center, IA, United States
| | - Nathan Tintle
- Department of Mathematics, Statistics, and Computer Science, Dordt CollegeSioux Center, IA, United States
| |
Collapse
|
13
|
Li YM, Xu C, Xiang Y, Peng C, Deng HW. An adaptive strategy for association analysis of common or rare variants using entropy theory. J Hum Genet 2017; 62:777-781. [PMID: 28381878 PMCID: PMC5584517 DOI: 10.1038/jhg.2017.39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2016] [Revised: 02/07/2017] [Accepted: 03/07/2017] [Indexed: 11/25/2022]
Abstract
Advances in DNA sequencing technology have been promoting the development of sequencing studies to identify rare variants associated with complex traits. Adaptive strategy can be effective to reduce the noise provided by non-causal variants. However, the existing adaptive strategies depend on many assumptions. In this paper, we proposed a new adaptive strategy using entropy theory for association analysis. This entropy-based strategy is based on the magnitude of association between variants and disease and does not depend on the detailed association pattern with causal variants. We considered multi-marker test and Sum test with collapsing method to construct the entropy-based adaptive strategy. Using simulation studies, we investigated the performance of our method for rare variant analyses as well as for common variant analyses with multi-marker test and compared it with several existing adaptive strategies. The results showed that our method can improve the power and achieve good performance when there is a large number of non-causal variants and effects of causal variants are in the same direction for rare variant.
Collapse
Affiliation(s)
- Yu-Mei Li
- School of Mathematics and Computational Science, Huaihua University, Hunan, China
- Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| | - Chao Xu
- Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| | - Yang Xiang
- School of Mathematics and Computational Science, Huaihua University, Hunan, China
| | - Cheng Peng
- Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
- Department of Geriatrics, National Key Clinical Specialty, Guangzhou First People’s Hospital, Guangzhou Medical University, Guangdong, China
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| |
Collapse
|
14
|
Konigorski S, Yilmaz YE, Pischon T. Comparison of single-marker and multi-marker tests in rare variant association studies of quantitative traits. PLoS One 2017; 12:e0178504. [PMID: 28562689 PMCID: PMC5451057 DOI: 10.1371/journal.pone.0178504] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2017] [Accepted: 05/15/2017] [Indexed: 11/19/2022] Open
Abstract
In genetic association studies of rare variants, low statistical power and potential violations of established estimator properties are among the main challenges of association tests. Multi-marker tests (MMTs) have been proposed to target these challenges, but any comparison with single-marker tests (SMTs) has to consider that their aim is to identify causal genomic regions instead of variants. Valid power comparisons have been performed for the analysis of binary traits indicating that MMTs have higher power, but there is a lack of conclusive studies for quantitative traits. The aim of our study was therefore to fairly compare SMTs and MMTs in their empirical power to identify the same causal loci associated with a quantitative trait. The results of extensive simulation studies indicate that previous results for binary traits cannot be generalized. First, we show that for the analysis of quantitative traits, conventional estimation methods and test statistics of single-marker approaches have valid properties yielding association tests with valid type I error, even when investigating singletons or doubletons. Furthermore, SMTs lead to more powerful association tests for identifying causal genes than MMTs when the effect sizes of causal variants are large, and less powerful tests when causal variants have small effect sizes. For moderate effect sizes, whether SMTs or MMTs have higher power depends on the sample size and percentage of causal SNVs. For a more complete picture, we also compare the power in studies of quantitative and binary traits, and the power to identify causal genes with the power to identify causal rare variants. In a genetic association analysis of systolic blood pressure in the Genetic Analysis Workshop 19 data, SMTs yielded smaller p-values compared to MMTs for most of the investigated blood pressure genes, and were least influenced by the definition of gene regions.
Collapse
Affiliation(s)
- Stefan Konigorski
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Yildiz E. Yilmaz
- Department of Mathematics and Statistics, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
- Discipline of Genetics, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
- Discipline of Medicine, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
| | - Tobias Pischon
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
- Charité Universitätsmedizin Berlin, Berlin, Germany
- DZHK (German Center for Cardiovascular Research), Berlin, Germany
| |
Collapse
|
15
|
Abstract
Several two-sample tests for high-dimensional data have been proposed recently, but they are powerful only against certain alternative hypotheses. In practice, since the true alternative hypothesis is unknown, it is unclear how to choose a powerful test. We propose an adaptive test that maintains high power across a wide range of situations and study its asymptotic properties. Its finite-sample performance is compared with that of existing tests. We apply it and other tests to detect possible associations between bipolar disease and a large number of single nucleotide polymorphisms on each chromosome based on data from a genome-wide association study. Numerical studies demonstrate the superior performance and high power of the proposed test across a wide spectrum of applications.
Collapse
Affiliation(s)
- Gongjun Xu
- School of Statistics, University of Minnesota, Minneapolis, Minnesota, U.S.A. 55455
| | - Lifeng Lin
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, U.S.A. 55455
| | - Peng Wei
- Division of Biostatistics and Human Genetics Center, University of Texas School of Public Health, Houston, Texas, U.S.A. 77030
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, U.S.A. 55455
| |
Collapse
|
16
|
Detecting disease association with rare variants in case-parents studies. J Hum Genet 2017; 62:549-552. [DOI: 10.1038/jhg.2017.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2016] [Revised: 12/27/2016] [Accepted: 12/27/2016] [Indexed: 11/09/2022]
|
17
|
Fang H, Zhang H, Yang Y. Poisson Approximation-Based Score Test for Detecting Association of Rare Variants. Ann Hum Genet 2016; 80:221-34. [PMID: 27346734 DOI: 10.1111/ahg.12154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 02/26/2016] [Indexed: 11/30/2022]
Abstract
Genome-wide association study (GWAS) has achieved great success in identifying genetic variants, but the nature of GWAS has determined its inherent limitations. Under the common disease rare variants (CDRV) hypothesis, the traditional association analysis methods commonly used in GWAS for common variants do not have enough power for detecting rare variants with a limited sample size. As a solution to this problem, pooling rare variants by their functions provides an efficient way for identifying susceptible genes. Rare variant typically have low frequencies of minor alleles, and the distribution of the total number of minor alleles of the rare variants can be approximated by a Poisson distribution. Based on this fact, we propose a new test method, the Poisson Approximation-based Score Test (PAST), for association analysis of rare variants. Two testing methods, namely, ePAST and mPAST, are proposed based on different strategies of pooling rare variants. Simulation results and application to the CRESCENDO cohort data show that our methods are more powerful than the existing methods.
Collapse
Affiliation(s)
- Hongyan Fang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, 230026, China
| | - Hong Zhang
- Institute of Biostatistics, Fudan School of Life Sciences, Fudan, Shanghai, 200433, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, 230026, China
| |
Collapse
|
18
|
Kao CF, Liu JR, Hung H, Kuo PH. A robust GWSS method to simultaneously detect rare and common variants for complex disease. PLoS One 2015; 10:e0120873. [PMID: 25880329 PMCID: PMC4399906 DOI: 10.1371/journal.pone.0120873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 01/26/2015] [Indexed: 11/19/2022] Open
Abstract
The rapid advances in sequencing technologies and the resulting next-generation sequencing data provide the opportunity to detect disease-associated variants with a better solution, in particular for low-frequency variants. Although both common and rare variants might exert their independent effects on the risk for the trait of interest, previous methods to detect the association effects rarely consider them simultaneously. We proposed a class of test statistics, the generalized weighted-sum statistic (GWSS), to detect disease associations in the presence of common and rare variants with a case-control study design. Information of rare variants was aggregated using a weighted sum method, while signal directions and strength of the variants were considered at the same time. Permutations were performed to obtain the empirical p-values of the test statistics. Our simulation showed that, compared to the existing methods, the GWSS method had better performance in most of the scenarios. The GWSS (in particular VDWSS-t) method is particularly robust for opposite association directions, association strength, and varying distributions of minor-allele frequencies. It is therefore promising for detecting disease-associated loci. For empirical data application, we also applied our GWSS method to the Genetic Analysis Workshop 17 data, and the results were consistent with the simulation, suggesting good performance of our method. As re-sequencing studies become more popular to identify putative disease loci, we recommend the use of this newly developed GWSS to detect associations with both common and rare variants.
Collapse
Affiliation(s)
- Chung-Feng Kao
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - Jia-Rou Liu
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
- Department of Public Health, Chang Gung University, Taoyuan,Taiwan
| | - Hung Hung
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
- Research Center for Genes, Environment and Human Health, National Taiwan University, Taipei, Taiwan
- * E-mail: (PHK); (HH)
| | - Po-Hsiu Kuo
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
- Research Center for Genes, Environment and Human Health, National Taiwan University, Taipei, Taiwan
- * E-mail: (PHK); (HH)
| |
Collapse
|
19
|
Zhang Q. Associating rare genetic variants with human diseases. Front Genet 2015; 6:133. [PMID: 25904936 PMCID: PMC4389536 DOI: 10.3389/fgene.2015.00133] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 03/19/2015] [Indexed: 11/20/2022] Open
Affiliation(s)
- Qunyuan Zhang
- Division of Statistical Genomics, Washington University School of Medicine St. Louis, MO, USA
| |
Collapse
|
20
|
Pan W, Chen YM, Wei P. Testing for polygenic effects in genome-wide association studies. Genet Epidemiol 2015; 39:306-16. [PMID: 25847094 DOI: 10.1002/gepi.21899] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Revised: 01/30/2015] [Accepted: 02/23/2015] [Indexed: 12/20/2022]
Abstract
To confirm associations with a large number of single nucleotide polymorphisms (SNPs), each with only a small effect size, as hypothesized in the polygenic theory for schizophrenia, the International Schizophrenia Consortium (2009, Nature 460:748-752) proposed a polygenic risk score (PRS) test and demonstrated its effectiveness when applied to psychiatric disorders. The basic idea of the PRS test is to use a half of the sample to select and up-weight those more likely to be associated SNPs, and then use the other half of the sample to test for aggregated effects of the selected SNPs. Intrigued by the novelty and increasing use of the PRS test, we aimed to evaluate and improve its performance for GWAS data. First, by an analysis of the PRS test, we point out its connection with the Sum test [Chapman and Whittaker, Genet Epidemiol 32:560-566; Pan, Genet Epidemiol 33:497-507]; given the known advantages and disadvantages of the Sum test, this connection motivated the development of several other polygenic tests, some of which may be more powerful than the PRS test under certain situations. Second, more importantly, to overcome the low statistical efficiency of the data-splitting strategy as adopted in the PRS test, we reformulate and thus modify the PRS test, obtaining several adaptive tests, which are closely related to the adaptive sum of powered score (SPU) test studied in the context of rare variant analysis [Pan et al., 2014, Genetics 197:1081-1095]. We use both simulated data and a real GWAS dataset of alcohol dependence to show dramatically improved power of the new tests over the PRS test; due to its superior performance and simplicity, we recommend the whole sample-based adaptive SPU test for polygenic testing. We hope to raise the awareness of the limitations of the PRS test and potential power gain of the adaptive SPU test.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | | | | |
Collapse
|
21
|
Zhang Q, Wang L, Koboldt D, Boreki IB, Province MA. Adjusting family relatedness in data-driven burden test of rare variants. Genet Epidemiol 2014; 38:722-7. [PMID: 25169066 DOI: 10.1002/gepi.21848] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Revised: 07/01/2014] [Accepted: 07/16/2014] [Indexed: 11/08/2022]
Abstract
Family data represent a rich resource for detecting association between rare variants (RVs) and human traits. However, most RV association analysis methods developed in recent years are data-driven burden tests which can adaptively learn weights from data but require permutation to evaluate significance, thus are not readily applicable to family data, because random permutation will destroy family structure. Direct application of these methods to family data may result in a significant inflation of false positives. To overcome this issue, we have developed a generalized, weighted sum mixed model (WSMM), and corresponding computational techniques that can incorporate family information into data-driven burden tests, and allow adaptive and efficient permutation test in family data. Using simulated and real datasets, we demonstrate that the WSMM method can be used to appropriately adjust for genetic relatedness among family members and has a good control for the inflation of false positives. We compare WSMM with a nondata-driven, family-based Sequence Kernel Association Test (famSKAT), showing that WSMM has significantly higher power in some cases. WSMM provides a generalized, flexible framework for adapting different data-driven burden tests to analyze data with any family structures, and it can be extended to binary and time-to-onset traits, with or without covariates.
Collapse
Affiliation(s)
- Qunyuan Zhang
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | | | | | | | | |
Collapse
|
22
|
Zhan X, Epstein MP, Ghosh D. An Adaptive Genetic Association Test Using Double Kernel Machines. STATISTICS IN BIOSCIENCES 2014; 7:262-281. [PMID: 26640602 DOI: 10.1007/s12561-014-9116-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines (GKM) test for the purposes of subset selection and then the least squares kernel machine (LSKM) test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study.
Collapse
Affiliation(s)
- Xiang Zhan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A. Tel.: +1-8143213493
| | - Michael P Epstein
- Department of Human Genetics, Emory University, Atlanta, GA 30322, U.S.A
| | - Debashis Ghosh
- Department of Statistics, Department of Public Health Sciences, Pennsylvania State University, University Park, PA 16802, U.S.A
| |
Collapse
|
23
|
Abstract
This article focuses on conducting global testing for association between a binary trait and a set of rare variants (RVs), although its application can be much broader to other types of traits, common variants (CVs), and gene set or pathway analysis. We show that many of the existing tests have deteriorating performance in the presence of many nonassociated RVs: their power can dramatically drop as the proportion of nonassociated RVs in the group to be tested increases. We propose a class of so-called sum of powered score (SPU) tests, each of which is based on the score vector from a general regression model and hence can deal with different types of traits and adjust for covariates, e.g., principal components accounting for population stratification. The SPU tests generalize the sum test, a representative burden test based on pooling or collapsing genotypes of RVs, and a sum of squared score (SSU) test that is closely related to several other powerful variance component tests; a previous study (Basu and Pan 2011) has demonstrated good performance of one, but not both, of the Sum and SSU tests in many situations. The SPU tests are versatile in the sense that one of them is often powerful, although its identity varies with the unknown true association parameters. We propose an adaptive SPU (aSPU) test to approximate the most powerful SPU test for a given scenario, consequently maintaining high power and being highly adaptive across various scenarios. We conducted extensive simulations to show superior performance of the aSPU test over several state-of-the-art association tests in the presence of many nonassociated RVs. Finally we applied the SPU and aSPU tests to the GAW17 mini-exome sequence data to compare its practical performance with some existing tests, demonstrating their potential usefulness.
Collapse
|
24
|
Cook K, Benitez A, Fu C, Tintle N. Evaluating the impact of genotype errors on rare variant tests of association. Front Genet 2014; 5:62. [PMID: 24744770 PMCID: PMC3978329 DOI: 10.3389/fgene.2014.00062] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Accepted: 03/11/2014] [Indexed: 01/23/2023] Open
Abstract
The new class of rare variant tests has usually been evaluated assuming perfect genotype information. In reality, rare variant genotypes may be incorrect, and so rare variant tests should be robust to imperfect data. Errors and uncertainty in SNP genotyping are already known to dramatically impact statistical power for single marker tests on common variants and, in some cases, inflate the type I error rate. Recent results show that uncertainty in genotype calls derived from sequencing reads are dependent on several factors, including read depth, calling algorithm, number of alleles present in the sample, and the frequency at which an allele segregates in the population. We have recently proposed a general framework for the evaluation and investigation of rare variant tests of association, classifying most rare variant tests into one of two broad categories (length or joint tests). We use this framework to relate factors affecting genotype uncertainty to the power and type I error rate of rare variant tests. We find that non-differential genotype errors (an error process that occurs independent of phenotype) decrease power, with larger decreases for extremely rare variants, and for the common homozygote to heterozygote error. Differential genotype errors (an error process that is associated with phenotype status), lead to inflated type I error rates which are more likely to occur at sites with more common homozygote to heterozygote errors than vice versa. Finally, our work suggests that certain rare variant tests and study designs may be more robust to the inclusion of genotype errors. Further work is needed to directly integrate genotype calling algorithm decisions, study costs and test statistic choices to provide comprehensive design and analysis advice which appropriately accounts for the impact of genotype errors.
Collapse
Affiliation(s)
- Kaitlyn Cook
- Department of Mathematics, Carleton College Northfield, MN, USA
| | - Alejandra Benitez
- Department of Applied Mathematics, Brown University Providence, RI, USA
| | - Casey Fu
- Department of Mathematics, Massachusetts Institute of Technology Boston, MA, USA
| | - Nathan Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College Sioux Center, IA, USA
| |
Collapse
|
25
|
Li B, Liu DJ, Leal SM. Identifying rare variants associated with complex traits via sequencing. ACTA ACUST UNITED AC 2014; Chapter 1:Unit 1.26. [PMID: 23853079 DOI: 10.1002/0471142905.hg0126s78] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Although genome-wide association studies have been successful in detecting associations with common variants, there is currently an increasing interest in identifying low-frequency and rare variants associated with complex traits. Next-generation sequencing technologies make it feasible to survey the full spectrum of genetic variation in coding regions or the entire genome. The association analysis for rare variants is challenging, and traditional methods are ineffective, however, due to the low frequency of rare variants, coupled with allelic heterogeneity. Recently a battery of new statistical methods has been proposed for identifying rare variants associated with complex traits. These methods test for associations by aggregating multiple rare variants across a gene or a genomic region or among a group of variants in the genome. In this unit, we describe key concepts for rare variant association for complex traits, survey some of the recent methods, discuss their statistical power under various scenarios, and provide practical guidance on analyzing next-generation sequencing data for identifying rare variants associated with complex traits.
Collapse
Affiliation(s)
- Bingshan Li
- Department of Molecular Physiology and Biophysics, Center for Human Genetics Research, Vanderbilt University, Nashville, Tennessee, USA
| | | | | |
Collapse
|
26
|
Freytag S, Bickeböller H. Comparison of three summary statistics for ranking genes in genome-wide association studies. Stat Med 2013; 33:1828-41. [PMID: 24323702 DOI: 10.1002/sim.6063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2013] [Revised: 09/11/2013] [Accepted: 11/18/2013] [Indexed: 01/30/2023]
Abstract
Problems associated with insufficient power have haunted the analysis of genome-wide association studies and are likely to be the main challenge for the analysis of next-generation sequencing data. Ranking genes according to their strength of association with the investigated phenotype is one solution. To obtain rankings for genes, researchers can draw from a wide range of statistics summarizing the relationships between variants mapped to a gene and the phenotype. Hence, it is of interest to explore the performance of these statistics in the context of rankings. To this end, we conducted a simulation study (limited to genes of equal sizes) of three different summary statistics examining the ability to rank genes in a meaningful order. The weighted sum of squared marginal score test (Pan, 2009), RareCover algorithm (Bahtia et al., 2010) and the elastic net regularization (Zou and Hastie, 2005) were chosen, because they can handle common as well as rare variants. The test based on the score statistic outperformed both other methods in almost all investigated scenarios. It was the only measure to consistently detect genes with interacting causal variants. However, the RareCover algorithm proved better at identifying genes including causal variants with small effect sizes and low minor allele frequency than the weighted sum of squared marginal score test. The performance of the elastic net regularization was unimpressive for all but the simplest scenarios.
Collapse
Affiliation(s)
- Saskia Freytag
- Institute of Genetic Epidemiology, University of Göttingen, Humboltallee 32, Medical School, 37073 Göttingen, Germany
| | | |
Collapse
|
27
|
Cardinale CJ, Kelsen JR, Baldassano RN, Hakonarson H. Impact of exome sequencing in inflammatory bowel disease. World J Gastroenterol 2013; 19:6721-9. [PMID: 24187447 PMCID: PMC3812471 DOI: 10.3748/wjg.v19.i40.6721] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/11/2013] [Revised: 09/11/2013] [Accepted: 09/16/2013] [Indexed: 02/06/2023] Open
Abstract
Approaches to understanding the genetic contribution to inflammatory bowel disease (IBD) have continuously evolved from family- and population-based epidemiology, to linkage analysis, and most recently, to genome-wide association studies (GWAS). The next stage in this evolution seems to be the sequencing of the exome, that is, the regions of the human genome which encode proteins. The GWAS approach has been very fruitful in identifying at least 163 loci as being associated with IBD, and now, exome sequencing promises to take our genetic understanding to the next level. In this review we will discuss the possible contributions that can be made by an exome sequencing approach both at the individual patient level to aid with disease diagnosis and future therapies, as well as in advancing knowledge of the pathogenesis of IBD.
Collapse
|
28
|
|
29
|
Cardinale CJ, Wei Z, Panossian S, Wang F, Kim CE, Mentch FD, Chiavacci RM, Kachelries KE, Pandey R, Grant SFA, Baldassano RN, Hakonarson H. Targeted resequencing identifies defective variants of decoy receptor 3 in pediatric-onset inflammatory bowel disease. Genes Immun 2013; 14:447-52. [DOI: 10.1038/gene.2013.43] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2013] [Accepted: 07/19/2013] [Indexed: 12/14/2022]
|
30
|
Fang H, Hou B, Wang Q, Yang Y. Rare variants analysis by risk-based variable-threshold method. Comput Biol Chem 2013; 46:32-8. [PMID: 23764529 DOI: 10.1016/j.compbiolchem.2013.04.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 04/03/2013] [Accepted: 04/10/2013] [Indexed: 11/17/2022]
Abstract
Genome-wide association studies, as a powerful approach for detecting common variants associated with diseases, have revealed many disease-associated loci. However, the traditional association analysis methods do not have enough power for detecting the effects of rare variants with limited sample size. As a solution to this problem, pooling rare variants by their functions into a composite variant provides an alternative way for identifying susceptible genes. In this paper, we propose a new pooling method to test the variant-disease association and to identify the functional rare variants related with the disease. Variants with smaller and larger risk measures defined as the ratio of allele frequencies between cases and controls are pooled and a chi-square test of the resultant pooled table is calculated. We vary the threshold of pooling over all possible values and use the maximal chi-square as test statistic. The maximal chi-square is in fact the global maximum over all possible poolings. Our approach is similar to the existing variable-threshold method, but we threshold on the risk measure instead of allele frequencies of controls. Simulation results show that our method performs better in both association testing and variant selection.
Collapse
Affiliation(s)
- Hongyan Fang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, China
| | | | | | | |
Collapse
|
31
|
Long N, Dickson SP, Maia JM, Kim HS, Zhu Q, Allen AS. Leveraging prior information to detect causal variants via multi-variant regression. PLoS Comput Biol 2013; 9:e1003093. [PMID: 23762022 PMCID: PMC3675126 DOI: 10.1371/journal.pcbi.1003093] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 04/29/2013] [Indexed: 01/03/2023] Open
Abstract
Although many methods are available to test sequence variants for association with complex diseases and traits, methods that specifically seek to identify causal variants are less developed. Here we develop and evaluate a Bayesian hierarchical regression method that incorporates prior information on the likelihood of variant causality through weighting of variant effects. By simulation studies using both simulated and real sequence variants, we compared a standard single variant test for analyzing variant-disease association with the proposed method using different weighting schemes. We found that by leveraging linkage disequilibrium of variants with known GWAS signals and sequence conservation (phastCons), the proposed method provides a powerful approach for detecting causal variants while controlling false positives. The decline in DNA sequencing cost permits the interrogation of potentially all variants across the entire allele frequency spectrum for their associations with complex human diseases and traits. However, the identification of causal variants remains challenging. Existing single variant tests do not distinguish between causal association and association induced by linkage disequilibrium and tend to be underpowered for rare or low-frequency variants, whereas variant grouping methods do not identify individual causal variants. We propose a novel Bayesian hierarchical regression approach that estimates effects of multiple variants on a disease trait simultaneously and incorporates prior information on the likelihood of causality. By simulation, we show that by combining linkage disequilibrium with known genome wide association signals and functional conservation, the proposed method, the first of its kind, is powerful to correctly detect causal variants.
Collapse
Affiliation(s)
- Nanye Long
- Center for Human Genome Variation, Duke University School of Medicine, Durham, North Carolina, United States of America.
| | | | | | | | | | | |
Collapse
|
32
|
Schaid DJ, McDonnell SK, Sinnwell JP, Thibodeau SN. Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genet Epidemiol 2013; 37:409-18. [PMID: 23650101 DOI: 10.1002/gepi.21727] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Revised: 03/11/2013] [Accepted: 04/01/2013] [Indexed: 11/11/2022]
Abstract
Searching for rare genetic variants associated with complex diseases can be facilitated by enriching for diseased carriers of rare variants by sampling cases from pedigrees enriched for disease, possibly with related or unrelated controls. This strategy, however, complicates analyses because of shared genetic ancestry, as well as linkage disequilibrium among genetic markers. To overcome these problems, we developed broad classes of "burden" statistics and kernel statistics, extending commonly used methods for unrelated case-control data to allow for known pedigree relationships, for autosomes and the X chromosome. Furthermore, by replacing pedigree-based genetic correlation matrices with estimates of genetic relationships based on large-scale genomic data, our methods can be used to account for population-structured data. By simulations, we show that the type I error rates of our developed methods are near the asymptotic nominal levels, allowing rapid computation of P-values. Our simulations also show that a linear weighted kernel statistic is generally more powerful than a weighted "burden" statistic. Because the proposed statistics are rapid to compute, they can be readily used for large-scale screening of the association of genomic sequence data with disease status.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota 55905, USA.
| | | | | | | |
Collapse
|
33
|
Liu K, Fast S, Zawistowski M, Tintle NL. A geometric framework for evaluating rare variant tests of association. Genet Epidemiol 2013; 37:345-57. [PMID: 23526307 DOI: 10.1002/gepi.21722] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2012] [Revised: 02/12/2013] [Accepted: 02/13/2013] [Indexed: 11/08/2022]
Abstract
The wave of next-generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to noncausal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.
Collapse
Affiliation(s)
- Keli Liu
- Department of Statistics, Harvard University, Cambridge, MA, USA
| | | | | | | |
Collapse
|
34
|
Assessing the impact of differential genotyping errors on rare variant tests of association. PLoS One 2013; 8:e56626. [PMID: 23472072 PMCID: PMC3589406 DOI: 10.1371/journal.pone.0056626] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2012] [Accepted: 01/15/2013] [Indexed: 11/19/2022] Open
Abstract
Genotyping errors are well-known to impact the power and type I error rate in single marker tests of association. Genotyping errors that happen according to the same process in cases and controls are known as non-differential genotyping errors, whereas genotyping errors that occur with different processes in the cases and controls are known as differential genotype errors. For single marker tests, non-differential genotyping errors reduce power, while differential genotyping errors increase the type I error rate. However, little is known about the behavior of the new generation of rare variant tests of association in the presence of genotyping errors. In this manuscript we use a comprehensive simulation study to explore the effects of numerous factors on the type I error rate of rare variant tests of association in the presence of differential genotyping error. We find that increased sample size, decreased minor allele frequency, and an increased number of single nucleotide variants (SNVs) included in the test all increase the type I error rate in the presence of differential genotyping errors. We also find that the greater the relative difference in case-control genotyping error rates the larger the type I error rate. Lastly, as is the case for single marker tests, genotyping errors classifying the common homozygote as the heterozygote inflate the type I error rate significantly more than errors classifying the heterozygote as the common homozygote. In general, our findings are in line with results from single marker tests. To ensure that type I error inflation does not occur when analyzing next-generation sequencing data careful consideration of study design (e.g. use of randomization), caution in meta-analysis and using publicly available controls, and the use of standard quality control metrics is critical.
Collapse
|
35
|
Zhang Y, Guan W, Pan W. Adjustment for population stratification via principal components in association analysis of rare variants. Genet Epidemiol 2012; 37:99-109. [PMID: 23065775 DOI: 10.1002/gepi.21691] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2012] [Revised: 09/11/2012] [Accepted: 09/13/2012] [Indexed: 11/07/2022]
Abstract
For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.
Collapse
Affiliation(s)
- Yiwei Zhang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | | | |
Collapse
|
36
|
Abstract
It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.
Collapse
Affiliation(s)
- Haimao Zhan
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, California, United States of America
| | - Shizhong Xu
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, California, United States of America
| |
Collapse
|
37
|
Meyer NJ, Daye ZJ, Rushefski M, Aplenc R, Lanken PN, Shashaty MGS, Christie JD, Feng R. SNP-set analysis replicates acute lung injury genetic risk factors. BMC MEDICAL GENETICS 2012; 13:52. [PMID: 22742663 PMCID: PMC3512475 DOI: 10.1186/1471-2350-13-52] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Accepted: 06/18/2012] [Indexed: 12/19/2022]
Abstract
BACKGROUND We used a gene - based replication strategy to test the reproducibility of prior acute lung injury (ALI) candidate gene associations. METHODS We phenotyped 474 patients from a prospective severe trauma cohort study for ALI. Genomic DNA from subjects' blood was genotyped using the IBC chip, a multiplex single nucleotide polymorphism (SNP) array. Results were filtered for 25 candidate genes selected using prespecified literature search criteria and present on the IBC platform. For each gene, we grouped SNPs according to haplotype blocks and tested the joint effect of all SNPs on susceptibility to ALI using the SNP-set kernel association test. Results were compared to single SNP analysis of the candidate SNPs. Analyses were separate for genetically determined ancestry (African or European). RESULTS We identified 4 genes in African ancestry and 2 in European ancestry trauma subjects which replicated their associations with ALI. Ours is the first replication of IL6, IL10, IRAK3, and VEGFA associations in non-European populations with ALI. Only one gene - VEGFA - demonstrated association with ALI in both ancestries, with distinct haplotype blocks in each ancestry driving the association. We also report the association between trauma-associated ALI and NFKBIA in European ancestry subjects. CONCLUSIONS Prior ALI genetic associations are reproducible and replicate in a trauma cohort. Kernel - based SNP-set analysis is a more powerful method to detect ALI association than single SNP analysis, and thus may be more useful for replication testing. Further, gene-based replication can extend candidate gene associations to diverse ethnicities.
Collapse
Affiliation(s)
- Nuala J Meyer
- Department of Medicine: Pulmonary, Allergy, and Critical Care Division, Perelman School of Medicine University of Pennsylvania, 3600 Spruce Street, 874 Maloney, Philadelphia, PA 19104, USA.
| | | | | | | | | | | | | | | |
Collapse
|
38
|
Chen GK, Chen G, Wei P, DeStefano AL. Incorporating biological information into association studies of sequencing data. Genet Epidemiol 2012; 35 Suppl 1:S29-34. [PMID: 22128055 DOI: 10.1002/gepi.20646] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.
Collapse
Affiliation(s)
- Gary K Chen
- Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | | | | | | |
Collapse
|
39
|
Dai Y, Jiang R, Dong J. Weighted selective collapsing strategy for detecting rare and common variants in genetic association study. BMC Genet 2012; 13:7. [PMID: 22309429 PMCID: PMC3296579 DOI: 10.1186/1471-2156-13-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2011] [Accepted: 02/06/2012] [Indexed: 01/12/2023] Open
Abstract
Background Genome-wide association studies (GWAS) have been used successfully in detecting associations between common genetic variants and complex diseases. However, common SNPs detected by current GWAS only explain a small proportion of heritable variability. With the development of next-generation sequencing technologies, researchers find more and more evidence to support the role played by rare variants in heritable variability. However, rare and common variants are often studied separately. The objective of this paper is to develop a robust strategy to analyze association between complex traits and genetic regions using both common and rare variants. Results We propose a weighted selective collapsing strategy for both candidate gene studies and genome-wide association scans. The strategy considers genetic information from both common and rare variants, selectively collapses all variants in a given region by a forward selection procedure, and uses an adaptive weight to favor more likely causal rare variants. Under this strategy, two tests are proposed. One test denoted by BwSC is sensitive to the directions of genetic effects, and it separates the deleterious and protective effects into two components. Another denoted by BwSCd is robust in the directions of genetic effects, and it considers the difference of the two components. In our simulation studies, BwSC achieves a higher power when the casual variants have the same genetic effect, while BwSCd is as powerful as several existing tests when a mixed genetic effect exists. Both of the proposed tests work well with and without the existence of genetic effects from common variants. Conclusions Two tests using a weighted selective collapsing strategy provide potentially powerful methods for association studies of sequencing data. The tests have a higher power when both common and rare variants contribute to the heritable variability and the effect of common variants is not strong enough to be detected by traditional methods. Our simulation studies have demonstrated a substantially higher power for both tests in all scenarios regardless whether the common SNPs are associated with the trait or not.
Collapse
Affiliation(s)
- Yilin Dai
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA.
| | | | | |
Collapse
|
40
|
Design and Statistical Analysis of Pooled Next Generation Sequencing for Rare Variants. JOURNAL OF PROBABILITY AND STATISTICS 2012. [DOI: 10.1155/2012/524724] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Next generation sequencing (NGS) is a revolutionary technology for biomedical research. One highly cost-efficient application of NGS is to detect disease association based on pooled DNA samples. However, several key issues need to be addressed for pooled NGS. One of them is the high sequencing error rate and its high variability across genomic positions and experiment runs, which, if not well considered in the experimental design and analysis, could lead to either inflated false positive rates or loss in statistical power. Another important issue is how to test association of a group of rare variants. To address the first issue, we proposed a new blocked pooling design in which multiple pools of DNA samples from cases and controls are sequenced together on same NGS functional units. To address the second issue, we proposed a testing procedure that does not require individual genotypes but by taking advantage of multiple DNA pools. Through a simulation study, we demonstrated that our approach provides a good control of the type I error rate, and yields satisfactory power compared to the test-based on individual genotypes. Our results also provide guidelines for designing an efficient pooled.
Collapse
|
41
|
Yi N, Liu N, Zhi D, Li J. Hierarchical generalized linear models for multiple groups of rare and common variants: jointly estimating group and individual-variant effects. PLoS Genet 2011; 7:e1002382. [PMID: 22144906 PMCID: PMC3228815 DOI: 10.1371/journal.pgen.1002382] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2011] [Accepted: 09/29/2011] [Indexed: 12/19/2022] Open
Abstract
Complex diseases and traits are likely influenced by many common and rare genetic variants and environmental factors. Detecting disease susceptibility variants is a challenging task, especially when their frequencies are low and/or their effects are small or moderate. We propose here a comprehensive hierarchical generalized linear model framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. The proposed hierarchical generalized linear models introduce a group effect and a genetic score (i.e., a linear combination of main-effect predictors for genetic variants) for each group of variants, and jointly they estimate the group effects and the weights of the genetic scores. This framework includes various previous methods as special cases, and it can effectively deal with both risk and protective variants in a group and can simultaneously estimate the cumulative contribution of multiple variants and their relative importance. Our computational strategy is based on extending the standard procedure for fitting generalized linear models in the statistical software R to the proposed hierarchical models, leading to the development of stable and flexible tools. The methods are illustrated with sequence data in gene ANGPTL4 from the Dallas Heart Study. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Collapse
Affiliation(s)
- Nengjun Yi
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, Alabama, USA.
| | | | | | | |
Collapse
|
42
|
Powers S, Gopalakrishnan S, Tintle N. Assessing the impact of non-differential genotyping errors on rare variant tests of association. Hum Hered 2011; 72:153-60. [PMID: 22004945 DOI: 10.1159/000332222] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2011] [Accepted: 08/24/2011] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND/AIMS We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful. METHODS We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates. RESULTS Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power. CONCLUSION Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes.
Collapse
Affiliation(s)
- Scott Powers
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, USA
| | | | | |
Collapse
|
43
|
Pan W, Basu S, Shen X. Adaptive tests for detecting gene-gene and gene-environment interactions. Hum Hered 2011; 72:98-109. [PMID: 21934325 DOI: 10.1159/000330632] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2011] [Accepted: 07/02/2011] [Indexed: 12/14/2022] Open
Abstract
There has been an increasing interest in detecting gene-gene and gene-environment interactions in genetic association studies. A major statistical challenge is how to deal with a large number of parameters measuring possible interaction effects, which leads to reduced power of any statistical test due to a large number of degrees of freedom or high cost of adjustment for multiple testing. Hence, a popular idea is to first apply some dimension reduction techniques before testing, while another is to apply only statistical tests that are developed for and robust to high-dimensional data. To combine both ideas, we propose applying an adaptive sum of squared score (SSU) test and several other adaptive tests. These adaptive tests are extensions of the adaptive Neyman test [Fan, 1996], which was originally proposed for high-dimensional data, providing a simple and effective way for dimension reduction. On the other hand, the original SSU test coincides with a version of a test specifically developed for high-dimensional data. We apply these adaptive tests and their original nonadaptive versions to simulated data to detect interactions between two groups of SNPs (e.g. multiple SNPs in two candidate regions). We found that for sparse models (i.e. with only few non-zero interaction parameters), the adaptive SSU test and its close variant, an adaptive version of the weighted sum of squared score (SSUw) test, improved the power over their non-adaptive versions, and performed consistently well across various scenarios. The proposed adaptive tests are built in the general framework of regression analysis, and can thus be applied to various types of traits in the presence of covariates.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, USA. weip @ biostat.umn.edu
| | | | | |
Collapse
|
44
|
Zhang Q, Irvin MR, Arnett DK, Province MA, Borecki I. A data-driven method for identifying rare variants with heterogeneous trait effects. Genet Epidemiol 2011; 35:679-85. [PMID: 21818776 DOI: 10.1002/gepi.20618] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2011] [Revised: 06/14/2011] [Accepted: 06/15/2011] [Indexed: 11/07/2022]
Abstract
Collapsing multiple variants into one variable and testing their collective effect is a useful strategy for rare variant association analysis. Direct collapsing, however, is not valid or may significantly lose power when a group of variants to be collapsed have heterogeneous effects on target traits (i.e. some positive and some negative). This could be especially true for quantitative traits (such as blood pressure and body mass index), regardless of whether subjects are sampled randomly from a population or selectively from two extreme tails of the trait distribution. To deal with this problem, we propose a novel, data-driven method, the P-value Weighted Sum Test (PWST), which allows each variant to be individually weighted according to the evidence of association from the data itself. Specifically, both significance and direction of individual variant effects are used to calculate a single weighted sum score based on rescaled left-tail P-values from single-variant analysis, after which a permutation test of association is performed between the score and the trait. Our simulation under different sampling strategies shows that PWST significantly increases statistical power when there are heterogeneous variant effects. The appeal of the PWST approach is illustrated in an application to sequence data by detecting the collective effect of variants in the peroxisome proliferator-activated receptor alpha (PPARα) gene on triglycerides (TG) response to fenofibrate treatment from 300 subjects in the Genetics of Lipid Lowering and Diet Network study.
Collapse
Affiliation(s)
- Qunyuan Zhang
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, Missouri 63108, USA.
| | | | | | | | | |
Collapse
|
45
|
Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 2011; 35:606-19. [PMID: 21769936 DOI: 10.1002/gepi.20609] [Citation(s) in RCA: 188] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Revised: 03/23/2011] [Accepted: 06/03/2011] [Indexed: 01/31/2023]
Abstract
In anticipation of the availability of next-generation sequencing data, there is increasing interest in investigating association between complex traits and rare variants (RVs). In contrast to association studies for common variants (CVs), due to the low frequencies of RVs, common wisdom suggests that existing statistical tests for CVs might not work, motivating the recent development of several new tests for analyzing RVs, most of which are based on the idea of pooling/collapsing RVs. However, there is a lack of evaluations of, and thus guidance on the use of, existing tests. Here we provide a comprehensive comparison of various statistical tests using simulated data. We consider both independent and correlated rare mutations, and representative tests for both CVs and RVs. As expected, if there are no or few non-causal (i.e. neutral or non-associated) RVs in a locus of interest while the effects of causal RVs on the trait are all (or mostly) in the same direction (i.e. either protective or deleterious, but not both), then the simple pooled association tests (without selecting RVs and their association directions) and a new test called kernel-based adaptive clustering (KBAC) perform similarly and are most powerful; KBAC is more robust than simple pooled association tests in the presence of non-causal RVs; however, as the number of non-causal CVs increases and/or in the presence of opposite association directions, the winners are two methods originally proposed for CVs and a new test called C-alpha test proposed for RVs, each of which can be regarded as testing on a variance component in a random-effects model. Interestingly, several methods based on sequential model selection (i.e. selecting causal RVs and their association directions), including two new methods proposed here, perform robustly and often have statistical power between those of the above two classes.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | |
Collapse
|