1601
|
Zheng HF. An example design of large-scale next-generation sequencing study for bone mineral density. ACTA ACUST UNITED AC 2013. [DOI: 10.1038/bonekey.2013.132] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
1602
|
Clarke GM, Rivas MA, Morris AP. A flexible approach for the analysis of rare variants allowing for a mixture of effects on binary or quantitative traits. PLoS Genet 2013; 9:e1003694. [PMID: 23966874 PMCID: PMC3744430 DOI: 10.1371/journal.pgen.1003694] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 06/19/2013] [Indexed: 11/18/2022] Open
Abstract
Multiple rare variants either within or across genes have been hypothesised to collectively influence complex human traits. The increasing availability of high throughput sequencing technologies offers the opportunity to study the effect of rare variants on these traits. However, appropriate and computationally efficient analytical methods are required to account for collections of rare variants that display a combination of protective, deleterious and null effects on the trait. We have developed a novel method for the analysis of rare genetic variation in a gene, region or pathway that, by simply aggregating summary statistics at each variant, can: (i) test for the presence of a mixture of effects on a trait; (ii) be applied to both binary and quantitative traits in population-based and family-based data; (iii) adjust for covariates to allow for non-genetic risk factors and; (iv) incorporate imputed genetic variation. In addition, for preliminary identification of promising genes, the method can be applied to association summary statistics, available from meta-analysis of published data, for example, without the need for individual level genotype data. Through simulation, we show that our method is immune to the presence of bi-directional effects, with no apparent loss in power across a range of different mixtures, and can achieve greater power than existing approaches as long as summary statistics at each variant are robust. We apply our method to investigate association of type-1 diabetes with imputed rare variants within genes in the major histocompatibility complex using genotype data from the Wellcome Trust Case Control Consortium.
Collapse
Affiliation(s)
- Geraldine M Clarke
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom.
| | | | | |
Collapse
|
1603
|
Hu YJ, Berndt S, Gustafsson S, Ganna A, Hirschhorn J, North KE, Ingelsson E, Lin DY, Berndt S, Gustafsson S, Mägi R, Ganna A, Wheeler E, Feitosa M, Justice A, Monda K, Croteau-Chonka D, Day F, Esko T, Fall T, Ferreira T, Gentilini D, Jackson A, Luan J, Randall J, Vedantam S, Willer C, Winkler T, Wood A, Workalemahu T, Hu YJ, Lee S, Liang L, Lin DY, Min J, Neale B, Thorleifsson G, Yang J, Albrecht E, Amin N, Bragg-Gresham J, Cadby G, den Heijer M, Eklund N, Fischer K, Goel A, Hottenga JJ, Huffman J, Jarick I, Johansson Å, Johnson T, Kanoni S, Kleber M, König I, Kristiansson K, Kutalik Z, Lamina C, Lecoeur C, Li G, Mangino M, McArdle W, Medina-Gomez C, Müller-Nurasyid M, Ngwa J, Nolte I, Paternoster L, Pechlivanis S, Perola M, Peters M, Preuss M, Rose L, Shi J, Shungin D, Smith A, Strawbridge R, Surakka I, Teumer A, Trip M, Tyrer J, Van Vliet-Ostaptchouk J, Vandenput L, Waite L, Zhao J, Absher D, Asselbergs F, Atalay M, Attwood A, Balmforth A, Basart H, Beilby J, Bonnycastle L, Brambilla P, Bruinenberg M, Campbell H, Chasman D, Chines P, Collins F, Connell J, Cookson W, de Faire U, de Vegt F, Dei M, Dimitriou M, Edkins S, Estrada K, Evans D, Farrall M, Ferrario M, Ferrières J, Franke L, Frau F, Gejman P, Grallert H, Grönberg H, Gudnason V, Hall A, Hall P, Hartikainen AL, Hayward C, Heard-Costa N, Heath A, Hebebrand J, Homuth G, Hu F, Hunt S, Hyppönen E, Iribarren C, Jacobs K, Jansson JO, Jula A, Kähönen M, Kathiresan S, Kee F, Khaw KT, Kivimaki M, Koenig W, Kraja A, Kumari M, Kuulasmaa K, Kuusisto J, Laitinen J, Lakka T, Langenberg C, Launer L, Lind L, Lindström J, Liu J, Liuzzi A, Lokki ML, Lorentzon M, Madden P, Magnusson P, Manunta P, Marek D, März W, Leach I, McKnight B, Medland S, Mihailov E, Milani L, Montgomery G, Mooser V, Mühleisen T, Munroe P, Musk A, Narisu N, Navis G, Nicholson G, Nohr E, Ong K, Oostra B, Palmer C, Palotie A, Peden J, Pedersen N, Peters A, Polasek O, Pouta A, Pramstaller P, Prokopenko I, Pütter C, Radhakrishnan A, Raitakari O, Rendon A, Rivadeneira F, Rudan I, Saaristo T, Sambrook J, Sanders A, Sanna S, Saramies J, Schipf S, Schreiber S, Schunkert H, Shin SY, Signorini S, Sinisalo J, Skrobek B, Soranzo N, Stančáková A, Stark K, Stephens J, Stirrups K, Stolk R, Stumvoll M, Swift A, Theodoraki E, Thorand B, Tregouet DA, Tremoli E, Van der Klauw M, van Meurs J, Vermeulen S, Viikari J, Virtamo J, Vitart V, Waeber G, Wang Z, Widén E, Wild S, Willemsen G, Winkelmann B, Witteman J, Wolffenbuttel B, Wong A, Wright A, Zillikens M, Amouyel P, Boehm B, Boerwinkle E, Boomsma D, Caulfield M, Chanock S, Cupples L, Cusi D, Dedoussis G, Erdmann J, Eriksson J, Franks P, Froguel P, Gieger C, Gyllensten U, Hamsten A, Harris T, Hengstenberg C, Hicks A, Hingorani A, Hinney A, Hofman A, Hovingh K, Hveem K, Illig T, Jarvelin MR, Jöckel KH, Keinanen-Kiukaanniemi S, Kiemeney L, Kuh D, Laakso M, Lehtimäki T, Levinson D, Martin N, Metspalu A, Morris A, Nieminen M, Njølstad I, Ohlsson C, Oldehinkel A, Ouwehand W, Palmer L, Penninx B, Power C, Province M, Psaty B, Qi L, Rauramaa R, Ridker P, Ripatti S, Salomaa V, Samani N, Snieder H, Sørensen T, Spector T, Stefansson K, Tönjes A, Tuomilehto J, Uitterlinden A, Uusitupa M, van der Harst P, Vollenweider P, Wallaschofski H, Wareham N, Watkins H, Wichmann HE, Wilson J, Abecasis G, Assimes T, Barroso I, Boehnke M, Borecki I, Deloukas P, Fox C, Frayling T, Groop L, Haritunian T, Heid I, Hunter D, Kaplan R, Karpe F, Moffatt M, Mohlke K, O’Connell J, Pawitan Y, Schadt E, Schlessinger D, Steinthorsdottir V, Strachan D, Thorsteinsdottir U, van Duijn C, Visscher P, Di Blasio A, Hirschhorn J, Lindgren C, Morris A, Meyre D, Scherag A, McCarthy M, Speliotes E, North K, Loos R, Ingelsson E. Meta-analysis of gene-level associations for rare variants based on single-variant statistics. Am J Hum Genet 2013; 93:236-48. [PMID: 23891470 DOI: 10.1016/j.ajhg.2013.06.011] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Revised: 06/05/2013] [Accepted: 06/12/2013] [Indexed: 02/02/2023] Open
Abstract
Meta-analysis of genome-wide association studies (GWASs) has led to the discoveries of many common variants associated with complex human diseases. There is a growing recognition that identifying "causal" rare variants also requires large-scale meta-analysis. The fact that association tests with rare variants are performed at the gene level rather than at the variant level poses unprecedented challenges in the meta-analysis. First, different studies may adopt different gene-level tests, so the results are not compatible. Second, gene-level tests require multivariate statistics (i.e., components of the test statistic and their covariance matrix), which are difficult to obtain. To overcome these challenges, we propose to perform gene-level tests for rare variants by combining the results of single-variant analysis (i.e., p values of association tests and effect estimates) from participating studies. This simple strategy is possible because of an insight that multivariate statistics can be recovered from single-variant statistics, together with the correlation matrix of the single-variant test statistics, which can be estimated from one of the participating studies or from a publicly available database. We show both theoretically and numerically that the proposed meta-analysis approach provides accurate control of the type I error and is as powerful as joint analysis of individual participant data. This approach accommodates any disease phenotype and any study design and produces all commonly used gene-level tests. An application to the GWAS summary results of the Genetic Investigation of ANthropometric Traits (GIANT) consortium reveals rare and low-frequency variants associated with human height. The relevant software is freely available.
Collapse
|
1604
|
Fang YH, Chiu YF. A novel support vector machine-based approach for rare variant detection. PLoS One 2013; 8:e71114. [PMID: 23940698 PMCID: PMC3737136 DOI: 10.1371/journal.pone.0071114] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 06/24/2013] [Indexed: 01/06/2023] Open
Abstract
Advances in next-generation sequencing technologies have enabled the identification of multiple rare single nucleotide polymorphisms involved in diseases or traits. Several strategies for identifying rare variants that contribute to disease susceptibility have recently been proposed. An important feature of many of these statistical methods is the pooling or collapsing of multiple rare single nucleotide variants to achieve a reasonably high frequency and effect. However, if the pooled rare variants are associated with the trait in different directions, then the pooling may weaken the signal, thereby reducing its statistical power. In the present paper, we propose a backward support vector machine (BSVM)-based variant selection procedure to identify informative disease-associated rare variants. In the selection procedure, the rare variants are weighted and collapsed according to their positive or negative associations with the disease, which may be associated with common variants and rare variants with protective, deleterious, or neutral effects. This nonparametric variant selection procedure is able to account for confounding factors and can also be adopted in other regression frameworks. The results of a simulation study and a data example show that the proposed BSVM approach is more powerful than four other approaches under the considered scenarios, while maintaining valid type I errors.
Collapse
Affiliation(s)
- Yao-Hwei Fang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan, ROC
| | - Yen-Feng Chiu
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan, ROC
- * E-mail:
| |
Collapse
|
1605
|
Panoutsopoulou K, Tachmazidou I, Zeggini E. In search of low-frequency and rare variants affecting complex traits. Hum Mol Genet 2013; 22:R16-21. [PMID: 23922232 PMCID: PMC3782074 DOI: 10.1093/hmg/ddt376] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The allelic architecture of complex traits is likely to be underpinned by a combination of multiple common frequency and rare variants. Targeted genotyping arrays and next-generation sequencing technologies at the whole-genome sequencing (WGS) and whole-exome scales (WES) are increasingly employed to access sequence variation across the full minor allele frequency (MAF) spectrum. Different study design strategies that make use of diverse technologies, imputation and sample selection approaches are an active target of development and evaluation efforts. Initial insights into the contribution of rare variants in common diseases and medically relevant quantitative traits point to low-frequency and rare alleles acting either independently or in aggregate and in several cases alongside common variants. Studies conducted in population isolates have been successful in detecting rare variant associations with complex phenotypes. Statistical methodologies that enable the joint analysis of rare variants across regions of the genome continue to evolve with current efforts focusing on incorporating information such as functional annotation, and on the meta-analysis of these burden tests. In addition, population stratification, defining genome-wide statistical significance thresholds and the design of appropriate replication experiments constitute important considerations for the powerful analysis and interpretation of rare variant association studies. Progress in addressing these emerging challenges and the accrual of sufficiently large data sets are poised to help the field of complex trait genetics enter a promising era of discovery.
Collapse
Affiliation(s)
| | | | - Eleftheria Zeggini
- To whom correspondence should be addressed at: Wellcome Trust Sanger Institute, The Morgan Building, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1HH, UK. Tel: +44-1223496868; Fax: +44-1223496826;
| |
Collapse
|
1606
|
Bromberg Y. Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 2013; 425:3993-4005. [PMID: 23928561 DOI: 10.1016/j.jmb.2013.07.038] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Revised: 07/26/2013] [Accepted: 07/28/2013] [Indexed: 12/24/2022]
Abstract
Reduced costs and increased speed and accuracy of sequencing can bring the genome-based evaluation of individual disease risk to the bedside. While past efforts have identified a number of actionable mutations, the bulk of genetic risk remains hidden in sequence data. The biggest challenge facing genomic medicine today is the development of new techniques to predict the specifics of a given human phenome (set of all expressed phenotypes) encoded by each individual variome (full set of genome variants) in the context of the given environment. Numerous tools exist for the computational identification of the functional effects of a single variant. However, the pipelines taking advantage of full genomic, exomic, transcriptomic (and other) sequences have only recently become a reality. This review looks at the building of methodologies for predicting "variome"-defined disease risk. It also discusses some of the challenges for incorporating such a pipeline into everyday medical practice.
Collapse
Affiliation(s)
- Y Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, 76 Lipman Drive, New Brunswick, NJ 08873, USA.
| |
Collapse
|
1607
|
Liang F, Xiong M. Bayesian detection of causal rare variants under posterior consistency. PLoS One 2013; 8:e69633. [PMID: 23922764 PMCID: PMC3724943 DOI: 10.1371/journal.pone.0069633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2013] [Accepted: 06/12/2013] [Indexed: 12/17/2022] Open
Abstract
Identification of causal rare variants that are associated with complex traits poses a central challenge on genome-wide association studies. However, most current research focuses only on testing the global association whether the rare variants in a given genomic region are collectively associated with the trait. Although some recent work, e.g., the Bayesian risk index method, have tried to address this problem, it is unclear whether the causal rare variants can be consistently identified by them in the small-n-large-P situation. We develop a new Bayesian method, the so-called Bayesian Rare Variant Detector (BRVD), to tackle this problem. The new method simultaneously addresses two issues: (i) (Global association test) Are there any of the variants associated with the disease, and (ii) (Causal variant detection) Which variants, if any, are driving the association. The BRVD ensures the causal rare variants to be consistently identified in the small-n-large-P situation by imposing some appropriate prior distributions on the model and model specific parameters. The numerical results indicate that the BRVD is more powerful for testing the global association than the existing methods, such as the combined multivariate and collapsing test, weighted sum statistic test, RARECOVER, sequence kernel association test, and Bayesian risk index, and also more powerful for identification of causal rare variants than the Bayesian risk index method. The BRVD has also been successfully applied to the Early-Onset Myocardial Infarction (EOMI) Exome Sequence Data. It identified a few causal rare variants that have been verified in the literature.
Collapse
Affiliation(s)
- Faming Liang
- Department of Statistics, Texas A&M University, College Station, Texas, United States of America.
| | | |
Collapse
|
1608
|
Larson NB, Schaid DJ. A kernel regression approach to gene-gene interaction detection for case-control studies. Genet Epidemiol 2013; 37:695-703. [PMID: 23868214 DOI: 10.1002/gepi.21749] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2013] [Revised: 05/07/2013] [Accepted: 06/12/2013] [Indexed: 01/13/2023]
Abstract
Gene-gene interactions are increasingly being addressed as a potentially important contributor to the variability of complex traits. Consequently, attentions have moved beyond single locus analysis of association to more complex genetic models. Although several single-marker approaches toward interaction analysis have been developed, such methods suffer from very high testing dimensionality and do not take advantage of existing information, notably the definition of genes as functional units. Here, we propose a comprehensive family of gene-level score tests for identifying genetic elements of disease risk, in particular pairwise gene-gene interactions. Using kernel machine methods, we devise score-based variance component tests under a generalized linear mixed model framework. We conducted simulations based upon coalescent genetic models to evaluate the performance of our approach under a variety of disease models. These simulations indicate that our methods are generally higher powered than alternative gene-level approaches and at worst competitive with exhaustive SNP-level (where SNP is single-nucleotide polymorphism) analyses. Furthermore, we observe that simulated epistatic effects resulted in significant marginal testing results for the involved genes regardless of whether or not true main effects were present. We detail the benefits of our methods and discuss potential genome-wide analysis strategies for gene-gene interaction analysis in a case-control study design.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | | |
Collapse
|
1609
|
Rivas MA, Pirinen M, Neville MJ, Gaulton KJ, Moutsianas L, Lindgren CM, Karpe F, McCarthy MI, Donnelly P. Assessing association between protein truncating variants and quantitative traits. Bioinformatics 2013; 29:2419-26. [PMID: 23860716 PMCID: PMC3777107 DOI: 10.1093/bioinformatics/btt409] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION In sequencing studies of common diseases and quantitative traits, power to test rare and low frequency variants individually is weak. To improve power, a common approach is to combine statistical evidence from several genetic variants in a region. Major challenges are how to do the combining and which statistical framework to use. General approaches for testing association between rare variants and quantitative traits include aggregating genotypes and trait values, referred to as 'collapsing', or using a score-based variance component test. However, little attention has been paid to alternative models tailored for protein truncating variants. Recent studies have highlighted the important role that protein truncating variants, commonly referred to as 'loss of function' variants, may have on disease susceptibility and quantitative levels of biomarkers. We propose a Bayesian modelling framework for the analysis of protein truncating variants and quantitative traits. RESULTS Our simulation results show that our models have an advantage over the commonly used methods. We apply our models to sequence and exome-array data and discover strong evidence of association between low plasma triglyceride levels and protein truncating variants at APOC3 (Apolipoprotein C3). AVAILABILITY Software is available from http://www.well.ox.ac.uk/~rivas/mamba
Collapse
Affiliation(s)
- Manuel A Rivas
- Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7BN, UK, Institute for Molecular Medicine Finland, University of Helsinki, Helsinki 00290, Finland, Oxford Centre for Diabetes, Endocrinology and Metabolism, Radcliffe Department of Medicine, Oxford OX3 7LJ, UK, NIHR Oxford Biomedical Research Centre, OUH Trust, Oxford OX3 7LE, UK and Department of Statistics, University of Oxford, Oxford OX1 3TG, UK
| | | | | | | | | | | | | | | | | | | |
Collapse
|
1610
|
Lee S, Teslovich T, Boehnke M, Lin X. General framework for meta-analysis of rare variants in sequencing association studies. Am J Hum Genet 2013; 93:42-53. [PMID: 23768515 PMCID: PMC3710762 DOI: 10.1016/j.ajhg.2013.05.010] [Citation(s) in RCA: 169] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 04/19/2013] [Accepted: 05/14/2013] [Indexed: 12/22/2022] Open
Abstract
We propose a general statistical framework for meta-analysis of gene- or region-based multimarker rare variant association tests in sequencing association studies. In genome-wide association studies, single-marker meta-analysis has been widely used to increase statistical power by combining results via regression coefficients and standard errors from different studies. In analysis of rare variants in sequencing studies, region-based multimarker tests are often used to increase power. We propose meta-analysis methods for commonly used gene- or region-based rare variants tests, such as burden tests and variance component tests. Because estimation of regression coefficients of individual rare variants is often unstable or not feasible, the proposed method avoids this difficulty by calculating score statistics instead that only require fitting the null model for each study and then aggregating these score statistics across studies. Our proposed meta-analysis rare variant association tests are conducted based on study-specific summary statistics, specifically score statistics for each variant and between-variant covariance-type (linkage disequilibrium) relationship statistics for each gene or region. The proposed methods are able to incorporate different levels of heterogeneity of genetic effects across studies and are applicable to meta-analysis of multiple ancestry groups. We show that the proposed methods are essentially as powerful as joint analysis by directly pooling individual level genotype data. We conduct extensive simulations to evaluate the performance of our methods by varying levels of heterogeneity across studies, and we apply the proposed methods to meta-analysis of rare variant effects in a multicohort study of the genetics of blood lipid levels.
Collapse
Affiliation(s)
- Seunggeun Lee
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
| | - Tanya M. Teslovich
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michael Boehnke
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
| |
Collapse
|
1611
|
Quantitative trait analysis in sequencing studies under trait-dependent sampling. Proc Natl Acad Sci U S A 2013; 110:12247-52. [PMID: 23847208 DOI: 10.1073/pnas.1221713110] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
It is not economically feasible to sequence all study subjects in a large cohort. A cost-effective strategy is to sequence only the subjects with the extreme values of a quantitative trait. In the National Heart, Lung, and Blood Institute Exome Sequencing Project, subjects with the highest or lowest values of body mass index, LDL, or blood pressure were selected for whole-exome sequencing. Failure to account for such trait-dependent sampling can cause severe inflation of type I error and substantial loss of power in quantitative trait analysis, especially when combining results from multiple studies with different selection criteria. We present valid and efficient statistical methods for association analysis of sequencing data under trait-dependent sampling. We pay special attention to gene-based analysis of rare variants. Our methods can be used to perform quantitative trait analysis not only for the trait that is used to select subjects for sequencing but for any other traits that are measured. For a particular trait of interest, our approach properly combines the association results from all studies with measurements of that trait. This meta-analysis is substantially more powerful than the analysis of any single study. By contrast, meta-analysis of standard linear regression results (ignoring trait-dependent sampling) can be less powerful than the analysis of a single study. The advantages of the proposed methods are demonstrated through simulation studies and the National Heart, Lung, and Blood Institute Exome Sequencing Project data. The methods are applicable to other types of genetic association studies and nongenetic studies.
Collapse
|
1612
|
Talluri R, Shete S. A linkage disequilibrium-based approach to selecting disease-associated rare variants. PLoS One 2013; 8:e69226. [PMID: 23874919 PMCID: PMC3708889 DOI: 10.1371/journal.pone.0069226] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2013] [Accepted: 06/06/2013] [Indexed: 11/18/2022] Open
Abstract
Rare variants have increasingly been cited as major contributors in the disease etiology of several complex disorders. Recently, several approaches have been proposed for analyzing the association of rare variants with disease. These approaches include collapsing rare variants, summing rare variant test statistics within a particular locus to improve power, and selecting a subset of rare variants for association testing, e.g., the step-up approach. We found that (a) if the variants being pooled are in linkage disequilibrium, the standard step-up method of selecting the best subset of variants results in loss of power compared to a model that pools all rare variants and (b) if the variants are in linkage equilibrium, performing a subset selection using step-based selection methods results in a gain of power of association compared to a model that pools all rare variants. Therefore, we propose an approach to selecting the best subset of variants to include in the model that is based on the linkage disequilibrium pattern among the rare variants. The proposed linkage disequilibrium-based variant selection model is flexible and borrows strength from the model that pools all rare variants when the rare variants are in linkage disequilibrium and from step-based selection methods when the variants are in linkage equilibrium. We performed simulations under three different realistic scenarios based on: (1) the HapMap3 dataset of the DRD2 gene, and CHRNA3/A5/B4 gene cluster (2) the block structure of linkage disequilibrium, and (3) linkage equilibrium. We proposed a permutation-based approach to control the type 1 error rate. The power comparisons after controlling the type 1 error show that the proposed linkage disequilibrium-based subset selection approach is an attractive alternative method for subset selection of rare variants.
Collapse
Affiliation(s)
- Rajesh Talluri
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Sanjay Shete
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| |
Collapse
|
1613
|
Sverdlov S, Thompson EA. Correlation between relatives given complete genotypes: from identity by descent to identity by function. Theor Popul Biol 2013; 88:57-67. [PMID: 23851163 DOI: 10.1016/j.tpb.2013.06.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2012] [Revised: 04/18/2013] [Accepted: 06/12/2013] [Indexed: 02/06/2023]
Abstract
In classical quantitative genetics, the correlation between the phenotypes of individuals with unknown genotypes and a known pedigree relationship is expressed in terms of probabilities of IBD states. In existing approaches to the inverse problem where genotypes are observed but pedigree relationships are not, dependence between phenotypes is either modeled as Bayesian uncertainty or mapped to an IBD model via inferred relatedness parameters. Neither approach yields a relationship between genotypic similarity and phenotypic similarity with a probabilistic interpretation corresponding to a generative model. We introduce a generative model for diploid allele effect based on the classic infinite allele mutation process. This approach motivates the concept of IBF (Identity by Function). The phenotypic covariance between two individuals given their diploid genotypes is expressed in terms of functional identity states. The IBF parameters define a genetic architecture for a trait without reference to specific alleles or population. Given full genome sequences, we treat a gene-scale functional region, rather than a SNP, as a QTL, modeling patterns of dominance for multiple alleles. Applications demonstrated by simulation include phenotype and effect prediction and association, and estimation of heritability and classical variance components. A simulation case study of the Missing Heritability problem illustrates a decomposition of heritability under the IBF framework into Explained and Unexplained components.
Collapse
Affiliation(s)
- Serge Sverdlov
- Department of Statistics, University of Washington, Box 354322, Seattle, WA 98195, USA.
| | | |
Collapse
|
1614
|
Hendricks AE, Dupuis J, Logue MW, Myers RH, Lunetta KL. Correction for multiple testing in a gene region. Eur J Hum Genet 2013; 22:414-8. [PMID: 23838599 DOI: 10.1038/ejhg.2013.144] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Revised: 04/27/2013] [Accepted: 05/11/2013] [Indexed: 11/09/2022] Open
Abstract
Several methods to correct for multiple testing within a gene region have been proposed. These methods are useful for candidate gene studies, and to fine map gene-regions from GWAs. The Bonferroni correction and permutation are common adjustments, but are overly conservative and computationally intensive, respectively. Other options include calculating the effective number of independent single-nucleotide polymorphisms (SNPs) or using theoretical approximations. Here, we compare a theoretical approximation based on extreme tail theory with four methods for calculating the effective number of independent SNPs. We evaluate the type-I error rates of these methods using single SNP association tests over 10 gene regions simulated using 1000 Genomes data. Overall, we find that the effective number of independent SNP method by Gao et al, as well as extreme tail theory produce type-I error rates at the or close to the chosen significance level. The type-I error rates for the other effective number of independent SNP methods vary by gene region characteristics. We find Gao et al and extreme tail theory to be efficient alternatives to more computationally intensive approaches to control for multiple testing in gene regions.
Collapse
Affiliation(s)
- Audrey E Hendricks
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Josée Dupuis
- 1] Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA [2] Bioinformatics Program, Boston University, Boston, MA, USA
| | - Mark W Logue
- 1] Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA [2] Department of Biomedical Genetics, Boston University School of Medicine, Boston, MA, USA
| | - Richard H Myers
- Department of Neurology, Boston University School of Medicine, Boston, MA, USA
| | - Kathryn L Lunetta
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| |
Collapse
|
1615
|
Ayers KL, Cordell HJ. Identification of grouped rare and common variants via penalized logistic regression. Genet Epidemiol 2013; 37:592-602. [PMID: 23836590 PMCID: PMC3842118 DOI: 10.1002/gepi.21746] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 05/24/2013] [Accepted: 05/24/2013] [Indexed: 11/09/2022]
Abstract
In spite of the success of genome-wide association studies in finding many common variants associated with disease, these variants seem to explain only a small proportion of the estimated heritability. Data collection has turned toward exome and whole genome sequencing, but it is well known that single marker methods frequently used for common variants have low power to detect rare variants associated with disease, even with very large sample sizes. In response, a variety of methods have been developed that attempt to cluster rare variants so that they may gather strength from one another under the premise that there may be multiple causal variants within a gene. Most of these methods group variants by gene or proximity, and test one gene or marker window at a time. We propose a penalized regression method (PeRC) that analyzes all genes at once, allowing grouping of all (rare and common) variants within a gene, along with subgrouping of the rare variants, thus borrowing strength from both rare and common variants within the same gene. The method can incorporate either a burden-based weighting of the rare variants or one in which the weights are data driven. In simulations, our method performs favorably when compared to many previously proposed approaches, including its predecessor, the sparse group lasso [Friedman et al., 2010].
Collapse
Affiliation(s)
- Kristin L Ayers
- Institute of Genetic Medicine, Newcastle University, Newcastle upon Tyne NE1 3BZ, United Kingdom.
| | | |
Collapse
|
1616
|
Hu H, Huff CD, Moore B, Flygare S, Reese MG, Yandell M. VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix. Genet Epidemiol 2013; 37:622-34. [PMID: 23836555 PMCID: PMC3791556 DOI: 10.1002/gepi.21743] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2013] [Revised: 04/09/2013] [Accepted: 05/21/2013] [Indexed: 12/03/2022]
Abstract
The need for improved algorithmic support for variant prioritization and disease-gene identification in personal genomes data is widely acknowledged. We previously presented the Variant Annotation, Analysis, and Search Tool (VAAST), which employs an aggregative variant association test that combines both amino acid substitution (AAS) and allele frequencies. Here we describe and benchmark VAAST 2.0, which uses a novel conservation-controlled AAS matrix (CASM), to incorporate information about phylogenetic conservation. We show that the CASM approach improves VAAST’s variant prioritization accuracy compared to its previous implementation, and compared to SIFT, PolyPhen-2, and MutationTaster. We also show that VAAST 2.0 outperforms KBAC, WSS, SKAT, and variable threshold (VT) using published case-control datasets for Crohn disease (NOD2), hypertriglyceridemia (LPL), and breast cancer (CHEK2). VAAST 2.0 also improves search accuracy on simulated datasets across a wide range of allele frequencies, population-attributable disease risks, and allelic heterogeneity, factors that compromise the accuracies of other aggregative variant association tests. We also demonstrate that, although most aggregative variant association tests are designed for common genetic diseases, these tests can be easily adopted as rare Mendelian disease-gene finders with a simple ranking-by-statistical-significance protocol, and the performance compares very favorably to state-of-art filtering approaches. The latter, despite their popularity, have suboptimal performance especially with the increasing case sample size.
Collapse
Affiliation(s)
- Hao Hu
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | | | | | | | | | | |
Collapse
|
1617
|
Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, Peters U. SBERIA: set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol 2013; 37:452-64. [PMID: 23720162 PMCID: PMC3713231 DOI: 10.1002/gepi.21735] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Revised: 04/04/2013] [Accepted: 04/30/2013] [Indexed: 01/28/2023]
Abstract
Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Collapse
Affiliation(s)
- Shuo Jiao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1618
|
Huang YT, Lin X. Gene set analysis using variance component tests. BMC Bioinformatics 2013; 14:210. [PMID: 23806107 PMCID: PMC3776447 DOI: 10.1186/1471-2105-14-210] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Accepted: 05/10/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. RESULTS We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). CONCLUSION We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.
Collapse
Affiliation(s)
- Yen-Tsung Huang
- Department of Epidemiology, Brown University, 121 South Main Street, Providence, RI 02912, USA
| | | |
Collapse
|
1619
|
Ma C, Blackwell T, Boehnke M, Scott LJ. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet Epidemiol 2013; 37:539-50. [PMID: 23788246 DOI: 10.1002/gepi.21742] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2013] [Revised: 05/12/2013] [Accepted: 05/20/2013] [Indexed: 11/05/2022]
Abstract
In genome-wide association studies of binary traits, investigators typically use logistic regression to test common variants for disease association within studies, and combine association results across studies using meta-analysis. For common variants, logistic regression tests are well calibrated, and meta-analysis of study-specific association results is only slightly less powerful than joint analysis of the combined individual-level data. In recent sequencing and dense chip based association studies, investigators increasingly test low-frequency variants for disease association. In this paper, we seek to (1) identify the association test with maximal power among tests with well controlled type I error rate and (2) compare the relative power of joint and meta-analysis tests. We use analytic calculation and simulation to compare the empirical type I error rate and power of four logistic regression based tests: Wald, score, likelihood ratio, and Firth bias-corrected. We demonstrate for low-count variants (roughly minor allele count [MAC] < 400) that: (1) for joint analysis, the Firth test has the best combination of type I error and power; (2) for meta-analysis of balanced studies (equal numbers of cases and controls), the score test is best, but is less powerful than Firth test based joint analysis; and (3) for meta-analysis of sufficiently unbalanced studies, all four tests can be anti-conservative, particularly the score test. We also establish MAC as the key parameter determining test calibration for joint and meta-analysis.
Collapse
Affiliation(s)
- Clement Ma
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan 48109-2029, USA
| | | | | | | | | |
Collapse
|
1620
|
Peiffer JA, Flint-Garcia SA, De Leon N, McMullen MD, Kaeppler SM, Buckler ES. The genetic architecture of maize stalk strength. PLoS One 2013; 8:e67066. [PMID: 23840585 PMCID: PMC3688621 DOI: 10.1371/journal.pone.0067066] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Accepted: 05/14/2013] [Indexed: 01/16/2023] Open
Abstract
Stalk strength is an important trait in maize (Zea mays L.). Strong stalks reduce lodging and maximize harvestable yield. Studies show rind penetrometer resistance (RPR), or the force required to pierce a stalk rind with a spike, is a valid approximation of strength. We measured RPR across 4,692 recombinant inbreds (RILs) comprising the maize nested association mapping (NAM) panel derived from crosses of diverse inbreds to the inbred, B73. An intermated B73×Mo17 family (IBM) of 196 RILs and a panel of 2,453 diverse inbreds from the North Central Regional Plant Introduction Station (NCRPIS) were also evaluated. We measured RPR in three environments. Family-nested QTL were identified by joint-linkage mapping in the NAM panel. We also performed a genome-wide association study (GWAS) and genomic best linear unbiased prediction (GBLUP) in each panel. Broad sense heritability computed on a line means basis was low for RPR. Only 8 of 26 families had a heritability above 0.20. The NCRPIS diversity panel had a heritability of 0.54. Across NAM and IBM families, 18 family-nested QTL and 141 significant GWAS associations were identified for RPR. Numerous weak associations were also found in the NCRPIS diversity panel. However, few were linked to loci involved in phenylpropanoid and cellulose synthesis or vegetative phase transition. Using an identity-by-state (IBS) relationship matrix estimated from 1.6 million single nucleotide polymorphisms (SNPs) and RPR measures from 20% of the NAM panel, genomic prediction by GBLUP explained 64±2% of variation in the remaining RILs. In the NCRPIS diversity panel, an IBS matrix estimated from 681,257 SNPs and RPR measures from 20% of the panel explained 33±3% of variation in the remaining inbreds. These results indicate the high genetic complexity of stalk strength and the potential for genomic prediction to hasten its improvement.
Collapse
Affiliation(s)
- Jason A Peiffer
- Department of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA.
| | | | | | | | | | | |
Collapse
|
1621
|
Chung RH, Shih CC. SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies. BMC Bioinformatics 2013; 14:199. [PMID: 23782512 PMCID: PMC3693898 DOI: 10.1186/1471-2105-14-199] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2013] [Accepted: 06/14/2013] [Indexed: 11/22/2022] Open
Abstract
Background Association studies based on next-generation sequencing (NGS) technology have become popular, and statistical association tests for NGS data have been developed rapidly. A flexible tool for simulating sequence data in either unrelated case–control or family samples with different disease and quantitative trait models would be useful for evaluating the statistical power for planning a study design and for comparing power among statistical methods based on NGS data. Results We developed a simulation tool, SeqSIMLA, which can simulate sequence data with user-specified disease and quantitative trait models. We implemented two disease models, in which the user can flexibly specify the number of disease loci, effect sizes or population attributable risk, disease prevalence, and risk or protective loci. We also implemented a quantitative trait model, in which the user can specify the number of quantitative trait loci (QTL), proportions of variance explained by the QTL, and genetic models. We compiled recombination rates from the HapMap project so that genomic structures similar to the real data can be simulated. Conclusions SeqSIMLA can efficiently simulate sequence data with disease or quantitative trait models specified by the user. SeqSIMLA will be very useful for evaluating statistical properties for new study designs and new statistical methods using NGS. SeqSIMLA can be downloaded for free at http://seqsimla.sourceforge.net.
Collapse
Affiliation(s)
- Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Miaoli, Taiwan.
| | | |
Collapse
|
1622
|
Belonogova NM, Svishcheva GR, van Duijn CM, Aulchenko YS, Axenovich TI. Region-based association analysis of human quantitative traits in related individuals. PLoS One 2013; 8:e65395. [PMID: 23799013 PMCID: PMC3684601 DOI: 10.1371/journal.pone.0065395] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2013] [Accepted: 04/24/2013] [Indexed: 01/27/2023] Open
Abstract
Regional-based association analysis instead of individual testing of each SNP was introduced in genome-wide association studies to increase the power of gene mapping, especially for rare genetic variants. For regional association tests, the kernel machine-based regression approach was recently proposed as a more powerful alternative to collapsing-based methods. However, the vast majority of existing algorithms and software for the kernel machine-based regression are applicable only to unrelated samples. In this paper, we present a new method for the kernel machine-based regression association analysis of quantitative traits in samples of related individuals. The method is based on the GRAMMAR+ transformation of phenotypes of related individuals, followed by use of existing kernel machine-based regression software for unrelated samples. We compared the performance of kernel-based association analysis on the material of the Genetic Analysis Workshop 17 family sample and real human data by using our transformation, the original untransformed trait, and environmental residuals. We demonstrated that only the GRAMMAR+ transformation produced type I errors close to the nominal value and that this method had the highest empirical power. The new method can be applied to analysis of related samples by using existing software for kernel-based association analysis developed for unrelated samples.
Collapse
Affiliation(s)
- Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | | | - Yurii S. Aulchenko
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- * E-mail:
| |
Collapse
|
1623
|
Morrison AC, Voorman A, Johnson AD, Liu X, Yu J, Li A, Muzny D, Yu F, Rice K, Zhu C, Bis J, Heiss G, O'Donnell CJ, Psaty BM, Cupples LA, Gibbs R, Boerwinkle E. Whole-genome sequence-based analysis of high-density lipoprotein cholesterol. Nat Genet 2013; 45:899-901. [PMID: 23770607 PMCID: PMC4030301 DOI: 10.1038/ng.2671] [Citation(s) in RCA: 122] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 05/24/2013] [Indexed: 12/15/2022]
Abstract
We describe initial steps for interrogating whole genome sequence (WGS) data to characterize the genetic architecture of a complex trait, such as high density lipoprotein cholesterol (HDL-C). We estimate that common variation contributes more to HDL-C heritability than rare variation, and screening for Mendelian dyslipidemia variants identified individuals with extreme HDL-C. WGS analyses highlight the value of regulatory and non-protein coding regions of the genome in addition to protein coding regions.
Collapse
Affiliation(s)
- Alanna C Morrison
- Human Genetics Center, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
1624
|
Fang H, Hou B, Wang Q, Yang Y. Rare variants analysis by risk-based variable-threshold method. Comput Biol Chem 2013; 46:32-8. [PMID: 23764529 DOI: 10.1016/j.compbiolchem.2013.04.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 04/03/2013] [Accepted: 04/10/2013] [Indexed: 11/17/2022]
Abstract
Genome-wide association studies, as a powerful approach for detecting common variants associated with diseases, have revealed many disease-associated loci. However, the traditional association analysis methods do not have enough power for detecting the effects of rare variants with limited sample size. As a solution to this problem, pooling rare variants by their functions into a composite variant provides an alternative way for identifying susceptible genes. In this paper, we propose a new pooling method to test the variant-disease association and to identify the functional rare variants related with the disease. Variants with smaller and larger risk measures defined as the ratio of allele frequencies between cases and controls are pooled and a chi-square test of the resultant pooled table is calculated. We vary the threshold of pooling over all possible values and use the maximal chi-square as test statistic. The maximal chi-square is in fact the global maximum over all possible poolings. Our approach is similar to the existing variable-threshold method, but we threshold on the risk measure instead of allele frequencies of controls. Simulation results show that our method performs better in both association testing and variant selection.
Collapse
Affiliation(s)
- Hongyan Fang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, China
| | | | | | | |
Collapse
|
1625
|
Arnedo J, del Val C, de Erausquin GA, Romero-Zaliz R, Svrakic D, Cloninger CR, Zwir I. PGMRA: a web server for (phenotype x genotype) many-to-many relation analysis in GWAS. Nucleic Acids Res 2013; 41:W142-9. [PMID: 23761451 PMCID: PMC3692099 DOI: 10.1093/nar/gkt496] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
It has been proposed that single nucleotide polymorphisms (SNPs) discovered by genome-wide association studies (GWAS) account for only a small fraction of the genetic variation of complex traits in human population. The remaining unexplained variance or missing heritability is thought to be due to marginal effects of many loci with small effects and has eluded attempts to identify its sources. Combination of different studies appears to resolve in part this problem. However, neither individual GWAS nor meta-analytic combinations thereof are helpful for disclosing which genetic variants contribute to explain a particular phenotype. Here, we propose that most of the missing heritability is latent in the GWAS data, which conceals intermediate phenotypes. To uncover such latent information, we propose the PGMRA server that introduces phenomics--the full set of phenotype features of an individual--to identify SNP-set structures in a broader sense, i.e. causally cohesive genotype-phenotype relations. These relations are agnostically identified (without considering disease status of the subjects) and organized in an interpretable fashion. Then, by incorporating a posteriori the subject status within each relation, we can establish the risk surface of a disease in an unbiased mode. This approach complements-instead of replaces-current analysis methods. The server is publically available at http://phop.ugr.es/fenogeno.
Collapse
Affiliation(s)
- Javier Arnedo
- Department of Computer Science and Artificial Intelligence, University of Granada, E-18071 Granada, Spain
| | | | | | | | | | | | | |
Collapse
|
1626
|
Auer PL, Wang G, Leal SM. Testing for rare variant associations in the presence of missing data. Genet Epidemiol 2013; 37:529-38. [PMID: 23757187 DOI: 10.1002/gepi.21736] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2013] [Revised: 04/01/2013] [Accepted: 04/17/2013] [Indexed: 11/07/2022]
Abstract
For studies of genetically complex diseases, many association methods have been developed to analyze rare variants. When variant calls are missing, naïve implementation of rare variant association (RVA) methods may lead to inflated type I error rates as well as a reduction in power. To overcome these problems, we developed extensions for four commonly used RVA tests. Data from the National Heart Lung and Blood Institute-Exome Sequencing Project were used to demonstrate that missing variant calls can lead to increased false-positive rates and that the extended RVA methods control type I error without reducing power. We suggest a combined strategy of data filtering based on variant and sample level missing genotypes along with implementation of these extended RVA tests.
Collapse
Affiliation(s)
- Paul L Auer
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | | | | |
Collapse
|
1627
|
Goldstein DB, Allen A, Keebler J, Margulies EH, Petrou S, Petrovski S, Sunyaev S. Sequencing studies in human genetics: design and interpretation. Nat Rev Genet 2013; 14:460-70. [PMID: 23752795 DOI: 10.1038/nrg3455] [Citation(s) in RCA: 185] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Next-generation sequencing is becoming the primary discovery tool in human genetics. There have been many clear successes in identifying genes that are responsible for Mendelian diseases, and sequencing approaches are now poised to identify the mutations that cause undiagnosed childhood genetic diseases and those that predispose individuals to more common complex diseases. There are, however, growing concerns that the complexity and magnitude of complete sequence data could lead to an explosion of weakly justified claims of association between genetic variants and disease. Here, we provide an overview of the basic workflow in next-generation sequencing studies and emphasize, where possible, measures and considerations that facilitate accurate inferences from human sequencing studies.
Collapse
Affiliation(s)
- David B Goldstein
- Center for Human Genome Variation, Duke University School of Medicine, 308 Research Drive, Box 91009, LSRC B Wing, Room 330, Durham, North Carolina 27708, USA.
| | | | | | | | | | | | | |
Collapse
|
1628
|
Ionita-Laza I, Lee S, Makarov V, Buxbaum J, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet 2013; 92:841-53. [PMID: 23684009 DOI: 10.1016/j.ajhg.2013.04.015] [Citation(s) in RCA: 333] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Revised: 03/20/2013] [Accepted: 04/18/2013] [Indexed: 01/08/2023] Open
Abstract
Recent developments in sequencing technologies have made it possible to uncover both rare and common genetic variants. Genome-wide association studies (GWASs) can test for the effect of common variants, whereas sequence-based association studies can evaluate the cumulative effect of both rare and common variants on disease risk. Many groupwise association tests, including burden tests and variance-component tests, have been proposed for this purpose. Although such tests do not exclude common variants from their evaluation, they focus mostly on testing the effect of rare variants by upweighting rare-variant effects and downweighting common-variant effects and can therefore lose substantial power when both rare and common genetic variants in a region influence trait susceptibility. There is increasing evidence that the allelic spectrum of risk variants at a given locus might include novel, rare, low-frequency, and common genetic variants. Here, we introduce several sequence kernel association tests to evaluate the cumulative effect of rare and common variants. The proposed tests are computationally efficient and are applicable to both binary and continuous traits. Furthermore, they can readily combine GWAS and whole-exome-sequencing data on the same individuals, when available, and are also applicable to deep-resequencing data of GWAS loci. We evaluate these tests on data simulated under comprehensive scenarios and show that compared with the most commonly used tests, including the burden and variance-component tests, they can achieve substantial increases in power. We next show applications to sequencing studies for Crohn disease and autism spectrum disorders. The proposed tests have been incorporated into the software package SKAT.
Collapse
|
1629
|
Tyrer JP, Guo Q, Easton DF, Pharoah PDP. The admixture maximum likelihood test to test for association between rare variants and disease phenotypes. BMC Bioinformatics 2013; 14:177. [PMID: 23738568 PMCID: PMC3698090 DOI: 10.1186/1471-2105-14-177] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2012] [Accepted: 05/22/2013] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND The development of genotyping arrays containing hundreds of thousands of rare variants across the genome and advances in high-throughput sequencing technologies have made feasible empirical genetic association studies to search for rare disease susceptibility alleles. As single variant testing is underpowered to detect associations, the development of statistical methods to combine analysis across variants - so-called "burden tests" - is an area of active research interest. We previously developed a method, the admixture maximum likelihood test, to test multiple, common variants for association with a trait of interest. We have extended this method, called the rare admixture maximum likelihood test (RAML), for the analysis of rare variants. In this paper we compare the performance of RAML with six other burden tests designed to test for association of rare variants. RESULTS We used simulation testing over a range of scenarios to test the power of RAML compared to the other rare variant association testing methods. These scenarios modelled differences in effect variability, the average direction of effect and the proportion of associated variants. We evaluated the power for all the different scenarios. RAML tended to have the greatest power for most scenarios where the proportion of associated variants was small, whereas SKAT-O performed a little better for the scenarios with a higher proportion of associated variants. CONCLUSIONS The RAML method makes no assumptions about the proportion of variants that are associated with the phenotype of interest or the magnitude and direction of their effect. The method is flexible and can be applied to both dichotomous and quantitative traits and allows for the inclusion of covariates in the underlying regression model. The RAML method performed well compared to the other methods over a wide range of scenarios. Generally power was moderate in most of the scenarios, underlying the need for large sample sizes in any form of association testing.
Collapse
Affiliation(s)
- Jonathan P Tyrer
- Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Qi Guo
- Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK
| | - Douglas F Easton
- Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK
- Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Paul DP Pharoah
- Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK
- Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| |
Collapse
|
1630
|
Long N, Dickson SP, Maia JM, Kim HS, Zhu Q, Allen AS. Leveraging prior information to detect causal variants via multi-variant regression. PLoS Comput Biol 2013; 9:e1003093. [PMID: 23762022 PMCID: PMC3675126 DOI: 10.1371/journal.pcbi.1003093] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 04/29/2013] [Indexed: 01/03/2023] Open
Abstract
Although many methods are available to test sequence variants for association with complex diseases and traits, methods that specifically seek to identify causal variants are less developed. Here we develop and evaluate a Bayesian hierarchical regression method that incorporates prior information on the likelihood of variant causality through weighting of variant effects. By simulation studies using both simulated and real sequence variants, we compared a standard single variant test for analyzing variant-disease association with the proposed method using different weighting schemes. We found that by leveraging linkage disequilibrium of variants with known GWAS signals and sequence conservation (phastCons), the proposed method provides a powerful approach for detecting causal variants while controlling false positives. The decline in DNA sequencing cost permits the interrogation of potentially all variants across the entire allele frequency spectrum for their associations with complex human diseases and traits. However, the identification of causal variants remains challenging. Existing single variant tests do not distinguish between causal association and association induced by linkage disequilibrium and tend to be underpowered for rare or low-frequency variants, whereas variant grouping methods do not identify individual causal variants. We propose a novel Bayesian hierarchical regression approach that estimates effects of multiple variants on a disease trait simultaneously and incorporates prior information on the likelihood of causality. By simulation, we show that by combining linkage disequilibrium with known genome wide association signals and functional conservation, the proposed method, the first of its kind, is powerful to correctly detect causal variants.
Collapse
Affiliation(s)
- Nanye Long
- Center for Human Genome Variation, Duke University School of Medicine, Durham, North Carolina, United States of America.
| | | | | | | | | | | |
Collapse
|
1631
|
Lin WY, Yi N, Lou XY, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N. Haplotype kernel association test as a powerful method to identify chromosomal regions harboring uncommon causal variants. Genet Epidemiol 2013; 37:560-70. [PMID: 23740760 DOI: 10.1002/gepi.21740] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Revised: 05/01/2013] [Accepted: 05/06/2013] [Indexed: 01/09/2023]
Abstract
For most complex diseases, the fraction of heritability that can be explained by the variants discovered from genome-wide association studies is minor. Although the so-called "rare variants" (minor allele frequency [MAF] < 1%) have attracted increasing attention, they are unlikely to account for much of the "missing heritability" because very few people may carry these rare variants. The genetic variants that are likely to fill in the "missing heritability" include uncommon causal variants (MAF < 5%), which are generally untyped in association studies using tagging single-nucleotide polymorphisms (SNPs) or commercial SNP arrays. Developing powerful statistical methods can help to identify chromosomal regions harboring uncommon causal variants, while bypassing the genome-wide or exome-wide next-generation sequencing. In this work, we propose a haplotype kernel association test (HKAT) that is equivalent to testing the variance component of random effects for distinct haplotypes. With an appropriate weighting scheme given to haplotypes, we can further enhance the ability of HKAT to detect uncommon causal variants. With scenarios simulated according to the population genetics theory, HKAT is shown to be a powerful method for detecting chromosomal regions harboring uncommon causal variants.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| | | | | | | | | | | | | | | |
Collapse
|
1632
|
Lutz SM, Fingerlin T, Fardo DW. Statistical Approaches to Combine Genetic Association Data. JOURNAL OF BIOMETRICS & BIOSTATISTICS 2013; 4:1000166. [PMID: 24009987 PMCID: PMC3760734 DOI: 10.4172/2155-6180.1000166] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
In an attempt to discover and unravel genetic predisposition to complex traits, new statistical methods have emerged that utilize multiple sources of data. This appeal to data aggregation is seen on various levels: across genetic variants, across genomic/biological/environmental measures and across different studies, often with fundamentally differing designs. While combining data can increase power to detect genetic variants associated with disease phenotypes, care must be taken in the design, analysis, and interpretation of such studies. Here, we explore methodologies employed to combine sources of genetic data and discuss the prospects for novel advances in the fields of statistical genetics and genetic epidemiology.
Collapse
Affiliation(s)
- Sharon M Lutz
- Department of Biostatistics, University of Colorado, 13001 E. 17 St, Aurora CO 80045, USA
| | - Tasha Fingerlin
- Department of Biostatistics, University of Colorado, 13001 E. 17 St, Aurora CO 80045, USA
| | - David W Fardo
- Department of Biostatistics, University of Kentucky, 725 Rose St, Lexington KY 40536, USA
| |
Collapse
|
1633
|
Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet 2013; 9:e1003509. [PMID: 23717212 PMCID: PMC3662650 DOI: 10.1371/journal.pgen.1003509] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2012] [Accepted: 03/28/2013] [Indexed: 01/15/2023] Open
Abstract
Combining data from genome-wide association studies (GWAS) conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP-based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles.
Collapse
|
1634
|
Pers TH, Dworzyński P, Thomas CE, Lage K, Brunak S. MetaRanker 2.0: a web server for prioritization of genetic variation data. Nucleic Acids Res 2013; 41:W104-8. [PMID: 23703204 PMCID: PMC3692047 DOI: 10.1093/nar/gkt387] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
MetaRanker 2.0 is a web server for prioritization of common and rare frequency genetic variation data. Based on heterogeneous data sets including genetic association data, protein–protein interactions, large-scale text-mining data, copy number variation data and gene expression experiments, MetaRanker 2.0 prioritizes the protein-coding part of the human genome to shortlist candidate genes for targeted follow-up studies. MetaRanker 2.0 is made freely available at www.cbs.dtu.dk/services/MetaRanker-2.0.
Collapse
Affiliation(s)
- Tune H Pers
- Department of Systems Biology, Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark
| | | | | | | | | |
Collapse
|
1635
|
Abstract
SUMMARY MASS is a command-line program to perform meta-analysis of sequencing studies by combining the score statistics from multiple studies. It implements three types of multivariate tests that encompass all commonly used association tests for rare variants. The input files can be generated from the accompanying software SCORE-Seq. This bundle of programs allows analysis of large sequencing studies in a time and memory efficient manner. AVAILABILITY AND IMPLEMENTATION MASS and SCORE-Seq, including documentations and executables, are available at http://dlin.web.unc.edu/software/. CONTACT lin@bios.unc.edu.
Collapse
Affiliation(s)
- Zheng-Zheng Tang
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599-7420, USA
| | | |
Collapse
|
1636
|
Evangelou E, Ioannidis JPA. Meta-analysis methods for genome-wide association studies and beyond. Nat Rev Genet 2013; 14:379-89. [PMID: 23657481 DOI: 10.1038/nrg3472] [Citation(s) in RCA: 404] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Meta-analysis of genome-wide association studies (GWASs) has become a popular method for discovering genetic risk variants. Here, we overview both widely applied and newer statistical methods for GWAS meta-analysis, including issues of interpretation and assessment of sources of heterogeneity. We also discuss extensions of these meta-analysis methods to complex data. Where possible, we provide guidelines for researchers who are planning to use these methods. Furthermore, we address special issues that may arise for meta-analysis of sequencing data and rare variants. Finally, we discuss challenges and solutions surrounding the goals of making meta-analysis data publicly available and building powerful consortia.
Collapse
Affiliation(s)
- Evangelos Evangelou
- Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina Medical School, Ioannina 45110, Greece
| | | |
Collapse
|
1637
|
Moore CB, Wallace JR, Frase AT, Pendergrass SA, Ritchie MD. BioBin: a bioinformatics tool for automating the binning of rare variants using publicly available biological knowledge. BMC Med Genomics 2013; 6 Suppl 2:S6. [PMID: 23819467 PMCID: PMC3654874 DOI: 10.1186/1755-8794-6-s2-s6] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Background With the recent decreasing cost of genome sequence data, there has been increasing interest in rare variants and methods to detect their association to disease. We developed BioBin, a flexible collapsing method inspired by biological knowledge that can be used to automate the binning of low frequency variants for association testing. We also built the Library of Knowledge Integration (LOKI), a repository of data assembled from public databases, which contains resources such as: dbSNP and gene Entrez database information from the National Center for Biotechnology (NCBI), pathway information from Gene Ontology (GO), Protein families database (Pfam), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, NetPath - signal transduction pathways, Open Regulatory Annotation Database (ORegAnno), Biological General Repository for Interaction Datasets (BioGrid), Pharmacogenomics Knowledge Base (PharmGKB), Molecular INTeraction database (MINT), and evolutionary conserved regions (ECRs) from UCSC Genome Browser. The novelty of BioBin is access to comprehensive knowledge-guided multi-level binning. For example, bin boundaries can be formed using genomic locations from: functional regions, evolutionary conserved regions, genes, and/or pathways. Methods We tested BioBin using simulated data and 1000 Genomes Project low coverage data to test our method with simulated causative variants and a pairwise comparison of rare variant (MAF < 0.03) burden differences between Yoruba individuals (YRI) and individuals of European descent (CEU). Lastly, we analyzed the NHLBI GO Exome Sequencing Project Kabuki dataset, a congenital disorder affecting multiple organs and often intellectual disability, contrasted with Complete Genomics data as controls. Results The results from our simulation studies indicate type I error rate is controlled, however, power falls quickly for small sample sizes using variants with modest effect sizes. Using BioBin, we were able to find simulated variants in genes with less than 20 loci, but found the sensitivity to be much less in large bins. We also highlighted the scale of population stratification between two 1000 Genomes Project data, CEU and YRI populations. Lastly, we were able to apply BioBin to natural biological data from dbGaP and identify an interesting candidate gene for further study. Conclusions We have established that BioBin will be a very practical and flexible tool to analyze sequence data and potentially uncover novel associations between low frequency variants and complex disease.
Collapse
Affiliation(s)
- Carrie B Moore
- Center for Human Genetics Research, Vanderbilt University, Nashville, TN 37232, USA
| | | | | | | | | |
Collapse
|
1638
|
Wu G, Zhi D. Pathway-based approaches for sequencing-based genome-wide association studies. Genet Epidemiol 2013; 37:478-94. [PMID: 23650134 DOI: 10.1002/gepi.21728] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2012] [Revised: 03/04/2013] [Accepted: 03/29/2013] [Indexed: 01/07/2023]
Abstract
For analyzing complex trait association with sequencing data, most current studies test aggregated effects of variants in a gene or genomic region. Although gene-based tests have insufficient power even for moderately sized samples, pathway-based analyses combine information across multiple genes in biological pathways and may offer additional insight. However, most existing pathway association methods are originally designed for genome-wide association studies, and are not comprehensively evaluated for sequencing data. Moreover, region-based rare variant association methods, although potentially applicable to pathway-based analysis by extending their region definition to gene sets, have never been rigorously tested. In the context of exome-based studies, we use simulated and real datasets to evaluate pathway-based association tests. Our simulation strategy adopts a genome-wide genetic model that distributes total genetic effects hierarchically into pathways, genes, and individual variants, allowing the evaluation of pathway-based methods with realistic quantifiable assumptions on the underlying genetic architectures. The results show that, although no single pathway-based association method offers superior performance in all simulated scenarios, a modification of Gene Set Enrichment Analysis approach using statistics from single-marker tests without gene-level collapsing (weighted Kolmogrov-Smirnov [WKS]-Variant method) is consistently powerful. Interestingly, directly applying rare variant association tests (e.g., sequence kernel association test) to pathway analysis offers a similar power, but its results are sensitive to assumptions of genetic architecture. We applied pathway association analysis to an exome-sequencing data of the chronic obstructive pulmonary disease, and found that the WKS-Variant method confirms associated genes previously published.
Collapse
Affiliation(s)
- Guodong Wu
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA
| | | |
Collapse
|
1639
|
Schaid DJ, McDonnell SK, Sinnwell JP, Thibodeau SN. Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genet Epidemiol 2013; 37:409-18. [PMID: 23650101 DOI: 10.1002/gepi.21727] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Revised: 03/11/2013] [Accepted: 04/01/2013] [Indexed: 11/11/2022]
Abstract
Searching for rare genetic variants associated with complex diseases can be facilitated by enriching for diseased carriers of rare variants by sampling cases from pedigrees enriched for disease, possibly with related or unrelated controls. This strategy, however, complicates analyses because of shared genetic ancestry, as well as linkage disequilibrium among genetic markers. To overcome these problems, we developed broad classes of "burden" statistics and kernel statistics, extending commonly used methods for unrelated case-control data to allow for known pedigree relationships, for autosomes and the X chromosome. Furthermore, by replacing pedigree-based genetic correlation matrices with estimates of genetic relationships based on large-scale genomic data, our methods can be used to account for population-structured data. By simulations, we show that the type I error rates of our developed methods are near the asymptotic nominal levels, allowing rapid computation of P-values. Our simulations also show that a linear weighted kernel statistic is generally more powerful than a weighted "burden" statistic. Because the proposed statistics are rapid to compute, they can be readily used for large-scale screening of the association of genomic sequence data with disease status.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota 55905, USA.
| | | | | | | |
Collapse
|
1640
|
Sun J, Zheng Y, Hsu L. A unified mixed-effects model for rare-variant association in sequencing studies. Genet Epidemiol 2013; 37:334-44. [PMID: 23483651 PMCID: PMC3740585 DOI: 10.1002/gepi.21717] [Citation(s) in RCA: 107] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2012] [Revised: 12/17/2012] [Accepted: 02/05/2013] [Indexed: 01/16/2023]
Abstract
For rare-variant association analysis, due to extreme low frequencies of these variants, it is necessary to aggregate them by a prior set (e.g., genes and pathways) in order to achieve adequate power. In this paper, we consider hierarchical models to relate a set of rare variants to phenotype by modeling the effects of variants as a function of variant characteristics while allowing for variant-specific effect (heterogeneity). We derive a set of two score statistics, testing the group effect by variant characteristics and the heterogeneity effect. We make a novel modification to these score statistics so that they are independent under the null hypothesis and their asymptotic distributions can be derived. As a result, the computational burden is greatly reduced compared with permutation-based tests. Our approach provides a general testing framework for rare variants association, which includes many commonly used tests, such as the burden test [Li and Leal, 2008] and the sequence kernel association test [Wu et al., 2011], as special cases. Furthermore, in contrast to these tests, our proposed test has an added capacity to identify which components of variant characteristics and heterogeneity contribute to the association. Simulations under a wide range of scenarios show that the proposed test is valid, robust, and powerful. An application to the Dallas Heart Study illustrates that apart from identifying genes with significant associations, the new method also provides additional information regarding the source of the association. Such information may be useful for generating hypothesis in future studies.
Collapse
Affiliation(s)
- Jianping Sun
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center Seattle, WA, USA
| | - Yingye Zheng
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center Seattle, WA, USA
| | - Li Hsu
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center Seattle, WA, USA
| |
Collapse
|
1641
|
Machiela MJ, Chen C, Liang L, Diver WR, Stevens VL, Tsilidis KK, Haiman CA, Chanock SJ, Hunter DJ, Kraft P. One thousand genomes imputation in the National Cancer Institute Breast and Prostate Cancer Cohort Consortium aggressive prostate cancer genome-wide association study. Prostate 2013; 73:677-89. [PMID: 23255287 PMCID: PMC3962143 DOI: 10.1002/pros.22608] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Accepted: 10/05/2012] [Indexed: 12/23/2022]
Abstract
BACKGROUND Genotype imputation substantially increases available markers for analysis in genome-wide association studies (GWAS) by leveraging linkage disequilibrium from a reference panel. We sought to (i) investigate the performance of imputation from the August 2010 release of the 1000 Genomes Project (1000GP) in an existing GWAS of prostate cancer, (ii) look for novel associations with prostate cancer risk, (iii) fine-map known prostate cancer susceptibility regions using an approximate Bayesian framework and stepwise regression, and (iv) compare power and efficiency of imputation and de novo sequencing. METHODS We used 2,782 aggressive prostate cancer cases and 4,458 controls from the NCI Breast and Prostate Cancer Cohort Consortium aggressive prostate cancer GWAS to infer 5.8 million well-imputed autosomal single nucleotide polymorphisms (SNPs). RESULTS Imputation quality, as measured by correlation between imputed and true allele counts, was higher among common variants than rare variants. We found no novel prostate cancer associations among a subset of 1.2 million well-imputed low-frequency variants. At a genome-wide sequencing cost of $2,500, imputation from SNP arrays is a more powerful strategy than sequencing for detecting disease associations of SNPs with minor allele frequencies (MAF) above 1%. CONCLUSIONS 1000GP imputation provided dense coverage of previously identified prostate cancer susceptibility regions, highlighting its potential as an inexpensive first-pass approach to fine mapping in regions such as 5p15 and 8q24. Our study shows 1000GP imputation can accurately identify low-frequency variants and stresses the importance of large sample size when studying these variants.
Collapse
Affiliation(s)
- Mitchell J. Machiela
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland
| | - Constance Chen
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts
| | - Liming Liang
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts
| | - W. Ryan Diver
- Epidemiology Research Program, American Cancer Society, Atlanta, Georgia
| | | | - Konstantinos K. Tsilidis
- Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece
- Cancer Epidemiology Unit, Nuffield Department of Clinical Medicine, University of Oxford, Oxford, United Kingdom
| | - Christopher A. Haiman
- Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California
| | - Stephen J. Chanock
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland
| | - David J. Hunter
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts
- Department of Nutrition, Harvard School of Public Health, Boston, Massachusetts
- Channing Laboratory, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| | - Peter Kraft
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts
| | | |
Collapse
|
1642
|
Bacanu SA. Testing for modes of inheritance involving compound heterozygotes. Genet Epidemiol 2013; 37:522-8. [PMID: 23633151 DOI: 10.1002/gepi.21732] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2012] [Revised: 03/26/2013] [Accepted: 04/01/2013] [Indexed: 11/09/2022]
Abstract
Functional variants change the protein product or the expression of genes. Due to the latest advances in sequencing technology, most known functional variants can now be assayed in a cost-effective manner. However, to fully use the information from functional variants, researchers need to model the joint effect of these variants. In this article, we propose methods that model the action/interaction of loss-of-function (LOF) mutations, i.e., those mutations that eliminate the protein product of a gene. When multiple LOFs occur in the same causal gene/region, their effect on a phenotype might depend on whether these mutations lie on the same DNA strand/haplotype. When compared to LOFs occurring on the same strand, if these mutations lie on different strands, both copies of the gene are impaired and the impact on the relevant phenotypes is likely to be more severe. To use the information from LOF strand colocalization, we propose three methods that utilize the information from the estimated number of affected strands. We compare the performance of the proposed and competing methods by using simulations of common and rare LOF variants. Two of the proposed methods exhibited desirable power profiles, the first for both common and rare LOFs and the second only for common LOFs. One of the existing methods, collapsed double heterozygosity, exhibits good power to detect compound models for rare variants, especially when no haplotype harbors two or more rare alleles. Consequently, we recommend these three methods to be used for the analysis of functional variants coming from sequencing studies.
Collapse
Affiliation(s)
- Silviu-Alin Bacanu
- Virginia Institute for Psychiatric and Behavioral Genetics BIOTECH I, Richmond, Virginia, USA.
| |
Collapse
|
1643
|
Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. ACTA ACUST UNITED AC 2013; 29:1526-33. [PMID: 23599503 PMCID: PMC3673214 DOI: 10.1093/bioinformatics/btt177] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power. RESULTS We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects-one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn's disease case-control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. AVAILABILITY A Python-based library implementing our approach is available at http://mscompbio.codeplex.com.
Collapse
|
1644
|
O'Brien KM, Orlow I, Antonescu CR, Ballman K, McCall L, DeMatteo R, Engel LS. Gastrointestinal stromal tumors, somatic mutations and candidate genetic risk variants. PLoS One 2013; 8:e62119. [PMID: 23637977 PMCID: PMC3630216 DOI: 10.1371/journal.pone.0062119] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 03/18/2013] [Indexed: 02/07/2023] Open
Abstract
Gastrointestinal stromal tumors (GISTs) are rare but treatable soft tissue sarcomas. Nearly all GISTs have somatic mutations in either the KIT or PDGFRA gene, but there are no known inherited genetic risk factors. We assessed the relationship between KIT/PDGFRA mutations and select deletions or single nucleotide polymorphisms (SNPs) in 279 participants from a clinical trial of adjuvant imatinib mesylate. Given previous evidence that certain susceptibility loci and carcinogens are associated with characteristic mutations, or "signatures" in other cancers, we hypothesized that the characteristic somatic mutations in the KIT and PDGFRA genes in GIST tumors may similarly be mutational signatures that are causally linked to specific mutagens or susceptibility loci. As previous epidemiologic studies suggest environmental risk factors such as dioxin and radiation exposure may be linked to sarcomas, we chose 208 variants in 39 candidate genes related to DNA repair and dioxin metabolism or response. We calculated adjusted odds ratios (ORs) and 95% confidence intervals (CIs) for the association between each variant and 7 categories of tumor mutation using logistic regression. We also evaluated gene-level effects using the sequence kernel association test (SKAT). Although none of the association p-values were statistically significant after adjustment for multiple comparisons, SNPs in CYP1B1 were strongly associated with KIT exon 11 codon 557-8 deletions (OR = 1.9, 95% CI: 1.3-2.9 for rs2855658 and OR = 1.8, 95% CI: 1.2-2.7 for rs1056836) and wild type GISTs (OR = 2.7, 95% CI: 1.5-4.8 for rs1800440 and OR = 0.5, 95% CI: 0.3-0.9 for rs1056836). CYP1B1 was also associated with these mutations categories in the SKAT analysis (p = 0.002 and p = 0.003, respectively). Other potential risk variants included GSTM1, RAD23B and ERCC2. This preliminary analysis of inherited genetic risk factors for GIST offers some clues about the disease's genetic origins and provides a starting point for future candidate gene or gene-environment research.
Collapse
Affiliation(s)
- Katie M. O'Brien
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Irene Orlow
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Cristina R. Antonescu
- Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Karla Ballman
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Linda McCall
- American College of Surgeons Oncology Group, Durham, North Carolina, United States of America
| | - Ronald DeMatteo
- Department of Surgery, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Lawrence S. Engel
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
1645
|
Thomas DC. Some surprising twists on the road to discovering the contribution of rare variants to complex diseases. Hum Hered 2013; 74:113-7. [PMID: 23594489 DOI: 10.1159/000347020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
1646
|
Abstract
The role of rare variants has become a focus in the search for association with complex traits. Imputation is a powerful and cost-efficient tool to access variants that have not been directly typed, but there are several challenges when imputing rare variants, most notably reference panel selection. Extensions to rare variant association tests to incorporate genotype uncertainty from imputation are discussed, as well as the use of imputed low-frequency and rare variants in the study of population isolates.
Collapse
|
1647
|
Tachmazidou I, Morris A, Zeggini E. Rare variant association testing for next-generation sequencing data via hierarchical clustering. Hum Hered 2013; 74:165-71. [PMID: 23594494 PMCID: PMC3668801 DOI: 10.1159/000346022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES It is thought that a proportion of the genetic susceptibility to complex diseases is due to low-frequency and rare variants. Next-generation sequencing in large populations facilitates the detection of rare variant associations to disease risk. In order to achieve adequate power to detect association at low-frequency and rare variants, locus-specific statistical methods are being developed that combine information across variants within a functional unit and test for association with this enriched signal through so-called burden tests. METHODS We propose a hierarchical clustering approach and a similarity kernel-based association test for continuous phenotypes. This method clusters individuals into groups, within which samples are assumed to be genetically similar, and subsequently tests the group effects among the different clusters. RESULTS The power of this approach is comparable to that of collapsing methods when causal variants have the same direction of effect, but its power is significantly higher compared to burden tests when both protective and risk variants are present in the region of interest. Overall, we observe that the Sequence Kernel Association Test (SKAT) is the most powerful approach under the allelic architectures considered. CONCLUSIONS In our overall comparison, we find the analytical framework within which SKAT operates to yield higher power and to control type I error appropriately.
Collapse
|
1648
|
Kazma R, Cardin NJ, Witte JS. Does accounting for gene-environment interactions help uncover association between rare variants and complex diseases? Hum Hered 2013; 74:205-14. [PMID: 23594498 DOI: 10.1159/000346825] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVE To determine whether accounting for gene-environment (G×E) interactions improves the power to detect associations between rare variants and a disease, we have extended three statistical methods and compared their power under various simulated disease models. METHODS To test for association of a group of rare variants with a disease, Min-P uses the lowest p value within the group of variants, CAST (Cohort Allelic Sums Test) uses an indicator variable to quantify the rare alleles within the group of variants, and SKAT (Sequence Kernel Association Test) uses a logistic regression based on kernel machine. For each method, we incorporate a term for the G×E interaction and test for association and interaction jointly. RESULTS When testing for disease association with a set of rare variants, accounting for G×E interactions can improve power in specific situations (pure interaction or high proportion of causal variants interacting with the environment). However, the power of this approach can decrease, in particular in the presence of main genetic or environmental effects. Among the methods compared, the optimized and weighted SKAT performed best, whether to test for genetic association or to test it jointly with G×E interactions. CONCLUSION This approach can be used in specific situations but is not appropriate for a primary analysis.
Collapse
Affiliation(s)
- Rémi Kazma
- Department of Epidemiology and Biostatistics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | | |
Collapse
|
1649
|
Analysis of rare, exonic variation amongst subjects with autism spectrum disorders and population controls. PLoS Genet 2013; 9:e1003443. [PMID: 23593035 PMCID: PMC3623759 DOI: 10.1371/journal.pgen.1003443] [Citation(s) in RCA: 116] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2012] [Accepted: 02/26/2013] [Indexed: 01/09/2023] Open
Abstract
We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD. This study evaluates association of rare variants and autism spectrum disorders (ASD) in case and control samples sequenced by two centers. Before doing association analyses, we studied how to combine information across studies. We first harmonized the whole-exome sequence (WES) data, across centers, in terms of the distribution of rare variation. Key features included filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. After filtering, the vast majority of variants calls from seven samples sequenced at both centers matched. We also evaluated whether one should combine summary statistics from data from each center (meta-analysis) or combine data and analyze it together (mega-analysis). For many gene-based tests, we showed that mega-analysis yields more power. After quality control of data from 1,039 ASD cases and 870 controls and a range of analyses, no gene showed exome-wide evidence of significant association. Our results comport with recent results demonstrating that hundreds of genes affect risk for ASD; they suggest that rare risk variants are scattered across these many genes, and thus larger samples will be required to identify those genes.
Collapse
|
1650
|
Perdry H, Müller-Myhsok B, Clerget-Darpoux F. Using Affected Sib-Pairs to Uncover Rare Disease Variants. Hum Hered 2013; 74:129-41. [DOI: 10.1159/000346788] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|