151
|
Bacanu SA. On optimal gene-based analysis of genome scans. Genet Epidemiol 2012; 36:333-9. [PMID: 22508187 DOI: 10.1002/gepi.21625] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2011] [Revised: 12/22/2011] [Accepted: 01/30/2012] [Indexed: 11/06/2022]
Abstract
Univariate analysis of markers has modest power when there are multiple causal variants within a gene. Under this scenario, combining the effects of all variants from a gene in a gene-wide statistic is thought to increase power. However, it is not really clear (1) what is the performance of most commonly used gene-wide methods for whole genome scans and (2) how scalable these methods are for more computationally intensive analyses, e.g. analysis of genome-wide sequence data. We attempt to answer these questions by using realistic simulations to assess the performance of a range of gene-based methods: (1) commonly used, e.g. VEGAS and GATES; (2) less commonly used, e.g. Simes, adaptive sum (aSUM), and kernel methods; and (3) a combination of univariate and multivariate tests we proposed for the analysis of markers in linkage disequilibrium. Simes is the fastest method and has good power for single causal variant models. aSUM method has good power for multiple causal variant models, especially at lower gene lengths. Our proposed statistic yields good power for all causal models. Given the extreme data volumes coming from sequencing studies, we recommend a two step analysis of genome scans. The initial step uses the very fast Simes procedure to flag possibly interesting genes. The second step refines interesting signals by using more computationally intensive methods, e.g. (1) aSUM for shorter and (2) VEGAS for larger gene lengths. Alternatively, genome scans can be analyzed using only our proposed method while sacrificing only a modest amount of power.
Collapse
|
152
|
Combined linkage and association mapping reveals CYCD5;1 as a quantitative trait gene for endoreduplication in Arabidopsis. Proc Natl Acad Sci U S A 2012; 109:4678-83. [PMID: 22392991 DOI: 10.1073/pnas.1120811109] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Endoreduplication is the process where a cell replicates its genome without mitosis and cytokinesis, often followed by cell differentiation. This alternative cell cycle results in various levels of endoploidy, reaching 4× or higher one haploid set of chromosomes. Endoreduplication is found in animals and is widespread in plants, where it plays a major role in cellular differentiation and plant development. Here, we show that variation in endoreduplication between Arabidopsis thaliana accessions Columbia-0 and Kashmir is controlled by two major quantitative trait loci, ENDO-1 and ENDO-2. A local candidate gene association analysis in a set of 87 accessions, combined with expression analysis, identified CYCD5;1 as the most likely candidate gene underlying ENDO-2, operating as a rate-determining factor of endoreduplication. In accordance, both the overexpression and silencing of CYCD5;1 were effective in changing DNA ploidy levels, confirming CYCD5;1 to be a previously undescribed quantitative trait gene underlying endoreduplication in Arabidopsis.
Collapse
|
153
|
Daye ZJ, Li H, Wei Z. A powerful test for multiple rare variants association studies that incorporates sequencing qualities. Nucleic Acids Res 2012; 40:e60. [PMID: 22262732 PMCID: PMC3340416 DOI: 10.1093/nar/gks024] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Next-generation sequencing data will soon become routinely available for association studies between complex traits and rare variants. Sequencing data, however, are characterized by the presence of sequencing errors at each individual genotype. This makes it especially challenging to perform association studies of rare variants, which, due to their low minor allele frequencies, can be easily perturbed by genotype errors. In this article, we develop the quality-weighted multivariate score association test (qMSAT), a new procedure that allows powerful association tests between complex traits and multiple rare variants under the presence of sequencing errors. Simulation results based on quality scores from real data show that the qMSAT often dominates over current methods, that do not utilize quality information. In particular, the qMSAT can dramatically increase power over existing methods under moderate sample sizes and relatively low coverage. Moreover, in an obesity data study, we identified using the qMSAT two functional regions (MGLL promoter and MGLL 3′-untranslated region) where rare variants are associated with extreme obesity. Due to the high cost of sequencing data, the qMSAT is especially valuable for large-scale studies involving rare variants, as it can potentially increase power without additional experimental cost. qMSAT is freely available at http://qmsat.sourceforge.net/.
Collapse
Affiliation(s)
- Z John Daye
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| | | | | |
Collapse
|
154
|
Pongpanich M, Neely ML, Tzeng JY. On the Aggregation of Multimarker Information for Marker-Set and Sequencing Data Analysis: Genotype Collapsing vs. Similarity Collapsing. Front Genet 2012; 2:110. [PMID: 22303404 PMCID: PMC3266618 DOI: 10.3389/fgene.2011.00110] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 12/25/2011] [Indexed: 12/12/2022] Open
Abstract
Methods that collapse information across genetic markers when searching for association signals are gaining momentum in the literature. Although originally developed to achieve a better balance between retaining information and controlling degrees of freedom when performing multimarker association analysis, these methods have recently been proven to be a powerful tool for identifying rare variants that contribute to complex phenotypes. The information among markers can be collapsed at the genotype level, which focuses on the mean of genetic information, or the similarity level, which focuses on the variance of genetic information. The aim of this work is to understand the strengths and weaknesses of these two collapsing strategies. Our results show that neither collapsing strategy outperforms the other across all simulated scenarios. Two factors that dominate the performance of these strategies are the signal-to-noise ratio and the underlying genetic architecture of the causal variants. Genotype collapsing is more sensitive to the marker set being contaminated by noise loci than similarity collapsing. In addition, genotype collapsing performs best when the genetic architecture of the causal variants is not complex (e.g., causal loci with similar effects and similar frequencies). Similarity collapsing is more robust as the complexity of the genetic architecture increases and outperforms genotype collapsing when the genetic architecture of the marker set becomes more sophisticated (e.g., causal loci with various effect sizes or frequencies and potential non-linear or interactive effects). Because the underlying genetic architecture is not known a priori, we also considered a two-stage analysis that combines the two top-performing methods from different collapsing strategies. We find that it is reasonably robust across all simulated scenarios.
Collapse
Affiliation(s)
- Monnat Pongpanich
- Bioinformatics Research Center, North Carolina State University Raleigh, NC, USA
| | | | | |
Collapse
|
155
|
Design and Statistical Analysis of Pooled Next Generation Sequencing for Rare Variants. JOURNAL OF PROBABILITY AND STATISTICS 2012. [DOI: 10.1155/2012/524724] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Next generation sequencing (NGS) is a revolutionary technology for biomedical research. One highly cost-efficient application of NGS is to detect disease association based on pooled DNA samples. However, several key issues need to be addressed for pooled NGS. One of them is the high sequencing error rate and its high variability across genomic positions and experiment runs, which, if not well considered in the experimental design and analysis, could lead to either inflated false positive rates or loss in statistical power. Another important issue is how to test association of a group of rare variants. To address the first issue, we proposed a new blocked pooling design in which multiple pools of DNA samples from cases and controls are sequenced together on same NGS functional units. To address the second issue, we proposed a testing procedure that does not require individual genotypes but by taking advantage of multiple DNA pools. Through a simulation study, we demonstrated that our approach provides a good control of the type I error rate, and yields satisfactory power compared to the test-based on individual genotypes. Our results also provide guidelines for designing an efficient pooled.
Collapse
|
156
|
Li L, Zheng W, Lee JS, Zhang X, Ferguson J, Yan X, Zhao H. Collapsing-based and kernel-based single-gene analyses applied to Genetic Analysis Workshop 17 mini-exome data. BMC Proc 2011; 5 Suppl 9:S117. [PMID: 22373309 PMCID: PMC3287841 DOI: 10.1186/1753-6561-5-s9-s117] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Collapse
Affiliation(s)
- Lun Li
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA.,Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Wei Zheng
- Keck Biotechnology Resource Laboratory, Yale University, 300 George St., New Haven, CT 06511, USA
| | - Joon Sang Lee
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| | - Xianghua Zhang
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA.,Department of Electronic Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China
| | - John Ferguson
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| | - Xiting Yan
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| | - Hongyu Zhao
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| |
Collapse
|
157
|
Abstract
Gene-based and single-nucleotide polymorphism (SNP) set association studies provide an important complement to SNP analysis. Kernel-based nonparametric regression has recently emerged as a powerful and flexible tool for this purpose. Our goal is to explore whether this approach can be extended to incorporate and test for interaction effects, especially for genes containing rare variant SNPs. Here, we construct nonparametric regression models that can be used to include a gene-environment interaction effect under the framework of the least-squares kernel machine and examine the performance of the proposed method on the Genetic Analysis Workshop 17 unrelated individuals data set. Two hundred simulated replicates were used to explore the power for detecting interaction. We demonstrate through a genome scan of the quantitative phenotype Q1 that the simulated gene-environment interaction effect in the data can be detected with reasonable power by using the least-squares kernel machine method.
Collapse
|
158
|
Abstract
We found from our analysis of the Genetic Analysis Workshop 17 data that the population structure of the 697 unrelated individuals was an important confounding factor for association studies, even if it was not explicitly considered when simulating the phenotypes. We uncovered structures beyond the reported ethnicities and found ample evidence of phenotype–population structure associations. The first 10 principal components of the genotype data of the 697 individuals demonstrated much stronger associations with Q1, Q2, and the disease than did the individuals’ ethnicities. In addition, we observed that population structure was a confounding factor for the Q1-gene association when identifying the significant genes both with and without adjusting for the causal single-nucleotide polymorphisms, the ethnicities, and the principal components. Many false discoveries remained after adjusting for the causal single-nucleotide polymorphisms. Adjusting for the principal components appeared more effective than did adjusting for ethnicity in terms of preventing false discoveries. This analysis was performed with knowledge of the causal loci.
Collapse
Affiliation(s)
- Huaizhen Qin
- Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA.
| | | | | |
Collapse
|
159
|
Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet 2011; 89:277-88. [PMID: 21835306 DOI: 10.1016/j.ajhg.2011.07.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Revised: 06/16/2011] [Accepted: 07/13/2011] [Indexed: 11/15/2022] Open
Abstract
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
160
|
Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC, Lin X. Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies. Genet Epidemiol 2011; 35:620-31. [PMID: 21818772 DOI: 10.1002/gepi.20610] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Revised: 05/06/2011] [Accepted: 06/03/2011] [Indexed: 02/01/2023]
Abstract
In this article, we develop a powerful test for identifying single nucleotide polymorphism (SNP)-sets that are predictive of survival with data from genome-wide association studies. We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially nonlinear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum P-value-based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
Collapse
Affiliation(s)
- Xinyi Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA
| | | | | | | | | | | | | |
Collapse
|
161
|
Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 2011; 35:606-19. [PMID: 21769936 DOI: 10.1002/gepi.20609] [Citation(s) in RCA: 188] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Revised: 03/23/2011] [Accepted: 06/03/2011] [Indexed: 01/31/2023]
Abstract
In anticipation of the availability of next-generation sequencing data, there is increasing interest in investigating association between complex traits and rare variants (RVs). In contrast to association studies for common variants (CVs), due to the low frequencies of RVs, common wisdom suggests that existing statistical tests for CVs might not work, motivating the recent development of several new tests for analyzing RVs, most of which are based on the idea of pooling/collapsing RVs. However, there is a lack of evaluations of, and thus guidance on the use of, existing tests. Here we provide a comprehensive comparison of various statistical tests using simulated data. We consider both independent and correlated rare mutations, and representative tests for both CVs and RVs. As expected, if there are no or few non-causal (i.e. neutral or non-associated) RVs in a locus of interest while the effects of causal RVs on the trait are all (or mostly) in the same direction (i.e. either protective or deleterious, but not both), then the simple pooled association tests (without selecting RVs and their association directions) and a new test called kernel-based adaptive clustering (KBAC) perform similarly and are most powerful; KBAC is more robust than simple pooled association tests in the presence of non-causal RVs; however, as the number of non-causal CVs increases and/or in the presence of opposite association directions, the winners are two methods originally proposed for CVs and a new test called C-alpha test proposed for RVs, each of which can be regarded as testing on a variance component in a random-effects model. Interestingly, several methods based on sequential model selection (i.e. selecting causal RVs and their association directions), including two new methods proposed here, perform robustly and often have statistical power between those of the above two classes.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | |
Collapse
|
162
|
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011; 89:82-93. [PMID: 21737059 DOI: 10.1016/j.ajhg.2011.05.029] [Citation(s) in RCA: 1736] [Impact Index Per Article: 124.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2011] [Revised: 05/27/2011] [Accepted: 05/30/2011] [Indexed: 01/18/2023] Open
Abstract
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
Collapse
Affiliation(s)
- Michael C Wu
- Department of Biostatistics, The University of North Carolina at Chapel Hill, 27599, USA
| | | | | | | | | | | |
Collapse
|
163
|
Basu S, Pan W, Oetting WS. A dimension reduction approach for modeling multi-locus interaction in case-control studies. Hum Hered 2011; 71:234-45. [PMID: 21734407 DOI: 10.1159/000328842] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2010] [Accepted: 04/12/2011] [Indexed: 01/01/2023] Open
Abstract
Studying one locus or one single nucleotide polymorphism (SNP) at a time may not be sufficient to understand complex diseases because they are unlikely to result from the effect of only one SNP. Each SNP alone may have little or no effect on the risk of the disease, but together they may increase the risk substantially. Analyses focusing on individual SNPs ignore the possibility of interaction among SNPs. In this paper, we propose a parsimonious model to assess the joint effect of a group of SNPs in a case-control study. The model implements a data reduction strategy within a likelihood framework and uses a test to assess the statistical significance of the effect of the group of SNPs on the binary trait. The primary advantage of the proposed approach is that the dimension reduction technique produces a test statistic with degrees of freedom significantly lower than a multiple logistic regression with only main effects of the SNPs, and our parsimonious model can incorporate the possibility of interaction among the SNPs. Moreover, the proposed approach estimates the direction of association of each SNP with the disease and provides an estimate of the average effect of the group of SNPs positively and negatively associated with the disease in the given SNP set. We illustrate the proposed model on simulated and real data, and compare its performance with a few other existing approaches. Our proposed approach appeared to outperform the other approaches for independent SNPs in our simulation studies.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, University of Minnesota, Minneapolis, USA. saonli @ umn.edu
| | | | | |
Collapse
|
164
|
Chun H, Ballard DH, Cho J, Zhao H. Identification of association between disease and multiple markers via sparse partial least-squares regression. Genet Epidemiol 2011; 35:479-86. [PMID: 21678491 DOI: 10.1002/gepi.20596] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2010] [Revised: 03/29/2011] [Accepted: 04/19/2011] [Indexed: 11/08/2022]
Abstract
Although genome-wide association studies have led to the identifications of hundreds of genes underlying dozens of traits in recent years, most published studies have primarily used single marker-based analysis. Intuitively, more information may be utilized when multiple markers are jointly analyzed. Therefore, many methods have been proposed in the literature for association analysis between traits and multiple markers. Among these methods, simulation and real data analyses have shown that it is often more effective to reduce the dimensionality of the markers in a region through principal components analysis of all the markers first, and then to perform association analysis between traits and those principal components that account for most of the genetic variations in the region. However, one major limitation of this approach is that the principal components are derived purely from marker genotypes, without consideration of their relevance to traits. Furthermore, these components are constructed as linear combinations of all the markers even when only a limited number are potentially relevant to traits. In this manuscript, we propose the use of sparse partial least-squares regression to derive the components that are linear combinations of only relevant markers. This approach is able to use information from both traits and marker genotypes. Extensive simulations and real data analyses on a Crohn's disease data set suggest the superiority of this approach over existing methods.
Collapse
Affiliation(s)
- Hyonho Chun
- Department of Epidemiology and Public Health, Yale University, 300 George Street 503, New Haven, CT 06511, USA.
| | | | | | | |
Collapse
|
165
|
Gusev A, Kenny EE, Lowe JK, Salit J, Saxena R, Kathiresan S, Altshuler DM, Friedman JM, Breslow JL, Pe'er I. DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation. Am J Hum Genet 2011; 88:706-717. [PMID: 21620352 DOI: 10.1016/j.ajhg.2011.04.023] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2011] [Revised: 04/13/2011] [Accepted: 04/26/2011] [Indexed: 02/01/2023] Open
Abstract
Rare variants affecting phenotype pose a unique challenge for human genetics. Although genome-wide association studies have successfully detected many common causal variants, they are underpowered in identifying disease variants that are too rare or population-specific to be imputed from a general reference panel and thus are poorly represented on commercial SNP arrays. We set out to overcome these challenges and detect association between disease and rare alleles using SNP arrays by relying on long stretches of genomic sharing that are identical by descent. We have developed an algorithm, DASH, which builds upon pairwise identical-by-descent shared segments to infer clusters of individuals likely to be sharing a single haplotype. DASH constructs a graph with nodes representing individuals and links on the basis of such segments spanning a locus and uses an iterative minimum cut algorithm to identify densely connected components. We have applied DASH to simulated data and diverse GWAS data sets by constructing haplotype clusters and testing them for association. In simulations we show this approach to be significantly more powerful than single-marker testing in an isolated population that is from Kosrae, Federated States of Micronesia and has abundant IBD, and we provide orthogonal information for rare, recent variants in the outbred Wellcome Trust Case-Control Consortium (WTCCC) data. In both cohorts, we identified a number of haplotype associations, five such loci in the WTCCC data and ten in the isolated, that were conditionally significant beyond any individual nearby markers. We have replicated one of these loci in an independent European cohort and identified putative structural changes in low-pass whole-genome sequence of the cluster carriers.
Collapse
Affiliation(s)
- Alexander Gusev
- Department of Computer Science, Columbia University, New York, NY 10027, USA
| | - Eimear E Kenny
- Department of Computer Science, Columbia University, New York, NY 10027, USA; Medical Sciences and Human Genetics, Rockefeller University, New York, NY 10065, USA
| | - Jennifer K Lowe
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
| | - Jaqueline Salit
- Medical Sciences and Human Genetics, Rockefeller University, New York, NY 10065, USA
| | - Richa Saxena
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
| | - Sekar Kathiresan
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Cardiovascular Disease Prevention Center, Cardiology Division, Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - David M Altshuler
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Center for Human Genetic Research and Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA; Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
| | - Jeffrey M Friedman
- Medical Sciences and Human Genetics, Rockefeller University, New York, NY 10065, USA
| | - Jan L Breslow
- Medical Sciences and Human Genetics, Rockefeller University, New York, NY 10065, USA
| | - Itsik Pe'er
- Department of Computer Science, Columbia University, New York, NY 10027, USA.
| |
Collapse
|
166
|
Abstract
Genomic data provide a valuable source of information for modeling covariance structures, allowing a more accurate prediction of total genetic values (GVs). We apply the kriging concept, originally developed in the geostatistical context for predictions in the low-dimensional space, to the high-dimensional space spanned by genomic single nucleotide polymorphism (SNP) vectors and study its properties in different gene-action scenarios. Two different kriging methods [“universal kriging” (UK) and “simple kriging” (SK)] are presented. As a novelty, we suggest use of the family of Matérn covariance functions to model the covariance structure of SNP vectors. A genomic best linear unbiased prediction (GBLUP) is applied as a reference method. The three approaches are compared in a whole-genome simulation study considering additive, additive-dominance, and epistatic gene-action models. Predictive performance is measured in terms of correlation between true and predicted GVs and average true GVs of the individuals ranked best by prediction. We show that UK outperforms GBLUP in the presence of dominance and epistatic effects. In a limiting case, it is shown that the genomic covariance structure proposed by VanRaden (2008) can be considered as a covariance function with corresponding quadratic variogram. We also prove theoretically that if a specific linear relationship exists between covariance matrices for two linear mixed models, the GVs resulting from BLUP are linked by a scaling factor. Finally, the relation of kriging to other models is discussed and further options for modeling the covariance structure, which might be more appropriate in the genomic context, are suggested.
Collapse
|
167
|
Fridley BL, Biernacka JM. Gene set analysis of SNP data: benefits, challenges, and future directions. Eur J Hum Genet 2011; 19:837-43. [PMID: 21487444 DOI: 10.1038/ejhg.2011.57] [Citation(s) in RCA: 108] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
The last decade of human genetic research witnessed the completion of hundreds of genome-wide association studies (GWASs). However, the genetic variants discovered through these efforts account for only a small proportion of the heritability of complex traits. One explanation for the missing heritability is that the common analysis approach, assessing the effect of each single-nucleotide polymorphism (SNP) individually, is not well suited to the detection of small effects of multiple SNPs. Gene set analysis (GSA) is one of several approaches that may contribute to the discovery of additional genetic risk factors for complex traits. Complex phenotypes are thought to be controlled by networks of interacting biochemical and physiological pathways influenced by the products of sets of genes. By assessing the overall evidence of association of a phenotype with all measured variation in a set of genes, GSA may identify functionally relevant sets of genes corresponding to relevant biomolecular pathways, which will enable more focused studies of genetic risk factors. This approach may thus contribute to the discovery of genetic variants responsible for some of the missing heritability. With the increased use of these approaches for the secondary analysis of data from GWAS, it is important to understand the different GSA methods and their strengths and weaknesses, and consider challenges inherent in these types of analyses. This paper provides an overview of GSA, highlighting the key challenges, potential solutions, and directions for ongoing research.
Collapse
Affiliation(s)
- Brooke L Fridley
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| | | |
Collapse
|
168
|
Platelet CD36 surface expression levels affect functional responses to oxidized LDL and are associated with inheritance of specific genetic polymorphisms. Blood 2011; 117:6355-66. [PMID: 21478428 DOI: 10.1182/blood-2011-02-338582] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
CD36 modulates platelet function via binding to oxidized LDL (oxLDL), cell-derived microparticles, and thrombospondin-1. We hypothesized that the level of platelet CD36 expression may be associated with inheritance of specific genetic polymorphisms and that this would determine platelet reactivity to oxLDL. Analysis of more than 500 subjects revealed that CD36 expression levels were consistent in individual donors over time but varied widely among donors (200-14,000 molecules per platelet). Platelet aggregometry and flow cytometry in a subset of subjects with various CD36 expression levels revealed a high level of correlation (r² = 0.87) between platelet activation responses to oxLDL and level of CD36 expression. A genome-wide association study of 374 white subjects from the Cleveland Clinic ASCLOGEN study showed strong associations of single nucleotide polymorphisms in CD36 with platelet surface CD36 expression. Most of these findings were replicated in a smaller subset of 25 black subjects. An innovative gene-based genome-wide scan provided further evidence that single nucleotide polymorphisms in CD36 were strongly associated with CD36 expression. These studies show that CD36 expression on platelets varies widely, correlates with functional responses to oxLDL, and is associated with inheritance of specific CD36 genetic polymorphisms, and suggest that inheritance of specific CD36 polymorphisms could affect thrombotic risk.
Collapse
|
169
|
Tang H, Dong X, Hassan M, Abbruzzese JL, Li D, Askari F, Su GL, Lok AS, Marrero JA. Body mass index and obesity- and diabetes-associated genotypes and risk for pancreatic cancer. Cancer Epidemiol Biomarkers Prev 2011. [PMID: 21357378 DOI: 10.1158/1055-9965] [Citation(s) in RCA: 269] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The genetic factors predisposing individuals with obesity or diabetes to pancreatic cancer have not been identified. AIMS To investigate the hypothesis that obesity- and diabetes-related genes modify the risk of pancreatic cancer. METHODS We genotyped 15 single nucleotide polymorphisms of fat mass and obesity-associated (FTO), peroxisome proliferators-activated receptor gamma (PPARγ), nuclear receptor family 5 member 2 (NR5A2), AMPK, and ADIPOQ genes in 1,070 patients with pancreatic cancer and 1,175 cancer-free controls. Information on risk factors was collected by personal interview. Adjusted ORs (AOR) and 95% CIs were calculated using unconditional logistic regression. RESULTS The PPARγ P12A GG genotype was inversely associated with risk of pancreatic cancer (AOR, 0.21; 95% CI, 0.07-0.62). Three NR5A2 variants that were previously identified in a genome-wide association study were significantly associated with reduced risk of pancreatic cancer, AORs ranging from 0.57 to 0.79. Two FTO gene variants and one ADIPOQ variant were differentially associated with pancreatic cancer according to levels of body mass index (BMI; P(interaction) = 0.0001, 0.0015, and 0.03). For example, the AOR (95% CI) for FTO IVS1-2777AC/AA genotype was 0.72 (0.55-0.96) and 1.54 (1.14-2.09) in participants with a BMI of less than 25 or 25 kg/m(2) or more, respectively. We observed no significant association between AMPK genotype and pancreatic cancer and no genotype interactions with diabetes or smoking. CONCLUSION Our findings suggest the PPARγ P12A GG genotype and NR5A2 variants may reduce the risk for pancreatic cancer. A positive association of FTO and ADIPOQ gene variants with pancreatic cancer may be limited to persons who are overweight. IMPACT The discovery of genetic factors modifying the risk of pancreatic cancer may help to identify high-risk individuals for prevention efforts.
Collapse
Affiliation(s)
- Hongwei Tang
- Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
170
|
Shriner D, Vaughan LK. A unified framework for multi-locus association analysis of both common and rare variants. BMC Genomics 2011; 12:89. [PMID: 21281506 PMCID: PMC3040731 DOI: 10.1186/1471-2164-12-89] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2010] [Accepted: 01/31/2011] [Indexed: 11/10/2022] Open
Abstract
Background Common, complex diseases are hypothesized to result from a combination of common and rare genetic variants. We developed a unified framework for the joint association testing of both types of variants. Within the framework, we developed a union-intersection test suitable for genome-wide analysis of single nucleotide polymorphisms (SNPs), candidate gene data, as well as medical sequencing data. The union-intersection test is a composite test of association of genotype frequencies and differential correlation among markers. Results We demonstrated by computer simulation that the false positive error rate was controlled at the expected level. We also demonstrated scenarios in which the multi-locus test was more powerful than traditional single marker analysis. To illustrate use of the union-intersection test with real data, we analyzed a publically available data set of 319,813 autosomal SNPs genotyped for 938 cases of Parkinson disease and 863 neurologically normal controls for which no genome-wide significant results were found by traditional single marker analysis. We also analyzed an independent follow-up sample of 183 cases and 248 controls for replication. Conclusions We identified a single risk haplotype with a directionally consistent effect in both samples in the gene GAK, which is involved in clathrin-mediated membrane trafficking. We also found suggestive evidence that directionally inconsistent marginal effects from single marker analysis appeared to result from risk being driven by different haplotypes in the two samples for the genes SYN3 and NGLY1, which are involved in neurotransmitter release and proteasomal degradation, respectively. These results illustrate the utility of our unified framework for genome-wide association analysis of common, complex diseases.
Collapse
Affiliation(s)
- Daniel Shriner
- Center for Research on Genomics and Global Health, National Human Genome Research Institute, Bethesda, MD 20892, USA.
| | | |
Collapse
|
171
|
Locke AE, Dooley KJ, Tinker SW, Cheong SY, Feingold E, Allen EG, Freeman SB, Torfs CP, Cua CL, Epstein MP, Wu MC, Lin X, Capone G, Sherman SL, Bean LJH. Variation in folate pathway genes contributes to risk of congenital heart defects among individuals with Down syndrome. Genet Epidemiol 2011; 34:613-23. [PMID: 20718043 DOI: 10.1002/gepi.20518] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Cardiac abnormalities are one of the most common congenital defects observed in individuals with Down syndrome. Considerable research has implicated both folate deficiency and genetic variation in folate pathway genes with birth defects, including both congenital heart defects (CHD) and Down syndrome (DS). Here, we test variation in folate pathway genes for a role in the major DS-associated CHD atrioventricular septal defect (AVSD). In a group of 121 case families (mother, father, and proband with DS and AVSD) and 122 control families (mother, father, and proband with DS and no CHD), tag SNPs were genotyped in and around five folate pathway genes: 5,10-methylenetetrahyrdofolate reductase (MTHFR), methionine synthase (MTR), methionine synthase reductase (MTRR), cystathionine beta-synthase (CBS), and the reduced folate carrier (SLC19A1, RFC1). SLC19A1 was found to be associated with AVSD using a multilocus allele-sharing test. Individual SNP tests also showed nominally significant associations with odds ratios of between 1.34 and 3.78, depending on the SNP and genetic model. Interestingly, all marginally significant SNPs in SLC19A1 are in strong linkage disequilibrium (r(2)> or = 0.8) with the nonsynonymous coding SNP rs1051266 (c.80A>G), which has previously been associated with nonsyndromic cases of CHD. In addition to SLC19A1, the known functional polymorphism MTHFR c.1298A was over-transmitted to cases with AVSD (P=0.05) and under-transmitted to controls (P=0.02). We conclude, therefore, that disruption of the folate pathway contributes to the incidence of AVSD among individuals with DS.
Collapse
Affiliation(s)
- Adam E Locke
- Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
172
|
Lourenço VM, Pires AM, Kirst M. Robust linear regression methods in association studies. ACTA ACUST UNITED AC 2011; 27:815-21. [PMID: 21217123 DOI: 10.1093/bioinformatics/btr006] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
MOTIVATION It is well known that data deficiencies, such as coding/rounding errors, outliers or missing values, may lead to misleading results for many statistical methods. Robust statistical methods are designed to accommodate certain types of those deficiencies, allowing for reliable results under various conditions. We analyze the case of statistical tests to detect associations between genomic individual variations (SNP) and quantitative traits when deviations from the normality assumption are observed. We consider the classical analysis of variance tests for the parameters of the appropriate linear model and a robust version of those tests based on M-regression. We then compare their empirical power and level using simulated data with several degrees of contamination. RESULTS Data normality is nothing but a mathematical convenience. In practice, experiments usually yield data with non-conforming observations. In the presence of this type of data, classical least squares statistical methods perform poorly, giving biased estimates, raising the number of spurious associations and often failing to detect true ones. We show through a simulation study and a real data example, that the robust methodology can be more powerful and thus more adequate for association studies than the classical approach. AVAILABILITY The code of the robustified version of function lmekin() from the R package kinship is provided as Supplementary Material.
Collapse
Affiliation(s)
- V M Lourenço
- Department of Mathematics, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal.
| | | | | |
Collapse
|
173
|
Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet 2010; 11:843-54. [PMID: 21085203 DOI: 10.1038/nrg2884] [Citation(s) in RCA: 581] [Impact Index Per Article: 38.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Genome-wide association (GWA) studies have typically focused on the analysis of single markers, which often lacks the power to uncover the relatively small effect sizes conferred by most genetic variants. Recently, pathway-based approaches have been developed, which use prior biological knowledge on gene function to facilitate more powerful analysis of GWA study data sets. These approaches typically examine whether a group of related genes in the same functional pathway are jointly associated with a trait of interest. Here we review the development of pathway-based approaches for GWA studies, discuss their practical use and caveats, and suggest that pathway-based approaches may also be useful for future GWA studies with sequencing data.
Collapse
Affiliation(s)
- Kai Wang
- Center for Applied Genomics, The Childrens Hospital of Philadelphia, Pennsylvania 19104, USA
| | | | | |
Collapse
|
174
|
Taylor K, Small C, Epstein M, Sherman S, Tang W, Wilson M, Bouzyk M, Marcus M. Associations of progesterone receptor polymorphisms with age at menarche and menstrual cycle length. Horm Res Paediatr 2010; 74:421-7. [PMID: 20814185 PMCID: PMC3021500 DOI: 10.1159/000316961] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2010] [Accepted: 06/11/2010] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Age at menarche and menstrual cycle characteristics are indicators of endocrine function and may be risk factors for diseases such as reproductive cancers. The progesterone receptor gene (PGR) has been identified as a candidate gene for age at menarche and menstrual function. METHODS Women office workers ages 19-41 self-reported age at menarche and participated in a prospective study of menstrual function and fertility. First-morning urine was used as the DNA source. 444 women were genotyped for a functional variant in PGR, rs1042838 (Val660Leu), and 264 women were also genotyped for 29 other SNPs across the extended gene region. RESULTS Genetic variation across PGR was associated with age at menarche using a global score statistic (p = 0.03 among non-Hispanic whites). Women carrying two copies of the Val660Leu variant experienced menarche 1 year later than women carrying one or no copies of the variant (13.6 ± 0.5 vs. 12.6 ± 0.1; p = 0.03). The Val660Leu variant was also associated with decreased odds of short menstrual cycles (17-24 days) (OR, 95% CI: 0.54 [0.36, 0.80]; p = 0.002). CONCLUSION Genetic variation in PGR was associated with age at menarche and menstrual cycle length in this population. Further investigation of these associations in a replication dataset is warranted.
Collapse
Affiliation(s)
- K.C. Taylor
- Department of Epidemiology, Emory University, Atlanta, Ga., USA,*Kira C. Taylor, University of North Carolina at Chapel Hill, 137 E Franklin St., Suite 306, Chapel Hill, NC 27514 (USA), Tel. +1 404 808 1718, Fax +1 919 966 9800, E-Mail
| | - C.M. Small
- Department of Epidemiology, Emory University, Atlanta, Ga., USA
| | - M.P. Epstein
- Department of Human Genetics, Emory University, Atlanta, Ga., USA
| | - S.L. Sherman
- Department of Human Genetics, Emory University, Atlanta, Ga., USA
| | - W. Tang
- Department of Biomarker Service Center, Emory University, Atlanta, Ga., USA
| | - M.M. Wilson
- Department of Biomarker Service Center, Emory University, Atlanta, Ga., USA
| | - M. Bouzyk
- Department of Biomarker Service Center, Emory University, Atlanta, Ga., USA
| | - M. Marcus
- Department of Epidemiology, Emory University, Atlanta, Ga., USA
| |
Collapse
|
175
|
King CR, Rathouz PJ, Nicolae DL. An evolutionary framework for association testing in resequencing studies. PLoS Genet 2010; 6:e1001202. [PMID: 21085648 PMCID: PMC2978703 DOI: 10.1371/journal.pgen.1001202] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2010] [Accepted: 10/07/2010] [Indexed: 11/17/2022] Open
Abstract
Sequencing technologies are becoming cheap enough to apply to large numbers of study participants and promise to provide new insights into human phenotypes by bringing to light rare and previously unknown genetic variants. We develop a new framework for the analysis of sequence data that incorporates all of the major features of previously proposed approaches, including those focused on allele counts and allele burden, but is both more general and more powerful. We harness population genetic theory to provide prior information on effect sizes and to create a pooling strategy for information from rare variants. Our method, EMMPAT (Evolutionary Mixed Model for Pooled Association Testing), generates a single test per gene (substantially reducing multiple testing concerns), facilitates graphical summaries, and improves the interpretation of results by allowing calculation of attributable variance. Simulations show that, relative to previously used approaches, our method increases the power to detect genes that affect phenotype when natural selection has kept alleles with large effect sizes rare. We demonstrate our approach on a population-based re-sequencing study of association between serum triglycerides and variation in ANGPTL4.
Collapse
Affiliation(s)
- C Ryan King
- Department of Health Studies, University of Chicago, Chicago, Illinois, United States of America.
| | | | | |
Collapse
|
176
|
Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet 2010; 11:773-85. [PMID: 20940738 PMCID: PMC3743540 DOI: 10.1038/nrg2867] [Citation(s) in RCA: 342] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The limitations of genome-wide association (GWA) studies that focus on the phenotypic influence of common genetic variants have motivated human geneticists to consider the contribution of rare variants to phenotypic expression. The increasing availability of high-throughput sequencing technologies has enabled studies of rare variants but these methods will not be sufficient for their success as appropriate analytical methods are also needed. We consider data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts.
Collapse
Affiliation(s)
- Vikas Bansal
- The Scripps Translational Science Institute, 3344 North Torrey Pines Court, Suite 300, La Jolla, California 92037, USA
| | | | | | | |
Collapse
|
177
|
Vounou M, Nichols TE, Montana G. Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach. Neuroimage 2010; 53:1147-59. [PMID: 20624472 DOI: 10.1016/j.neuroimage.2010.07.002] [Citation(s) in RCA: 142] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2010] [Revised: 06/24/2010] [Accepted: 07/01/2010] [Indexed: 10/19/2022] Open
Abstract
There is growing interest in performing genome-wide searches for associations between genetic variants and brain imaging phenotypes. While much work has focused on single scalar valued summaries of brain phenotype, accounting for the richness of imaging data requires a brain-wide, genome-wide search. In particular, the standard approach based on mass-univariate linear modelling (MULM) does not account for the structured patterns of correlations present in each domain. In this work, we propose sparse reduced rank regression (sRRR), a strategy for multivariate modelling of high-dimensional imaging responses (measurements taken over regions of interest or individual voxels) and genetic covariates (single nucleotide polymorphisms or copy number variations), which enforces sparsity in the regression coefficients. Such sparsity constraints ensure that the model performs simultaneous genotype and phenotype selection. Using simulation procedures that accurately reflect realistic human genetic variation and imaging correlations, we present detailed evaluations of the sRRR method in comparison with the more traditional MULM approach. In all settings considered, sRRR has better power to detect deleterious genetic variants compared to MULM. Important issues concerning model selection and connections to existing latent variable models are also discussed. This work shows that sRRR offers a promising alternative for detecting brain-wide, genome-wide associations.
Collapse
Affiliation(s)
- Maria Vounou
- Statistics Section, Department of Mathematics, Imperial College London, UK
| | | | | | | |
Collapse
|
178
|
Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 2010; 70:109-31. [PMID: 20610906 DOI: 10.1159/000312641] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2009] [Accepted: 03/09/2010] [Indexed: 01/05/2023] Open
Abstract
Measures of genomic similarity are the basis of many statistical analytic methods. We review the mathematical and statistical basis of similarity methods, particularly based on kernel methods. A kernel function converts information for a pair of subjects to a quantitative value representing either similarity (larger values meaning more similar) or distance (smaller values meaning more similar), with the requirement that it must create a positive semidefinite matrix when applied to all pairs of subjects. This review emphasizes the wide range of statistical methods and software that can be used when similarity is based on kernel methods, such as nonparametric regression, linear mixed models and generalized linear mixed models, hierarchical models, score statistics, and support vector machines. The mathematical rigor for these methods is summarized, as is the mathematical framework for making kernels. This review provides a framework to move from intuitive and heuristic approaches to define genomic similarities to more rigorous methods that can take advantage of powerful statistical modeling and existing software. A companion paper reviews novel approaches to creating kernels that might be useful for genomic analyses, providing insights with examples [1].
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minn., USA
| |
Collapse
|
179
|
Nonparametric Bayesian variable selection with applications to multiple quantitative trait loci mapping with epistasis and gene-environment interaction. Genetics 2010; 186:385-94. [PMID: 20551445 DOI: 10.1534/genetics.109.113688] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The joint action of multiple genes is an important source of variation for complex traits and human diseases. However, mapping genes with epistatic effects and gene-environment interactions is a difficult problem because of relatively small sample sizes and very large parameter spaces for quantitative trait locus models that include such interactions. Here we present a nonparametric Bayesian method to map multiple quantitative trait loci (QTL) by considering epistatic and gene-environment interactions. The proposed method is not restricted to pairwise interactions among genes, as is typically done in parametric QTL analysis. Rather than modeling each main and interaction term explicitly, our nonparametric Bayesian method measures the importance of each QTL, irrespective of whether it is mostly due to a main effect or due to some interaction effect(s), via an unspecified function of the genotypes at all candidate QTL. A Gaussian process prior is assigned to this unknown function. In addition to the candidate QTL, nongenetic factors and covariates, such as age, gender, and environmental conditions, can also be included in the unspecified function. The importance of each genetic factor (QTL) and each nongenetic factor/covariate included in the function is estimated by a single hyperparameter, which enters the covariance function and captures any main or interaction effect associated with a given factor/covariate. An initial evaluation of the performance of the proposed method is obtained via analysis of simulated and real data.
Collapse
|
180
|
Mukhopadhyay I, Feingold E, Weeks DE, Thalamuthu A. Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet Epidemiol 2010; 34:213-21. [PMID: 19697357 DOI: 10.1002/gepi.20451] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
In a genetic association study, it is often desirable to perform an overall test of whether any or all single-nucleotide polymorphisms (SNPs) in a gene are associated with a phenotype. Several such tests exist, but most of them are powerful only under very specific assumptions about the genetic effects of the individual SNPs. In addition, some of the existing tests assume that the direction of the effect of each SNP is known, which is a highly unlikely scenario. Here, we propose a new kernel-based association test of joint association of several SNPs. Our test is non-parametric and robust, and does not make any assumption about the directions of individual SNP effects. It can be used to test multiple correlated SNPs within a gene and can also be used to test independent SNPs or genes in a biological pathway. Our test uses an analysis of variance paradigm to compare variation between cases and controls to the variation within the groups. The variation is measured using kernel functions for each marker, and then a composite statistic is constructed to combine the markers into a single test. We present simulation results comparing our statistic to the U-statistic-based method by Schaid et al. ([2005] Am. J. Hum. Genet. 76:780-793) and another statistic by Wessel and Schork ([2006] Am. J. Hum. Genet. 79:792-806). We consider a variety of different disease models and assumptions about how many SNPs within the gene are actually associated with disease. Our results indicate that our statistic has higher power than other statistics under most realistic conditions.
Collapse
|
181
|
Abstract
GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data.
Collapse
|
182
|
Liu CY, Wu MC, Chen F, Ter-Minassian M, Asomaning K, Zhai R, Wang Z, Su L, Heist RS, Kulke MH, Lin X, Liu G, Christiani DC. A Large-scale genetic association study of esophageal adenocarcinoma risk. Carcinogenesis 2010; 31:1259-63. [PMID: 20453000 DOI: 10.1093/carcin/bgq092] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The incidence of esophageal adenocarcinoma (EA) has been increasing rapidly, particularly among white males, over the past few decades in the USA. However, the etiology of EA and the striking male predominance is not fully explained by known risk factors. To identify susceptible genes for EA risk, we conducted a pathway-based candidate gene association study on 335 Caucasian EA cases and 319 Caucasian controls. A total of 1330 single-nucleotide polymorphisms (SNPs) selected from 354 genes were analyzed using an Illumina GoldenGate assay. The genotyped common SNPs include missense and exonic SNPs, SNPs within untranslated regions and 2 kb 5' of the gene, and tagSNPs for genes with little functional information available. Logistic regression adjusted for potential confounders was used to assess the genetic effect of each SNP on EA risk. We also tested gene-gender interactions using the likelihood ratio tests. We found that the genetic variants in the apoptosis pathway were significantly associated with EA risk after correcting for multiple comparisons. SNPs of rs3127075 in Caspase-7 (CASP7) and rs4661636 in Caspase-9 (CASP9) genes that play a critical role in apoptosis were found to be associated with an increased risk of EA. A protective effect of SNP rs572483 in the progesterone receptor (PGR) gene was observed among women carrying the variant G allele [adjusted odds ratio (OR) = 0.19; 95% confidence interval (CI) = 0.08-0.46] but was not observed among men (adjusted OR = 1.38; 95% CI = 0.95-2.00). In conclusion, this study suggests that the genetic variants of CASP7 and CASP9 in the apoptosis pathway may be important predictive markers for EA susceptibility and that PGR in the sex hormone signaling pathway may be associated with the gender differences in EA risk.
Collapse
Affiliation(s)
- Chen-Yu Liu
- Department of Environmental Health, Harvard School of Public Health, Boston, MA 02115, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
183
|
Ballard DH, Cho J, Zhao H. Comparisons of multi-marker association methods to detect association between a candidate region and disease. Genet Epidemiol 2010; 34:201-12. [PMID: 19810024 PMCID: PMC3158797 DOI: 10.1002/gepi.20448] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The joint use of information from multiple markers may be more effective to reveal association between a genomic region and a trait than single marker analysis. In this article, we compare the performance of seven multi-marker methods. These methods include (1) single marker analysis (either the best-scoring single nucleotide polymorphism in a candidate region or a combined test based on Fisher's method); (2) fixed effects regression models where the predictors are either the observed genotypes in the region, principal components that explain a proportion of the genetic variation, or predictors based on Fourier transformation for the genotypes; and (3) variance components analysis. In our simulation studies, we consider genetic models where the association is due to one, two, or three markers, and the disease-causing markers have varying allele frequencies. We use information from either all the markers in a region or information only from tagging markers. Our simulation results suggest that when there is one disease-causing variant, the best-scoring marker method is preferred whereas the variance components method and the principal components method work well for more common disease-causing variants. When there is more than one disease-causing variant, the principal components method seems to perform well over all the scenarios studied. When these methods are applied to analyze associations between all the markers in or near a gene and disease status for an inflammatory bowel disease data set, the analysis based on the principal components method leads to biologically more consistent discoveries than other methods.
Collapse
Affiliation(s)
- David H Ballard
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06511, USA.
| | | | | |
Collapse
|
184
|
Lindström S, Hunter DJ, Grönberg H, Stattin P, Wiklund F, Xu J, Chanock SJ, Hayes R, Kraft P. Sequence variants in the TLR4 and TLR6-1-10 genes and prostate cancer risk. Results based on pooled analysis from three independent studies. Cancer Epidemiol Biomarkers Prev 2010; 19:873-6. [PMID: 20200442 DOI: 10.1158/1055-9965.epi-09-0618] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Genetic variation in two members of the Toll-like receptor family, TLR4 and the gene cluster TLR6-1-10, has been implicated in prostate cancer in several studies but the associated alleles have not been consistent across reports. METHODS We did a pooled analysis combining genotype data from three case-control studies, Cancer of the Prostate in Sweden, the Health Professionals Follow-up Study, and the Prostate, Lung, Colon and Ovarian Cancer Screening Trial, with data from 3,101 prostate cancer cases and 2,523 controls. We did imputation to obtain dense coverage of the genes and comparable genotype data for all cohorts. In total, 58 single nucleotide polymorphisms in TLR4 and 96 single nucleotide polymorphisms in TLR6-1-10 were genotyped or imputed and analyzed in the entire data set. We did a cohort-specific analysis as well as meta-analysis and pooled analysis. We also evaluated whether the analyses differed by age or disease severity. RESULTS We observed no overall association between genetic variation at the TLR4 and TLR6-1-10 loci and risk of prostate cancer. CONCLUSIONS Common germ line genetic variation in TLR4 and TLR6-1-10 did not seem to have a strong association with risk of prostate cancer. IMPACT This study suggests that earlier associations between prostate cancer risk and TLR4 and TLR6-1-10 sequence variants were chance findings. To definitely assess the causal relationship between TLR sequence variants and prostate cancer risk, very large sample sizes are needed.
Collapse
Affiliation(s)
- Sara Lindström
- Department of Epidemiology, Harvard School of Public Health, Building 2, Room 249B, 655 Huntington Avenue, Boston, MA 02115, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
185
|
Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. Pathway analysis by adaptive combination of P-values. Genet Epidemiol 2010; 33:700-9. [PMID: 19333968 DOI: 10.1002/gepi.20422] [Citation(s) in RCA: 222] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
It is increasingly recognized that pathway analyses-a joint test of association between the outcome and a group of single nucleotide polymorphisms (SNPs) within a biological pathway-could potentially complement single-SNP analysis and provide additional insights for the genetic architecture of complex diseases. Building upon existing P-value combining methods, we propose a class of highly flexible pathway analysis approaches based on an adaptive rank truncated product statistic that can effectively combine evidence of associations over different SNPs and genes within a pathway. The statistical significance of the pathway-level test statistics is evaluated using a highly efficient permutation algorithm that remains computationally feasible irrespective of the size of the pathway and complexity of the underlying test statistics for summarizing SNP- and gene-level associations. We demonstrate through simulation studies that a gene-based analysis that treats the underlying genes, as opposed to the underlying SNPs, as the basic units for hypothesis testing, is a very robust and powerful approach to pathway-based association testing. We also illustrate the advantage of the proposed methods using a study of the association between the nicotinic receptor pathway and cigarette smoking behaviors.
Collapse
Affiliation(s)
- Kai Yu
- Division of Cancer Epidemiology and Genetics, NCI, Rockville, Maryland 20892, USA.
| | | | | | | | | | | | | | | |
Collapse
|
186
|
Shen YF, Zhu J. Power analysis of principal components regression in genetic association studies. J Zhejiang Univ Sci B 2010; 10:721-30. [PMID: 19816996 DOI: 10.1631/jzus.b0830866] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Association analysis provides an opportunity to find genetic variants underlying complex traits. A principal components regression (PCR)-based approach was shown to outperform some competing approaches. However, a limitation of this method is that the principal components (PCs) selected from single nucleotide polymorphisms (SNPs) may be unrelated to the phenotype. In this article, we investigate the theoretical properties of such a method in more detail. We first derive the exact power function of the test based on PCR, and hence clarify the relationship between the test power and the degrees of freedom (DF). Next, we extend the PCR test to a general weighted PCs test, which provides a unified framework for understanding the properties of some related statistics. We then compare the performance of these tests. We also introduce several data-driven adaptive alternatives to overcome difficulties in the PCR approach. Finally, we illustrate our results using simulations based on real genotype data. Simulation study shows the risk of using the unsupervised rule to determine the number of PCs, and demonstrates that there is no single uniformly powerful method for detecting genetic variants.
Collapse
Affiliation(s)
- Yan-feng Shen
- Department of Mathematics, Zhejiang University, Hangzhou 310027, China
| | | |
Collapse
|
187
|
Feng R. Discussion: Why do we test multiple traits in genetic association studies? J Korean Stat Soc 2009. [DOI: 10.1016/j.jkss.2008.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
188
|
Edwards AO, Fridley BL, James KM, Sharma AK, Cunningham JM, Tosakulwong N. Evaluation of clustering and genotype distribution for replication in genome wide association studies: the age-related eye disease study. PLoS One 2008; 3:e3813. [PMID: 19043567 PMCID: PMC2583911 DOI: 10.1371/journal.pone.0003813] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2008] [Accepted: 11/06/2008] [Indexed: 01/22/2023] Open
Abstract
Genome-wide association studies (GWASs) assess correlation between traits and DNA sequence variation using large numbers of genetic variants such as single nucleotide polymorphisms (SNPs) distributed across the genome. A GWAS produces many trait-SNP associations with low p-values, but few are replicated in subsequent studies. We sought to determine if characteristics of the genomic loci associated with a trait could be used to identify initial associations with a higher chance of replication in a second cohort. Data from the age-related eye disease study (AREDS) of 100,000 SNPs on 395 subjects with and 198 without age-related macular degeneration (AMD) were employed. Loci highly associated with AMD were characterized based on the distribution of genotypes, level of significance, and clustering of adjacent SNPs also associated with AMD suggesting linkage disequilibrium or multiple effects. Forty nine loci were highly associated with AMD, including 3 loci (CFH, C2/BF, LOC387715/HTRA1) already known to contain important genetic risks for AMD. One additional locus (C3) reported during the course of this study was identified and replicated in an additional study group. Tag-SNPs and haplotypes for each locus were evaluated for association with AMD in additional cohorts to account for population differences between discovery and replication subjects, but no additional clearly significant associations were identified. Relying on a significant genotype tests using a log-additive model would have excluded 57% of the non-replicated and none of the replicated loci, while use of other SNP features and clustering might have missed true associations.
Collapse
Affiliation(s)
- Albert O Edwards
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota, United States of America.
| | | | | | | | | | | |
Collapse
|
189
|
Wei Z, Li M, Rebbeck T, Li H. U-statistics-based tests for multiple genes in genetic association studies. Ann Hum Genet 2008; 72:821-33. [PMID: 18691161 DOI: 10.1111/j.1469-1809.2008.00473.x] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
As our understanding of biological pathways and the genes that regulate these pathways increases, consideration of these biological pathways has become an increasingly important part of genetic and molecular epidemiology. Pathway-based genetic association studies often involve genotyping of variants in genes acting in certain biological pathways. Such pathway-based genetic association studies can potentially capture the highly heterogeneous nature of many complex traits, with multiple causative loci and multiple alleles at some of the causative loci. In this paper, we develop two nonparametric test statistics that consider simultaneously the effects of multiple markers. Our approach, which is based on data-adaptive U-statistics, can handle both qualitative data such as case-control data and quantitative continuous phenotype data. Simulations demonstrate that our proposed methods are more powerful than standard methods, especially when there are multiple risk loci each with small genetic effects. When the number of disease-predisposing genes is small, the data-adaptive weighting of the U-statistics over all the markers produces similar power to commonly used single marker tests. We further illustrate the potential merits of our proposed tests in the analysis of a data set from a pathway-based candidate gene association study of breast cancer and hormone metabolism pathways. Finally, potential applications of the proposed tests to genome-wide association studies are also discussed.
Collapse
Affiliation(s)
- Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| | | | | | | |
Collapse
|