351
|
Freytag S, Bickeböller H. Comparison of three summary statistics for ranking genes in genome-wide association studies. Stat Med 2013; 33:1828-41. [PMID: 24323702 DOI: 10.1002/sim.6063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2013] [Revised: 09/11/2013] [Accepted: 11/18/2013] [Indexed: 01/30/2023]
Abstract
Problems associated with insufficient power have haunted the analysis of genome-wide association studies and are likely to be the main challenge for the analysis of next-generation sequencing data. Ranking genes according to their strength of association with the investigated phenotype is one solution. To obtain rankings for genes, researchers can draw from a wide range of statistics summarizing the relationships between variants mapped to a gene and the phenotype. Hence, it is of interest to explore the performance of these statistics in the context of rankings. To this end, we conducted a simulation study (limited to genes of equal sizes) of three different summary statistics examining the ability to rank genes in a meaningful order. The weighted sum of squared marginal score test (Pan, 2009), RareCover algorithm (Bahtia et al., 2010) and the elastic net regularization (Zou and Hastie, 2005) were chosen, because they can handle common as well as rare variants. The test based on the score statistic outperformed both other methods in almost all investigated scenarios. It was the only measure to consistently detect genes with interacting causal variants. However, the RareCover algorithm proved better at identifying genes including causal variants with small effect sizes and low minor allele frequency than the weighted sum of squared marginal score test. The performance of the elastic net regularization was unimpressive for all but the simplest scenarios.
Collapse
Affiliation(s)
- Saskia Freytag
- Institute of Genetic Epidemiology, University of Göttingen, Humboltallee 32, Medical School, 37073 Göttingen, Germany
| | | |
Collapse
|
352
|
Integrating GWASs and human protein interaction networks identifies a gene subnetwork underlying alcohol dependence. Am J Hum Genet 2013; 93:1027-34. [PMID: 24268660 DOI: 10.1016/j.ajhg.2013.10.021] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2013] [Revised: 10/14/2013] [Accepted: 10/21/2013] [Indexed: 12/31/2022] Open
Abstract
Despite a significant genetic contribution to alcohol dependence (AD), few AD-risk genes have been identified to date. In the current study, we aimed to integrate genome-wide association studies (GWASs) and human protein interaction networks to investigate whether a subnetwork of genes whose protein products interact with one another might collectively contribute to AD. By using two discovery GWAS data sets of the Study of Addiction: Genetics and Environment (SAGE) and the Collaborative Study on the Genetics of Alcoholism (COGA), we identified a subnetwork of 39 genes that not only was enriched for genes associated with AD, but also collectively associated with AD in both European Americans (p < 0.0001) and African Americans (p = 0.0008). We replicated the association of the gene subnetwork with AD in three independent samples, including two samples of European descent (p = 0.001 and p = 0.006) and one sample of African descent (p = 0.0069). To evaluate whether the significant associations are likely to be false-positive findings and to ascertain their specificity, we examined the same gene subnetwork in three other human complex disorders (bipolar disorder, major depressive disorder, and type 2 diabetes) and found no significant associations. Functional enrichment analysis revealed that the gene subnetwork was enriched for genes involved in cation transport, synaptic transmission, and transmission of nerve impulses, all of which are biologically meaningful processes that may underlie the risk for AD. In conclusion, we identified a gene subnetwork underlying AD that is biologically meaningful and highly reproducible, providing important clues for future research into AD etiology and treatment.
Collapse
|
353
|
Wang X, Lee S, Zhu X, Redline S, Lin X. GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet Epidemiol 2013; 37:778-86. [PMID: 24166731 PMCID: PMC4007511 DOI: 10.1002/gepi.21763] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2013] [Revised: 08/17/2013] [Accepted: 09/10/2013] [Indexed: 12/17/2022]
Abstract
Family-based genetic association studies of related individuals provide opportunities to detect genetic variants that complement studies of unrelated individuals. Most statistical methods for family association studies for common variants are single marker based, which test one SNP a time. In this paper, we consider testing the effect of an SNP set, e.g., SNPs in a gene, in family studies, for both continuous and discrete traits. Specifically, we propose a generalized estimating equations (GEEs) based kernel association test, a variance component based testing method, to test for the association between a phenotype and multiple variants in an SNP set jointly using family samples. The proposed approach allows for both continuous and discrete traits, where the correlation among family members is taken into account through the use of an empirical covariance estimator. We derive the theoretical distribution of the proposed statistic under the null and develop analytical methods to calculate the P-values. We also propose an efficient resampling method for correcting for small sample size bias in family studies. The proposed method allows for easily incorporating covariates and SNP-SNP interactions. Simulation studies show that the proposed method properly controls for type I error rates under both random and ascertained sampling schemes in family studies. We demonstrate through simulation studies that our approach has superior performance for association mapping compared to the single marker based minimum P-value GEE test for an SNP-set effect over a range of scenarios. We illustrate the application of the proposed method using data from the Cleveland Family GWAS Study.
Collapse
Affiliation(s)
- Xuefeng Wang
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA 02115
| | - Seunggeun Lee
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA 02115
| | - Xiaofeng Zhu
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA 44106
| | - Susan Redline
- Department of Medicine, Brigham and Women’s Hospital and Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA 02115
| |
Collapse
|
354
|
Qu L, Guennel T, Marshall SL. Linear score tests for variance components in linear mixed models and applications to genetic association studies. Biometrics 2013; 69:883-92. [PMID: 24328714 DOI: 10.1111/biom.12095] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2012] [Revised: 06/01/2013] [Accepted: 07/01/2013] [Indexed: 01/16/2023]
Abstract
Following the rapid development of genome-scale genotyping technologies, genetic association mapping has become a popular tool to detect genomic regions responsible for certain (disease) phenotypes, especially in early-phase pharmacogenomic studies with limited sample size. In response to such applications, a good association test needs to be (1) applicable to a wide range of possible genetic models, including, but not limited to, the presence of gene-by-environment or gene-by-gene interactions and non-linearity of a group of marker effects, (2) accurate in small samples, fast to compute on the genomic scale, and amenable to large scale multiple testing corrections, and (3) reasonably powerful to locate causal genomic regions. The kernel machine method represented in linear mixed models provides a viable solution by transforming the problem into testing the nullity of variance components. In this study, we consider score-based tests by choosing a statistic linear in the score function. When the model under the null hypothesis has only one error variance parameter, our test is exact in finite samples. When the null model has more than one variance parameter, we develop a new moment-based approximation that performs well in simulations. Through simulations and analysis of real data, we demonstrate that the new test possesses most of the aforementioned characteristics, especially when compared to existing quadratic score tests or restricted likelihood ratio tests.
Collapse
Affiliation(s)
- Long Qu
- Department of Mathematics and Statistics, Wright State University, Dayton, Ohio 45435, U.S.A
| | | | | |
Collapse
|
355
|
Regional replication of association with refractive error on 15q14 and 15q25 in the Age-Related Eye Disease Study cohort. Mol Vis 2013; 19:2173-86. [PMID: 24227913 PMCID: PMC3826323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 10/30/2013] [Indexed: 11/25/2022] Open
Abstract
PURPOSE Refractive error is a complex trait with multiple genetic and environmental risk factors, and is the most common cause of preventable blindness worldwide. The common nature of the trait suggests the presence of many genetic factors that individually may have modest effects. To achieve an adequate sample size to detect these common variants, large, international collaborations have formed. These consortia typically use meta-analysis to combine multiple studies from many different populations. This approach is robust to differences between populations; however, it does not compensate for the different haplotypes in each genetic background evidenced by different alleles in linkage disequilibrium with the causative variant. We used the Age-Related Eye Disease Study (AREDS) cohort to replicate published significant associations at two loci on chromosome 15 from two genome-wide association studies (GWASs). The single nucleotide polymorphisms (SNPs) that exhibited association on chromosome 15 in the original studies did not show evidence of association with refractive error in the AREDS cohort. This paper seeks to determine whether the non-replication in this AREDS sample may be due to the limited number of SNPs chosen for replication. METHODS We selected all SNPs genotyped on the Illumina Omni2.5v1_B array or custom TaqMan assays or imputed from the GWAS data, in the region surrounding the SNPs from the Consortium for Refractive Error and Myopia study. We analyzed the SNPs for association with refractive error using standard regression methods in PLINK. The effective number of tests was calculated using the Genetic Type I Error Calculator. RESULTS Although use of the same SNPs used in the Consortium for Refractive Error and Myopia study did not show any evidence of association with refractive error in this AREDS sample, other SNPs within the candidate regions demonstrated an association with refractive error. Significant evidence of association was found using the hyperopia categorical trait, with the most significant SNPs rs1357179 on 15q14 (p=1.69×10⁻³) and rs7164400 on 15q25 (p=8.39×10⁻⁴), which passed the replication thresholds. CONCLUSIONS This study adds to the growing body of evidence that attempting to replicate the most significant SNPs found in one population may not be significant in another population due to differences in the linkage disequilibrium structure and/or allele frequency. This suggests that replication studies should include less significant SNPs in an associated region rather than only a few selected SNPs chosen by a significance threshold.
Collapse
|
356
|
Yang W, Gu C. A whole-genome simulator capable of modeling high-order epistasis for complex disease. Genet Epidemiol 2013; 37:686-94. [PMID: 24114848 PMCID: PMC4143152 DOI: 10.1002/gepi.21761] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2013] [Revised: 08/09/2013] [Accepted: 08/14/2013] [Indexed: 11/10/2022]
Abstract
Genome-wide association studies (GWAS) have been successful in finding numerous new risk variants for complex diseases, but the results almost exclusively rely on single-marker scans. Methods that can analyze joint effects of many variants in GWAS data are still being developed and trialed. To evaluate the performance of such methods it is essential to have a GWAS data simulator that can rapidly simulate a large number of samples, and capture key features of real GWAS data such as linkage disequilibrium (LD) among single-nucleotide polymorphisms (SNPs) and joint effects of multiple loci (multilocus epistasis). In the current study, we combine techniques for specifying high-order epistasis among risk SNPs with an existing program GWAsimulator [Li and Li, 2008] to achieve rapid whole-genome simulation with accurate modeling of complex interactions. We considered various approaches to specifying interaction models including the following: departure from product of marginal effects for pairwise interactions, product terms in logistic regression models for low-order interactions, and penetrance tables conforming to marginal effect constraints for high-order interactions or prescribing known biological interactions. Methods for conversion among different model specifications are developed using penetrance table as the fundamental characterization of disease models. The new program, called simGWA, is capable of efficiently generating large samples of GWAS data with high precision. We show that data simulated by simGWA are faithful to template LD structures, and conform to prespecified diseases models with (or without) interactions.
Collapse
Affiliation(s)
- Wei Yang
- Division of Biostatistics, Washington University School of Medicine, St. Louis, MO
| | - Charles Gu
- Division of Biostatistics, Washington University School of Medicine, St. Louis, MO
- Department of Genetics, Washington University School of Medicine, St. Louis, MO
| |
Collapse
|
357
|
O'Brien KM, Orlow I, Antonescu CR, Ballman K, McCall L, Dematteo R, Engel LS. Gastrointestinal stromal tumors: a case-only analysis of single nucleotide polymorphisms and somatic mutations. Clin Sarcoma Res 2013; 3:12. [PMID: 24159917 PMCID: PMC3827940 DOI: 10.1186/2045-3329-3-12] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2013] [Accepted: 10/23/2013] [Indexed: 12/28/2022] Open
Abstract
Background Gastrointestinal stromal tumors are rare soft tissue sarcomas that typically develop from mesenchymal cells with acquired gain-in-function mutations in KIT or PDGFRA oncogenes. These somatic mutations have been well-characterized, but little is known about inherited genetic risk factors. Given evidence that certain susceptibility loci and carcinogens are associated with characteristic mutations in other cancers, we hypothesized that these signature KIT or PDGFRA mutations may be similarly fundamental to understanding gastrointestinal stromal tumor etiology. Therefore, we examined associations between 522 single nucleotide polymorphisms and seven KIT or PDGFRA tumor mutations types. Candidate pathways included dioxin response, toxin metabolism, matrix metalloproteinase production, and immune and inflammatory response. Methods We estimated odds ratios and 95% confidence intervals for associations between each candidate SNP and tumor mutation type in 279 individuals from a clinical trial of adjuvant imatinib mesylate. We used sequence kernel association tests to look for pathway-level associations. Results One variant, rs1716 on ITGAE, was significantly associated with KIT exon 11 non-codon 557–8 deletions (odds ratio = 2.86, 95% confidence interval: 1.71-4.78) after adjustment for multiple comparisons. Other noteworthy associations included rs3024498 (IL10) and rs1050783 (F13A1) with PDGFRA mutations, rs2071888 (TAPBP) with wild type tumors and several matrix metalloproteinase SNPs with KIT exon 11 codon 557–558 deletions. Several pathways were strongly associated with somatic mutations in PDGFRA, including defense response (p = 0.005) and negative regulation of immune response (p = 0.01). Conclusions This exploratory analysis offers novel insights into gastrointestinal stromal tumor etiology and provides a starting point for future studies of genetic and environmental risk factors for the disease.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Lawrence S Engel
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA.
| |
Collapse
|
358
|
Thompson PM, Ge T, Glahn DC, Jahanshad N, Nichols TE. Genetics of the connectome. Neuroimage 2013; 80:475-88. [PMID: 23707675 PMCID: PMC3905600 DOI: 10.1016/j.neuroimage.2013.05.013] [Citation(s) in RCA: 132] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2013] [Revised: 05/05/2013] [Accepted: 05/08/2013] [Indexed: 11/24/2022] Open
Abstract
Connectome genetics attempts to discover how genetic factors affect brain connectivity. Here we review a variety of genetic analysis methods--such as genome-wide association studies (GWAS), linkage and candidate gene studies--that have been fruitfully adapted to imaging data to implicate specific variants in the genome for brain-related traits. Studies that emphasized the genetic influences on brain connectivity. Some of these analyses of brain integrity and connectivity using diffusion MRI, and others have mapped genetic effects on functional networks using resting state functional MRI. Connectome-wide genome-wide scans have also been conducted, and we review the multivariate methods required to handle the extremely high dimension of the genomic and network data. We also review some consortium efforts, such as ENIGMA, that offer the power to detect robust common genetic associations using phenotypic harmonization procedures and meta-analysis. Current work on connectome genetics is advancing on many fronts and promises to shed light on how disease risk genes affect the brain. It is already discovering new genetic loci and even entire genetic networks that affect brain organization and connectivity.
Collapse
Affiliation(s)
- Paul M Thompson
- Imaging Genetics Center, Laboratory of NeuroImaging, Dept. of Neurology, UCLA School of Medicine, Los Angeles, CA 90095, USA.
| | | | | | | | | |
Collapse
|
359
|
Harmon QE, Engel SM, Olshan AF, Moran T, Stuebe AM, Luo J, Wu MC, Avery CL. Association of polymorphisms in natural killer cell-related genes with preterm birth. Am J Epidemiol 2013; 178:1208-18. [PMID: 23982189 DOI: 10.1093/aje/kwt108] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Inflammation is implicated in preterm birth, but genetic studies of inflammatory genes have yielded inconsistent results. Maternal DNA from 1,646 participants in the Pregnancy, Infection, and Nutrition Cohort, enrolled in Orange and Wake counties, North Carolina (1995-2005), were genotyped for 432 tag single-nucleotide polymorphisms (SNPs) in 30 candidate genes. Gene-level and SNP associations were modeled within strata of genetic ancestry. Six genes were associated with preterm birth among European Americans: interleukin 12A (IL12A); colony-stimulating factor 2 (CSF2); interferon γ receptor 2 (IFNGR2); killer cell immunoglobulin-like receptor, three domain, long cytoplasmic tail, 2 (KIR3DL2); interleukin 4 (IL4); and interleukin 13 (IL13). Of these, relatively strong single-SNP associations were seen in IFNGR2 and KIR3DL2. Among the 4 genes related to natural killer cell function, 2 (IL12A and CSF2) were consistently associated with reduced risk of prematurity for both European and African Americans. SNPs tagging a locus control region for IL4 and IL13 were associated with an increased risk of spontaneous preterm birth for European Americans (rs3091307; risk ratio = 1.9; 95% confidence interval: 1.4, 2.5). Although gene-level associations were detected only in European Americans, single-SNP associations among European and African Americans were often similar in direction, though estimated with less precision among African Americans. In conclusion, we identified novel associations between variants in the natural killer cell immune pathway and prematurity in this biracial US population.
Collapse
|
360
|
Dai H, Zhao Y, Qian C, Cai M, Zhang R, Chu M, Dai J, Hu Z, Shen H, Chen F. Weighted SNP set analysis in genome-wide association study. PLoS One 2013; 8:e75897. [PMID: 24098741 PMCID: PMC3786949 DOI: 10.1371/journal.pone.0075897] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Accepted: 08/19/2013] [Indexed: 11/18/2022] Open
Abstract
Genome-wide association studies (GWAS) are popular for identifying genetic variants which are associated with disease risk. Many approaches have been proposed to test multiple single nucleotide polymorphisms (SNPs) in a region simultaneously which considering disadvantages of methods in single locus association analysis. Kernel machine based SNP set analysis is more powerful than single locus analysis, which borrows information from SNPs correlated with causal or tag SNPs. Four types of kernel machine functions and principal component based approach (PCA) were also compared. However, given the loss of power caused by low minor allele frequencies (MAF), we conducted an extension work on PCA and used a new method called weighted PCA (wPCA). Comparative analysis was performed for weighted principal component analysis (wPCA), logistic kernel machine based test (LKM) and principal component analysis (PCA) based on SNP set in the case of different minor allele frequencies (MAF) and linkage disequilibrium (LD) structures. We also applied the three methods to analyze two SNP sets extracted from a real GWAS dataset of non-small cell lung cancer in Han Chinese population. Simulation results show that when the MAF of the causal SNP is low, weighted principal component and weighted IBS are more powerful than PCA and other kernel machine functions at different LD structures and different numbers of causal SNPs. Application of the three methods to a real GWAS dataset indicates that wPCA and wIBS have better performance than the linear kernel, IBS kernel and PCA.
Collapse
Affiliation(s)
- Hui Dai
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Cheng Qian
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Min Cai
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Ruyang Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Minjie Chu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Juncheng Dai
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Zhibin Hu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
- Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Cancer Center, Nanjing Medical University, Nanjing, China
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
| | - Hongbing Shen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
- Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Cancer Center, Nanjing Medical University, Nanjing, China
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
- * E-mail:
| |
Collapse
|
361
|
Carbonetto P, Stephens M. Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in Crohn's disease. PLoS Genet 2013; 9:e1003770. [PMID: 24098138 PMCID: PMC3789883 DOI: 10.1371/journal.pgen.1003770] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2012] [Accepted: 07/22/2013] [Indexed: 12/17/2022] Open
Abstract
Pathway analyses of genome-wide association studies aggregate information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. We develop a model-based approach to pathway analysis, and apply this approach to data from the Wellcome Trust Case Control Consortium (WTCCC) studies. Our method offers several benefits over existing approaches. First, our method not only interrogates pathways for enrichment of disease associations, but also estimates the level of enrichment, which yields a coherent way to promote variants in enriched pathways, enhancing discovery of genes underlying disease. Second, our approach allows for multiple enriched pathways, a feature that leads to novel findings in two diseases where the major histocompatibility complex (MHC) is a major determinant of disease susceptibility. Third, by modeling disease as the combined effect of multiple markers, our method automatically accounts for linkage disequilibrium among variants. Interrogation of pathways from eight pathway databases yields strong support for enriched pathways, indicating links between Crohn's disease (CD) and cytokine-driven networks that modulate immune responses; between rheumatoid arthritis (RA) and "Measles" pathway genes involved in immune responses triggered by measles infection; and between type 1 diabetes (T1D) and IL2-mediated signaling genes. Prioritizing variants in these enriched pathways yields many additional putative disease associations compared to analyses without enrichment. For CD and RA, 7 of 8 additional non-MHC associations are corroborated by other studies, providing validation for our approach. For T1D, prioritization of IL-2 signaling genes yields strong evidence for 7 additional non-MHC candidate disease loci, as well as suggestive evidence for several more. Of the 7 strongest associations, 4 are validated by other studies, and 3 (near IL-2 signaling genes RAF1, MAPK14, and FYN) constitute novel putative T1D loci for further study.
Collapse
Affiliation(s)
- Peter Carbonetto
- Dept. of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Matthew Stephens
- Dept. of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Dept. of Statistics, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
362
|
A fast multilocus test with adaptive SNP selection for large-scale genetic-association studies. Eur J Hum Genet 2013; 22:696-702. [PMID: 24022295 DOI: 10.1038/ejhg.2013.201] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Revised: 07/02/2013] [Accepted: 08/07/2013] [Indexed: 12/20/2022] Open
Abstract
As increasing evidence suggests that multiple correlated genetic variants could jointly influence the outcome, a multilocus test that aggregates association evidence across multiple genetic markers in a considered gene or a genomic region may be more powerful than a single-marker test for detecting susceptibility loci. We propose a multilocus test, AdaJoint, which adopts a variable selection procedure to identify a subset of genetic markers that jointly show the strongest association signal, and defines the test statistic based on the selected genetic markers. The P-value from the AdaJoint test is evaluated by a computationally efficient algorithm that effectively adjusts for multiple-comparison, and is hundreds of times faster than the standard permutation method. Simulation studies demonstrate that AdaJoint has the most robust performance among several commonly used multilocus tests. We perform multilocus analysis of over 26,000 genes/regions on two genome-wide association studies of pancreatic cancer. Compared with its competitors, AdaJoint identifies a much stronger association between the gene CLPTM1L and pancreatic cancer risk (6.0 × 10(-8)), with the signal optimally captured by two correlated single-nucleotide polymorphisms (SNPs). Finally, we show AdaJoint as a powerful tool for mapping cis-regulating methylation quantitative trait loci on normal breast tissues, and find many CpG sites whose methylation levels are jointly regulated by multiple SNPs nearby.
Collapse
|
363
|
Lin X, Lee S, Christiani DC, Lin X. Test for interactions between a genetic marker set and environment in generalized linear models. Biostatistics 2013; 14:667-81. [PMID: 23462021 PMCID: PMC3769996 DOI: 10.1093/biostatistics/kxt006] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Revised: 01/21/2013] [Accepted: 01/28/2013] [Indexed: 11/13/2022] Open
Abstract
We consider in this paper testing for interactions between a genetic marker set and an environmental variable. A common practice in studying gene-environment (GE) interactions is to analyze one single-nucleotide polymorphism (SNP) at a time. It is of significant interest to analyze SNPs in a biologically defined set simultaneously, e.g. gene or pathway. In this paper, we first show that if the main effects of multiple SNPs in a set are associated with a disease/trait, the classical single SNP-GE interaction analysis can be biased. We derive the asymptotic bias and study the conditions under which the classical single SNP-GE interaction analysis is unbiased. We further show that, the simple minimum p-value-based SNP-set GE analysis, can be biased and have an inflated Type 1 error rate. To overcome these difficulties, we propose a computationally efficient and powerful gene-environment set association test (GESAT) in generalized linear models. Our method tests for SNP-set by environment interactions using a variance component test, and estimates the main SNP effects under the null hypothesis using ridge regression. We evaluate the performance of GESAT using simulation studies, and apply GESAT to data from the Harvard lung cancer genetic study to investigate GE interactions between the SNPs in the 15q24-25.1 region and smoking on lung cancer risk.
Collapse
Affiliation(s)
- Xinyi Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | | | | | | |
Collapse
|
364
|
Zhang H, Maity A, Arshad H, Holloway J, Karmaus W. Variable selection in semi-parametric models. Stat Methods Med Res 2013; 25:1736-52. [PMID: 23990355 DOI: 10.1177/0962280213499679] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
We propose Bayesian variable selection methods in semi-parametric models in the framework of partially linear Gaussian and problit regressions. Reproducing kernels are utilized to evaluate possibly non-linear joint effect of a set of variables. Indicator variables are introduced into the reproducing kernels for the inclusion or exclusion of a variable. Different scenarios based on posterior probabilities of including a variable are proposed to select important variables. Simulations are used to demonstrate and evaluate the methods. It was found that the proposed methods can efficiently select the correct variables regardless of the feature of the effects, linear or non-linear in an unknown form. The proposed methods are applied to two real data sets to identify cytosine phosphate guanine methylation sites associated with maternal smoking and cytosine phosphate guanine sites associated with cotinine levels with creatinine levels adjusted. The selected methylation sites have the potential to advance our understanding of the underlying mechanism for the impact of smoking exposure on health outcomes, and consequently benefit medical research in disease intervention.
Collapse
Affiliation(s)
- Hongmei Zhang
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis, TN, USA
| | - Arnab Maity
- Department of Statistics, North Carolina State University, NC, USA
| | - Hasan Arshad
- The David Hide Asthma and Allergy Research Center, St. Marys Hospital, Isle of Wight, UK Allergy and Clinical Immunology, University of Southampton, Southampton, UK
| | - John Holloway
- Faculty of Medicine, University of Southampton, Southampton, UK
| | - Wilfried Karmaus
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis, TN, USA
| |
Collapse
|
365
|
Combined genotype and haplotype tests for region-based association studies. BMC Genomics 2013; 14:569. [PMID: 23964661 PMCID: PMC3852120 DOI: 10.1186/1471-2164-14-569] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2013] [Accepted: 08/13/2013] [Indexed: 12/13/2022] Open
Abstract
Background Although single-SNP analysis has proven to be useful in identifying many disease-associated loci, region-based analysis has several advantages. Empirically, it has been shown that region-based genotype and haplotype approaches may possess much higher power than single-SNP statistical tests. Both high quality haplotypes and genotypes may be available for analysis given the development of next generation sequencing technologies and haplotype assembly algorithms. Results As generally it is unknown whether genotypes or haplotypes are more relevant for identifying an association, we propose to use both of them with the purpose of preserving high power under both genotype and haplotype disease scenarios. We suggest two approaches for a combined association test and investigate the performance of these two approaches based on a theoretical model, population genetics simulations and analysis of a real data set. Conclusions Based on a theoretical model, population genetics simulations and analysis of a central corneal thickness (CCT) Genome Wide Association Study (GWAS) data set we have shown that combined genotype and haplotype approach has a high potential utility for applications in association studies.
Collapse
|
366
|
Pan WC, Kile ML, Seow WJ, Lin X, Quamruzzaman Q, Rahman M, Mahiuddin G, Mostofa G, Lu Q, Christiani DC. Genetic susceptible locus in NOTCH2 interacts with arsenic in drinking water on risk of type 2 diabetes. PLoS One 2013; 8:e70792. [PMID: 23967108 PMCID: PMC3743824 DOI: 10.1371/journal.pone.0070792] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2013] [Accepted: 06/24/2013] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Chronic exposure to arsenic in drinking water is associated with increased risk of type 2 diabetes mellitus (T2DM) but the underlying molecular mechanism remains unclear. OBJECTIVES This study evaluated the interaction between single nucleotide polymorphisms (SNPs) in genes associated with diabetes and arsenic exposure in drinking water on the risk of developing T2DM. METHODS In 2009-2011, we conducted a follow up study of 957 Bangladeshi adults who participated in a case-control study of arsenic-induced skin lesions in 2001-2003. Logistic regression models were used to evaluate the association between 38 SNPs in 18 genes and risk of T2DM measured at follow up. T2DM was defined as having a blood hemoglobin A1C level greater than or equal to 6.5% at follow-up. Arsenic exposure was characterized by drinking water samples collected from participants' tubewells. False discovery rates were applied in the analysis to control for multiple comparisons. RESULTS Median arsenic levels in 2001-2003 were higher among diabetic participants compared with non-diabetic ones (71.6 µg/L vs. 12.5 µg/L, p-value <0.001). Three SNPs in ADAMTS9 were nominally associated with increased risk of T2DM (rs17070905, Odds Ratio (OR) = 2.30, 95% confidence interval (CI) 1.17-4.50; rs17070967, OR = 2.02, 95%CI 1.00-4.06; rs6766801, OR = 2.33, 95%CI 1.18-4.60), but these associations did not reach the statistical significance after adjusting for multiple comparisons. A significant interaction between arsenic and NOTCH2 (rs699780) was observed which significantly increased the risk of T2DM (p for interaction = 0.003; q-value = 0.021). Further restricted analysis among participants exposed to water arsenic of less than 148 µg/L showed consistent results for interaction between the NOTCH2 variant and arsenic exposure on T2DM (p for interaction = 0.048; q-value = 0.004). CONCLUSIONS These findings suggest that genetic variation in NOTCH2 increased susceptibility to T2DM among people exposed to inorganic arsenic. Additionally, genetic variants in ADAMTS9 may increase the risk of T2DM.
Collapse
Affiliation(s)
- Wen-Chi Pan
- Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Molly L. Kile
- Department of Public Health, College of Public Health and Human Sciences, Oregon State University, Corvallis, Oregon, United States of America
| | - Wei Jie Seow
- Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Quazi Quamruzzaman
- Department of Environmental Research, Dhaka Community Hospital, Dhaka, Bangladesh
| | - Mahmuder Rahman
- Department of Environmental Research, Dhaka Community Hospital, Dhaka, Bangladesh
| | - Golam Mahiuddin
- Department of Environmental Research, Dhaka Community Hospital, Dhaka, Bangladesh
| | - Golam Mostofa
- Department of Environmental Research, Dhaka Community Hospital, Dhaka, Bangladesh
| | - Quan Lu
- Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts, United States of America
- Department of Genetics and Complex Diseases, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - David C. Christiani
- Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts, United States of America
| |
Collapse
|
367
|
Larson NB, Schaid DJ. A kernel regression approach to gene-gene interaction detection for case-control studies. Genet Epidemiol 2013; 37:695-703. [PMID: 23868214 DOI: 10.1002/gepi.21749] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2013] [Revised: 05/07/2013] [Accepted: 06/12/2013] [Indexed: 01/13/2023]
Abstract
Gene-gene interactions are increasingly being addressed as a potentially important contributor to the variability of complex traits. Consequently, attentions have moved beyond single locus analysis of association to more complex genetic models. Although several single-marker approaches toward interaction analysis have been developed, such methods suffer from very high testing dimensionality and do not take advantage of existing information, notably the definition of genes as functional units. Here, we propose a comprehensive family of gene-level score tests for identifying genetic elements of disease risk, in particular pairwise gene-gene interactions. Using kernel machine methods, we devise score-based variance component tests under a generalized linear mixed model framework. We conducted simulations based upon coalescent genetic models to evaluate the performance of our approach under a variety of disease models. These simulations indicate that our methods are generally higher powered than alternative gene-level approaches and at worst competitive with exhaustive SNP-level (where SNP is single-nucleotide polymorphism) analyses. Furthermore, we observe that simulated epistatic effects resulted in significant marginal testing results for the involved genes regardless of whether or not true main effects were present. We detail the benefits of our methods and discuss potential genome-wide analysis strategies for gene-gene interaction analysis in a case-control study design.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | | |
Collapse
|
368
|
Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, Peters U. SBERIA: set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol 2013; 37:452-64. [PMID: 23720162 PMCID: PMC3713231 DOI: 10.1002/gepi.21735] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Revised: 04/04/2013] [Accepted: 04/30/2013] [Indexed: 01/28/2023]
Abstract
Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Collapse
Affiliation(s)
- Shuo Jiao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
369
|
Belonogova NM, Svishcheva GR, van Duijn CM, Aulchenko YS, Axenovich TI. Region-based association analysis of human quantitative traits in related individuals. PLoS One 2013; 8:e65395. [PMID: 23799013 PMCID: PMC3684601 DOI: 10.1371/journal.pone.0065395] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2013] [Accepted: 04/24/2013] [Indexed: 01/27/2023] Open
Abstract
Regional-based association analysis instead of individual testing of each SNP was introduced in genome-wide association studies to increase the power of gene mapping, especially for rare genetic variants. For regional association tests, the kernel machine-based regression approach was recently proposed as a more powerful alternative to collapsing-based methods. However, the vast majority of existing algorithms and software for the kernel machine-based regression are applicable only to unrelated samples. In this paper, we present a new method for the kernel machine-based regression association analysis of quantitative traits in samples of related individuals. The method is based on the GRAMMAR+ transformation of phenotypes of related individuals, followed by use of existing kernel machine-based regression software for unrelated samples. We compared the performance of kernel-based association analysis on the material of the Genetic Analysis Workshop 17 family sample and real human data by using our transformation, the original untransformed trait, and environmental residuals. We demonstrated that only the GRAMMAR+ transformation produced type I errors close to the nominal value and that this method had the highest empirical power. The new method can be applied to analysis of related samples by using existing software for kernel-based association analysis developed for unrelated samples.
Collapse
Affiliation(s)
- Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | | | - Yurii S. Aulchenko
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- * E-mail:
| |
Collapse
|
370
|
Arnedo J, del Val C, de Erausquin GA, Romero-Zaliz R, Svrakic D, Cloninger CR, Zwir I. PGMRA: a web server for (phenotype x genotype) many-to-many relation analysis in GWAS. Nucleic Acids Res 2013; 41:W142-9. [PMID: 23761451 PMCID: PMC3692099 DOI: 10.1093/nar/gkt496] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
It has been proposed that single nucleotide polymorphisms (SNPs) discovered by genome-wide association studies (GWAS) account for only a small fraction of the genetic variation of complex traits in human population. The remaining unexplained variance or missing heritability is thought to be due to marginal effects of many loci with small effects and has eluded attempts to identify its sources. Combination of different studies appears to resolve in part this problem. However, neither individual GWAS nor meta-analytic combinations thereof are helpful for disclosing which genetic variants contribute to explain a particular phenotype. Here, we propose that most of the missing heritability is latent in the GWAS data, which conceals intermediate phenotypes. To uncover such latent information, we propose the PGMRA server that introduces phenomics--the full set of phenotype features of an individual--to identify SNP-set structures in a broader sense, i.e. causally cohesive genotype-phenotype relations. These relations are agnostically identified (without considering disease status of the subjects) and organized in an interpretable fashion. Then, by incorporating a posteriori the subject status within each relation, we can establish the risk surface of a disease in an unbiased mode. This approach complements-instead of replaces-current analysis methods. The server is publically available at http://phop.ugr.es/fenogeno.
Collapse
Affiliation(s)
- Javier Arnedo
- Department of Computer Science and Artificial Intelligence, University of Granada, E-18071 Granada, Spain
| | | | | | | | | | | | | |
Collapse
|
371
|
Lee D, Lee GK, Yoon KA, Lee JS. Pathway-based analysis using genome-wide association data from a Korean non-small cell lung cancer study. PLoS One 2013; 8:e65396. [PMID: 23762359 PMCID: PMC3675130 DOI: 10.1371/journal.pone.0065396] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 04/24/2013] [Indexed: 11/18/2022] Open
Abstract
Pathway-based analysis, used in conjunction with genome-wide association study (GWAS) techniques, is a powerful tool to detect subtle but systematic patterns in genome that can help elucidate complex diseases, like cancers. Here, we stepped back from genetic polymorphisms at a single locus and examined how multiple association signals can be orchestrated to find pathways related to lung cancer susceptibility. We used single-nucleotide polymorphism (SNP) array data from 869 non-small cell lung cancer (NSCLC) cases from a previous GWAS at the National Cancer Center and 1,533 controls from the Korean Association Resource project for the pathway-based analysis. After mapping single-nucleotide polymorphisms to genes, considering their coding region and regulatory elements (±20 kbp), multivariate logistic regression of additive and dominant genetic models were fitted against disease status, with adjustments for age, gender, and smoking status. Pathway statistics were evaluated using Gene Set Enrichment Analysis (GSEA) and Adaptive Rank Truncated Product (ARTP) methods. Among 880 pathways, 11 showed relatively significant statistics compared to our positive controls (PGSEA≤0.025, false discovery rate≤0.25). Candidate pathways were validated using the ARTP method and similarities between pathways were computed against each other. The top-ranked pathways were ABC Transporters (PGSEA<0.001, PARTP = 0.001), VEGF Signaling Pathway (PGSEA<0.001, PARTP = 0.008), G1/S Check Point (PGSEA = 0.004, PARTP = 0.013), and NRAGE Signals Death through JNK (PGSEA = 0.006, PARTP = 0.001). Our results demonstrate that pathway analysis can shed light on post-GWAS research and help identify potential targets for cancer susceptibility.
Collapse
MESH Headings
- Adult
- Aged
- Aged, 80 and over
- Asian People
- Carcinoma, Non-Small-Cell Lung/diagnosis
- Carcinoma, Non-Small-Cell Lung/ethnology
- Carcinoma, Non-Small-Cell Lung/genetics
- Carcinoma, Non-Small-Cell Lung/metabolism
- Case-Control Studies
- Databases, Genetic
- Female
- Gene Expression Regulation, Neoplastic
- Genetic Predisposition to Disease
- Genome, Human
- Genome-Wide Association Study
- Humans
- Logistic Models
- Lung Neoplasms/diagnosis
- Lung Neoplasms/ethnology
- Lung Neoplasms/genetics
- Lung Neoplasms/metabolism
- Male
- Metabolic Networks and Pathways/genetics
- Middle Aged
- Models, Genetic
- Polymorphism, Single Nucleotide
- Signal Transduction
Collapse
Affiliation(s)
- Donghoon Lee
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
| | - Geon Kook Lee
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
| | - Kyong-Ah Yoon
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
- * E-mail:
| | - Jin Soo Lee
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
| |
Collapse
|
372
|
Lin WY, Yi N, Lou XY, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N. Haplotype kernel association test as a powerful method to identify chromosomal regions harboring uncommon causal variants. Genet Epidemiol 2013; 37:560-70. [PMID: 23740760 DOI: 10.1002/gepi.21740] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Revised: 05/01/2013] [Accepted: 05/06/2013] [Indexed: 01/09/2023]
Abstract
For most complex diseases, the fraction of heritability that can be explained by the variants discovered from genome-wide association studies is minor. Although the so-called "rare variants" (minor allele frequency [MAF] < 1%) have attracted increasing attention, they are unlikely to account for much of the "missing heritability" because very few people may carry these rare variants. The genetic variants that are likely to fill in the "missing heritability" include uncommon causal variants (MAF < 5%), which are generally untyped in association studies using tagging single-nucleotide polymorphisms (SNPs) or commercial SNP arrays. Developing powerful statistical methods can help to identify chromosomal regions harboring uncommon causal variants, while bypassing the genome-wide or exome-wide next-generation sequencing. In this work, we propose a haplotype kernel association test (HKAT) that is equivalent to testing the variance component of random effects for distinct haplotypes. With an appropriate weighting scheme given to haplotypes, we can further enhance the ability of HKAT to detect uncommon causal variants. With scenarios simulated according to the population genetics theory, HKAT is shown to be a powerful method for detecting chromosomal regions harboring uncommon causal variants.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| | | | | | | | | | | | | | | |
Collapse
|
373
|
Petersen A, Alvarez C, DeClaire S, Tintle NL. Assessing methods for assigning SNPs to genes in gene-based tests of association using common variants. PLoS One 2013; 8:e62161. [PMID: 23741293 PMCID: PMC3669368 DOI: 10.1371/journal.pone.0062161] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2012] [Accepted: 03/18/2013] [Indexed: 11/18/2022] Open
Abstract
Gene-based tests of association are frequently applied to common SNPs (MAF>5%) as an alternative to single-marker tests. In this analysis we conduct a variety of simulation studies applied to five popular gene-based tests investigating general trends related to their performance in realistic situations. In particular, we focus on the impact of non-causal SNPs and a variety of LD structures on the behavior of these tests. Ultimately, we find that non-causal SNPs can significantly impact the power of all gene-based tests. On average, we find that the "noise" from 6-12 non-causal SNPs will cancel out the "signal" of one causal SNP across five popular gene-based tests. Furthermore, we find complex and differing behavior of the methods in the presence of LD within and between non-causal and causal SNPs. Ultimately, better approaches for a priori prioritization of potentially causal SNPs (e.g., predicting functionality of non-synonymous SNPs), application of these methods to sequenced or fully imputed datasets, and limited use of window-based methods for assigning inter-genic SNPs to genes will improve power. However, significant power loss from non-causal SNPs may remain unless alternative statistical approaches robust to the inclusion of non-causal SNPs are developed.
Collapse
Affiliation(s)
- Ashley Petersen
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Carolina Alvarez
- Department of Biostatistics, Florida International University, Miami, Florida, United States of America
| | - Scott DeClaire
- Department of Mathematics, Hope College, Holland, Michigan, United States of America
| | - Nathan L. Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College, Sioux Center, Iowa, United States of America
| |
Collapse
|
374
|
Zhu H, Li L, Zhou H. Nonlinear dimension reduction with Wright-Fisher kernel for genotype aggregation and association mapping. Bioinformatics 2013; 28:i375-i381. [PMID: 22962455 PMCID: PMC3436833 DOI: 10.1093/bioinformatics/bts406] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Association tests based on next-generation sequencing data are often under-powered due to the presence of rare variants and large amount of neutral or protective variants. A successful strategy is to aggregate genetic information within meaningful single-nucleotide polymorphism (SNP) sets, e.g. genes or pathways, and test association on SNP sets. Many existing methods for group-wise tests require specific assumptions about the direction of individual SNP effects and/or perform poorly in the presence of interactions. RESULTS We propose a joint association test strategy based on two key components: a nonlinear supervised dimension reduction approach for effective SNP information aggregation and a novel kernel specially designed for qualitative genotype data. The new test demonstrates superior performance in identifying causal genes over existing methods across a large variety of disease models simulated from sequence data of real genes. In general, the proposed method provides an association test strategy that can (i) detect both rare and common causal variants, (ii) deal with both additive and interaction effect, (iii) handle both quantitative traits and disease dichotomies and (iv) incorporate non-genetic covariates. In addition, the new kernel can potentially boost the power of the entire family of kernel-based methods for genetic data analysis. AVAILABILITY The method is implemented in MATLAB. Source code is available upon request. CONTACT hongjie.zhu@duke.edu.
Collapse
Affiliation(s)
- Hongjie Zhu
- Department of Psychiatry and Behavior Science, Duke University, Durham, NC 27710, USA.
| | | | | |
Collapse
|
375
|
SNP set association analysis for genome-wide association studies. PLoS One 2013; 8:e62495. [PMID: 23658731 PMCID: PMC3643925 DOI: 10.1371/journal.pone.0062495] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 03/22/2013] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association study (GWAS) is a promising approach for identifying common genetic variants of the diseases on the basis of millions of single nucleotide polymorphisms (SNPs). In order to avoid low power caused by overmuch correction for multiple comparisons in single locus association study, some methods have been proposed by grouping SNPs together into a SNP set based on genomic features, then testing the joint effect of the SNP set. We compare the performances of principal component analysis (PCA), supervised principal component analysis (SPCA), kernel principal component analysis (KPCA), and sliced inverse regression (SIR). Simulated SNP sets are generated under scenarios of 0, 1 and ≥2 causal SNPs model. Our simulation results show that all of these methods can control the type I error at the nominal significance level. SPCA is always more powerful than the other methods at different settings of linkage disequilibrium structures and minor allele frequency of the simulated datasets. We also apply these four methods to a real GWAS of non-small cell lung cancer (NSCLC) in Han Chinese population
Collapse
|
376
|
Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. ACTA ACUST UNITED AC 2013; 29:1526-33. [PMID: 23599503 PMCID: PMC3673214 DOI: 10.1093/bioinformatics/btt177] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Approaches for testing sets of variants, such as a set of rare or common variants within a gene or pathway, for association with complex traits are important. In particular, set tests allow for aggregation of weak signal within a set, can capture interplay among variants and reduce the burden of multiple hypothesis testing. Until now, these approaches did not address confounding by family relatedness and population structure, a problem that is becoming more important as larger datasets are used to increase power. RESULTS We introduce a new approach for set tests that handles confounders. Our model is based on the linear mixed model and uses two random effects-one to capture the set association signal and one to capture confounders. We also introduce a computational speedup for two random-effects models that makes this approach feasible even for extremely large cohorts. Using this model with both the likelihood ratio test and score test, we find that the former yields more power while controlling type I error. Application of our approach to richly structured Genetic Analysis Workshop 14 data demonstrates that our method successfully corrects for population structure and family relatedness, whereas application of our method to a 15 000 individual Crohn's disease case-control cohort demonstrates that it additionally recovers genes not recoverable by univariate analysis. AVAILABILITY A Python-based library implementing our approach is available at http://mscompbio.codeplex.com.
Collapse
|
377
|
O'Brien KM, Orlow I, Antonescu CR, Ballman K, McCall L, DeMatteo R, Engel LS. Gastrointestinal stromal tumors, somatic mutations and candidate genetic risk variants. PLoS One 2013; 8:e62119. [PMID: 23637977 PMCID: PMC3630216 DOI: 10.1371/journal.pone.0062119] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 03/18/2013] [Indexed: 02/07/2023] Open
Abstract
Gastrointestinal stromal tumors (GISTs) are rare but treatable soft tissue sarcomas. Nearly all GISTs have somatic mutations in either the KIT or PDGFRA gene, but there are no known inherited genetic risk factors. We assessed the relationship between KIT/PDGFRA mutations and select deletions or single nucleotide polymorphisms (SNPs) in 279 participants from a clinical trial of adjuvant imatinib mesylate. Given previous evidence that certain susceptibility loci and carcinogens are associated with characteristic mutations, or "signatures" in other cancers, we hypothesized that the characteristic somatic mutations in the KIT and PDGFRA genes in GIST tumors may similarly be mutational signatures that are causally linked to specific mutagens or susceptibility loci. As previous epidemiologic studies suggest environmental risk factors such as dioxin and radiation exposure may be linked to sarcomas, we chose 208 variants in 39 candidate genes related to DNA repair and dioxin metabolism or response. We calculated adjusted odds ratios (ORs) and 95% confidence intervals (CIs) for the association between each variant and 7 categories of tumor mutation using logistic regression. We also evaluated gene-level effects using the sequence kernel association test (SKAT). Although none of the association p-values were statistically significant after adjustment for multiple comparisons, SNPs in CYP1B1 were strongly associated with KIT exon 11 codon 557-8 deletions (OR = 1.9, 95% CI: 1.3-2.9 for rs2855658 and OR = 1.8, 95% CI: 1.2-2.7 for rs1056836) and wild type GISTs (OR = 2.7, 95% CI: 1.5-4.8 for rs1800440 and OR = 0.5, 95% CI: 0.3-0.9 for rs1056836). CYP1B1 was also associated with these mutations categories in the SKAT analysis (p = 0.002 and p = 0.003, respectively). Other potential risk variants included GSTM1, RAD23B and ERCC2. This preliminary analysis of inherited genetic risk factors for GIST offers some clues about the disease's genetic origins and provides a starting point for future candidate gene or gene-environment research.
Collapse
Affiliation(s)
- Katie M. O'Brien
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Irene Orlow
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Cristina R. Antonescu
- Department of Pathology, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Karla Ballman
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Linda McCall
- American College of Surgeons Oncology Group, Durham, North Carolina, United States of America
| | - Ronald DeMatteo
- Department of Surgery, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Lawrence S. Engel
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
378
|
Kazma R, Cardin NJ, Witte JS. Does accounting for gene-environment interactions help uncover association between rare variants and complex diseases? Hum Hered 2013; 74:205-14. [PMID: 23594498 DOI: 10.1159/000346825] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVE To determine whether accounting for gene-environment (G×E) interactions improves the power to detect associations between rare variants and a disease, we have extended three statistical methods and compared their power under various simulated disease models. METHODS To test for association of a group of rare variants with a disease, Min-P uses the lowest p value within the group of variants, CAST (Cohort Allelic Sums Test) uses an indicator variable to quantify the rare alleles within the group of variants, and SKAT (Sequence Kernel Association Test) uses a logistic regression based on kernel machine. For each method, we incorporate a term for the G×E interaction and test for association and interaction jointly. RESULTS When testing for disease association with a set of rare variants, accounting for G×E interactions can improve power in specific situations (pure interaction or high proportion of causal variants interacting with the environment). However, the power of this approach can decrease, in particular in the presence of main genetic or environmental effects. Among the methods compared, the optimized and weighted SKAT performed best, whether to test for genetic association or to test it jointly with G×E interactions. CONCLUSION This approach can be used in specific situations but is not appropriate for a primary analysis.
Collapse
Affiliation(s)
- Rémi Kazma
- Department of Epidemiology and Biostatistics and Institute for Human Genetics, University of California, San Francisco, CA, USA
| | | | | |
Collapse
|
379
|
Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, Armistead PM. Kernel machine SNP-set testing under multiple candidate kernels. Genet Epidemiol 2013; 37:267-75. [PMID: 23471868 PMCID: PMC3769109 DOI: 10.1002/gepi.21715] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2012] [Revised: 01/15/2013] [Accepted: 02/05/2013] [Indexed: 11/10/2022]
Abstract
Joint testing for the cumulative effect of multiple single-nucleotide polymorphisms grouped on the basis of prior biological knowledge has become a popular and powerful strategy for the analysis of large-scale genetic association studies. The kernel machine (KM)-testing framework is a useful approach that has been proposed for testing associations between multiple genetic variants and many different types of complex traits by comparing pairwise similarity in phenotype between subjects to pairwise similarity in genotype, with similarity in genotype defined via a kernel function. An advantage of the KM framework is its flexibility: choosing different kernel functions allows for different assumptions concerning the underlying model and can allow for improved power. In practice, it is difficult to know which kernel to use a priori because this depends on the unknown underlying trait architecture and selecting the kernel which gives the lowest P-value can lead to inflated type I error. Therefore, we propose practical strategies for KM testing when multiple candidate kernels are present based on constructing composite kernels and based on efficient perturbation procedures. We demonstrate through simulations and real data applications that the procedures protect the type I error rate and can lead to substantially improved power over poor choices of kernels and only modest differences in power vs. using the best candidate kernel.
Collapse
Affiliation(s)
- Michael C Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7420, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
380
|
Wang X, Morris NJ, Zhu X, Elston RC. A variance component based multi-marker association test using family and unrelated data. BMC Genet 2013; 14:17. [PMID: 23497289 PMCID: PMC3614458 DOI: 10.1186/1471-2156-14-17] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Accepted: 02/11/2013] [Indexed: 02/02/2023] Open
Abstract
Background Incorporating family data in genetic association studies has become increasingly appreciated, especially for its potential value in testing rare variants. We introduce here a variance-component based association test that can test multiple common or rare variants jointly using both family and unrelated samples. Results The proposed approach implemented in our R package aggregates or collapses the information across a region based on genetic similarity instead of genotype scores, which avoids the power loss when the effects are in different directions or have different association strengths. The method is also able to effectively leverage the LD information in a region and it can produce a test statistic with an adaptively estimated number of degrees of freedom. Our method can readily allow for the adjustment of non-genetic contributions to the familial similarity, as well as multiple covariates. Conclusions We demonstrate through simulations that the proposed method achieves good performance in terms of Type I error control and statistical power. The method is implemented in the R package “fassoc”, which provides a useful tool for data analysis and exploration.
Collapse
Affiliation(s)
- Xuefeng Wang
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
| | | | | | | |
Collapse
|
381
|
Freytag S, Bickeböller H, Amos CI, Kneib T, Schlather M. A novel kernel for correcting size bias in the logistic kernel machine test with an application to rheumatoid arthritis. Hum Hered 2013; 74:97-108. [PMID: 23466369 DOI: 10.1159/000347188] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2012] [Accepted: 01/17/2013] [Indexed: 01/07/2023] Open
Abstract
OBJECTIVES The logistic kernel machine test (LKMT) is a testing procedure tailored towards high-dimensional genetic data. Its use in pathway analyses of case-control genome-wide association studies results from its computational efficiency and flexibility in incorporating additional information via the kernel. The kernel can be any positive definite function; unfortunately, its form strongly influences the test's power and bias. Most authors have recommended the use of a simple linear kernel. We demonstrate via a simulation that the probability of rejecting the null hypothesis of no association just by chance increases with the number of SNPs or genes in the pathway when applying a simple linear kernel. METHODS We propose a novel kernel that includes an appropriate standardization in order to protect against any inflation of false positive results. Moreover, our novel kernel contains information on gene membership of SNPs in the pathway. RESULTS When applying the novel kernel to data from the North American Rheumatoid Arthritis Consortium, we find that even this basic genomic structure can improve the ability of the LKMT to identify meaningful associations. We also demonstrate that the standardization effectively eliminates problems of size bias. CONCLUSION We recommend the use of our standardized kernel and urge caution when using non-adjusted kernels in the LKMT to conduct pathway analyses.
Collapse
Affiliation(s)
- Saskia Freytag
- Department of Genetic Epidemiology, Medical School, Georg-August University Göttingen, Göttingen, Germany.
| | | | | | | | | |
Collapse
|
382
|
Gene-based testing of interactions in association studies of quantitative traits. PLoS Genet 2013; 9:e1003321. [PMID: 23468652 PMCID: PMC3585009 DOI: 10.1371/journal.pgen.1003321] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 12/31/2012] [Indexed: 01/05/2023] Open
Abstract
Various methods have been developed for identifying gene–gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene–gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following P value combining methods: minimum P value, extended Simes procedure, truncated tail strength, and truncated P value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein–protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between SMAD3 and NEDD9, in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies. Epistasis is likely to play a significant role in complex diseases or traits and is one of the many possible explanations for “missing heritability.” However, epistatic interactions have been difficult to detect in genome-wide association studies (GWAS) due to the limited power caused by the multiple-testing correction from the large number of tests conducted. Gene-based gene–gene interaction (GGG) tests might hold the key to relaxing the multiple-testing correction burden and increasing the power for identifying epistatic interactions in GWAS. Here, we developed GGG tests of quantitative traits by extending four P value combining methods and evaluated their type I error rates and power using extensive simulations. All four GGG tests are more powerful than a principal component-based test. We also applied our GGG tests to data from the Atherosclerosis Risk in Communities study and found five gene-level interactions associated with the levels of total cholesterol and high-density lipoprotein cholesterol (HDL-C). One interaction between SMAD3 and NEDD9 on HDL-C was further replicated in an independent sample from the Multi-Ethnic Study of Atherosclerosis.
Collapse
|
383
|
Thornton KR, Foran AJ, Long AD. Properties and modeling of GWAS when complex disease risk is due to non-complementing, deleterious mutations in genes of large effect. PLoS Genet 2013; 9:e1003258. [PMID: 23437004 PMCID: PMC3578756 DOI: 10.1371/journal.pgen.1003258] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2012] [Accepted: 12/02/2012] [Indexed: 01/08/2023] Open
Abstract
Current genome-wide association studies (GWAS) have high power to detect intermediate frequency SNPs making modest contributions to complex disease, but they are underpowered to detect rare alleles of large effect (RALE). This has led to speculation that the bulk of variation for most complex diseases is due to RALE. One concern with existing models of RALE is that they do not make explicit assumptions about the evolution of a phenotype and its molecular basis. Rather, much of the existing literature relies on arbitrary mapping of phenotypes onto genotypes obtained either from standard population-genetic simulation tools or from non-genetic models. We introduce a novel simulation of a 100-kilobase gene region, based on the standard definition of a gene, in which mutations are unconditionally deleterious, are continuously arising, have partially recessive and non-complementing effects on phenotype (analogous to what is widely observed for most Mendelian disorders), and are interspersed with neutral markers that can be genotyped. Genes evolving according to this model exhibit a characteristic GWAS signature consisting of an excess of marginally significant markers. Existing tests for an excess burden of rare alleles in cases have low power while a simple new statistic has high power to identify disease genes evolving under our model. The structure of linkage disequilibrium between causative mutations and significantly associated markers under our model differs fundamentally from that seen when rare causative markers are assumed to be neutral. Rather than tagging single haplotypes bearing a large number of rare causative alleles, we find that significant SNPs in a GWAS tend to tag single causative mutations of small effect relative to other mutations in the same gene. Our results emphasize the importance of evaluating the power to detect associations under models that are genetically and evolutionarily motivated. Current GWA studies typically only explain a small fraction of heritable variation in complex traits, resulting in speculation that a large fraction of variation in such traits may be due to rare alleles of large effect (RALE). The most parsimonious evolutionary mechanism that results in an inverse relationship between the frequency and effect size of causative alleles is an equilibrium between newly arising deleterious mutations and selection eliminating those mutations, resulting in an inverse relation between effect size and average frequency. This assumption is not built into many current models of RALE and, as a result, power calculations may be misleading. We use forward population genetic simulations to explore the ability of GWAS to detect genes in which unconditionally deleterious, partially recessive mutations arise each generation. Our model is based on the standard definition of a gene as a region within which loss-of-function mutations fail to complement, consistent with the multi-allelic basis for Mendelian disorders. Our model predicts that it may not be uncommon for single genes evolving under our model to contribute upwards of 5% to variation in a complex trait, and that such genes could be routinely detected via modified GWAS approaches.
Collapse
Affiliation(s)
- Kevin R. Thornton
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, California, United States of America
- * E-mail: (KRT); (ADL)
| | - Andrew J. Foran
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, California, United States of America
| | - Anthony D. Long
- Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, California, United States of America
- * E-mail: (KRT); (ADL)
| |
Collapse
|
384
|
Zakharov S, Salim A, Thalamuthu A. Comparison of similarity-based tests and pooling strategies for rare variants. BMC Genomics 2013; 14:50. [PMID: 23343094 PMCID: PMC3600007 DOI: 10.1186/1471-2164-14-50] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 01/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background As several rare genomic variants have been shown to affect common phenotypes, rare variants association analysis has received considerable attention. Several efficient association tests using genotype and phenotype similarity measures have been proposed in the literature. The major advantages of similarity-based tests are their ability to accommodate multiple types of DNA variations within one association test, and to account for the possible interaction within a region. However, not much work has been done to compare the performance of similarity-based tests on rare variants association scenarios, especially when applied with different rare variants pooling strategies. Results Based on the population genetics simulations and analysis of a publicly-available sequencing data set, we compared the performance of four similarity-based tests and two rare variants pooling strategies. We showed that weighting approach outperforms collapsing under the presence of strong effect from rare variants and under the presence of moderate effect from common variants, whereas collapsing of rare variants is preferable when common variants possess a strong effect. We also demonstrated that the difference in statistical power between the two pooling strategies may be substantial. The results also highlighted consistently high power of two similarity-based approaches when applied with an appropriate pooling strategy. Conclusions Population genetics simulations and sequencing data set analysis showed high power of two similarity-based tests and a substantial difference in power between the two pooling strategies.
Collapse
Affiliation(s)
- Sergii Zakharov
- Human Genetics, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore.
| | | | | |
Collapse
|
385
|
Wang J, Zhao Z, Cao Z, Yang A, Zhang J. A probabilistic method for identifying rare variants underlying complex traits. BMC Genomics 2013; 14 Suppl 1:S11. [PMID: 23369113 PMCID: PMC3549819 DOI: 10.1186/1471-2164-14-s1-s11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered. RESULTS In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without pre-selection by experts. In our model, each variant has two hidden states that represent the causal/non-causal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003.
Collapse
Affiliation(s)
- Jiayin Wang
- Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, PR China.
| | | | | | | | | |
Collapse
|
386
|
Machiela MJ, Lindström S, Allen NE, Haiman CA, Albanes D, Barricarte A, Berndt SI, Bueno-de-Mesquita HB, Chanock S, Gaziano JM, Gapstur SM, Giovannucci E, Henderson BE, Jacobs EJ, Kolonel LN, Krogh V, Ma J, Stampfer MJ, Stevens VL, Stram DO, Tjønneland A, Travis R, Willett WC, Hunter DJ, Le Marchand L, Kraft P. Association of type 2 diabetes susceptibility variants with advanced prostate cancer risk in the Breast and Prostate Cancer Cohort Consortium. Am J Epidemiol 2012. [PMID: 23193118 DOI: 10.1093/aje/kws191] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Observational studies have found an inverse association between type 2 diabetes (T2D) and prostate cancer (PCa), and genome-wide association studies have found common variants near 3 loci associated with both diseases. The authors examined whether a genetic background that favors T2D is associated with risk of advanced PCa. Data from the National Cancer Institute's Breast and Prostate Cancer Cohort Consortium, a genome-wide association study of 2,782 advanced PCa cases and 4,458 controls, were used to evaluate whether individual single nucleotide polymorphisms or aggregations of these 36 T2D susceptibility loci are associated with PCa. Ten T2D markers near 9 loci (NOTCH2, ADCY5, JAZF1, CDKN2A/B, TCF7L2, KCNQ1, MTNR1B, FTO, and HNF1B) were nominally associated with PCa (P < 0.05); the association for single nucleotide polymorphism rs757210 at the HNF1B locus was significant when multiple comparisons were accounted for (adjusted P = 0.001). Genetic risk scores weighted by the T2D log odds ratio and multilocus kernel tests also indicated a significant relation between T2D variants and PCa risk. A mediation analysis of 9,065 PCa cases and 9,526 controls failed to produce evidence that diabetes mediates the association of the HNF1B locus with PCa risk. These data suggest a shared genetic component between T2D and PCa and add to the evidence for an interrelation between these diseases.
Collapse
Affiliation(s)
- Mitchell J Machiela
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, MA 02115, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
387
|
Kazma R, Mefford JA, Cheng I, Plummer SJ, Levin AM, Rybicki BA, Casey G, Witte JS. Association of the innate immunity and inflammation pathway with advanced prostate cancer risk. PLoS One 2012; 7:e51680. [PMID: 23272139 PMCID: PMC3522730 DOI: 10.1371/journal.pone.0051680] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 11/05/2012] [Indexed: 01/13/2023] Open
Abstract
Prostate cancer is the most frequent and second most lethal cancer in men in the United States. Innate immunity and inflammation may increase the risk of prostate cancer. To determine the role of innate immunity and inflammation in advanced prostate cancer, we investigated the association of 320 single nucleotide polymorphisms, located in 46 genes involved in this pathway, with disease risk using 494 cases with advanced disease and 536 controls from Cleveland, Ohio. Taken together, the whole pathway was associated with advanced prostate cancer risk (P = 0.02). Two sub-pathways (intracellular antiviral molecules and extracellular pattern recognition) and four genes in these sub-pathways (TLR1, TLR6, OAS1, and OAS2) were nominally associated with advanced prostate cancer risk and harbor several SNPs nominally associated with advanced prostate cancer risk. Our results suggest that the innate immunity and inflammation pathway may play a modest role in the etiology of advanced prostate cancer through multiple small effects.
Collapse
Affiliation(s)
- Rémi Kazma
- Department of Epidemiology and Biostatistics and Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America
| | - Joel A. Mefford
- Department of Epidemiology and Biostatistics and Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America
| | - Iona Cheng
- Epidemiology Program, University of Hawai’i Cancer Center, University of Hawai’i, Honolulu, Hawai’i, United States of America
| | - Sarah J. Plummer
- Department of Preventive Medicine, Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America
| | - Albert M. Levin
- Department of Biostatistics and Research Epidemiology, Henry Ford Health System, Detroit, Michigan, United States of America
| | - Benjamin A. Rybicki
- Department of Biostatistics and Research Epidemiology, Henry Ford Health System, Detroit, Michigan, United States of America
| | - Graham Casey
- Department of Preventive Medicine, Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America
| | - John S. Witte
- Department of Epidemiology and Biostatistics and Institute for Human Genetics, University of California San Francisco, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
388
|
Power of a reproducing kernel-based method for testing the joint effect of a set of single-nucleotide polymorphisms. Genetica 2012. [PMID: 23180006 DOI: 10.1007/s10709-012-9690-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
This study explored a semi-parametric method built upon reproducing kernels for estimating and testing the joint effect of a set of single nucleotide polymorphisms (SNPs). The kernel adopted is the identity-by-state kernel that measures SNP similarity between subjects. In this article, through simulations we first assessed its statistical power under different situations. It was found that in addition to the effect of sample size, the testing power was impacted by the strength of association between SNPs and the outcome of interest, and by the SNP similarity among the subjects. A quadratic relationship between SNP similarity and testing power was identified, and this relationship was further affected by sample sizes. Next we applied the method to a SNP-lung function data set to estimate and test the joint effect of a set of SNPs on forced vital capacity, one type of lung function measure. The findings were then connected to the patterns observed in simulation studies and further explored via variable importance indices of each SNP inferred from a variable selection procedure.
Collapse
|
389
|
Identification of genetic associations of SP110/MYBBP1A/RELA with pulmonary tuberculosis in the Chinese Han population. Hum Genet 2012; 132:265-73. [PMID: 23129390 DOI: 10.1007/s00439-012-1244-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2012] [Accepted: 10/17/2012] [Indexed: 10/27/2022]
Abstract
Genetic factors play important roles in the development of tuberculosis (TB). SP110 is a promising candidate target for controlling TB infections. However, several studies associating SP110 single nucleotide polymorphisms (SNPs) with TB have yielded conflicting results. This may be partly resolved by studying other genes associated with SP110, such as MYBBP1A and RELA. Here, we genotyped 6 SP110 SNPs, 8 MYBBP1A SNPs and 5 RELA SNPs in 702 Chinese pulmonary TB patients and 425 healthy subjects using MassARRAY and SNaPshot methods. Using SNP-based analysis with Bonferroni correction, rs3809849 in MYBBP1A [Pcorrected (cor) = 0.0038] and rs9061 in SP110 (Pcor = 0.019) were found to be significantly associated with TB. Furthermore, meta-analysis of rs9061 in East Asian populations showed that the rs9061 T allele conferred significant risk for TB [P = 0.002, pooled odds ratio (OR), 1.24, 95% confidence interval (CI) = 1.08-1.43]. The MYBBP1A GTCTTGGG haplotype and haplotypes CGACCG/TGATTG within SP110 were found to be markedly and significantly associated with TB (P = 2.00E-06, 5.00E-6 and 2.59E-4, respectively). Gene-based analysis also demonstrated that SP110 and MYBBP1A were each associated with TB (Pcor = 0.011 and 0.035, respectively). The logistic regression analysis results supported interactions between SP110 and MYBBP1A, indicating that subjects carrying a GC/CC genotype in MYBBP1A and CC genotype in SP110 possessed the high risk of developing TB (P = 1.74E-12). Our study suggests that a combination of SP110 and MYBBP1A gene polymorphisms may serve as a novel marker for identifying the risk of developing TB in the Chinese Han population.
Collapse
|
390
|
Shui IM, Mucci LA, Wilson KM, Kraft P, Penney KL, Stampfer MJ, Giovannucci E. Common genetic variation of the calcium-sensing receptor and lethal prostate cancer risk. Cancer Epidemiol Biomarkers Prev 2012; 22:118-26. [PMID: 23125333 DOI: 10.1158/1055-9965.epi-12-0670-t] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Bony metastases cause substantial morbidity and mortality from prostate cancer (PCa). The calcium-sensing receptor (CaSR) is expressed on prostate tumors and may participate in bone metastases development. We assessed whether (i) common genetic variation in CaSR was associated with PCa risk and (ii) these associations varied by calcium intake or plasma 25-hydroxyvitamin D [25(OH)D] levels. METHODS We included 1,193 PCa cases and 1,244 controls nested in the prospective Health Professionals Follow-up Study (1993-2004). We genotyped 18 CaSR single-nucleotide polymorphism (SNPs) to capture common variation. The main outcome was risk of lethal PCa (n = 113); secondary outcomes were overall (n = 1,193) and high-grade PCa (n = 225). We used the kernel machine approach to conduct a gene-level multimarker analysis and unconditional logistic regression to compute per-allele ORs and 95% confidence intervals (CI) for individual SNPs. RESULTS The joint association of SNPs in CaSR was significant for lethal PCa (P = 0.04); this association was stronger in those with low 25(OH)D (P = 0.009). No individual SNPs were associated after considering multiple testing; three SNPs were nominally associated (P < 0.05) with lethal PCa with ORs (95% CI) of 0.65(0.42-0.99): rs6438705; 0.65(0.47-0.89): rs13083990; and 1.55(1.09-2.20): rs2270916. The three nonsynonymous SNPs (rs1801725, rs1042636, and rs1801726) were not significantly associated; however, the association for rs1801725 was stronger in men with low 25(OH)D [OR(95%CI): 0.54(0.31-0.95)]. There were no significant associations with overall or high-grade PCa. CONCLUSIONS Our findings indicate that CaSR may be involved in PCa progression. IMPACT Further studies investigating potential mechanisms for CaSR and PCa, including bone remodeling and metastases are warranted.
Collapse
Affiliation(s)
- Irene M Shui
- Department of Epidemiology, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA.
| | | | | | | | | | | | | |
Collapse
|
391
|
Jia P, Zhao Z. Searching joint association signals in CATIE schizophrenia genome-wide association studies through a refined integrative network approach. BMC Genomics 2012; 13 Suppl 6:S15. [PMID: 23134571 PMCID: PMC3481439 DOI: 10.1186/1471-2164-13-s6-s15] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Genome-wide association studies (GWAS) have generated a wealth of valuable genotyping data for complex diseases/traits. A large proportion of these data are embedded with many weakly associated markers that have been missed in traditional single marker analyses, but they may provide valuable insights in dissecting the genetic components of diseases. Gene set analysis (GSA) augmented by protein-protein interaction network data provides a promising way to examine GWAS data by analyzing the combined effects of multiple genes/markers, each of which may have only individually weak to moderate association effects. A critical issue in GSA of GWAS data is the definition of gene-wise P values based on multiple SNPs mapped to a gene. Results In this study, we proposed an alternative restricted search approach based on our previously developed dense module search algorithm, and we demonstrated it in the CATIE GWAS dataset for schizophrenia. Specifically, we explored three ways of computing gene-wise P values and examined their effects on the resultant module genes. These methods calculate gene-wise P values based on all the SNPs, the top ranked SNPs, or the most significant SNP among all the SNPs mapped to a gene. We applied the restricted search approach and identified a module gene set for each of the gene-wise P value data set. In our evaluation using an independent method, ALIGATOR, we showed that although each of these input datasets generated a unique set of module genes, all of them were significant in the GWAS dataset. Further functional enrichment analysis of these module genes showed that at the pathway level, they were all consistently related to neuro- and immune-related pathways. Finally, we compared our method with a previously reported method. Conclusion Our results showed that the approaches to computing gene-wise P values in GWAS data are critical in GSA. This work is useful for evaluating key factors in GSA of GWAS data.
Collapse
Affiliation(s)
- Peilin Jia
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | | |
Collapse
|
392
|
Zhang Y, Guan W, Pan W. Adjustment for population stratification via principal components in association analysis of rare variants. Genet Epidemiol 2012; 37:99-109. [PMID: 23065775 DOI: 10.1002/gepi.21691] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2012] [Revised: 09/11/2012] [Accepted: 09/13/2012] [Indexed: 11/07/2022]
Abstract
For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.
Collapse
Affiliation(s)
- Yiwei Zhang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | | | |
Collapse
|
393
|
Yuan Z, Gao Q, He Y, Zhang X, Li F, Zhao J, Xue F. Detection for gene-gene co-association via kernel canonical correlation analysis. BMC Genet 2012; 13:83. [PMID: 23039928 PMCID: PMC3506484 DOI: 10.1186/1471-2156-13-83] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2012] [Accepted: 09/28/2012] [Indexed: 11/15/2022] Open
Abstract
Background Currently, most methods for detecting gene-gene interaction (GGI) in genomewide association studies (GWASs) are limited in their use of single nucleotide polymorphism (SNP) as the unit of association. One way to address this drawback is to consider higher level units such as genes or regions in the analysis. Earlier we proposed a statistic based on canonical correlations (CCU) as a gene-based method for detecting gene-gene co-association. However, it can only capture linear relationship and not nonlinear correlation between genes. We therefore proposed a counterpart (KCCU) based on kernel canonical correlation analysis (KCCA). Results Through simulation the KCCU statistic was shown to be a valid test and more powerful than CCU statistic with respect to sample size and interaction odds ratio. Analysis of data from regions involving three genes on rheumatoid arthritis (RA) from Genetic Analysis Workshop 16 (GAW16) indicated that only KCCU statistic was able to identify interactions reported earlier. Conclusions KCCU statistic is a valid and powerful gene-based method for detecting gene-gene co-association.
Collapse
Affiliation(s)
- Zhongshang Yuan
- Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan, 250012, China
| | | | | | | | | | | | | |
Collapse
|
394
|
Wei P, Tang H, Li D. Insights into pancreatic cancer etiology from pathway analysis of genome-wide association study data. PLoS One 2012; 7:e46887. [PMID: 23056513 PMCID: PMC3464266 DOI: 10.1371/journal.pone.0046887] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2012] [Accepted: 09/06/2012] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Pancreatic cancer is the fourth leading cause of cancer death in the U.S. and the etiology of this highly lethal disease has not been well defined. To identify genetic susceptibility factors for pancreatic cancer, we conducted pathway analysis of genome-wide association study (GWAS) data in 3,141 pancreatic cancer patients and 3,367 controls with European ancestry. METHODS Using the gene set ridge regression in association studies (GRASS) method, we analyzed 197 pathways identified from the Kyoto Encyclopedia of Genes and Genomes database. We used the logistic kernel machine (LKM) test to identify major contributing genes to each pathway. We conducted functional enrichment analysis of the most significant genes (P<0.01) using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). RESULTS Two pathways were significantly associated with risk of pancreatic cancer after adjusting for multiple comparisons (P<0.00025) and in replication testing: neuroactive ligand-receptor interaction, (Ps<0.00002), and the olfactory transduction pathway (P = 0.0001). LKM test identified four genes that were significantly associated with risk of pancreatic cancer after Bonferroni correction (P<1×10(-5)): ABO, HNF1A, OR13C4, and SHH. Functional enrichment analysis using DAVID consistently found the G protein-coupled receptor signaling pathway (including both neuroactive ligand-receptor interaction and olfactory transduction pathways) to be the most significant pathway for pancreatic cancer risk in this study population. CONCLUSION These novel findings provide new perspectives on genetic susceptibility to and molecular mechanisms of pancreatic cancer.
Collapse
Affiliation(s)
- Peng Wei
- Division of Biostatistics and Human Genetics Center, School of Public Health, University of Texas Health Science Center, Houston, Texas, United States of America
| | - Hongwei Tang
- Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Donghui Li
- Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
395
|
Liu J, Vidaillet H, Burnside E, Page D. A Collective Ranking Method for Genome-wide Association Studies. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2012; 2012:313-320. [PMID: 32355913 DOI: 10.1145/2382936.2382976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Genome-wide association studies (GWAS) analyze genetic variation (SNPs) across the entire human genome, searching for SNPs that are associated with certain phenotypes, most often diseases, such as breast cancer. In GWAS, we seek a ranking of SNPs in terms of their relevance to the given phenotype. However, because certain SNPs are known to be highly correlated with one another across individuals, it can be beneficial to take into account these correlations when ranking. If a SNP appears associated with the phenotype, and we question whether this association is real, the extent to which its neighbors (correlated SNPs) also appear associated can be informative. Therefore, we propose CollectRank, a ranking approach which allows SNPs to reinforce one another via the correlation structure. CollectRank is loosely analogous to the well-known PageRank algorithm. We first evaluate CollectRank on synthetic data generated from a variety of genetic models under different settings. The numerical results suggest CollectRank can significantly outperform common GWAS methods at the cost of a small amount of extra computation. We further evaluate CollectRank on two real-world GWAS on breast cancer and atrial fibrillation/flutter, and CollectRank performs well in both studies. We finally provide a theoretical analysis that also suggests CollectRank's advantages.
Collapse
Affiliation(s)
- Jie Liu
- Dept. of Computer Sciences, Univ. of Wisconsin-Madison
| | | | | | - David Page
- Biostat. & Med. Info. Dept. Univ. of Wisconsin-Madison
| |
Collapse
|
396
|
Zhao Y, Chen F, Zhai R, Lin X, Diao N, Christiani DC. Association test based on SNP set: logistic kernel machine based test vs. principal component analysis. PLoS One 2012; 7:e44978. [PMID: 23028716 PMCID: PMC3441747 DOI: 10.1371/journal.pone.0044978] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Accepted: 08/16/2012] [Indexed: 01/04/2023] Open
Abstract
GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.
Collapse
Affiliation(s)
- Yang Zhao
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Rihong Zhai
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - Nancy Diao
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - David C. Christiani
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
397
|
Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, Lin X. SNP set association analysis for familial data. Genet Epidemiol 2012; 36:797-810. [PMID: 22968922 DOI: 10.1002/gepi.21676] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Revised: 07/06/2012] [Accepted: 07/30/2012] [Indexed: 11/06/2022]
Abstract
Genome-wide association studies (GWAS) are a popular approach for identifying common genetic variants and epistatic effects associated with a disease phenotype. The traditional statistical analysis of such GWAS attempts to assess the association between each individual single-nucleotide polymorphism (SNP) and the observed phenotype. Recently, kernel machine-based tests for association between a SNP set (e.g., SNPs in a gene) and the disease phenotype have been proposed as a useful alternative to the traditional individual-SNP approach, and allow for flexible modeling of the potentially complicated joint SNP effects in a SNP set while adjusting for covariates. We extend the kernel machine framework to accommodate related subjects from multiple independent families, and provide a score-based variance component test for assessing the association of a given SNP set with a continuous phenotype, while adjusting for additional covariates and accounting for within-family correlation. We illustrate the proposed method using simulation studies and an application to genetic data from the Genetic Epidemiology Network of Arteriopathy (GENOA) study.
Collapse
Affiliation(s)
- Elizabeth D Schifano
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts
| | | | | | | | | | | | | |
Collapse
|
398
|
Li MX, Kwan J, Sham P. HYST: a hybrid set-based test for genome-wide association studies, with application to protein-protein interaction-based association analysis. Am J Hum Genet 2012; 91:478-88. [PMID: 22958900 DOI: 10.1016/j.ajhg.2012.08.004] [Citation(s) in RCA: 83] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Revised: 04/30/2012] [Accepted: 08/07/2012] [Indexed: 11/25/2022] Open
Abstract
The extended Simes' test (known as GATES) and scaled chi-square test were proposed to combine a set of dependent genome-wide association signals at multiple single-nucleotide polymorphisms (SNPs) for assessing the overall significance of association at the gene or pathway levels. The two tests use different strategies to combine association p values and can outperform each other when the number of and linkage disequilibrium between SNPs vary. In this paper, we introduce a hybrid set-based test (HYST) combining the two tests for genome-wide association studies (GWASs). We describe how HYST can be used to evaluate statistical significance for association at the protein-protein interaction (PPI) level in order to increase power for detecting disease-susceptibility genes of moderate effect size. Computer simulations demonstrated that HYST had a reasonable type 1 error rate and was generally more powerful than its parents and other alternative tests to detect a PPI pair where both genes are associated with the disease of interest. We applied the method to three complex disease GWAS data sets in the public domain; the method detected a number of highly connected significant PPI pairs involving multiple confirmed disease-susceptibility genes not found in the SNP- and gene-based association analyses. These results indicate that HYST can be effectively used to examine a collection of predefined SNP sets based on prior biological knowledge for revealing additional disease-predisposing genes of modest effects in GWASs.
Collapse
|
399
|
Li S, Cui Y. Gene-centric gene–gene interaction: A model-based kernel machine method. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas545] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
400
|
Lu Q, Wei C, Ye C, Li M, Elston RC. A likelihood ratio-based Mann-Whitney approach finds novel replicable joint gene action for type 2 diabetes. Genet Epidemiol 2012; 36:583-93. [PMID: 22760990 PMCID: PMC3634342 DOI: 10.1002/gepi.21651] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2012] [Revised: 04/09/2012] [Accepted: 05/09/2012] [Indexed: 12/29/2022]
Abstract
The potential importance of the joint action of genes, whether modeled with or without a statistical interaction term, has long been recognized. However, identifying such action has been a great challenge, especially when millions of genetic markers are involved. We propose a likelihood ratio-based Mann-Whitney test to search for joint gene action either among candidate genes or genome-wide. It extends the traditional univariate Mann-Whitney test to assess the joint association of genotypes at multiple loci with disease, allowing for high-order statistical interactions. Because only one overall significance test is conducted for the entire analysis, it avoids the issue of multiple testing. Moreover, the approach adopts a computationally efficient algorithm, making a genome-wide search feasible in a reasonable amount of time on a high performance personal computer. We evaluated the approach using both theoretical and real data. By applying the approach to 40 type 2 diabetes (T2D) susceptibility single-nucleotide polymorphisms (SNPs), we identified a four-locus model strongly associated with T2D in the Wellcome Trust (WT) study (permutation P-value < 0.001), and replicated the same finding in the Nurses' Health Study/Health Professionals Follow-Up Study (NHS/HPFS) (P-value = 3.03×10-11). We also conducted a genome-wide search on 385,598 SNPs in the WT study. The analysis took approximately 55 hr on a personal computer, identifying the same first two loci, but overall a different set of four SNPs, jointly associated with T2D (P-value = 1.29×10-5). The nominal significance of this same association reached 4.01×10-6 in the NHS/HPFS.
Collapse
Affiliation(s)
- Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Changshuai Wei
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Chengyin Ye
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Ming Li
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Robert C. Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
| |
Collapse
|