101
|
An asymptotically minimax kernel machine. Stat Probab Lett 2014. [DOI: 10.1016/j.spl.2014.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
102
|
Fan R, Wang Y, Mills JL, Carter TC, Lobach I, Wilson AF, Bailey-Wilson JE, Weeks DE, Xiong M. Generalized functional linear models for gene-based case-control association studies. Genet Epidemiol 2014; 38:622-637. [PMID: 25203683 PMCID: PMC4189986 DOI: 10.1002/gepi.21840] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Revised: 04/29/2014] [Accepted: 05/28/2014] [Indexed: 01/23/2023]
Abstract
By using functional data analysis techniques, we developed generalized functional linear models for testing association between a dichotomous trait and multiple genetic variants in a genetic region while adjusting for covariates. Both fixed and mixed effect models are developed and compared. Extensive simulations show that Rao's efficient score tests of the fixed effect models are very conservative since they generate lower type I errors than nominal levels, and global tests of the mixed effect models generate accurate type I errors. Furthermore, we found that the Rao's efficient score test statistics of the fixed effect models have higher power than the sequence kernel association test (SKAT) and its optimal unified version (SKAT-O) in most cases when the causal variants are both rare and common. When the causal variants are all rare (i.e., minor allele frequencies less than 0.03), the Rao's efficient score test statistics and the global tests have similar or slightly lower power than SKAT and SKAT-O. In practice, it is not known whether rare variants or common variants in a gene region are disease related. All we can assume is that a combination of rare and common variants influences disease susceptibility. Thus, the improved performance of our models when the causal variants are both rare and common shows that the proposed models can be very useful in dissecting complex traits. We compare the performance of our methods with SKAT and SKAT-O on real neural tube defects and Hirschsprung's disease datasets. The Rao's efficient score test statistics and the global tests are more sensitive than SKAT and SKAT-O in the real data analysis. Our methods can be used in either gene-disease genome-wide/exome-wide association studies or candidate gene analyses.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institutes of Health, Rockville, MD 20852
| | - Yifan Wang
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institutes of Health, Rockville, MD 20852
| | - James L. Mills
- Epidemiology Branch, Division of Intramural Population Health Research Eunice Kennedy Shriver National Institute of Child Health and Human Development National Institutes of Health, Rockville, MD 20852
| | - Tonia C. Carter
- Center for Human Genetics, Marshfield Clinic, Marshfield, WI 54449
| | - Iryna Lobach
- Department of Neurology, School of Medicine University of California, San Francisco, CA 94185
| | - Alexander F. Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch National Human Genome Research Institute National Institutes of Health, Bethesda, MD 20892
| | - Joan E. Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch National Human Genome Research Institute National Institutes of Health, Bethesda, MD 20892
| | - Daniel E. Weeks
- Departments of Human Genetics and Biostatistics, Graduate School of Public Health University of Pittsburgh, Pittsburgh, PA 15261
| | - Momiao Xiong
- Human Genetics Center, University of Texas - Houston P.O. Box 20334, Houston, Texas 77225
| |
Collapse
|
103
|
Chen H, Malzahn D, Balliu B, Li C, Bailey JN. Testing genetic association with rare and common variants in family data. Genet Epidemiol 2014; 38 Suppl 1:S37-43. [PMID: 25112186 DOI: 10.1002/gepi.21823] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
With the advance of next-generation sequencing technologies in recent years, rare genetic variant data have now become available for genetic epidemiology studies. For family samples, however, only a few statistical methods for association analysis of rare genetic variants have been developed. Rare variant approaches are of great interest, particularly for family data, because samples enriched for trait-relevant variants can be ascertained and rare variants are putatively enriched through segregation. To facilitate the evaluation of existing and new rare variant testing approaches for analyzing family data, Genetic Analysis Workshop 18 (GAW18) provided genotype and next-generation sequencing data and longitudinal blood pressure traits from extended pedigrees of Mexican American families from the San Antonio Family Study. Our GAW18 group members analyzed real and simulated phenotype data from GAW18 by using generalized linear mixed-effects models or principal components to adjust for familial correlation or by testing binary traits using a correction factor for familial effects. With one exception, approaches dealt with the extended pedigrees in their original state using information based on the kinship matrix or alternative genetic similarity measures. For simulated data our group demonstrated that the family-based kernel machine score test is superior in power to family-based single-marker or burden tests, except in a few specific scenarios. For real data three contributions identified significant associations. They substantially reduced the number of tests before performing the association analysis. We conclude from our real data analyses that further development of strategies for targeted testing or more focused screening of genetic variants is strongly desirable.
Collapse
Affiliation(s)
- Han Chen
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
| | | | | | | | | |
Collapse
|
104
|
Morota G, Gianola D. Kernel-based whole-genome prediction of complex traits: a review. Front Genet 2014; 5:363. [PMID: 25360145 PMCID: PMC4199321 DOI: 10.3389/fgene.2014.00363] [Citation(s) in RCA: 102] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2014] [Accepted: 09/29/2014] [Indexed: 01/18/2023] Open
Abstract
Prediction of genetic values has been a focus of applied quantitative genetics since the beginning of the 20th century, with renewed interest following the advent of the era of whole genome-enabled prediction. Opportunities offered by the emergence of high-dimensional genomic data fueled by post-Sanger sequencing technologies, especially molecular markers, have driven researchers to extend Ronald Fisher and Sewall Wright's models to confront new challenges. In particular, kernel methods are gaining consideration as a regression method of choice for genome-enabled prediction. Complex traits are presumably influenced by many genomic regions working in concert with others (clearly so when considering pathways), thus generating interactions. Motivated by this view, a growing number of statistical approaches based on kernels attempt to capture non-additive effects, either parametrically or non-parametrically. This review centers on whole-genome regression using kernel methods applied to a wide range of quantitative traits of agricultural importance in animals and plants. We discuss various kernel-based approaches tailored to capturing total genetic variation, with the aim of arriving at an enhanced predictive performance in the light of available genome annotation information. Connections between prediction machines born in animal breeding, statistics, and machine learning are revisited, and their empirical prediction performance is discussed. Overall, while some encouraging results have been obtained with non-parametric kernels, recovering non-additive genetic variation in a validation dataset remains a challenge in quantitative genetics.
Collapse
Affiliation(s)
- Gota Morota
- Department of Animal Science, University of Nebraska-Lincoln Lincoln, NE, USA
| | - Daniel Gianola
- Department of Animal Sciences, University of Wisconsin-Madison Madison, WI, USA ; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison Madison, WI, USA ; Department of Dairy Science, University of Wisconsin-Madison Madison, WI, USA
| |
Collapse
|
105
|
Huang YT. Integrative modeling of multi-platform genomic data under the framework of mediation analysis. Stat Med 2014; 34:162-78. [PMID: 25316269 DOI: 10.1002/sim.6326] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 07/02/2014] [Accepted: 09/22/2014] [Indexed: 12/24/2022]
Abstract
Given the availability of genomic data, there have been emerging interests in integrating multi-platform data. Here, we propose to model genetics (single nucleotide polymorphism (SNP)), epigenetics (DNA methylation), and gene expression data as a biological process to delineate phenotypic traits under the framework of causal mediation modeling. We propose a regression model for the joint effect of SNPs, methylation, gene expression, and their nonlinear interactions on the outcome and develop a variance component score test for any arbitrary set of regression coefficients. The test statistic under the null follows a mixture of chi-square distributions, which can be approximated using a characteristic function inversion method or a perturbation procedure. We construct tests for candidate models determined by different combinations of SNPs, DNA methylation, gene expression, and interactions and further propose an omnibus test to accommodate different models. We then study three path-specific effects: the direct effect of SNPs on the outcome, the effect mediated through expression, and the effect through methylation. We characterize correspondences between the three path-specific effects and coefficients in the regression model, which are influenced by causal relations among SNPs, DNA methylation, and gene expression. We illustrate the utility of our method in two genomic studies and numerical simulation studies.
Collapse
Affiliation(s)
- Yen-Tsung Huang
- Department of Epidemiology, Brown University, 121 S. Main St., Box G-S121-2, Providence, RI, 02912, U.S.A
| |
Collapse
|
106
|
Simpson CL, Wojciechowski R, Oexle K, Murgia F, Portas L, Li X, Verhoeven VJM, Vitart V, Schache M, Hosseini SM, Hysi PG, Raffel LJ, Cotch MF, Chew E, Klein BEK, Klein R, Wong TY, van Duijn CM, Mitchell P, Saw SM, Fossarello M, Wang JJ, Polašek O, Campbell H, Rudan I, Oostra BA, Uitterlinden AG, Hofman A, Rivadeneira F, Amin N, Karssen LC, Vingerling JR, Döring A, Bettecken T, Bencic G, Gieger C, Wichmann HE, Wilson JF, Venturini C, Fleck B, Cumberland PM, Rahi JS, Hammond CJ, Hayward C, Wright AF, Paterson AD, Baird PN, Klaver CCW, Rotter JI, Pirastu M, Meitinger T, Bailey-Wilson JE, Stambolian D. Genome-wide meta-analysis of myopia and hyperopia provides evidence for replication of 11 loci. PLoS One 2014; 9:e107110. [PMID: 25233373 PMCID: PMC4169415 DOI: 10.1371/journal.pone.0107110] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2013] [Accepted: 08/12/2014] [Indexed: 01/01/2023] Open
Abstract
Refractive error (RE) is a complex, multifactorial disorder characterized by a mismatch between the optical power of the eye and its axial length that causes object images to be focused off the retina. The two major subtypes of RE are myopia (nearsightedness) and hyperopia (farsightedness), which represent opposite ends of the distribution of the quantitative measure of spherical refraction. We performed a fixed effects meta-analysis of genome-wide association results of myopia and hyperopia from 9 studies of European-derived populations: AREDS, KORA, FES, OGP-Talana, MESA, RSI, RSII, RSIII and ERF. One genome-wide significant region was observed for myopia, corresponding to a previously identified myopia locus on 8q12 (p = 1.25×10(-8)), which has been reported by Kiefer et al. as significantly associated with myopia age at onset and Verhoeven et al. as significantly associated to mean spherical-equivalent (MSE) refractive error. We observed two genome-wide significant associations with hyperopia. These regions overlapped with loci on 15q14 (minimum p value = 9.11×10(-11)) and 8q12 (minimum p value 1.82×10(-11)) previously reported for MSE and myopia age at onset. We also used an intermarker linkage- disequilibrium-based method for calculating the effective number of tests in targeted regional replication analyses. We analyzed myopia (which represents the closest phenotype in our data to the one used by Kiefer et al.) and showed replication of 10 additional loci associated with myopia previously reported by Kiefer et al. This is the first replication of these loci using myopia as the trait under analysis. "Replication-level" association was also seen between hyperopia and 12 of Kiefer et al.'s published loci. For the loci that show evidence of association to both myopia and hyperopia, the estimated effect of the risk alleles were in opposite directions for the two traits. This suggests that these loci are important contributors to variation of refractive error across the distribution.
Collapse
Affiliation(s)
- Claire L. Simpson
- National Human Genome Research Institute, National Institutes of Health, Baltimore, Maryland, United States of America
| | - Robert Wojciechowski
- National Human Genome Research Institute, National Institutes of Health, Baltimore, Maryland, United States of America
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America
| | - Konrad Oexle
- Institute of Human Genetics, Technische Universität München, Munich, Germany
| | - Federico Murgia
- Institute of Population Genetics, National Research Council of Italy, Sassari, Italy
| | - Laura Portas
- Institute of Population Genetics, National Research Council of Italy, Sassari, Italy
| | - Xiaohui Li
- Institute for Translational Genomics and Population Sciences, Los Angeles BioMedical Research Institute at Harbor-UCLA Medical Center, Torrance, California, United States of America
| | - Virginie J. M. Verhoeven
- Department of Ophthalmology, Erasmus Medical Center, Rotterdam, the Netherlands
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
| | - Veronique Vitart
- MRC Human Genetics Unit, IGMM, University of Edinburgh, Edinburgh, United Kingdom
| | - Maria Schache
- Centre for Eye Research Australia, University of Melbourne, Royal Victorian Eye and Ear Hospital, Melbourne, Australia
| | - S. Mohsen Hosseini
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada, and DCCT/EDIC Research Group, The Diabetes Control and Complications Trial and Follow-up Study, The Biostatistics Center, The George Washington University, Rockville, Maryland, United States of America
| | - Pirro G. Hysi
- Department of Twin Research & Genetic Epidemiology, King's College London, St Thomas' Hospital, London, United Kingdom
| | - Leslie J. Raffel
- Medical Genetics Institute, Cedars-Sinai Medical Center, Los Angeles, California, United States of America
| | - Mary Frances Cotch
- Division of Epidemiology and Clinical Applications, National Eye Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Emily Chew
- Division of Epidemiology and Clinical Applications, National Eye Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Barbara E. K. Klein
- Department of Ophthalmology and Visual Sciences, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, United States of America
| | - Ronald Klein
- Department of Ophthalmology and Visual Sciences, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, United States of America
| | - Tien Yin Wong
- Centre for Eye Research Australia, University of Melbourne, Royal Victorian Eye and Ear Hospital, Melbourne, Australia
- Singapore Eye Research Institute, National University of Singapore, Singapore, Singapore
| | | | - Paul Mitchell
- Centre for Vision Research, Department of Ophthalmology and Westmead Millennium Institute, University of Sydney, Sydney, Australia
| | - Seang Mei Saw
- Department of Epidemiology and Public Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Maurizio Fossarello
- Dipartimento di Scienze Chirurgiche, Clinica Oculistica Universita' degli studi di Cagliari, Cagliari, Italy
| | - Jie Jin Wang
- Centre for Eye Research Australia, University of Melbourne, Royal Victorian Eye and Ear Hospital, Melbourne, Australia
- Centre for Vision Research, Department of Ophthalmology and Westmead Millennium Institute, University of Sydney, Sydney, Australia
| | - DCCT/EDIC Research Group
- The Diabetes Control and Complications Trial and Follow-up Study, The Biostatistics Center, The George Washington University, Rockville, Maryland, United States of America
| | - Ozren Polašek
- Croatian Centre for Global Health, University of Split Medical School, Split, Croatia
| | - Harry Campbell
- Centre for Population Health Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Igor Rudan
- Centre for Population Health Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Ben A. Oostra
- Department of Clinical Genetics, Erasmus Medical Center, Rotterdam, the Netherlands
| | - André G. Uitterlinden
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
- Department of Internal Medicine, Erasmus Medical Center, Rotterdam, the Netherlands
- Netherlands Consortium for Healthy Ageing, Netherlands Genomics Initiative, The Hague, the Netherlands
| | - Albert Hofman
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
- Netherlands Consortium for Healthy Ageing, Netherlands Genomics Initiative, The Hague, the Netherlands
| | - Fernando Rivadeneira
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
- Department of Internal Medicine, Erasmus Medical Center, Rotterdam, the Netherlands
- Netherlands Consortium for Healthy Ageing, Netherlands Genomics Initiative, The Hague, the Netherlands
| | - Najaf Amin
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
| | - Lennart C. Karssen
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
| | - Johannes R. Vingerling
- Department of Ophthalmology, Erasmus Medical Center, Rotterdam, the Netherlands
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
| | - Angela Döring
- Institute of Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany
| | - Thomas Bettecken
- Institute of Human Genetics, Helmholtz Zentrum München, Neuherberg, Germany
| | - Goran Bencic
- Department of Ophthalmology, Hospital “Sestre Milosrdnice”, Zagreb, Croatia
| | - Christian Gieger
- Institute of Genetic Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany
| | - H.-Erich Wichmann
- Institute of Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany
| | - James F. Wilson
- Centre for Population Health Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Cristina Venturini
- Department of Twin Research & Genetic Epidemiology, King's College London, St Thomas' Hospital, London, United Kingdom
| | - Brian Fleck
- Princess Alexandra Eye Pavilion, Edinburgh, United Kingdom
| | - Phillippa M. Cumberland
- MRC Centre of Epidemiology for Child Health, Institute of Child Health, University College London, London, United Kingdom
| | - Jugnoo S. Rahi
- MRC Centre of Epidemiology for Child Health, Institute of Child Health, University College London, London, United Kingdom
- Institute of Ophthalmology, University College London, London, United Kingdom
- Ulverscroft Vision Research Group, Institute of Child Health, University College London, London, United Kingdom
| | - Chris J. Hammond
- Department of Twin Research & Genetic Epidemiology, King's College London, St Thomas' Hospital, London, United Kingdom
| | - Caroline Hayward
- MRC Human Genetics Unit, IGMM, University of Edinburgh, Edinburgh, United Kingdom
| | - Alan F. Wright
- MRC Human Genetics Unit, IGMM, University of Edinburgh, Edinburgh, United Kingdom
| | - Andrew D. Paterson
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada, and DCCT/EDIC Research Group, The Diabetes Control and Complications Trial and Follow-up Study, The Biostatistics Center, The George Washington University, Rockville, Maryland, United States of America
| | - Paul N. Baird
- Centre for Eye Research Australia, University of Melbourne, Royal Victorian Eye and Ear Hospital, Melbourne, Australia
| | - Caroline C. W. Klaver
- Department of Ophthalmology, Erasmus Medical Center, Rotterdam, the Netherlands
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, the Netherlands
| | - Jerome I. Rotter
- Institute for Translational Genomics and Population Sciences, Los Angeles BioMedical Research Institute at Harbor-UCLA Medical Center, Torrance, California, United States of America
| | - Mario Pirastu
- Institute of Population Genetics, National Research Council of Italy, Sassari, Italy
| | - Thomas Meitinger
- Institute of Human Genetics, Technische Universität München, Munich, Germany
- Institute of Human Genetics, Helmholtz Zentrum München, Neuherberg, Germany
| | - Joan E. Bailey-Wilson
- National Human Genome Research Institute, National Institutes of Health, Baltimore, Maryland, United States of America
| | - Dwight Stambolian
- Department of Ophthalmology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
107
|
Kohler JR, Guennel T, Marshall SL. Analytical strategies for discovery and replication of genetic effects in pharmacogenomic studies. PHARMACOGENOMICS & PERSONALIZED MEDICINE 2014; 7:217-25. [PMID: 25206308 PMCID: PMC4157400 DOI: 10.2147/pgpm.s66841] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
In the past decade, the pharmaceutical industry and biomedical research sector have devoted considerable resources to pharmacogenomics (PGx) with the hope that understanding genetic variation in patients would deliver on the promise of personalized medicine. With the advent of new technologies and the improved collection of DNA samples, the roadblock to advancements in PGx discovery is no longer the lack of high-density genetic information captured on patient populations, but rather the development, adaptation, and tailoring of analytical strategies to effectively harness this wealth of information. The current analytical paradigm in PGx considers the single-nucleotide polymorphism (SNP) as the genomic feature of interest and performs single SNP association tests to discover PGx effects – ie, genetic effects impacting drug response. While it can be straightforward to process single SNP results and to consider how this information may be extended for use in downstream patient stratification, the rate of replication for single SNP associations has been low and the desired success of producing clinically and commercially viable biomarkers has not been realized. This may be due to the fact that single SNP association testing is suboptimal given the complexities of PGx discovery in the clinical trial setting, including: 1) relatively small sample sizes; 2) diverse clinical cohorts within and across trials due to genetic ancestry (potentially impacting the ability to replicate findings); and 3) the potential polygenic nature of a drug response. Subsequently, a shift in the current paradigm is proposed: to consider the gene as the genomic feature of interest in PGx discovery. The proof-of-concept study presented in this manuscript demonstrates that genomic region-based association testing has the potential to improve the power of detecting single SNP or complex PGx effects in the discovery stage (by leveraging the underlying genetic architecture and reducing the multiplicity burden), and it can also improve power in the replication stage.
Collapse
|
108
|
Jiang Y, Conneely KN, Epstein MP. Flexible and robust methods for rare-variant testing of quantitative traits in trios and nuclear families. Genet Epidemiol 2014; 38:542-51. [PMID: 25044337 DOI: 10.1002/gepi.21839] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Revised: 05/21/2014] [Accepted: 05/29/2014] [Indexed: 11/07/2022]
Abstract
Most rare-variant association tests for complex traits are applicable only to population-based or case-control resequencing studies. There are fewer rare-variant association tests for family-based resequencing studies, which is unfortunate because pedigrees possess many attractive characteristics for such analyses. Family-based studies can be more powerful than their population-based counterparts due to increased genetic load and further enable the implementation of rare-variant association tests that, by design, are robust to confounding due to population stratification. With this in mind, we propose a rare-variant association test for quantitative traits in families; this test integrates the QTDT approach of Abecasis et al. [Abecasis et al., ] into the kernel-based SNP association test KMFAM of Schifano et al. [Schifano et al., ]. The resulting within-family test enjoys the many benefits of the kernel framework for rare-variant association testing, including rapid evaluation of P-values and preservation of power when a region harbors rare causal variation that acts in different directions on phenotype. Additionally, by design, this within-family test is robust to confounding due to population stratification. Although within-family association tests are generally less powerful than their counterparts that use all genetic information, we show that we can recover much of this power (although still ensuring robustness to population stratification) using a straightforward screening procedure. Our method accommodates covariates and allows for missing parental genotype data, and we have written software implementing the approach in R for public use.
Collapse
Affiliation(s)
- Yunxuan Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, United States of America
| | | | | |
Collapse
|
109
|
Zhan X, Epstein MP, Ghosh D. An Adaptive Genetic Association Test Using Double Kernel Machines. STATISTICS IN BIOSCIENCES 2014; 7:262-281. [PMID: 26640602 DOI: 10.1007/s12561-014-9116-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines (GKM) test for the purposes of subset selection and then the least squares kernel machine (LSKM) test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study.
Collapse
Affiliation(s)
- Xiang Zhan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A. Tel.: +1-8143213493
| | - Michael P Epstein
- Department of Human Genetics, Emory University, Atlanta, GA 30322, U.S.A
| | - Debashis Ghosh
- Department of Statistics, Department of Public Health Sciences, Pennsylvania State University, University Park, PA 16802, U.S.A
| |
Collapse
|
110
|
Wang X, Epstein MP, Tzeng JY. Analysis of gene-gene interactions using gene-trait similarity regression. Hum Hered 2014; 78:17-26. [PMID: 24969398 DOI: 10.1159/000360161] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Accepted: 01/30/2014] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVE Gene-gene interactions (G×G) are important to study because of their extensiveness in biological systems and their potential in explaining missing heritability of complex traits. In this work, we propose a new similarity-based test to assess G×G at the gene level, which permits the study of epistasis at biologically functional units with amplified interaction signals. METHODS Under the framework of gene-trait similarity regression (SimReg), we propose a gene-based test for detecting G×G. SimReg uses a regression model to correlate trait similarity with genotypic similarity across a gene. Unlike existing gene-level methods based on leading principal components (PCs), SimReg summarizes all information on genotypic variation within a gene and can be used to assess the joint/interactive effects of two genes as well as the effect of one gene conditional on another. RESULTS Using simulations and a real data application to the Warfarin study, we show that the SimReg G×G tests have satisfactory power and robustness under different genetic architecture when compared to existing gene-based interaction tests such as PC analysis or partial least squares. A genome-wide association study with approx. 20,000 genes may be completed on a parallel computing system in 2 weeks.
Collapse
Affiliation(s)
- Xin Wang
- Bioinformatics Research Center, North Carolina State University, Raleigh, N.C., USA
| | | | | |
Collapse
|
111
|
Gong X, Lu W, Kendrick KM, Pu W, Wang C, Jin L, Lu G, Liu Z, Liu H, Feng J. A brain-wide association study of DISC1 genetic variants reveals a relationship with the structure and functional connectivity of the precuneus in schizophrenia. Hum Brain Mapp 2014; 35:5414-30. [PMID: 24909300 DOI: 10.1002/hbm.22560] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2013] [Revised: 04/07/2014] [Accepted: 05/13/2014] [Indexed: 01/05/2023] Open
Abstract
The Disrupted in Schizophrenia Gene 1 (DISC1) plays a role in both neural signaling and development and is associated with schizophrenia, although its links to altered brain structure and function in this disorder are not fully established. Here we have used structural and functional MRI to investigate links with six DISC1 single nucleotide polymorphisms (SNPs). We employed a brain-wide association analysis (BWAS) together with a Jacknife internal validation approach in 46 schizophrenia patients and 24 matched healthy control subjects. Results from structural MRI showed significant associations between all six DISC1 variants and gray matter volume in the precuneus, post-central gyrus and middle cingulate gyrus. Associations with specific SNPs were found for rs2738880 in the left precuneus and right post-central gyrus, and rs1535530 in the right precuneus and middle cingulate gyrus. Using regions showing structural associations as seeds a resting-state functional connectivity analysis revealed significant associations between all 6 SNPS and connectivity between the right precuneus and inferior frontal gyrus. The connection between the right precuneus and inferior frontal gyrus was also specifically associated with rs821617. Importantly schizophrenia patients showed positive correlations between the six DISC-1 SNPs associated gray matter volume in the left precuneus and right post-central gyrus and negative symptom severity. No correlations with illness duration were found. Our results provide the first evidence suggesting a key role for structural and functional connectivity associations between DISC1 polymorphisms and the precuneus in schizophrenia.
Collapse
Affiliation(s)
- Xiaohong Gong
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China; State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, 200433, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
112
|
Svishcheva GR, Belonogova NM, Axenovich TI. FFBSKAT: fast family-based sequence kernel association test. PLoS One 2014; 9:e99407. [PMID: 24905468 PMCID: PMC4048315 DOI: 10.1371/journal.pone.0099407] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2014] [Accepted: 05/14/2014] [Indexed: 11/28/2022] Open
Abstract
The kernel machine-based regression is an efficient approach to region-based association analysis aimed at identification of rare genetic variants. However, this method is computationally complex. The running time of kernel-based association analysis becomes especially long for samples with genetic (sub) structures, thus increasing the need to develop new and effective methods, algorithms, and software packages. We have developed a new R-package called fast family-based sequence kernel association test (FFBSKAT) for analysis of quantitative traits in samples of related individuals. This software implements a score-based variance component test to assess the association of a given set of single nucleotide polymorphisms with a continuous phenotype. We compared the performance of our software with that of two existing software for family-based sequence kernel association testing, namely, ASKAT and famSKAT, using the Genetic Analysis Workshop 17 family sample. Results demonstrate that FFBSKAT is several times faster than other available programs. In addition, the calculations of the three-compared software were similarly accurate. With respect to the available analysis modes, we combined the advantages of both ASKAT and famSKAT and added new options to empower FFBSKAT users. The FFBSKAT package is fast, user-friendly, and provides an easy-to-use method to perform whole-exome kernel machine-based regression association analysis of quantitative traits in samples of related individuals. The FFBSKAT package, along with its manual, is available for free download at http://mga.bionet.nsc.ru/soft/FFBSKAT/.
Collapse
Affiliation(s)
- Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
- * E-mail:
| |
Collapse
|
113
|
Fan R, Wang Y, Mills JL, Wilson AF, Bailey-Wilson JE, Xiong M. Functional linear models for association analysis of quantitative traits. Genet Epidemiol 2014; 37:726-42. [PMID: 24130119 DOI: 10.1002/gepi.21757] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Revised: 07/15/2013] [Accepted: 08/14/2013] [Indexed: 12/19/2022]
Abstract
Functional linear models are developed in this paper for testing associations between quantitative traits and genetic variants, which can be rare variants or common variants or the combination of the two. By treating multiple genetic variants of an individual in a human population as a realization of a stochastic process, the genome of an individual in a chromosome region is a continuum of sequence data rather than discrete observations. The genome of an individual is viewed as a stochastic function that contains both linkage and linkage disequilibrium (LD) information of the genetic markers. By using techniques of functional data analysis, both fixed and mixed effect functional linear models are built to test the association between quantitative traits and genetic variants adjusting for covariates. After extensive simulation analysis, it is shown that the F-distributed tests of the proposed fixed effect functional linear models have higher power than that of sequence kernel association test (SKAT) and its optimal unified test (SKAT-O) for three scenarios in most cases: (1) the causal variants are all rare, (2) the causal variants are both rare and common, and (3) the causal variants are common. The superior performance of the fixed effect functional linear models is most likely due to its optimal utilization of both genetic linkage and LD information of multiple genetic variants in a genome and similarity among different individuals, while SKAT and SKAT-O only model the similarities and pairwise LD but do not model linkage and higher order LD information sufficiently. In addition, the proposed fixed effect models generate accurate type I error rates in simulation studies. We also show that the functional kernel score tests of the proposed mixed effect functional linear models are preferable in candidate gene analysis and small sample problems. The methods are applied to analyze three biochemical traits in data from the Trinity Students Study.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Rockville, Maryland, United States of America
| | | | | | | | | | | |
Collapse
|
114
|
Li M, He Z, Zhang M, Zhan X, Wei C, Elston RC, Lu Q. A generalized genetic random field method for the genetic association analysis of sequencing data. Genet Epidemiol 2014; 38:242-53. [PMID: 24482034 PMCID: PMC5241166 DOI: 10.1002/gepi.21790] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2013] [Revised: 11/28/2013] [Accepted: 12/21/2013] [Indexed: 01/23/2023]
Abstract
With the advance of high-throughput sequencing technologies, it has become feasible to investigate the influence of the entire spectrum of sequencing variations on complex human diseases. Although association studies utilizing the new sequencing technologies hold great promise to unravel novel genetic variants, especially rare genetic variants that contribute to human diseases, the statistical analysis of high-dimensional sequencing data remains a challenge. Advanced analytical methods are in great need to facilitate high-dimensional sequencing data analyses. In this article, we propose a generalized genetic random field (GGRF) method for association analyses of sequencing data. Like other similarity-based methods (e.g., SIMreg and SKAT), the new method has the advantages of avoiding the need to specify thresholds for rare variants and allowing for testing multiple variants acting in different directions and magnitude of effects. The method is built on the generalized estimating equation framework and thus accommodates a variety of disease phenotypes (e.g., quantitative and binary phenotypes). Moreover, it has a nice asymptotic property, and can be applied to small-scale sequencing data without need for small-sample adjustment. Through simulations, we demonstrate that the proposed GGRF attains an improved or comparable power over a commonly used method, SKAT, under various disease scenarios, especially when rare variants play a significant role in disease etiology. We further illustrate GGRF with an application to a real dataset from the Dallas Heart Study. By using GGRF, we were able to detect the association of two candidate genes, ANGPTL3 and ANGPTL4, with serum triglyceride.
Collapse
Affiliation(s)
- Ming Li
- Division of Biostatistics, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, United States of America
| | - Zihuai He
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Min Zhang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Xiaowei Zhan
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Changshuai Wei
- Department of Epidemiology and Biostatics, Michigan State University, East Lansing, Michigan, United States of America
| | - Robert C. Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio, United States of America
| | - Qing Lu
- Department of Epidemiology and Biostatics, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
115
|
Zeng P, Zhao Y, Zhang L, Huang S, Chen F. Rare variants detection with kernel machine learning based on likelihood ratio test. PLoS One 2014; 9:e93355. [PMID: 24675868 PMCID: PMC3968153 DOI: 10.1371/journal.pone.0093355] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2013] [Accepted: 03/03/2014] [Indexed: 11/18/2022] Open
Abstract
This paper mainly utilizes likelihood-based tests to detect rare variants associated with a continuous phenotype under the framework of kernel machine learning. Both the likelihood ratio test (LRT) and the restricted likelihood ratio test (ReLRT) are investigated. The relationship between the kernel machine learning and the mixed effects model is discussed. By using the eigenvalue representation of LRT and ReLRT, their exact finite sample distributions are obtained in a simulation manner. Numerical studies are performed to evaluate the performance of the proposed approaches under the contexts of standard mixed effects model and kernel machine learning. The results have shown that the LRT and ReLRT can control the type I error correctly at the given α level. The LRT and ReLRT consistently outperform the SKAT, regardless of the sample size and the proportion of the negative causal rare variants, and suffer from fewer power reductions compared to the SKAT when both positive and negative effects of rare variants are present. The LRT and ReLRT performed under the context of kernel machine learning have slightly higher powers than those performed under the context of standard mixed effects model. We use the Genetic Analysis Workshop 17 exome sequencing SNP data as an illustrative example. Some interesting results are observed from the analysis. Finally, we give the discussion.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu, China
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Liwei Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Shuiping Huang
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
- * E-mail:
| |
Collapse
|
116
|
|
117
|
Huang YT, Vanderweele TJ, Lin X. JOINT ANALYSIS OF SNP AND GENE EXPRESSION DATA IN GENETIC ASSOCIATION STUDIES OF COMPLEX DISEASES. Ann Appl Stat 2014; 8:352-376. [PMID: 24729824 DOI: 10.1214/13-aoas690] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Genetic association studies have been a popular approach for assessing the association between common Single Nucleotide Polymorphisms (SNPs) and complex diseases. However, other genomic data involved in the mechanism from SNPs to disease, e.g., gene expressions, are usually neglected in these association studies. In this paper, we propose to exploit gene expression information to more powerfully test the association between SNPs and diseases by jointly modeling the relations among SNPs, gene expressions and diseases. We propose a variance component test for the total effect of SNPs and a gene expression on disease risk. We cast the test within the causal mediation analysis framework with the gene expression as a potential mediator. For eQTL SNPs, the use of gene expression information can enhance power to test for the total effect of a SNP-set, which are the combined direct and indirect effects of the SNPs mediated through the gene expression, on disease risk. We show that the test statistic under the null hypothesis follows a mixture of χ2 distributions, which can be evaluated analytically or empirically using the resampling-based perturbation method. We construct tests for each of three disease models that is determined by SNPs only, SNPs and gene expression, or includes also their interactions. As the true disease model is unknown in practice, we further propose an omnibus test to accommodate different underlying disease models. We evaluate the finite sample performance of the proposed methods using simulation studies, and show that our proposed test performs well and the omnibus test can almost reach the optimal power where the disease model is known and correctly specified. We apply our method to re-analyze the overall effect of the SNP-set and expression of the ORMDL3 gene on the risk of asthma.
Collapse
|
118
|
Lu M, Lee HS, Hadley D, Huang JZ, Qian X. Supervised categorical principal component analysis for genome-wide association analyses. BMC Genomics 2014; 15 Suppl 1:S10. [PMID: 24564304 PMCID: PMC4046680 DOI: 10.1186/1471-2164-15-s1-s10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023] Open
Abstract
In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.
Collapse
|
119
|
Schatzberg AF, Keller J, Tennakoon L, Lembke A, Williams G, Kraemer FB, Sarginson JE, Lazzeroni LC, Murphy GM. HPA axis genetic variation, cortisol and psychosis in major depression. Mol Psychiatry 2014; 19:220-7. [PMID: 24166410 PMCID: PMC4339288 DOI: 10.1038/mp.2013.129] [Citation(s) in RCA: 90] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Revised: 06/27/2013] [Accepted: 07/10/2013] [Indexed: 01/07/2023]
Abstract
Genetic variation underlying hypothalamic pituitary adrenal (HPA) axis overactivity in healthy controls (HCs) and patients with severe forms of major depression has not been well explored, but could explain risk for cortisol dysregulation. In total, 95 participants were studied: 40 patients with psychotic major depression (PMD); 26 patients with non-psychotic major depression (NPMD); and 29 HCs. Collection of genetic material was added one third of the way into a larger study on cortisol, cognition and psychosis in major depression. Subjects were assessed using the Brief Psychiatric Rating Scale, the Hamilton Depression Rating Scale and the Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders. Blood was collected hourly for determination of cortisol from 1800 to 0900 h and for the assessment of alleles for six genes involved in HPA axis regulation. Two of the six genes contributed significantly to cortisol levels, psychosis measures or depression severity. After accounting for age, depression and psychosis, and medication status, only allelic variation for the glucocorticoid receptor (GR) gene accounted for a significant variance for mean cortisol levels from 1800 to 0100 h (r(2)=0.288) and from 0100 to 0900 h (r(2)=0.171). In addition, GR and corticotropin-releasing hormone receptor 1 (CRHR1) genotypes contributed significantly to psychosis measures and CRHR1 contributed significantly to depression severity rating.
Collapse
MESH Headings
- Adult
- Affective Disorders, Psychotic/diagnosis
- Affective Disorders, Psychotic/genetics
- Affective Disorders, Psychotic/physiopathology
- Corticotropin-Releasing Hormone/genetics
- Depressive Disorder, Major/diagnosis
- Depressive Disorder, Major/genetics
- Depressive Disorder, Major/physiopathology
- Female
- Humans
- Hydrocortisone/blood
- Hypothalamo-Hypophyseal System/physiopathology
- Interview, Psychological
- Linkage Disequilibrium
- Male
- Pituitary-Adrenal System/physiopathology
- Psychiatric Status Rating Scales
- Receptors, Corticotropin-Releasing Hormone/genetics
- Receptors, Glucocorticoid/genetics
- Receptors, Mineralocorticoid/genetics
- Tacrolimus Binding Proteins/genetics
Collapse
Affiliation(s)
- Alan F. Schatzberg
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| | - Jennifer Keller
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| | - Lakshika Tennakoon
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| | - Anna Lembke
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| | | | | | - Jane E. Sarginson
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| | - Laura C. Lazzeroni
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| | - Greer M. Murphy
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine
| |
Collapse
|
120
|
Incorporating prior knowledge to increase the power of genome-wide association studies. Methods Mol Biol 2014; 1019:519-41. [PMID: 23756909 DOI: 10.1007/978-1-62703-447-0_25] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
Abstract
Typical methods of analyzing genome-wide single nucleotide variant (SNV) data in cases and controls involve testing each variant's genotypes separately for phenotype association, and then using a substantial multiple-testing penalty to minimize the rate of false positives. This approach, however, can result in low power for modestly associated SNVs. Furthermore, simply looking at the most associated SNVs may not directly yield biological insights about disease etiology. SNVset methods attempt to address both limitations of the traditional approach by testing biologically meaningful sets of SNVs (e.g., genes or pathways). The number of tests run in a SNVset analysis is typically much lower (hundreds or thousands instead of millions) than in a traditional analysis, so the false-positive rate is lower. Additionally, by testing SNVsets that are biologically meaningful finding a significant set may more quickly yield insights into disease etiology.In this chapter we summarize the short history of SNVset testing and provide an overview of the many recently proposed methods. Furthermore, we provide detailed step-by-step instructions on how to perform a SNVset analysis, including a substantial number of practical tips and questions that researchers should consider before undertaking a SNVset analysis. Lastly, we describe a companion R package (snvset) that implements recently proposed SNVset methods. While SNVset testing is a new approach, with many new methods still being developed and many open questions, the promise of the approach is worth serious consideration when considering analytic methods for GWAS.
Collapse
|
121
|
Larson NB, Schaid DJ. Regularized rare variant enrichment analysis for case-control exome sequencing data. Genet Epidemiol 2013; 38:104-13. [PMID: 24382715 DOI: 10.1002/gepi.21783] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Revised: 11/04/2013] [Accepted: 12/02/2013] [Indexed: 11/09/2022]
Abstract
Rare variants have recently garnered an immense amount of attention in genetic association analysis. However, unlike methods traditionally used for single marker analysis in GWAS, rare variant analysis often requires some method of aggregation, since single marker approaches are poorly powered for typical sequencing study sample sizes. Advancements in sequencing technologies have rendered next-generation sequencing platforms a realistic alternative to traditional genotyping arrays. Exome sequencing in particular not only provides base-level resolution of genetic coding regions, but also a natural paradigm for aggregation via genes and exons. Here, we propose the use of penalized regression in combination with variant aggregation measures to identify rare variant enrichment in exome sequencing data. In contrast to marginal gene-level testing, we simultaneously evaluate the effects of rare variants in multiple genes, focusing on gene-based least absolute shrinkage and selection operator (LASSO) and exon-based sparse group LASSO models. By using gene membership as a grouping variable, the sparse group LASSO can be used as a gene-centric analysis of rare variants while also providing a penalized approach toward identifying specific regions of interest. We apply extensive simulations to evaluate the performance of these approaches with respect to specificity and sensitivity, comparing these results to multiple competing marginal testing methods. Finally, we discuss our findings and outline future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | | |
Collapse
|
122
|
Jin L, Zhu W, Yu Y, Kou C, Meng X, Tao Y, Guo J. Nonparametric tests of associations with disease based on U-statistics. Ann Hum Genet 2013; 78:141-53. [PMID: 24328673 DOI: 10.1111/ahg.12049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Accepted: 09/01/2013] [Indexed: 11/25/2022]
Abstract
In case-control studies, association analysis was designed to test whether genetic variants were associated with human diseases. To evaluate the association, analysing one genetic marker at a time suffered from weak power, because of the correction for multiple testing and possibly small genetic effects. An alternative strategy was to test simultaneous effects of multiple markers, which was believed to be more powerful. However, when the number of markers under investigation was large, they would be subjected to weak power as well, because of the greater degrees of freedom. To conquer these limitations in case-control studies, we proposed a novel method that could test joint association of several loci (i.e. haplotype), with only a single degree of freedom. In this research, we developed a nonparametric approach, which was based on U-statistics. We also introduced a new kernel for U-statistic, which could combine the haplotype structure information, and was expected to enhance the power. Simulations indicated that our proposed approach offered merits in identifying the associations between diseases and haplotypes. Application of our method to a study of candidate genes for internalising disorder illustrated its virtue in utility and interpretation, and provided an excellent result in detecting the associations.
Collapse
Affiliation(s)
- Lina Jin
- Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, Jilin, 130024, China; School of Public Health, Jilin University, Changchun, Jilin, 130021, China
| | | | | | | | | | | | | |
Collapse
|
123
|
Taub MA, Schwender HR, Younkin SG, Louis TA, Ruczinski I. On multi-marker tests for association in case-control studies. Front Genet 2013; 4:252. [PMID: 24379823 PMCID: PMC3863805 DOI: 10.3389/fgene.2013.00252] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 11/07/2013] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies (GWAs) have identified thousands of DNA loci associated with a variety of traits. Statistical inference is almost always based on single marker hypothesis tests of association and the respective p-values with Bonferroni correction. Since commercially available genomic arrays interrogate hundreds of thousands or even millions of loci simultaneously, many causal yet undetected loci are believed to exist because the conditional power to achieve a genome-wide significance level can be low, in particular for markers with small effect sizes and low minor allele frequencies and in studies with modest sample size. However, the correlation between neighboring markers in the human genome due to linkage disequilibrium (LD) resulting in correlated marker test statistics can be incorporated into multi-marker hypothesis tests, thereby increasing power to detect association. Herein, we establish a theoretical benchmark by quantifying the maximum power achievable for multi-marker tests of association in case-control studies, achievable only when the causal marker is known. Using that genotype correlations within an LD block translate into an asymptotically multivariate normal distribution for score test statistics, we develop a set of weights for the markers that maximize the non-centrality parameter, and assess the relative loss of power for other approaches. We find that the method of Conneely and Boehnke (2007) based on the maximum absolute test statistic observed in an LD block is a practical and powerful method in a variety of settings. We also explore the effect on the power that prior biological or functional knowledge used to narrow down the locus of the causal marker can have, and conclude that this prior knowledge has to be very strong and specific for the power to approach the maximum achievable level, or even beat the power observed for methods such as the one proposed by Conneely and Boehnke (2007).
Collapse
Affiliation(s)
- Margaret A Taub
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Holger R Schwender
- Mathematical Institute, Heinrich Heine University Düsseldorf Düsseldorf, Germany
| | - Samuel G Younkin
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Thomas A Louis
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| |
Collapse
|
124
|
Qu L, Guennel T, Marshall SL. Linear score tests for variance components in linear mixed models and applications to genetic association studies. Biometrics 2013; 69:883-92. [PMID: 24328714 DOI: 10.1111/biom.12095] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2012] [Revised: 06/01/2013] [Accepted: 07/01/2013] [Indexed: 01/16/2023]
Abstract
Following the rapid development of genome-scale genotyping technologies, genetic association mapping has become a popular tool to detect genomic regions responsible for certain (disease) phenotypes, especially in early-phase pharmacogenomic studies with limited sample size. In response to such applications, a good association test needs to be (1) applicable to a wide range of possible genetic models, including, but not limited to, the presence of gene-by-environment or gene-by-gene interactions and non-linearity of a group of marker effects, (2) accurate in small samples, fast to compute on the genomic scale, and amenable to large scale multiple testing corrections, and (3) reasonably powerful to locate causal genomic regions. The kernel machine method represented in linear mixed models provides a viable solution by transforming the problem into testing the nullity of variance components. In this study, we consider score-based tests by choosing a statistic linear in the score function. When the model under the null hypothesis has only one error variance parameter, our test is exact in finite samples. When the null model has more than one variance parameter, we develop a new moment-based approximation that performs well in simulations. Through simulations and analysis of real data, we demonstrate that the new test possesses most of the aforementioned characteristics, especially when compared to existing quadratic score tests or restricted likelihood ratio tests.
Collapse
Affiliation(s)
- Long Qu
- Department of Mathematics and Statistics, Wright State University, Dayton, Ohio 45435, U.S.A
| | | | | |
Collapse
|
125
|
Regional replication of association with refractive error on 15q14 and 15q25 in the Age-Related Eye Disease Study cohort. Mol Vis 2013; 19:2173-86. [PMID: 24227913 PMCID: PMC3826323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 10/30/2013] [Indexed: 11/25/2022] Open
Abstract
PURPOSE Refractive error is a complex trait with multiple genetic and environmental risk factors, and is the most common cause of preventable blindness worldwide. The common nature of the trait suggests the presence of many genetic factors that individually may have modest effects. To achieve an adequate sample size to detect these common variants, large, international collaborations have formed. These consortia typically use meta-analysis to combine multiple studies from many different populations. This approach is robust to differences between populations; however, it does not compensate for the different haplotypes in each genetic background evidenced by different alleles in linkage disequilibrium with the causative variant. We used the Age-Related Eye Disease Study (AREDS) cohort to replicate published significant associations at two loci on chromosome 15 from two genome-wide association studies (GWASs). The single nucleotide polymorphisms (SNPs) that exhibited association on chromosome 15 in the original studies did not show evidence of association with refractive error in the AREDS cohort. This paper seeks to determine whether the non-replication in this AREDS sample may be due to the limited number of SNPs chosen for replication. METHODS We selected all SNPs genotyped on the Illumina Omni2.5v1_B array or custom TaqMan assays or imputed from the GWAS data, in the region surrounding the SNPs from the Consortium for Refractive Error and Myopia study. We analyzed the SNPs for association with refractive error using standard regression methods in PLINK. The effective number of tests was calculated using the Genetic Type I Error Calculator. RESULTS Although use of the same SNPs used in the Consortium for Refractive Error and Myopia study did not show any evidence of association with refractive error in this AREDS sample, other SNPs within the candidate regions demonstrated an association with refractive error. Significant evidence of association was found using the hyperopia categorical trait, with the most significant SNPs rs1357179 on 15q14 (p=1.69×10⁻³) and rs7164400 on 15q25 (p=8.39×10⁻⁴), which passed the replication thresholds. CONCLUSIONS This study adds to the growing body of evidence that attempting to replicate the most significant SNPs found in one population may not be significant in another population due to differences in the linkage disequilibrium structure and/or allele frequency. This suggests that replication studies should include less significant SNPs in an associated region rather than only a few selected SNPs chosen by a significance threshold.
Collapse
|
126
|
Schaid DJ, Sinnwell JP, McDonnell SK, Thibodeau SN. Detecting genomic clustering of risk variants from sequence data: cases versus controls. Hum Genet 2013; 132:1301-9. [PMID: 23842950 PMCID: PMC3797865 DOI: 10.1007/s00439-013-1335-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Accepted: 07/02/2013] [Indexed: 02/02/2023]
Abstract
As the ability to measure dense genetic markers approaches the limit of the DNA sequence itself, taking advantage of possible clustering of genetic variants in, and around, a gene would benefit genetic association analyses, and likely provide biological insights. The greatest benefit might be realized when multiple rare variants cluster in a functional region. Several statistical tests have been developed, one of which is based on the popular Kulldorff scan statistic for spatial clustering of disease. We extended another popular spatial clustering method--Tango's statistic--to genomic sequence data. An advantage of Tango's method is that it is rapid to compute, and when single test statistic is computed, its distribution is well approximated by a scaled χ(2) distribution, making computation of p values very rapid. We compared the Type-I error rates and power of several clustering statistics, as well as the omnibus sequence kernel association test. Although our version of Tango's statistic, which we call "Kernel Distance" statistic, took approximately half the time to compute than the Kulldorff scan statistic, it had slightly less power than the scan statistic. Our results showed that the Ionita-Laza version of Kulldorff's scan statistic had the greatest power over a range of clustering scenarios.
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA,
| | | | | | | |
Collapse
|
127
|
Larson NB, Schaid DJ. A kernel regression approach to gene-gene interaction detection for case-control studies. Genet Epidemiol 2013; 37:695-703. [PMID: 23868214 DOI: 10.1002/gepi.21749] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2013] [Revised: 05/07/2013] [Accepted: 06/12/2013] [Indexed: 01/13/2023]
Abstract
Gene-gene interactions are increasingly being addressed as a potentially important contributor to the variability of complex traits. Consequently, attentions have moved beyond single locus analysis of association to more complex genetic models. Although several single-marker approaches toward interaction analysis have been developed, such methods suffer from very high testing dimensionality and do not take advantage of existing information, notably the definition of genes as functional units. Here, we propose a comprehensive family of gene-level score tests for identifying genetic elements of disease risk, in particular pairwise gene-gene interactions. Using kernel machine methods, we devise score-based variance component tests under a generalized linear mixed model framework. We conducted simulations based upon coalescent genetic models to evaluate the performance of our approach under a variety of disease models. These simulations indicate that our methods are generally higher powered than alternative gene-level approaches and at worst competitive with exhaustive SNP-level (where SNP is single-nucleotide polymorphism) analyses. Furthermore, we observe that simulated epistatic effects resulted in significant marginal testing results for the involved genes regardless of whether or not true main effects were present. We detail the benefits of our methods and discuss potential genome-wide analysis strategies for gene-gene interaction analysis in a case-control study design.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | | |
Collapse
|
128
|
Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, Peters U. SBERIA: set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol 2013; 37:452-64. [PMID: 23720162 PMCID: PMC3713231 DOI: 10.1002/gepi.21735] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Revised: 04/04/2013] [Accepted: 04/30/2013] [Indexed: 01/28/2023]
Abstract
Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Collapse
Affiliation(s)
- Shuo Jiao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
129
|
Belonogova NM, Svishcheva GR, van Duijn CM, Aulchenko YS, Axenovich TI. Region-based association analysis of human quantitative traits in related individuals. PLoS One 2013; 8:e65395. [PMID: 23799013 PMCID: PMC3684601 DOI: 10.1371/journal.pone.0065395] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2013] [Accepted: 04/24/2013] [Indexed: 01/27/2023] Open
Abstract
Regional-based association analysis instead of individual testing of each SNP was introduced in genome-wide association studies to increase the power of gene mapping, especially for rare genetic variants. For regional association tests, the kernel machine-based regression approach was recently proposed as a more powerful alternative to collapsing-based methods. However, the vast majority of existing algorithms and software for the kernel machine-based regression are applicable only to unrelated samples. In this paper, we present a new method for the kernel machine-based regression association analysis of quantitative traits in samples of related individuals. The method is based on the GRAMMAR+ transformation of phenotypes of related individuals, followed by use of existing kernel machine-based regression software for unrelated samples. We compared the performance of kernel-based association analysis on the material of the Genetic Analysis Workshop 17 family sample and real human data by using our transformation, the original untransformed trait, and environmental residuals. We demonstrated that only the GRAMMAR+ transformation produced type I errors close to the nominal value and that this method had the highest empirical power. The new method can be applied to analysis of related samples by using existing software for kernel-based association analysis developed for unrelated samples.
Collapse
Affiliation(s)
- Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | | | - Yurii S. Aulchenko
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- * E-mail:
| |
Collapse
|
130
|
Lee D, Lee GK, Yoon KA, Lee JS. Pathway-based analysis using genome-wide association data from a Korean non-small cell lung cancer study. PLoS One 2013; 8:e65396. [PMID: 23762359 PMCID: PMC3675130 DOI: 10.1371/journal.pone.0065396] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 04/24/2013] [Indexed: 11/18/2022] Open
Abstract
Pathway-based analysis, used in conjunction with genome-wide association study (GWAS) techniques, is a powerful tool to detect subtle but systematic patterns in genome that can help elucidate complex diseases, like cancers. Here, we stepped back from genetic polymorphisms at a single locus and examined how multiple association signals can be orchestrated to find pathways related to lung cancer susceptibility. We used single-nucleotide polymorphism (SNP) array data from 869 non-small cell lung cancer (NSCLC) cases from a previous GWAS at the National Cancer Center and 1,533 controls from the Korean Association Resource project for the pathway-based analysis. After mapping single-nucleotide polymorphisms to genes, considering their coding region and regulatory elements (±20 kbp), multivariate logistic regression of additive and dominant genetic models were fitted against disease status, with adjustments for age, gender, and smoking status. Pathway statistics were evaluated using Gene Set Enrichment Analysis (GSEA) and Adaptive Rank Truncated Product (ARTP) methods. Among 880 pathways, 11 showed relatively significant statistics compared to our positive controls (PGSEA≤0.025, false discovery rate≤0.25). Candidate pathways were validated using the ARTP method and similarities between pathways were computed against each other. The top-ranked pathways were ABC Transporters (PGSEA<0.001, PARTP = 0.001), VEGF Signaling Pathway (PGSEA<0.001, PARTP = 0.008), G1/S Check Point (PGSEA = 0.004, PARTP = 0.013), and NRAGE Signals Death through JNK (PGSEA = 0.006, PARTP = 0.001). Our results demonstrate that pathway analysis can shed light on post-GWAS research and help identify potential targets for cancer susceptibility.
Collapse
MESH Headings
- Adult
- Aged
- Aged, 80 and over
- Asian People
- Carcinoma, Non-Small-Cell Lung/diagnosis
- Carcinoma, Non-Small-Cell Lung/ethnology
- Carcinoma, Non-Small-Cell Lung/genetics
- Carcinoma, Non-Small-Cell Lung/metabolism
- Case-Control Studies
- Databases, Genetic
- Female
- Gene Expression Regulation, Neoplastic
- Genetic Predisposition to Disease
- Genome, Human
- Genome-Wide Association Study
- Humans
- Logistic Models
- Lung Neoplasms/diagnosis
- Lung Neoplasms/ethnology
- Lung Neoplasms/genetics
- Lung Neoplasms/metabolism
- Male
- Metabolic Networks and Pathways/genetics
- Middle Aged
- Models, Genetic
- Polymorphism, Single Nucleotide
- Signal Transduction
Collapse
Affiliation(s)
- Donghoon Lee
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
| | - Geon Kook Lee
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
| | - Kyong-Ah Yoon
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
- * E-mail:
| | - Jin Soo Lee
- Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi, Republic of Korea
| |
Collapse
|
131
|
Zhu H, Li L, Zhou H. Nonlinear dimension reduction with Wright-Fisher kernel for genotype aggregation and association mapping. Bioinformatics 2013; 28:i375-i381. [PMID: 22962455 PMCID: PMC3436833 DOI: 10.1093/bioinformatics/bts406] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Association tests based on next-generation sequencing data are often under-powered due to the presence of rare variants and large amount of neutral or protective variants. A successful strategy is to aggregate genetic information within meaningful single-nucleotide polymorphism (SNP) sets, e.g. genes or pathways, and test association on SNP sets. Many existing methods for group-wise tests require specific assumptions about the direction of individual SNP effects and/or perform poorly in the presence of interactions. RESULTS We propose a joint association test strategy based on two key components: a nonlinear supervised dimension reduction approach for effective SNP information aggregation and a novel kernel specially designed for qualitative genotype data. The new test demonstrates superior performance in identifying causal genes over existing methods across a large variety of disease models simulated from sequence data of real genes. In general, the proposed method provides an association test strategy that can (i) detect both rare and common causal variants, (ii) deal with both additive and interaction effect, (iii) handle both quantitative traits and disease dichotomies and (iv) incorporate non-genetic covariates. In addition, the new kernel can potentially boost the power of the entire family of kernel-based methods for genetic data analysis. AVAILABILITY The method is implemented in MATLAB. Source code is available upon request. CONTACT hongjie.zhu@duke.edu.
Collapse
Affiliation(s)
- Hongjie Zhu
- Department of Psychiatry and Behavior Science, Duke University, Durham, NC 27710, USA.
| | | | | |
Collapse
|
132
|
Schaid DJ, McDonnell SK, Sinnwell JP, Thibodeau SN. Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genet Epidemiol 2013; 37:409-18. [PMID: 23650101 DOI: 10.1002/gepi.21727] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Revised: 03/11/2013] [Accepted: 04/01/2013] [Indexed: 11/11/2022]
Abstract
Searching for rare genetic variants associated with complex diseases can be facilitated by enriching for diseased carriers of rare variants by sampling cases from pedigrees enriched for disease, possibly with related or unrelated controls. This strategy, however, complicates analyses because of shared genetic ancestry, as well as linkage disequilibrium among genetic markers. To overcome these problems, we developed broad classes of "burden" statistics and kernel statistics, extending commonly used methods for unrelated case-control data to allow for known pedigree relationships, for autosomes and the X chromosome. Furthermore, by replacing pedigree-based genetic correlation matrices with estimates of genetic relationships based on large-scale genomic data, our methods can be used to account for population-structured data. By simulations, we show that the type I error rates of our developed methods are near the asymptotic nominal levels, allowing rapid computation of P-values. Our simulations also show that a linear weighted kernel statistic is generally more powerful than a weighted "burden" statistic. Because the proposed statistics are rapid to compute, they can be readily used for large-scale screening of the association of genomic sequence data with disease status.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota 55905, USA.
| | | | | | | |
Collapse
|
133
|
Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, Armistead PM. Kernel machine SNP-set testing under multiple candidate kernels. Genet Epidemiol 2013; 37:267-75. [PMID: 23471868 PMCID: PMC3769109 DOI: 10.1002/gepi.21715] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2012] [Revised: 01/15/2013] [Accepted: 02/05/2013] [Indexed: 11/10/2022]
Abstract
Joint testing for the cumulative effect of multiple single-nucleotide polymorphisms grouped on the basis of prior biological knowledge has become a popular and powerful strategy for the analysis of large-scale genetic association studies. The kernel machine (KM)-testing framework is a useful approach that has been proposed for testing associations between multiple genetic variants and many different types of complex traits by comparing pairwise similarity in phenotype between subjects to pairwise similarity in genotype, with similarity in genotype defined via a kernel function. An advantage of the KM framework is its flexibility: choosing different kernel functions allows for different assumptions concerning the underlying model and can allow for improved power. In practice, it is difficult to know which kernel to use a priori because this depends on the unknown underlying trait architecture and selecting the kernel which gives the lowest P-value can lead to inflated type I error. Therefore, we propose practical strategies for KM testing when multiple candidate kernels are present based on constructing composite kernels and based on efficient perturbation procedures. We demonstrate through simulations and real data applications that the procedures protect the type I error rate and can lead to substantially improved power over poor choices of kernels and only modest differences in power vs. using the best candidate kernel.
Collapse
Affiliation(s)
- Michael C Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7420, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
134
|
Wang X, Morris NJ, Zhu X, Elston RC. A variance component based multi-marker association test using family and unrelated data. BMC Genet 2013; 14:17. [PMID: 23497289 PMCID: PMC3614458 DOI: 10.1186/1471-2156-14-17] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Accepted: 02/11/2013] [Indexed: 02/02/2023] Open
Abstract
Background Incorporating family data in genetic association studies has become increasingly appreciated, especially for its potential value in testing rare variants. We introduce here a variance-component based association test that can test multiple common or rare variants jointly using both family and unrelated samples. Results The proposed approach implemented in our R package aggregates or collapses the information across a region based on genetic similarity instead of genotype scores, which avoids the power loss when the effects are in different directions or have different association strengths. The method is also able to effectively leverage the LD information in a region and it can produce a test statistic with an adaptively estimated number of degrees of freedom. Our method can readily allow for the adjustment of non-genetic contributions to the familial similarity, as well as multiple covariates. Conclusions We demonstrate through simulations that the proposed method achieves good performance in terms of Type I error control and statistical power. The method is implemented in the R package “fassoc”, which provides a useful tool for data analysis and exploration.
Collapse
Affiliation(s)
- Xuefeng Wang
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
| | | | | | | |
Collapse
|
135
|
Gene-based testing of interactions in association studies of quantitative traits. PLoS Genet 2013; 9:e1003321. [PMID: 23468652 PMCID: PMC3585009 DOI: 10.1371/journal.pgen.1003321] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 12/31/2012] [Indexed: 01/05/2023] Open
Abstract
Various methods have been developed for identifying gene–gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene–gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following P value combining methods: minimum P value, extended Simes procedure, truncated tail strength, and truncated P value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein–protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between SMAD3 and NEDD9, in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies. Epistasis is likely to play a significant role in complex diseases or traits and is one of the many possible explanations for “missing heritability.” However, epistatic interactions have been difficult to detect in genome-wide association studies (GWAS) due to the limited power caused by the multiple-testing correction from the large number of tests conducted. Gene-based gene–gene interaction (GGG) tests might hold the key to relaxing the multiple-testing correction burden and increasing the power for identifying epistatic interactions in GWAS. Here, we developed GGG tests of quantitative traits by extending four P value combining methods and evaluated their type I error rates and power using extensive simulations. All four GGG tests are more powerful than a principal component-based test. We also applied our GGG tests to data from the Atherosclerosis Risk in Communities study and found five gene-level interactions associated with the levels of total cholesterol and high-density lipoprotein cholesterol (HDL-C). One interaction between SMAD3 and NEDD9 on HDL-C was further replicated in an independent sample from the Multi-Ethnic Study of Atherosclerosis.
Collapse
|
136
|
Zakharov S, Salim A, Thalamuthu A. Comparison of similarity-based tests and pooling strategies for rare variants. BMC Genomics 2013; 14:50. [PMID: 23343094 PMCID: PMC3600007 DOI: 10.1186/1471-2164-14-50] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 01/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background As several rare genomic variants have been shown to affect common phenotypes, rare variants association analysis has received considerable attention. Several efficient association tests using genotype and phenotype similarity measures have been proposed in the literature. The major advantages of similarity-based tests are their ability to accommodate multiple types of DNA variations within one association test, and to account for the possible interaction within a region. However, not much work has been done to compare the performance of similarity-based tests on rare variants association scenarios, especially when applied with different rare variants pooling strategies. Results Based on the population genetics simulations and analysis of a publicly-available sequencing data set, we compared the performance of four similarity-based tests and two rare variants pooling strategies. We showed that weighting approach outperforms collapsing under the presence of strong effect from rare variants and under the presence of moderate effect from common variants, whereas collapsing of rare variants is preferable when common variants possess a strong effect. We also demonstrated that the difference in statistical power between the two pooling strategies may be substantial. The results also highlighted consistently high power of two similarity-based approaches when applied with an appropriate pooling strategy. Conclusions Population genetics simulations and sequencing data set analysis showed high power of two similarity-based tests and a substantial difference in power between the two pooling strategies.
Collapse
Affiliation(s)
- Sergii Zakharov
- Human Genetics, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore.
| | | | | |
Collapse
|
137
|
Wang J, Zhao Z, Cao Z, Yang A, Zhang J. A probabilistic method for identifying rare variants underlying complex traits. BMC Genomics 2013; 14 Suppl 1:S11. [PMID: 23369113 PMCID: PMC3549819 DOI: 10.1186/1471-2164-14-s1-s11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered. RESULTS In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without pre-selection by experts. In our model, each variant has two hidden states that represent the causal/non-causal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003.
Collapse
Affiliation(s)
- Jiayin Wang
- Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, PR China.
| | | | | | | | | |
Collapse
|
138
|
Schaid DJ, Sinnwell JP, Jenkins GD. Regression modeling of allele frequencies and testing Hardy Weinberg Equilibrium. Hum Hered 2013; 74:71-82. [PMID: 23328647 DOI: 10.1159/000345846] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 11/13/2012] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND/AIMS Tests for whether observed genotype proportions fit Hardy Weinberg Equilibrium (HWE) are widely used in population genetics analyses, as well as to evaluate quality of genotype data. To date, all methods testing for HWE require subjects to be classified into discrete categories, yet it is becoming clear that the distribution of allele frequencies tends to be smooth over geographic regions. METHODS To evaluate the HWE assumption, we develop new approaches to model allele frequencies as functions of covariates, and use these models to test whether there is residual correlation between the two alleles of subjects; lack of residual correlation supports the null hypothesis of HWE, but conditional on how the covariates influence the allele frequencies. RESULTS By simulations, we illustrate that a simple statistical test of residual correlation of alleles adequately controls the type I error rate, while maintaining power that is comparable to standard tests for HWE. CONCLUSION Our approach can be implemented in standard software, enabling more flexible and powerful ways to evaluate the association of covariates with allele frequencies and whether these associations 'explain' departures from HWE when the covariates are ignored, opening new strategies to evaluate the quality of genotype data generated by next-generation sequencing assays.
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| | | | | |
Collapse
|
139
|
Chen H, Meigs JB, Dupuis J. Sequence kernel association test for quantitative traits in family samples. Genet Epidemiol 2012; 37:196-204. [PMID: 23280576 DOI: 10.1002/gepi.21703] [Citation(s) in RCA: 172] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2012] [Revised: 11/12/2012] [Accepted: 11/22/2012] [Indexed: 11/12/2022]
Abstract
A large number of rare genetic variants have been discovered with the development in sequencing technology and the lowering of sequencing costs. Rare variant analysis may help identify novel genes associated with diseases and quantitative traits, adding to our knowledge of explaining heritability of these phenotypes. Many statistical methods for rare variant analysis have been developed in recent years, but some of them require the strong assumption that all rare variants in the analysis share the same direction of effect, and others requiring permutation to calculate the P-values are computer intensive. Among these methods, the sequence kernel association test (SKAT) is a powerful method under many different scenarios. It does not require any assumption on the directionality of effects, and statistical significance is computed analytically. In this paper, we extend SKAT to be applicable to family data. The family-based SKAT (famSKAT) has a different test statistic and null distribution compared to SKAT, but is equivalent to SKAT when there is no familial correlation. Our simulation studies show that SKAT has inflated type I error if familial correlation is inappropriately ignored, but has appropriate type I error if applied to a single individual per family to obtain an unrelated subset. In contrast, famSKAT has the correct type I error when analyzing correlated observations, and it has higher power than competing methods in many different scenarios. We illustrate our approach to analyze the association of rare genetic variants using glycemic traits from the Framingham Heart Study.
Collapse
Affiliation(s)
- Han Chen
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.
| | | | | |
Collapse
|
140
|
Machiela MJ, Lindström S, Allen NE, Haiman CA, Albanes D, Barricarte A, Berndt SI, Bueno-de-Mesquita HB, Chanock S, Gaziano JM, Gapstur SM, Giovannucci E, Henderson BE, Jacobs EJ, Kolonel LN, Krogh V, Ma J, Stampfer MJ, Stevens VL, Stram DO, Tjønneland A, Travis R, Willett WC, Hunter DJ, Le Marchand L, Kraft P. Association of type 2 diabetes susceptibility variants with advanced prostate cancer risk in the Breast and Prostate Cancer Cohort Consortium. Am J Epidemiol 2012. [PMID: 23193118 DOI: 10.1093/aje/kws191] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Observational studies have found an inverse association between type 2 diabetes (T2D) and prostate cancer (PCa), and genome-wide association studies have found common variants near 3 loci associated with both diseases. The authors examined whether a genetic background that favors T2D is associated with risk of advanced PCa. Data from the National Cancer Institute's Breast and Prostate Cancer Cohort Consortium, a genome-wide association study of 2,782 advanced PCa cases and 4,458 controls, were used to evaluate whether individual single nucleotide polymorphisms or aggregations of these 36 T2D susceptibility loci are associated with PCa. Ten T2D markers near 9 loci (NOTCH2, ADCY5, JAZF1, CDKN2A/B, TCF7L2, KCNQ1, MTNR1B, FTO, and HNF1B) were nominally associated with PCa (P < 0.05); the association for single nucleotide polymorphism rs757210 at the HNF1B locus was significant when multiple comparisons were accounted for (adjusted P = 0.001). Genetic risk scores weighted by the T2D log odds ratio and multilocus kernel tests also indicated a significant relation between T2D variants and PCa risk. A mediation analysis of 9,065 PCa cases and 9,526 controls failed to produce evidence that diabetes mediates the association of the HNF1B locus with PCa risk. These data suggest a shared genetic component between T2D and PCa and add to the evidence for an interrelation between these diseases.
Collapse
Affiliation(s)
- Mitchell J Machiela
- Program in Molecular and Genetic Epidemiology, Department of Epidemiology, Harvard School of Public Health, Boston, MA 02115, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
141
|
Ge T, Feng J, Hibar DP, Thompson PM, Nichols TE. Increasing power for voxel-wise genome-wide association studies: the random field theory, least square kernel machines and fast permutation procedures. Neuroimage 2012; 63:858-73. [PMID: 22800732 PMCID: PMC3635688 DOI: 10.1016/j.neuroimage.2012.07.012] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2012] [Revised: 07/04/2012] [Accepted: 07/07/2012] [Indexed: 12/20/2022] Open
Abstract
Imaging traits are thought to have more direct links to genetic variation than diagnostic measures based on cognitive or clinical assessments and provide a powerful substrate to examine the influence of genetics on human brains. Although imaging genetics has attracted growing attention and interest, most brain-wide genome-wide association studies focus on voxel-wise single-locus approaches, without taking advantage of the spatial information in images or combining the effect of multiple genetic variants. In this paper we present a fast implementation of voxel- and cluster-wise inferences based on the random field theory to fully use the spatial information in images. The approach is combined with a multi-locus model based on least square kernel machines to associate the joint effect of several single nucleotide polymorphisms (SNP) with imaging traits. A fast permutation procedure is also proposed which significantly reduces the number of permutations needed relative to the standard empirical method and provides accurate small p-value estimates based on parametric tail approximation. We explored the relation between 448,294 single nucleotide polymorphisms and 18,043 genes in 31,662 voxels of the entire brain across 740 elderly subjects from the Alzheimer's disease neuroimaging initiative (ADNI). Structural MRI scans were analyzed using tensor-based morphometry (TBM) to compute 3D maps of regional brain volume differences compared to an average template image based on healthy elderly subjects. We find method to be more sensitive compared with voxel-wise single-locus approaches. A number of genes were identified as having significant associations with volumetric changes. The most associated gene was GRIN2B, which encodes the N-methyl-d-aspartate (NMDA) glutamate receptor NR2B subunit and affects both the parietal and temporal lobes in human brains. Its role in Alzheimer's disease has been widely acknowledged and studied, suggesting the validity of the approach. The various advantages over existing approaches indicate a great potential offered by this novel framework to detect genetic influences on human brains.
Collapse
Affiliation(s)
- Tian Ge
- Centre for Computational Systems Biology and School of Mathematical Sciences, Fudan University, Shanghai, China
- Department of Computer Science, The University of Warwick, Coventry, UK
| | - Jianfeng Feng
- Centre for Computational Systems Biology and School of Mathematical Sciences, Fudan University, Shanghai, China
- Department of Computer Science, The University of Warwick, Coventry, UK
| | - Derrek P. Hibar
- Laboratory of Neuro Imaging, Department of Neurology, UCLA School of Medicine, Los Angeles, CA, USA
| | - Paul M. Thompson
- Laboratory of Neuro Imaging, Department of Neurology, UCLA School of Medicine, Los Angeles, CA, USA
| | - Thomas E. Nichols
- Department of Statistics & Warwick Manufacturing Group, The University of Warwick, Coventry, UK
- Oxford Centre for Functional MRI of the Brain (FMRIB), Nuffield Department of Clinical Neurosciences, Oxford University, UK
| |
Collapse
|
142
|
Zhang Y, Guan W, Pan W. Adjustment for population stratification via principal components in association analysis of rare variants. Genet Epidemiol 2012; 37:99-109. [PMID: 23065775 DOI: 10.1002/gepi.21691] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2012] [Revised: 09/11/2012] [Accepted: 09/13/2012] [Indexed: 11/07/2022]
Abstract
For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.
Collapse
Affiliation(s)
- Yiwei Zhang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | | | |
Collapse
|
143
|
Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, Lin X. SNP set association analysis for familial data. Genet Epidemiol 2012; 36:797-810. [PMID: 22968922 DOI: 10.1002/gepi.21676] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Revised: 07/06/2012] [Accepted: 07/30/2012] [Indexed: 11/06/2022]
Abstract
Genome-wide association studies (GWAS) are a popular approach for identifying common genetic variants and epistatic effects associated with a disease phenotype. The traditional statistical analysis of such GWAS attempts to assess the association between each individual single-nucleotide polymorphism (SNP) and the observed phenotype. Recently, kernel machine-based tests for association between a SNP set (e.g., SNPs in a gene) and the disease phenotype have been proposed as a useful alternative to the traditional individual-SNP approach, and allow for flexible modeling of the potentially complicated joint SNP effects in a SNP set while adjusting for covariates. We extend the kernel machine framework to accommodate related subjects from multiple independent families, and provide a score-based variance component test for assessing the association of a given SNP set with a continuous phenotype, while adjusting for additional covariates and accounting for within-family correlation. We illustrate the proposed method using simulation studies and an application to genetic data from the Genetic Epidemiology Network of Arteriopathy (GENOA) study.
Collapse
Affiliation(s)
- Elizabeth D Schifano
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts
| | | | | | | | | | | | | |
Collapse
|
144
|
Li S, Cui Y. Gene-centric gene–gene interaction: A model-based kernel machine method. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas545] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
145
|
Maity A, Sullivan PF, Tzeng JY. Multivariate phenotype association analysis by marker-set kernel machine regression. Genet Epidemiol 2012; 36:686-95. [PMID: 22899176 DOI: 10.1002/gepi.21663] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2012] [Revised: 05/23/2012] [Accepted: 06/18/2012] [Indexed: 11/06/2022]
Abstract
Genetic studies of complex diseases often collect multiple phenotypes relevant to the disorders. As these phenotypes can be correlated and share common genetic mechanisms, jointly analyzing these traits may bring more power to detect genes influencing individual or multiple phenotypes. Given the advancement brought by the multivariate phenotype approaches and the multimarker kernel machine regression, we construct a multivariate regression based on kernel machine to facilitate the joint evaluation of multimarker effects on multiple phenotypes. The kernel machine serves as a powerful dimension-reduction tool to capture complex effects among markers. The multivariate framework incorporates the potentially correlated multidimensional phenotypic information and accommodates common or different environmental covariates for each trait. We derive the multivariate kernel machine test based on a score-like statistic, and conduct simulations to evaluate the validity and efficacy of the method. We also study the performance of the commonly adapted strategies for kernel machine analysis on multiple phenotypes, including the multiple univariate kernel machine tests with original phenotypes or with their principal components. Our results suggest that none of these approaches has the uniformly best power, and the optimal test depends on the magnitude of the phenotype correlation and the effect patterns. However, the multivariate test retains to be a reasonable approach when the multiple phenotypes have none or mild correlations, and gives the best power once the correlation becomes stronger or when there exist genes that affect more than one phenotype. We illustrate the utility of the multivariate kernel machine method through the Clinical Antipsychotic Trails of Intervention Effectiveness antibody study.
Collapse
Affiliation(s)
- Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, USA
| | | | | |
Collapse
|
146
|
Lin WY, Tiwari HK, Gao G, Zhang K, Arcaroli JJ, Abraham E, Liu N. Similarity-based multimarker association tests for continuous traits. Ann Hum Genet 2012; 76:246-60. [PMID: 22497480 DOI: 10.1111/j.1469-1809.2012.00706.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average R(2) (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes >0.4). We also applied our two tests to an adiposity data set to show their utility.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Department of Biostatistics, University of Alabama at Birmingham, USA
| | | | | | | | | | | | | |
Collapse
|
147
|
Cai T, Lin X, Carroll RJ. Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics 2012; 13:776-90. [PMID: 22734045 DOI: 10.1093/biostatistics/kxs015] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology 23, 429-435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One 2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics 63, 1079-1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics 9, 292-2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics 86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50-57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Collapse
Affiliation(s)
- Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
| | | | | |
Collapse
|
148
|
Wang K, Fingert JH. Statistical tests for detecting rare variants using variance-stabilising transformations. Ann Hum Genet 2012; 76:402-9. [PMID: 22724536 DOI: 10.1111/j.1469-1809.2012.00718.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Next generation sequencing holds great promise for detecting rare variants underlying complex human traits. Due to their extremely low allele frequencies, the normality approximation for a proportion no longer works well. The Fisher's exact method appears to be suitable but it is conservative. We investigate the utility of various variance-stabilising transformations in single marker association analysis on rare variants. Unlike a proportion itself, the variance of the transformed proportions no longer depends on the proportion, making application of such transformations to rare variant association analysis extremely appealing. Simulation studies demonstrate that tests based on such transformations are more powerful than the Fisher's exact test while controlling for type I error rate. Based on theoretical considerations and results from simulation studies, we recommend the test based on the Anscombe transformation over tests with other transformations.
Collapse
Affiliation(s)
- Kai Wang
- Department of Biostatistics, College of Public Health, The University of Iowa, Iowa City, IA 52242, USA.
| | | |
Collapse
|
149
|
Abstract
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker-based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next-generation sequencing data and rare variants association studies are discussed.
Collapse
Affiliation(s)
- Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
150
|
Sun YV, Sung YJ, Tintle N, Ziegler A. Identification of genetic association of multiple rare variants using collapsing methods. Genet Epidemiol 2012; 35 Suppl 1:S101-6. [PMID: 22128049 DOI: 10.1002/gepi.20658] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.
| | | | | | | |
Collapse
|