1
|
Yin A, Yuan A, Tan MT. Highly robust causal semiparametric U-statistic with applications in biomedical studies. Int J Biostat 2024; 20:69-91. [PMID: 36433631 PMCID: PMC10225018 DOI: 10.1515/ijb-2022-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/31/2022] [Indexed: 11/28/2022]
Abstract
With our increased ability to capture large data, causal inference has received renewed attention and is playing an ever-important role in biomedicine and economics. However, one major methodological hurdle is that existing methods rely on many unverifiable model assumptions. Thus robust modeling is a critically important approach complementary to sensitivity analysis, where it compares results under various model assumptions. The more robust a method is with respect to model assumptions, the more worthy it is. The doubly robust estimator (DRE) is a significant advance in this direction. However, in practice, many outcome measures are functionals of multiple distributions, and so are the associated estimands, which can only be estimated via U-statistics. Thus most existing DREs do not apply. This article proposes a broad class of highly robust U-statistic estimators (HREs), which use semiparametric specifications for both the propensity score and outcome models in constructing the U-statistic. Thus, the HRE is more robust than the existing DREs. We derive comprehensive asymptotic properties of the proposed estimators and perform extensive simulation studies to evaluate their finite sample performance and compare them with the corresponding parametric U-statistics and the naive estimators, which show significant advantages. Then we apply the method to analyze a clinical trial from the AIDS Clinical Trials Group.
Collapse
Affiliation(s)
- Anqi Yin
- Department of Biostatistics, Bioinformatics and Biomathematics Georgetown University, Washington, DC 20057, USA
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics and Biomathematics Georgetown University, Washington, DC 20057, USA
| | - Ming T. Tan
- Department of Biostatistics, Bioinformatics and Biomathematics Georgetown University, Washington, DC 20057, USA
| |
Collapse
|
2
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Rare variant association on unrelated individuals in case-control studies using aggregation tests: existing methods and current limitations. Brief Bioinform 2023; 24:bbad412. [PMID: 37974506 DOI: 10.1093/bib/bbad412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 10/14/2023] [Accepted: 10/28/2023] [Indexed: 11/19/2023] Open
Abstract
Over the past years, progress made in next-generation sequencing technologies and bioinformatics have sparked a surge in association studies. Especially, genome-wide association studies (GWASs) have demonstrated their effectiveness in identifying disease associations with common genetic variants. Yet, rare variants can contribute to additional disease risk or trait heterogeneity. Because GWASs are underpowered for detecting association with such variants, numerous statistical methods have been recently proposed. Aggregation tests collapse multiple rare variants within a genetic region (e.g. gene, gene set, genomic loci) to test for association. An increasing number of studies using such methods successfully identified trait-associated rare variants and led to a better understanding of the underlying disease mechanism. In this review, we compare existing aggregation tests, their statistical features and scope of application, splitting them into the five classical classes: burden, adaptive burden, variance-component, omnibus and other. Finally, we describe some limitations of current aggregation tests, highlighting potential direction for further investigations.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- WELBIO department, WEL Research Institute, avenue Pasteur, 6, 1300 Wavre, Belgium
| |
Collapse
|
3
|
Pluta D, Shen T, Xue G, Chen C, Ombao H, Yu Z. Ridge-penalized adaptive Mantel test and its application in imaging genetics. Stat Med 2021; 40:5313-5332. [PMID: 34216035 DOI: 10.1002/sim.9127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 06/01/2021] [Accepted: 06/16/2021] [Indexed: 01/23/2023]
Abstract
We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement and testing. This result is not only theoretically interesting but also has important implications in penalized hypothesis testing, especially in high-dimensional settings such as imaging genetics. Applying the proposed method to an imaging genetic study of visual working memory in healthy adults, we identified interesting associations of brain connectivity (measured by electroencephalogram coherence) with selected genetic features.
Collapse
Affiliation(s)
- Dustin Pluta
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Tong Shen
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Gui Xue
- Center for Brain and Learning Science, Beijing Normal University, Beijing, China
| | - Chuansheng Chen
- Department of Psychology and Social Behavior, University of California, Irvine, Irvine, California, USA
| | - Hernando Ombao
- Statistics Program, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| |
Collapse
|
4
|
Determining population stratification and subgroup effects in association studies of rare genetic variants for nicotine dependence. Psychiatr Genet 2020; 29:111-119. [PMID: 31033776 PMCID: PMC6636808 DOI: 10.1097/ypg.0000000000000227] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Supplemental Digital Content is available in the text. Background Rare variants (minor allele frequency < 1% or 5 %) can help researchers to deal with the confounding issue of ‘missing heritability’ and have a proven role in dissecting the etiology for human diseases and complex traits. Methods We extended the combined multivariate and collapsing (CMC) and weighted sum statistic (WSS) methods and accounted for the effects of population stratification and subgroup effects using stratified analyses by the principal component analysis, named here as ‘str-CMC’ and ‘str-WSS’. To evaluate the validity of the extended methods, we analyzed the Genetic Architecture of Smoking and Smoking Cessation database, which includes African Americans and European Americans genotyped on Illumina Human Omni2.5, and we compared the results with those obtained with the sequence kernel association test (SKAT) and its modification, SKAT-O that included population stratification and subgroup effect as covariates. We utilized the Cochran–Mantel–Haenszel test to check for possible differences in single nucleotide polymorphism allele frequency between subgroups within a gene. We aimed to detect rare variants and considered population stratification and subgroup effects in the genomic region containing 39 acetylcholine receptor-related genes. Results The Cochran–Mantel–Haenszel test as applied to GABRG2 (P = 0.001) was significant. However, GABRG2 was detected both by str-CMC (P= 8.04E-06) and str-WSS (P= 0.046) in African Americans but not by SKAT or SKAT-O. Conclusions Our results imply that if associated rare variants are only specific to a subgroup, a stratified analysis might be a better approach than a combined analysis.
Collapse
|
5
|
Gloaguen E, Dizier MH, Boissel M, Rocheleau G, Canouil M, Froguel P, Tichet J, Roussel R, Julier C, Balkau B, Mathieu F. General regression model: A "model-free" association test for quantitative traits allowing to test for the underlying genetic model. Ann Hum Genet 2019; 84:280-290. [PMID: 31834638 DOI: 10.1111/ahg.12372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Revised: 11/19/2019] [Accepted: 11/20/2019] [Indexed: 11/26/2022]
Abstract
Most genome-wide association studies used genetic-model-based tests assuming an additive mode of inheritance, leading to underpowered association tests in case of departure from additivity. The general regression model (GRM) association test proposed by Fisher and Wilson in 1980 makes no assumption on the genetic model. Interestingly, it also allows formal testing of the underlying genetic model. We conducted a simulation study of quantitative traits to compare the power of the GRM test to the classical linear regression tests, the maximum of the three statistics (MAX), and the allele-based (allelic) tests. Simulations were performed on two samples sizes, using a large panel of genetic models, varying genetic models, minor allele frequencies, and the percentage of explained variance. In case of departure from additivity, the GRM was more powerful than the additive regression tests (power gain reaching 80%) and had similar power when the true model is additive. GRM was also as or more powerful than the MAX or allelic tests. The true simulated model was mostly retained by the GRM test. Application of GRM to HbA1c illustrates its gain in power. To conclude, GRM increases power to detect association for quantitative traits, allows determining the genetic model and is easily applicable.
Collapse
Affiliation(s)
- Emilie Gloaguen
- Inserm UMRS-958, Paris, France.,Université Paris Diderot, Sorbonne Paris Cité, Paris, France
| | - Marie-Hélène Dizier
- Inserm UMR-946, Paris, France.,Université Paris Diderot, Sorbonne Paris Cité, Paris, France
| | - Mathilde Boissel
- Université de Lille, UMR 8199 - EGID, Lille, France.,CNRS, Paris, France.,Institut Pasteur de Lille, Lille, France
| | - Ghislain Rocheleau
- Université de Lille, UMR 8199 - EGID, Lille, France.,CNRS, Paris, France.,Institut Pasteur de Lille, Lille, France
| | - Mickaël Canouil
- Université de Lille, UMR 8199 - EGID, Lille, France.,CNRS, Paris, France.,Institut Pasteur de Lille, Lille, France
| | - Philippe Froguel
- Université de Lille, UMR 8199 - EGID, Lille, France.,CNRS, Paris, France.,Institut Pasteur de Lille, Lille, France.,Department of Genomics of Common Disease, Imperial College London, London, United Kingdom
| | | | - Ronan Roussel
- Inserm U1138, Centre de Recherche des Cordeliers, Paris, France.,Université Paris Diderot, Sorbonne Paris Cité, Paris, France.,Diabetology, Endocrinology and Nutrition Department, DHU FIRE, Hôpital Bichat, AP-HP, Paris, France
| | -
- Inserm UMRS-958, Paris, France
| | - Cécile Julier
- Inserm UMRS-958, Paris, France.,Université Paris Diderot, Sorbonne Paris Cité, Paris, France
| | | | - Flavie Mathieu
- Mission Associations Recherche & Société - Inserm Siège, DISC, Paris, France.,Paris Diderot, Sorbonne Paris Cité, Paris, France
| |
Collapse
|
6
|
Liu Y, Huang J, Urbanowicz RJ, Chen K, Manduchi E, Greene CS, Moore JH, Scheet P, Chen Y. Embracing study heterogeneity for finding genetic interactions in large-scale research consortia. Genet Epidemiol 2019; 44:52-66. [PMID: 31583758 DOI: 10.1002/gepi.22262] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 08/02/2019] [Accepted: 08/09/2019] [Indexed: 11/12/2022]
Abstract
Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based effect-size tests for interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.
Collapse
Affiliation(s)
- Yulun Liu
- Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, Texas
| | - Jing Huang
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| | - Elisabetta Manduchi
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Casey S Greene
- Department of Pharmacology, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Paul Scheet
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania.,Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
7
|
Wei C, Li M, Wen Y, Ye C, Lu Q. A multi-locus predictiveness curve and its summary assessment for genetic risk prediction. Stat Methods Med Res 2019; 29:44-56. [PMID: 30612522 DOI: 10.1177/0962280218819202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Genetic association studies using high-throughput genotyping and sequencing technologies have identified a large number of genetic variants associated with complex human diseases. These findings have provided an unprecedented opportunity to identify individuals in the population at high risk for disease who carry causal genetic mutations and hold great promise for early intervention and individualized medicine. While interest is high in building risk prediction models based on recent genetic findings, it is crucial to have appropriate statistical measurements to assess the performance of a genetic risk prediction model. Predictiveness curves were recently proposed as a graphic tool for evaluating a risk prediction model on the basis of a single continuous biomarker. The curve evaluates a risk prediction model for classification performance as well as its usefulness when applied to a population. In this article, we extend the predictiveness curve to measure the collective contribution of multiple genetic variants. We further propose a nonparametric, U-statistics-based measurement, referred to as the U-Index, to quantify the performance of a multi-locus predictiveness curve. In particular, a global U-Index and a partial U-Index can be used in the general population and a subpopulation of particular clinical interest, respectively. Through simulation studies, we demonstrate that the proposed U-Index has advantages over several existing summary statistics under various disease models. We also show that the partial U-Index can have its own uniqueness when rare variants have a substantial contribution to disease risk. Finally, we use the proposed predictiveness curve and its corresponding U-Index to evaluate the performance of a genetic risk prediction model for nicotine dependence.
Collapse
Affiliation(s)
- Changshuai Wei
- Core Artificial Intelligence, Amazon.com Inc, Seattle, WA, USA
| | - Ming Li
- Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN, USA
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Chengyin Ye
- Department of Health Management, Hangzhou Normal University, Hangzhou, China
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
8
|
Wang X, Wang S, Meng X. A novel SNP-set analytical method without distinguishing common variants or rare variants in genome-wide association study. INT J BIOMATH 2018. [DOI: 10.1142/s1793524518500948] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Single nucleotide polymorphism (SNP)-set analysis in genome-wide association studies (GWASs) has become a hot topic. Most existing SNP-set analystic methods are designed and work well according to the different natures of common or rare variants and associated diseases. But the information that the disease associated variants are common or rare cannot be gained in advance. Therefore, in this research, we proposed a new and powerful weighted function method without distinguishing common or rare variants to select tagging SNP-set. We applied our selection method to sequence kernel association test (SKAT) and compared the power with some existing methods. The simulation results showed that our method has higher power not only than SKAT in un-weighted case, but also than SKAT in other weighted functions. Moreover, the power is improved significantly when the minor allele frequency (MAF) of causal SNP is relatively small.
Collapse
Affiliation(s)
- Xinzeng Wang
- State Key Laboratory of Mining Disaster Prevention and Control Co-founded by Shandong Province and the Ministry of Science and Technology, Shandong University of Science and Technology, Qingdao 266590, P. R. China
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao 266510, P. R. China
| | - Shudong Wang
- College of Computer and Communication Engineering, China University of Petroleum (East China), Qingdao, Shandong 266580, P. R. China
| | - Xinzhu Meng
- State Key Laboratory of Mining Disaster Prevention and Control Co-founded by Shandong Province and the Ministry of Science and Technology, Shandong University of Science and Technology, Qingdao 266590, P. R. China
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao 266510, P. R. China
| |
Collapse
|
9
|
Wei C, Lu Q. A generalized association test based on U statistics. Bioinformatics 2018; 33:1963-1971. [PMID: 28334117 DOI: 10.1093/bioinformatics/btx103] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/15/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Second generation sequencing technologies are being increasingly used for genetic association studies, where the main research interest is to identify sets of genetic variants that contribute to various phenotypes. The phenotype can be univariate disease status, multivariate responses and even high-dimensional outcomes. Considering the genotype and phenotype as two complex objects, this also poses a general statistical problem of testing association between complex objects. Results We here proposed a similarity-based test, generalized similarity U (GSU), that can test the association between complex objects. We first studied the theoretical properties of the test in a general setting and then focused on the application of the test to sequencing association studies. Based on theoretical analysis, we proposed to use Laplacian Kernel-based similarity for GSU to boost power and enhance robustness. Through simulation, we found that GSU did have advantages over existing methods in terms of power and robustness. We further performed a whole genome sequencing (WGS) scan for Alzherimer's disease neuroimaging initiative data, identifying three genes, APOE , APOC1 and TOMM40 , associated with imaging phenotype. Availability and Implementation We developed a C ++ package for analysis of WGS data using GSU. The source codes can be downloaded at https://github.com/changshuaiwei/gsu . Contact weichangshuai@gmail.com ; qlu@epi.msu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Changshuai Wei
- Department of Biostatistics and Epidemiology, University of North Texas Health Science Center, Fort Worth, TX 76107
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
10
|
Reexamining Dis/Similarity-Based Tests for Rare-Variant Association with Case-Control Samples. Genetics 2018; 209:105-113. [PMID: 29545466 DOI: 10.1534/genetics.118.300769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 03/02/2018] [Indexed: 11/18/2022] Open
Abstract
A properly designed distance-based measure can capture informative genetic differences among individuals with different phenotypes and can be used to detect variants responsible for the phenotypes. To detect associated variants, various tests have been designed to contrast genetic dissimilarity or similarity scores of certain subject groups in different ways, among which the most widely used strategy is to quantify the difference between the within-group genetic dissimilarity/similarity (i.e., case-case and control-control similarities) and the between-group dissimilarity/similarity (i.e., case-control similarities). While it has been noted that for common variants, the within-group and the between-group measures should all be included; in this work, we show that for rare variants, comparison based on the two within-group measures can more effectively quantify the genetic difference between cases and controls. The between-group measure tends to overlap with one of the two within-group measures for rare variants, although such overlap is not present for common variants. Consequently, a dissimilarity or similarity test that includes the between-group information tends to attenuate the association signals and leads to power loss. Based on these findings, we propose a dissimilarity test that compares the degree of SNP dissimilarity within cases to that within controls to better characterize the difference between two disease phenotypes. We provide the statistical properties, asymptotic distribution, and computation details for a small sample size of the proposed test. We use simulated and real sequence data to assess the performance of the proposed test, comparing it with other rare-variant methods including those similarity-based tests that use both within-group and between-group information. As similarity-based approaches serve as one of the dominating approaches in rare-variant analysis, our results provide some insight for the effective detection of rare variants.
Collapse
|
11
|
Jadhav S, Tong X, Lu Q. A functional U-statistic method for association analysis of sequencing data. Genet Epidemiol 2017; 41:636-643. [PMID: 28850771 DOI: 10.1002/gepi.22063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Revised: 06/06/2017] [Accepted: 07/10/2017] [Indexed: 11/08/2022]
Abstract
Although sequencing studies hold great promise for uncovering novel variants predisposing to human diseases, the high dimensionality of the sequencing data brings tremendous challenges to data analysis. Moreover, for many complex diseases (e.g., psychiatric disorders) multiple related phenotypes are collected. These phenotypes can be different measurements of an underlying disease, or measurements characterizing multiple related diseases for studying common genetic mechanism. Although jointly analyzing these phenotypes could potentially increase the power of identifying disease-associated genes, the different types of phenotypes pose challenges for association analysis. To address these challenges, we propose a nonparametric method, functional U-statistic method (FU), for multivariate analysis of sequencing data. It first constructs smooth functions from individuals' sequencing data, and then tests the association of these functions with multiple phenotypes by using a U-statistic. The method provides a general framework for analyzing various types of phenotypes (e.g., binary and continuous phenotypes) with unknown distributions. Fitting the genetic variants within a gene using a smoothing function also allows us to capture complexities of gene structure (e.g., linkage disequilibrium, LD), which could potentially increase the power of association analysis. Through simulations, we compared our method to the multivariate outcome score test (MOST), and found that our test attained better performance than MOST. In a real data application, we apply our method to the sequencing data from Minnesota Twin Study (MTS) and found potential associations of several nicotine receptor subunit (CHRN) genes, including CHRNB3, associated with nicotine dependence and/or alcohol dependence.
Collapse
Affiliation(s)
- Sneha Jadhav
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
| | - Xiaoran Tong
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
12
|
Dizier MH, Demenais F, Mathieu F. Gain of power of the general regression model compared to Cochran-Armitage Trend tests: simulation study and application to bipolar disorder. BMC Genet 2017; 18:24. [PMID: 28283021 PMCID: PMC5345257 DOI: 10.1186/s12863-017-0486-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 03/02/2017] [Indexed: 11/25/2022] Open
Abstract
Background Most genome-wide association studies assumed an additive model of inheritance which may result in significant loss of power when there is a strong departure from additivity. The General Regression Model (GRM), which allows performing an assumption-free test for association by testing for both additive effect and deviation from additive effect, may be more appropriate for association tests. Additionally, GRM allows testing the underlying genetic model. We compared the power of GRM association test to additive and other Cochran-Armitage Trend (CAT) tests through simulations and by applying GRM to a large case/control sample, the bipolar Welcome Trust Case Control Cohort data. Simulations were performed on two sets of case/control samples (1000/1000 and 2000/2000), using a large panel of genetic models. Four association tests (GRM and additive, recessive and dominant CAT tests) were applied to all replicates. Results We showed that GRM power to detect association was similar or greater than the additive CAT test, in particular in case of recessive inheritance, with up to 67% gain in power. GRM analysis of genome-wide bipolar disorder Welcome Trust Consortium data (1998 cases/3004 controls) showed significant association in the 16p12 region (rs420259; P = 3.4E-7) which has not been identified using the additive CAT test. As expected, rs42025 fitted a non-additive (recessive) model. Conclusions GRM provides increased power compared to the additive CAT test for association studies and is easily applicable. Electronic supplementary material The online version of this article (doi:10.1186/s12863-017-0486-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marie-Hélène Dizier
- Genetic Variation and Human Diseases Unit, UMR-946, Inserm, Université Paris Diderot, Université Sorbonne Paris Cité, Paris, France
| | - Florence Demenais
- Genetic Variation and Human Diseases Unit, UMR-946, Inserm, Université Paris Diderot, Université Sorbonne Paris Cité, Paris, France
| | - Flavie Mathieu
- Inserm Siège, Université Paris Diderot, Université Sorbonne Paris Cité, Paris, France.
| |
Collapse
|
13
|
Kwak IY, Pan W. Gene- and pathway-based association tests for multiple traits with GWAS summary statistics. Bioinformatics 2017; 33:64-71. [PMID: 27592708 PMCID: PMC5198520 DOI: 10.1093/bioinformatics/btw577] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Revised: 08/08/2016] [Accepted: 08/29/2016] [Indexed: 11/15/2022] Open
Abstract
To identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene- or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP- and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or P-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods. AVAILABILITY AND IMPLEMENTATION The methods are implemented in R package aSPU, freely and publicly available at: https://cran.r-project.org/web/packages/aSPU/ CONTACT: weip@biostat.umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Il-Youp Kwak
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
14
|
Yoo YJ, Sun L, Poirier JG, Paterson AD, Bull SB. Multiple linear combination (MLC) regression tests for common variants adapted to linkage disequilibrium structure. Genet Epidemiol 2016; 41:108-121. [PMID: 27885705 PMCID: PMC5245123 DOI: 10.1002/gepi.22024] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2016] [Revised: 05/25/2016] [Accepted: 09/27/2016] [Indexed: 11/21/2022]
Abstract
By jointly analyzing multiple variants within a gene, instead of one at a time, gene‐based multiple regression can improve power, robustness, and interpretation in genetic association analysis. We investigate multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes. MLC is a directional test that exploits LD structure in a gene to construct clusters of closely correlated variants recoded such that the majority of pairwise correlations are positive. It combines variant effects within the same cluster linearly, and aggregates cluster‐specific effects in a quadratic sum of squares and cross‐products, producing a test statistic with reduced degrees of freedom (df) equal to the number of clusters. By simulation studies of 1000 genes from across the genome, we demonstrate that MLC is a well‐powered and robust choice among existing methods across a broad range of gene structures. Compared to minimum P‐value, variance‐component, and principal‐component methods, the mean power of MLC is never much lower than that of other methods, and can be higher, particularly with multiple causal variants. Moreover, the variation in gene‐specific MLC test size and power across 1000 genes is less than that of other methods, suggesting it is a complementary approach for discovery in genome‐wide analysis. The cluster construction of the MLC test statistics helps reveal within‐gene LD structure, allowing interpretation of clustered variants as haplotypic effects, while multiple regression helps to distinguish direct and indirect associations.
Collapse
Affiliation(s)
- Yun Joo Yoo
- Department of Mathematics Education, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, Canada.,Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Julia G Poirier
- Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| | - Andrew D Paterson
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Program in Genetics and Genome Biology, Hospital for Sick Children Research Institute, Toronto, Canada
| | - Shelley B Bull
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| |
Collapse
|
15
|
Li M, Wei C, Wen Y, Wang T, Lu Q. Detecting Gene-Gene Interactions Associated with Multiple Complex Traits with U-Statistics. Curr Genomics 2016; 17:403-415. [PMID: 28479869 PMCID: PMC5320542 DOI: 10.2174/1389202917666160513100946] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2015] [Revised: 05/26/2015] [Accepted: 06/06/2015] [Indexed: 12/02/2022] Open
Abstract
Many complex diseases, such as psychiatric and behavioral disorders, are commonly characterized through various measurements that reflect physical, behavioral and psychological aspects of diseases. While it remains a great challenge to find a unified measurement to characterize a disease, the available multiple phenotypes can be analyzed jointly in the genetic association study. Simultaneously testing these phenotypes has many advantages, including considering different aspects of the disease in the analysis, and utilizing correlated phenotypes to improve the power of detecting disease-associated variants. Furthermore, complex diseases are likely caused by the interplay of multiple genetic variants through complicated mechanisms. Considering gene-gene interactions in the joint association analysis of complex diseases could further increase our ability to discover genetic variants involving complex disease pathways. In this article, we propose a stepwise U-test for joint association analysis of multiple loci and multiple phenotypes. Through simulations, we demonstrated that testing multiple phenotypes simultaneously could attain higher power than testing one single phenotype at a time, especially when there are shared genes contributing to multiple phenotypes. We also illustrated the proposed method with an application to Nicotine Dependence (ND), using datasets from the Study of Addition, Genetics and Environment (SAGE). The joint analysis of three ND phenotypes identified two SNPs, rs10508649 and rs2491397, and reached a nominal P-value of 3.79e-13. The association was further replicated in two independent datasets with P-values of 2.37e-05 and 7.46e-05.
Collapse
Affiliation(s)
- Ming Li
- 1Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A; 2Department of Epidemiology and Biostatistics, University of North Texas Health Science Center, Fort Worth, TX 76107, U.S.A; 3Department of Statistics, University of Auckland, Auckland 1010, New Zealand; 4Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, P.R. China; 5Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| | - Changshuai Wei
- 1Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A; 2Department of Epidemiology and Biostatistics, University of North Texas Health Science Center, Fort Worth, TX 76107, U.S.A; 3Department of Statistics, University of Auckland, Auckland 1010, New Zealand; 4Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, P.R. China; 5Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| | - Yalu Wen
- 1Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A; 2Department of Epidemiology and Biostatistics, University of North Texas Health Science Center, Fort Worth, TX 76107, U.S.A; 3Department of Statistics, University of Auckland, Auckland 1010, New Zealand; 4Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, P.R. China; 5Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| | - Tong Wang
- 1Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A; 2Department of Epidemiology and Biostatistics, University of North Texas Health Science Center, Fort Worth, TX 76107, U.S.A; 3Department of Statistics, University of Auckland, Auckland 1010, New Zealand; 4Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, P.R. China; 5Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| | - Qing Lu
- 1Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A; 2Department of Epidemiology and Biostatistics, University of North Texas Health Science Center, Fort Worth, TX 76107, U.S.A; 3Department of Statistics, University of Auckland, Auckland 1010, New Zealand; 4Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, P.R. China; 5Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| |
Collapse
|
16
|
Hu X, Zhang W, Zhang S, Ma S, Li Q. Group-combined P-values with applications to genetic association studies. Bioinformatics 2016; 32:2737-43. [PMID: 27259542 DOI: 10.1093/bioinformatics/btw314] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2016] [Accepted: 05/13/2016] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION In large-scale genetic association studies with tens of hundreds of single nucleotide polymorphisms (SNPs) genotyped, the traditional statistical framework of logistic regression using maximum likelihood estimator (MLE) to infer the odds ratios of SNPs may not work appropriately. This is because a large number of odds ratios need to be estimated, and the MLEs may be not stable when some of the SNPs are in high linkage disequilibrium. Under this situation, the P-value combination procedures seem to provide good alternatives as they are constructed on the basis of single-marker analysis. RESULTS The commonly used P-value combination methods (such as the Fisher's combined test, the truncated product method, the truncated tail strength and the adaptive rank truncated product) may lose power when the significance level varies across SNPs. To tackle this problem, a group combined P-value method (GCP) is proposed, where the P-values are divided into multiple groups and then are combined at the group level. With this strategy, the significance values are integrated at different levels, and the power is improved. Simulation shows that the GCP can effectively control the type I error rates and have additional power over the existing methods-the power increase can be as high as over 50% under some situations. The proposed GCP method is applied to data from the Genetic Analysis Workshop 16. Among all the methods, only the GCP and ARTP can give the significance to identify a genomic region covering gene DSC3 being associated with rheumatoid arthritis, but the GCP provides smaller P-value. AVAILABILITY AND IMPLEMENTATION http://www.statsci.amss.ac.cn/yjscy/yjy/lqz/201510/t20151027_313273.html CONTACT liqz@amss.ac.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaonan Hu
- School of Mathematical Sciences, University of Chinese Academy of Sciences Key Laboratory of Big Data Mining and Knowledge Management
| | - Wei Zhang
- Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences Key Laboratory of Big Data Mining and Knowledge Management
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Qizhai Li
- Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
17
|
Powerful and Adaptive Testing for Multi-trait and Multi-SNP Associations with GWAS and Sequencing Data. Genetics 2016; 203:715-31. [PMID: 27075728 DOI: 10.1534/genetics.115.186502] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 04/02/2016] [Indexed: 11/18/2022] Open
Abstract
Testing for genetic association with multiple traits has become increasingly important, not only because of its potential to boost statistical power, but also for its direct relevance to applications. For example, there is accumulating evidence showing that some complex neurodegenerative and psychiatric diseases like Alzheimer's disease are due to disrupted brain networks, for which it would be natural to identify genetic variants associated with a disrupted brain network, represented as a set of multiple traits, one for each of multiple brain regions of interest. In spite of its promise, testing for multivariate trait associations is challenging: if not appropriately used, its power can be much lower than testing on each univariate trait separately (with a proper control for multiple testing). Furthermore, differing from most existing methods for single-SNP-multiple-trait associations, we consider SNP set-based association testing to decipher complicated joint effects of multiple SNPs on multiple traits. Because the power of a test critically depends on several unknown factors such as the proportions of associated SNPs and of traits, we propose a highly adaptive test at both the SNP and trait levels, giving higher weights to those likely associated SNPs and traits, to yield high power across a wide spectrum of situations. We illuminate relationships among the proposed and some existing tests, showing that the proposed test covers several existing tests as special cases. We compare the performance of the new test with that of several existing tests, using both simulated and real data. The methods were applied to structural magnetic resonance imaging data drawn from the Alzheimer's Disease Neuroimaging Initiative to identify genes associated with gray matter atrophy in the human brain default mode network (DMN). For genome-wide association studies (GWAS), genes AMOTL1 on chromosome 11 and APOE on chromosome 19 were discovered by the new test to be significantly associated with the DMN. Notably, gene AMOTL1 was not detected by single SNP-based analyses. To our knowledge, AMOTL1 has not been highlighted in other Alzheimer's disease studies before, although it was indicated to be related to cognitive impairment. The proposed method is also applicable to rare variants in sequencing data and can be extended to pathway analysis.
Collapse
|
18
|
Wei C, Elston RC, Lu Q. A weighted U statistic for association analyses considering genetic heterogeneity. Stat Med 2016; 35:2802-14. [PMID: 26833871 DOI: 10.1002/sim.6877] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Revised: 11/11/2015] [Accepted: 12/28/2015] [Indexed: 11/10/2022]
Abstract
Converging evidence suggests that common complex diseases with the same or similar clinical manifestations could have different underlying genetic etiologies. While current research interests have shifted toward uncovering rare variants and structural variations predisposing to human diseases, the impact of heterogeneity in genetic studies of complex diseases has been largely overlooked. Most of the existing statistical methods assume the disease under investigation has a homogeneous genetic effect and could, therefore, have low power if the disease undergoes heterogeneous pathophysiological and etiological processes. In this paper, we propose a heterogeneity-weighted U (HWU) method for association analyses considering genetic heterogeneity. HWU can be applied to various types of phenotypes (e.g., binary and continuous) and is computationally efficient for high-dimensional genetic data. Through simulations, we showed the advantage of HWU when the underlying genetic etiology of a disease was heterogeneous, as well as the robustness of HWU against different model assumptions (e.g., phenotype distributions). Using HWU, we conducted a genome-wide analysis of nicotine dependence from the Study of Addiction: Genetics and Environments dataset. The genome-wide analysis of nearly one million genetic markers took 7h, identifying heterogeneous effects of two new genes (i.e., CYP3A5 and IKBKB) on nicotine dependence. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Changshuai Wei
- Department of Biostatistics and Epidemiology, University of North Texas Health Science Center, Fort Worth, TX, U.S.A
| | - Robert C Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, U.S.A
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, U.S.A
| |
Collapse
|
19
|
Wang C, Kao WH, Hsiao CK. Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies. PLoS One 2015; 10:e0135918. [PMID: 26302001 PMCID: PMC4547758 DOI: 10.1371/journal.pone.0135918] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Accepted: 07/28/2015] [Indexed: 11/27/2022] Open
Abstract
The availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. To construct SNP-sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs or SNP-sets should be clustered. A dendrogram can then be constructed based on such distance measure, and the number of clusters can be determined. With the resulting SNP-sets, we next develop an association test HDAT to examine susceptibility to the disease of interest. This proposed test assesses, based on Hamming distance, whether the similarity between a diseased and a normal individual differs from the similarity between two individuals of the same disease status. In our proposed methodology, only genotype information is needed. No inference of haplotypes is required, and SNPs under consideration do not need to locate in nearby regions. The proposed clustering algorithm and association test are illustrated with applications and simulation studies. As compared with other existing methods, the clustering algorithm is faster and better at identifying sets containing SNPs exerting a similar effect. In addition, the simulation studies demonstrated that the proposed test works well for SNP-sets containing a large proportion of neutral SNPs. Furthermore, employing the clustering algorithm before testing a large set of data improves the knowledge in confining the genetic regions for susceptible genetic markers.
Collapse
Affiliation(s)
- Charlotte Wang
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
| | - Wen-Hsin Kao
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
| | - Chuhsing Kate Hsiao
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
- Bioinformatics and Biostatistics Core, Division of Genomic Medicine, Research Center for Medical Excellence, National Taiwan University, Taipei, 100, Taiwan
- Department of Public Health, National Taiwan University, Taipei, 100, Taiwan
- * E-mail:
| |
Collapse
|
20
|
Clique-Based Clustering of Correlated SNPs in a Gene Can Improve Performance of Gene-Based Multi-Bin Linear Combination Test. BIOMED RESEARCH INTERNATIONAL 2015; 2015:852341. [PMID: 26346579 PMCID: PMC4539439 DOI: 10.1155/2015/852341] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 02/03/2015] [Accepted: 02/14/2015] [Indexed: 11/18/2022]
Abstract
Gene-based analysis of multiple single nucleotide polymorphisms (SNPs) in a gene region is an alternative to single SNP analysis. The multi-bin linear combination test (MLC) proposed in previous studies utilizes the correlation among SNPs within a gene to construct a gene-based global test. SNPs are partitioned into clusters of highly correlated SNPs, and the MLC test statistic quadratically combines linear combination statistics constructed for each cluster. The test has degrees of freedom equal to the number of clusters and can be more powerful than a fully quadratic or fully linear test statistic. In this study, we develop a new SNP clustering algorithm designed to find cliques, which are complete subnetworks of SNPs with all pairwise correlations above a threshold. We evaluate the performance of the MLC test using the clique-based CLQ algorithm versus using the tag-SNP-based LDSelect algorithm. In our numerical power calculations we observed that the two clustering algorithms produce identical clusters about 40~60% of the time, yielding similar power on average. However, because the CLQ algorithm tends to produce smaller clusters with stronger positive correlation, the MLC test is less likely to be affected by the occurrence of opposing signs in the individual SNP effect coefficients.
Collapse
|
21
|
Zhang W, Li Q. Nonparametric Risk and Nonparametric Odds in Quantitative Genetic Association Studies. Sci Rep 2015; 5:12105. [PMID: 26174851 PMCID: PMC5378889 DOI: 10.1038/srep12105] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2014] [Accepted: 06/17/2015] [Indexed: 12/30/2022] Open
Abstract
The coefficient in a linear regression model is commonly employed to evaluate the genetic effect of a single nucleotide polymorphism associated with a quantitative trait under the assumption that the trait value follows a normal distribution or is appropriately normally distributed after a certain transformation. When this assumption is violated, the distribution-free tests are preferred. In this work, we propose the nonparametric risk (NR) and nonparametric odds (NO), obtain the asymptotic normal distribution of estimated NR and then construct the confidence intervals. We also define the genetic models using NR, construct the test statistic under a given genetic model and a robust test, which are free of the genetic uncertainty. Simulation studies show that the proposed confidence intervals have satisfactory cover probabilities and the proposed test can control the type I error rates and is more powerful than the exiting ones under most of the considered scenarios. Application to gene of PTPN22 and genomic region of 6p21.33 from the Genetic Analysis Workshop 16 for association with the anticyclic citrullinated protein antibody further show their performances.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Qizhai Li
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
22
|
Wang YT, Sung PY, Lin PL, Yu YW, Chung RH. A multi-SNP association test for complex diseases incorporating an optimal P-value threshold algorithm in nuclear families. BMC Genomics 2015; 16:381. [PMID: 25975968 PMCID: PMC4433014 DOI: 10.1186/s12864-015-1620-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2014] [Accepted: 05/05/2015] [Indexed: 01/22/2023] Open
Abstract
Background Genome-wide association studies (GWAS) have become a common approach to identifying single nucleotide polymorphisms (SNPs) associated with complex diseases. As complex diseases are caused by the joint effects of multiple genes, while the effect of individual gene or SNP is modest, a method considering the joint effects of multiple SNPs can be more powerful than testing individual SNPs. The multi-SNP analysis aims to test association based on a SNP set, usually defined based on biological knowledge such as gene or pathway, which may contain only a portion of SNPs with effects on the disease. Therefore, a challenge for the multi-SNP analysis is how to effectively select a subset of SNPs with promising association signals from the SNP set. Results We developed the Optimal P-value Threshold Pedigree Disequilibrium Test (OPTPDT). The OPTPDT uses general nuclear families. A variable p-value threshold algorithm is used to determine an optimal p-value threshold for selecting a subset of SNPs. A permutation procedure is used to assess the significance of the test. We used simulations to verify that the OPTPDT has correct type I error rates. Our power studies showed that the OPTPDT can be more powerful than the set-based test in PLINK, the multi-SNP FBAT test, and the p-value based test GATES. We applied the OPTPDT to a family-based autism GWAS dataset for gene-based association analysis and identified MACROD2-AS1 with genome-wide significance (p-value= 2.5 × 10− 6). Conclusions Our simulation results suggested that the OPTPDT is a valid and powerful test. The OPTPDT will be helpful for gene-based or pathway association analysis. The method is ideal for the secondary analysis of existing GWAS datasets, which may identify a set of SNPs with joint effects on the disease. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1620-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yi-Ting Wang
- Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan.
| | - Pei-Yuan Sung
- Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan.
| | - Peng-Lin Lin
- Department of Medical Science, National Tsing Hua University, Hsin-Chu, Taiwan.
| | - Ya-Wen Yu
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Taiwan.
| | - Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Taiwan.
| |
Collapse
|
23
|
Yan B, Wang S, Jia H, Liu X, Wang X. An efficient weighted tag SNP-set analytical method in genome-wide association studies. BMC Genet 2015; 16:25. [PMID: 25879733 PMCID: PMC4373116 DOI: 10.1186/s12863-015-0182-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2014] [Accepted: 02/17/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Single-nucleotide polymorphism (SNP)-set analysis in Genome-wide association studies (GWAS) has emerged as a research hotspot for identifying genetic variants associated with disease susceptibility. But most existing methods of SNP-set analysis are affected by the quality of SNP-set, and poor quality of SNP-set can lead to low power in GWAS. RESULTS In this research, we propose an efficient weighted tag-SNP-set analytical method to detect the disease associations. In our method, we first design a fast algorithm to select a subset of SNPs (called tag SNP-set) from a given original SNP-set based on the linkage disequilibrium (LD) between SNPs, then assign a proper weight to each of the selected tag SNP respectively and test the joint effect of these weighted tag SNPs. The intensive simulation results show that the power of weighted tag SNP-set-based test is much higher than that of weighted original SNP-set-based test and that of un-weighted tag SNP-set-based test. We also compare the powers of the weighted tag SNP-set-based test based on four types of tag SNP-sets. The simulation results indicate the method of selecting tag SNP-set impacts the power greatly and the power of our proposed method is the highest. CONCLUSIONS From the analysis of simulated replicated data sets, we came to a conclusion that weighted tag SNP-set-based test is a powerful SNP-set test in GWAS. We also designed a faster algorithm of selecting tag SNPs which include most of information of original SNP-set, and a better weighted function which can describe the status of each tag SNP in GWAS.
Collapse
Affiliation(s)
- Bin Yan
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong, 266590, China.
| | - Shudong Wang
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong, 266590, China. .,College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, 266580, China. .,State Key Laboratory of Mining Disaster Prevention and Control Co-founded by Shandong Province and the Ministry of Science and Technology, Shandong University of Science and Technology, Qingdao, Shandong, 266590, China.
| | - Huaqian Jia
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong, 266590, China.
| | - Xing Liu
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong, 266590, China.
| | - Xinzeng Wang
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong, 266590, China.
| |
Collapse
|
24
|
A powerful nonparametric statistical framework for family-based association analyses. Genetics 2015; 200:69-78. [PMID: 25745024 DOI: 10.1534/genetics.115.175174] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Accepted: 02/23/2015] [Indexed: 01/04/2023] Open
Abstract
Family-based study design is commonly used in genetic research. It has many ideal features, including being robust to population stratification (PS). With the advance of high-throughput technologies and ever-decreasing genotyping cost, it has become common for family studies to examine a large number of variants for their associations with disease phenotypes. The yield from the analysis of these family-based genetic data can be enhanced by adopting computationally efficient and powerful statistical methods. We propose a general framework of a family-based U-statistic, referred to as family-U, for family-based association studies. Unlike existing parametric-based methods, the proposed method makes no assumption of the underlying disease models and can be applied to various phenotypes (e.g., binary and quantitative phenotypes) and pedigree structures (e.g., nuclear families and extended pedigrees). By using only within-family information, it can offer robust protection against PS. In the absence of PS, it can also utilize additional information (i.e., between-family information) for power improvement. Through simulations, we demonstrated that family-U attained higher power over a commonly used method, family-based association tests, under various disease scenarios. We further illustrated the new method with an application to large-scale family data from the Framingham Heart Study. By utilizing additional information (i.e., between-family information), family-U confirmed a previous association of CHRNA5 with nicotine dependence.
Collapse
|
25
|
Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genet Epidemiol 2015; 35:211-6. [PMID: 21308765 DOI: 10.1002/gepi.20567] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 11/21/2010] [Accepted: 01/04/2011] [Indexed: 11/10/2022]
Abstract
To detect genetic association with common and complex diseases, two powerful yet quite different multimarker association tests have been proposed, genomic distance-based regression (GDBR) (Wessel and Schork [2006] Am J Hum Genet 79:821–833) and kernel machine regression (KMR) (Kwee et al. [2008] Am J Hum Genet 82:386–397; Wu et al. [2010] Am J Hum Genet 86:929–942). GDBR is based on relating a multimarker similarity metric for a group of subjects to variation in their trait values, while KMR is based on nonparametric estimates of the effects of the multiple markers on the trait through a kernel function or kernel matrix. Since the two approaches are both powerful and general, but appear quite different, it is important to know their specific relationships. In this report, we show that, under the condition that there is no other covariate, there is a striking correspondence between the two approaches for a quantitative or a binary trait: if the same positive semi-definite matrix is used as the centered similarity matrix in GDBR and as the kernel matrix in KMR, the F-test statistic in GDBR and the score test statistic in KMR are equal (up to some ignorable constants). The result is based on the connections of both methods to linear or logistic (random-effects) regression models.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455–0392, USA.
| |
Collapse
|
26
|
Li Z, Yuan A, Han G, Gao G, Li Q. Rank-based tests for identifying multiple genetic variants associated with quantitative traits. Ann Hum Genet 2015; 78:306-10. [PMID: 24942081 DOI: 10.1111/ahg.12067] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We consider the analysis of multiple genetic variants within a gene or a region that are expected to confer risks to human complex diseases with quantitative traits, where the trait values do not follow the normal distribution even after some transformations. We rank the phenotypic values, calculate a score to measure the trend effect of a particular allele for each marker, and then construct three statistics based on the quadratic frameworks of methods Hotelling T(2) , the summation of squared univariate statistic and the inverse of the square root weighted statistics to combine the scores for different marker loci. Simulation results show that the above three test statistics can control the type I error rate well and are more robust than standard tests constructed based on linear regression. Application to GAW16 data for rheumatoid arthritis successfully detects the association between the HLA-DRB1 gene and anticyclic citrullinated protein measure, while the standard methods based on normal assumption cannot detect this association.
Collapse
|
27
|
Chien LC, Chiu YF, Liang KY, Chuang LM. Simultaneous estimation of the locations and effects of multiple disease loci in case-control studies. Biostatistics 2014; 16:222-39. [PMID: 25481194 DOI: 10.1093/biostatistics/kxu052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The genetic basis of complex diseases often involves multiple causative loci. Under such a disease etiology, assuming one disease locus in linkage disequilibrium mapping is likely to induce bias and lead to efficiency loss in disease locus estimation. An approach is needed for simultaneously localizing multiple functional loci within the same region. However, due to the increasing number of parameters accompanying disease loci, these estimates can be computationally infeasible. To circumvent this problem, we propose to estimate the main and two-adjacent-locus joint effects and a nuisance parameter at the disease loci separately through a linear approximation. Estimates of the genetic effects are entered into a generalized estimating equation to estimate disease loci, and the procedure is conducted iteratively until convergence. The proposed method provides estimates and confidence intervals (CIs) for the disease loci, the genetic main effects, and the joint effects of two adjacent disease loci, with the CIs for the disease loci providing useful regions for further fine-mapping. We apply the proposed approach to a data example of case-control studies. Results of the simulations and data example suggest that the developed method performs well in terms of bias, variance, and coverage probability under scenarios with up to three disease loci.
Collapse
Affiliation(s)
- Li-Chu Chien
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli 35053, Taiwan, ROC
| | - Yen-Feng Chiu
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli 35053, Taiwan, ROC; Institute of Statistics, National Chiao Tung University Hsinchu 30010, Taiwan, ROC; Biostatistics Center, China Medical University, Taichung 40402, Taiwan, ROC
| | - Kung-Yee Liang
- Institution of Public Health and Department of Public Health, National Yang Ming University, Taipei 11221, Taiwan
| | - Lee-Ming Chuang
- Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan; Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei 10051, Taiwan
| |
Collapse
|
28
|
Wei C, Li M, He Z, Vsevolozhskaya O, Schaid DJ, Lu Q. A weighted U-statistic for genetic association analyses of sequencing data. Genet Epidemiol 2014; 38:699-708. [PMID: 25331574 DOI: 10.1002/gepi.21864] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Revised: 08/15/2014] [Accepted: 09/05/2014] [Indexed: 12/13/2022]
Abstract
With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.
Collapse
Affiliation(s)
- Changshuai Wei
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America; Department of Biostatistics and Epidemiology, University of North Texas Health Science Center, Fort Worth, Texas, United States of America
| | | | | | | | | | | |
Collapse
|
29
|
Lu M, Lee HS, Hadley D, Huang JZ, Qian X. Supervised categorical principal component analysis for genome-wide association analyses. BMC Genomics 2014; 15 Suppl 1:S10. [PMID: 24564304 PMCID: PMC4046680 DOI: 10.1186/1471-2164-15-s1-s10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023] Open
Abstract
In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.
Collapse
|
30
|
Larson NB, Jenkins GD, Larson MC, Vierkant RA, Sellers TA, Phelan CM, Schildkraut JM, Sutphen R, Pharoah PPD, Gayther SA, Wentzensen N, Goode EL, Fridley BL. Kernel canonical correlation analysis for assessing gene-gene interactions and application to ovarian cancer. Eur J Hum Genet 2014; 22:126-31. [PMID: 23591404 PMCID: PMC3865403 DOI: 10.1038/ejhg.2013.69] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2012] [Revised: 01/11/2013] [Accepted: 01/16/2013] [Indexed: 01/24/2023] Open
Abstract
Although single-locus approaches have been widely applied to identify disease-associated single-nucleotide polymorphisms (SNPs), complex diseases are thought to be the product of multiple interactions between loci. This has led to the recent development of statistical methods for detecting statistical interactions between two loci. Canonical correlation analysis (CCA) has previously been proposed to detect gene-gene coassociation. However, this approach is limited to detecting linear relations and can only be applied when the number of observations exceeds the number of SNPs in a gene. This limitation is particularly important for next-generation sequencing, which could yield a large number of novel variants on a limited number of subjects. To overcome these limitations, we propose an approach to detect gene-gene interactions on the basis of a kernelized version of CCA (KCCA). Our simulation studies showed that KCCA controls the Type-I error, and is more powerful than leading gene-based approaches under a disease model with negligible marginal effects. To demonstrate the utility of our approach, we also applied KCCA to assess interactions between 200 genes in the NF-κB pathway in relation to ovarian cancer risk in 3869 cases and 3276 controls. We identified 13 significant gene pairs relevant to ovarian cancer risk (local false discovery rate <0.05). Finally, we discuss the advantages of KCCA in gene-gene interaction analysis and its future role in genetic association studies.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Gregory D Jenkins
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Melissa C Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Robert A Vierkant
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | | | | | | | - Rebecca Sutphen
- Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
| | | | - Simon A Gayther
- Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicolas Wentzensen
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Ovarian Cancer Association Consortium
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
- Duke Comprehensive Cancer Center, Duke University, Durham, NC, USA
- Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
- Department of Oncology, University of Cambridge, Cambridge, UK
- Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
| | - Ellen L Goode
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Brooke L Fridley
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
| |
Collapse
|
31
|
Jin L, Zhu W, Yu Y, Kou C, Meng X, Tao Y, Guo J. Nonparametric tests of associations with disease based on U-statistics. Ann Hum Genet 2013; 78:141-53. [PMID: 24328673 DOI: 10.1111/ahg.12049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Accepted: 09/01/2013] [Indexed: 11/25/2022]
Abstract
In case-control studies, association analysis was designed to test whether genetic variants were associated with human diseases. To evaluate the association, analysing one genetic marker at a time suffered from weak power, because of the correction for multiple testing and possibly small genetic effects. An alternative strategy was to test simultaneous effects of multiple markers, which was believed to be more powerful. However, when the number of markers under investigation was large, they would be subjected to weak power as well, because of the greater degrees of freedom. To conquer these limitations in case-control studies, we proposed a novel method that could test joint association of several loci (i.e. haplotype), with only a single degree of freedom. In this research, we developed a nonparametric approach, which was based on U-statistics. We also introduced a new kernel for U-statistic, which could combine the haplotype structure information, and was expected to enhance the power. Simulations indicated that our proposed approach offered merits in identifying the associations between diseases and haplotypes. Application of our method to a study of candidate genes for internalising disorder illustrated its virtue in utility and interpretation, and provided an excellent result in detecting the associations.
Collapse
Affiliation(s)
- Lina Jin
- Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, Jilin, 130024, China; School of Public Health, Jilin University, Changchun, Jilin, 130021, China
| | | | | | | | | | | | | |
Collapse
|
32
|
Taub MA, Schwender HR, Younkin SG, Louis TA, Ruczinski I. On multi-marker tests for association in case-control studies. Front Genet 2013; 4:252. [PMID: 24379823 PMCID: PMC3863805 DOI: 10.3389/fgene.2013.00252] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 11/07/2013] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies (GWAs) have identified thousands of DNA loci associated with a variety of traits. Statistical inference is almost always based on single marker hypothesis tests of association and the respective p-values with Bonferroni correction. Since commercially available genomic arrays interrogate hundreds of thousands or even millions of loci simultaneously, many causal yet undetected loci are believed to exist because the conditional power to achieve a genome-wide significance level can be low, in particular for markers with small effect sizes and low minor allele frequencies and in studies with modest sample size. However, the correlation between neighboring markers in the human genome due to linkage disequilibrium (LD) resulting in correlated marker test statistics can be incorporated into multi-marker hypothesis tests, thereby increasing power to detect association. Herein, we establish a theoretical benchmark by quantifying the maximum power achievable for multi-marker tests of association in case-control studies, achievable only when the causal marker is known. Using that genotype correlations within an LD block translate into an asymptotically multivariate normal distribution for score test statistics, we develop a set of weights for the markers that maximize the non-centrality parameter, and assess the relative loss of power for other approaches. We find that the method of Conneely and Boehnke (2007) based on the maximum absolute test statistic observed in an LD block is a practical and powerful method in a variety of settings. We also explore the effect on the power that prior biological or functional knowledge used to narrow down the locus of the causal marker can have, and conclude that this prior knowledge has to be very strong and specific for the power to approach the maximum achievable level, or even beat the power observed for methods such as the one proposed by Conneely and Boehnke (2007).
Collapse
Affiliation(s)
- Margaret A Taub
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Holger R Schwender
- Mathematical Institute, Heinrich Heine University Düsseldorf Düsseldorf, Germany
| | - Samuel G Younkin
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Thomas A Louis
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| |
Collapse
|
33
|
Di Camillo B, Sambo F, Toffolo G, Cobelli C. ABACUS: an entropy-based cumulative bivariate statistic robust to rare variants and different direction of genotype effect. ACTA ACUST UNITED AC 2013; 30:384-91. [PMID: 24292361 DOI: 10.1093/bioinformatics/btt697] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
MOTIVATION In the past years, both sequencing and microarray have been widely used to search for relations between genetic variations and predisposition to complex pathologies such as diabetes or neurological disorders. These studies, however, have been able to explain only a small fraction of disease heritability, possibly because complex pathologies cannot be referred to few dysfunctional genes, but are rather heterogeneous and multicausal, as a result of a combination of rare and common variants possibly impairing multiple regulatory pathways. Rare variants, though, are difficult to detect, especially when the effects of causal variants are in different directions, i.e. with protective and detrimental effects. RESULTS Here, we propose ABACUS, an Algorithm based on a BivAriate CUmulative Statistic to identify single nucleotide polymorphisms (SNPs) significantly associated with a disease within predefined sets of SNPs such as pathways or genomic regions. ABACUS is robust to the concurrent presence of SNPs with protective and detrimental effects and of common and rare variants; moreover, it is powerful even when few SNPs in the SNP-set are associated with the phenotype. We assessed ABACUS performance on simulated and real data and compared it with three state-of-the-art methods. When ABACUS was applied to type 1 and 2 diabetes data, besides observing a wide overlap with already known associations, we found a number of biologically sound pathways, which might shed light on diabetes mechanism and etiology. AVAILABILITY AND IMPLEMENTATION ABACUS is available at http://www.dei.unipd.it/∼dicamill/pagine/Software.html.
Collapse
Affiliation(s)
- Barbara Di Camillo
- Department of Information Engineering, University of Padova, via Gradenigo 6B, 35131 Padova, Italy
| | | | | | | |
Collapse
|
34
|
Zeggini E, Asimit JL. An evaluation of power to detect low-frequency variant associations using allele-matching tests that account for uncertainty. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2013:100-5. [PMID: 21121037 DOI: 10.1142/9789814335058_0011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
There is growing interest in the role of rare variants in multifactorial disease etiology, and increasing evidence that rare variants are associated with complex traits. Single SNP tests are underpowered in rare variant association analyses, so locus-based tests must be used. Quality scores at both the SNP and genotype level are available for sequencing data and they are rarely accounted for. A locus-based method that has high power in the presence of rare variants is extended to incorporate such quality scores as weights, and its power is compared with the original method via a simulation study. Preliminary results suggest that taking uncertainty into account does not improve the power.
Collapse
Affiliation(s)
- E Zeggini
- Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK.
| | | |
Collapse
|
35
|
A fast multilocus test with adaptive SNP selection for large-scale genetic-association studies. Eur J Hum Genet 2013; 22:696-702. [PMID: 24022295 DOI: 10.1038/ejhg.2013.201] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Revised: 07/02/2013] [Accepted: 08/07/2013] [Indexed: 12/20/2022] Open
Abstract
As increasing evidence suggests that multiple correlated genetic variants could jointly influence the outcome, a multilocus test that aggregates association evidence across multiple genetic markers in a considered gene or a genomic region may be more powerful than a single-marker test for detecting susceptibility loci. We propose a multilocus test, AdaJoint, which adopts a variable selection procedure to identify a subset of genetic markers that jointly show the strongest association signal, and defines the test statistic based on the selected genetic markers. The P-value from the AdaJoint test is evaluated by a computationally efficient algorithm that effectively adjusts for multiple-comparison, and is hundreds of times faster than the standard permutation method. Simulation studies demonstrate that AdaJoint has the most robust performance among several commonly used multilocus tests. We perform multilocus analysis of over 26,000 genes/regions on two genome-wide association studies of pancreatic cancer. Compared with its competitors, AdaJoint identifies a much stronger association between the gene CLPTM1L and pancreatic cancer risk (6.0 × 10(-8)), with the signal optimally captured by two correlated single-nucleotide polymorphisms (SNPs). Finally, we show AdaJoint as a powerful tool for mapping cis-regulating methylation quantitative trait loci on normal breast tissues, and find many CpG sites whose methylation levels are jointly regulated by multiple SNPs nearby.
Collapse
|
36
|
Larson NB, Schaid DJ. A kernel regression approach to gene-gene interaction detection for case-control studies. Genet Epidemiol 2013; 37:695-703. [PMID: 23868214 DOI: 10.1002/gepi.21749] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2013] [Revised: 05/07/2013] [Accepted: 06/12/2013] [Indexed: 01/13/2023]
Abstract
Gene-gene interactions are increasingly being addressed as a potentially important contributor to the variability of complex traits. Consequently, attentions have moved beyond single locus analysis of association to more complex genetic models. Although several single-marker approaches toward interaction analysis have been developed, such methods suffer from very high testing dimensionality and do not take advantage of existing information, notably the definition of genes as functional units. Here, we propose a comprehensive family of gene-level score tests for identifying genetic elements of disease risk, in particular pairwise gene-gene interactions. Using kernel machine methods, we devise score-based variance component tests under a generalized linear mixed model framework. We conducted simulations based upon coalescent genetic models to evaluate the performance of our approach under a variety of disease models. These simulations indicate that our methods are generally higher powered than alternative gene-level approaches and at worst competitive with exhaustive SNP-level (where SNP is single-nucleotide polymorphism) analyses. Furthermore, we observe that simulated epistatic effects resulted in significant marginal testing results for the involved genes regardless of whether or not true main effects were present. We detail the benefits of our methods and discuss potential genome-wide analysis strategies for gene-gene interaction analysis in a case-control study design.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | | |
Collapse
|
37
|
Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, Peters U. SBERIA: set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol 2013; 37:452-64. [PMID: 23720162 PMCID: PMC3713231 DOI: 10.1002/gepi.21735] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Revised: 04/04/2013] [Accepted: 04/30/2013] [Indexed: 01/28/2023]
Abstract
Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Collapse
Affiliation(s)
- Shuo Jiao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Zakharov S, Salim A, Thalamuthu A. Comparison of similarity-based tests and pooling strategies for rare variants. BMC Genomics 2013; 14:50. [PMID: 23343094 PMCID: PMC3600007 DOI: 10.1186/1471-2164-14-50] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 01/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background As several rare genomic variants have been shown to affect common phenotypes, rare variants association analysis has received considerable attention. Several efficient association tests using genotype and phenotype similarity measures have been proposed in the literature. The major advantages of similarity-based tests are their ability to accommodate multiple types of DNA variations within one association test, and to account for the possible interaction within a region. However, not much work has been done to compare the performance of similarity-based tests on rare variants association scenarios, especially when applied with different rare variants pooling strategies. Results Based on the population genetics simulations and analysis of a publicly-available sequencing data set, we compared the performance of four similarity-based tests and two rare variants pooling strategies. We showed that weighting approach outperforms collapsing under the presence of strong effect from rare variants and under the presence of moderate effect from common variants, whereas collapsing of rare variants is preferable when common variants possess a strong effect. We also demonstrated that the difference in statistical power between the two pooling strategies may be substantial. The results also highlighted consistently high power of two similarity-based approaches when applied with an appropriate pooling strategy. Conclusions Population genetics simulations and sequencing data set analysis showed high power of two similarity-based tests and a substantial difference in power between the two pooling strategies.
Collapse
Affiliation(s)
- Sergii Zakharov
- Human Genetics, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore.
| | | | | |
Collapse
|
39
|
Xu J, Zheng G, Yuan A. Case-Control Genome-wide Joint Association Study Using Semiparametric Empirical Model and Approximate Bayes Factor. J STAT COMPUT SIM 2013; 83:1191-1209. [PMID: 24532860 DOI: 10.1080/00949655.2011.654119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Abstract
We propose a semiparametric approach for the analysis of case-control genome-wide association study. Parametric components are used to model both the conditional distribution of the case status given the covariates and the distribution of genotype counts, whereas the distribution of the covariates are modeled nonparametrically. This yields a direct and joint modeling of the case status, covariates and genotype counts, and gives better understanding of the disease mechanism and results in more reliable conclusions. Side information, such as the disease prevalence, can be conveniently incorporated into the model by empirical likelihood approach and leads to more efficient estimates and powerful test in the detection of disease-associated SNPs. Profiling is used to eliminate a nuisance nonparametric component, and the resulting profile empirical likelihood estimates are shown to be consistent and asymptotically normal. For the hypothesis test on disease association, we apply the approximate Bayes factor (ABF) which is computationally simple and most desirable in genome-wide association studies where hundreds of thousands to a million genetic markers are tested. We treat the approximate Bayes factor as a hybrid Bayes factor which replaces the full data by the maximum likelihood estimates of the parameters of interest in the full model and derive it under a general setting. The deviation from Hardy-Weinberg Equilibrium (HWE) is also taken into account and the ABF for HWE using cases is shown to provide evidence of association between a disease and a genetic marker. Simulation studies and an application are further provided to illustrate the utility of the proposed methodology.
Collapse
Affiliation(s)
- Jinfeng Xu
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546
| | - Gang Zheng
- Office of Biostatistics Research, DPPS, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, Bethesda, MD 20892, USA
| | - Ao Yuan
- National Human Genome Center, Howard University, 2216 Sixth Street N.W., Washington, DC 20059
| |
Collapse
|
40
|
Li S, Cui Y. Gene-centric gene–gene interaction: A model-based kernel machine method. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas545] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
41
|
Lu Q, Wei C, Ye C, Li M, Elston RC. A likelihood ratio-based Mann-Whitney approach finds novel replicable joint gene action for type 2 diabetes. Genet Epidemiol 2012; 36:583-93. [PMID: 22760990 PMCID: PMC3634342 DOI: 10.1002/gepi.21651] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2012] [Revised: 04/09/2012] [Accepted: 05/09/2012] [Indexed: 12/29/2022]
Abstract
The potential importance of the joint action of genes, whether modeled with or without a statistical interaction term, has long been recognized. However, identifying such action has been a great challenge, especially when millions of genetic markers are involved. We propose a likelihood ratio-based Mann-Whitney test to search for joint gene action either among candidate genes or genome-wide. It extends the traditional univariate Mann-Whitney test to assess the joint association of genotypes at multiple loci with disease, allowing for high-order statistical interactions. Because only one overall significance test is conducted for the entire analysis, it avoids the issue of multiple testing. Moreover, the approach adopts a computationally efficient algorithm, making a genome-wide search feasible in a reasonable amount of time on a high performance personal computer. We evaluated the approach using both theoretical and real data. By applying the approach to 40 type 2 diabetes (T2D) susceptibility single-nucleotide polymorphisms (SNPs), we identified a four-locus model strongly associated with T2D in the Wellcome Trust (WT) study (permutation P-value < 0.001), and replicated the same finding in the Nurses' Health Study/Health Professionals Follow-Up Study (NHS/HPFS) (P-value = 3.03×10-11). We also conducted a genome-wide search on 385,598 SNPs in the WT study. The analysis took approximately 55 hr on a personal computer, identifying the same first two loci, but overall a different set of four SNPs, jointly associated with T2D (P-value = 1.29×10-5). The nominal significance of this same association reached 4.01×10-6 in the NHS/HPFS.
Collapse
Affiliation(s)
- Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Changshuai Wei
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Chengyin Ye
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Ming Li
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Robert C. Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
| |
Collapse
|
42
|
Lin WY, Tiwari HK, Gao G, Zhang K, Arcaroli JJ, Abraham E, Liu N. Similarity-based multimarker association tests for continuous traits. Ann Hum Genet 2012; 76:246-60. [PMID: 22497480 DOI: 10.1111/j.1469-1809.2012.00706.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average R(2) (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes >0.4). We also applied our two tests to an adiposity data set to show their utility.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Department of Biostatistics, University of Alabama at Birmingham, USA
| | | | | | | | | | | | | |
Collapse
|
43
|
Abstract
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker-based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next-generation sequencing data and rare variants association studies are discussed.
Collapse
Affiliation(s)
- Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
44
|
Wang K. Statistical tests of genetic association for case-control study designs. Biostatistics 2012; 13:724-33. [DOI: 10.1093/biostatistics/kxs002] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
45
|
A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet 2011; 89:354-67. [PMID: 21885029 DOI: 10.1016/j.ajhg.2011.07.015] [Citation(s) in RCA: 209] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Revised: 07/21/2011] [Accepted: 07/26/2011] [Indexed: 12/19/2022] Open
Abstract
Biological and empirical evidence suggests that rare variants account for a large proportion of the genetic contributions to complex human diseases. Recent technological advances in high-throughput sequencing platforms have made it possible for researchers to generate comprehensive information on rare variants in large samples. We provide a general framework for association testing with rare variants by combining mutation information across multiple variant sites within a gene and relating the enriched genetic information to disease phenotypes through appropriate regression models. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary, quantitative, and age at onset), and it allows arbitrary covariates (e.g., environmental factors and ancestry variables). We derive theoretically optimal procedures for combining rare mutations and construct suitable test statistics for various biological scenarios. The allele-frequency threshold can be fixed or variable. The effects of the combined rare mutations on the phenotype can be in the same direction or different directions. The proposed methods are statistically more powerful and computationally more efficient than existing ones. An application to a deep-resequencing study of drug targets led to a discovery of rare variants associated with total cholesterol. The relevant software is freely available.
Collapse
|
46
|
Han F, Pan W. A composite likelihood approach to latent multivariate Gaussian modeling of SNP data with application to genetic association testing. Biometrics 2011; 68:307-15. [PMID: 21838810 DOI: 10.1111/j.1541-0420.2011.01649.x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Many statistical tests have been proposed for case-control data to detect disease association with multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium. The main reason for the existence of so many tests is that each test aims to detect one or two aspects of many possible distributional differences between cases and controls, largely due to the lack of a general and yet simple model for discrete genotype data. Here we propose a latent variable model to represent SNP data: the observed SNP data are assumed to be obtained by discretizing a latent multivariate Gaussian variate. Because the latent variate is multivariate Gaussian, its distribution is completely characterized by its mean vector and covariance matrix, in contrast to much more complex forms of a general distribution for discrete multivariate SNP data. We propose a composite likelihood approach for parameter estimation. A direct application of this latent variable model is to association testing with multiple SNPs in a candidate gene or region. In contrast to many existing tests that aim to detect only one or two aspects of many possible distributional differences of discrete SNP data, we can exclusively focus on testing the mean and covariance parameters of the latent Gaussian distributions for cases and controls. Our simulation results demonstrate potential power gains of the proposed approach over some existing methods.
Collapse
Affiliation(s)
- Fang Han
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | | |
Collapse
|
47
|
Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet 2011; 89:277-88. [PMID: 21835306 DOI: 10.1016/j.ajhg.2011.07.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Revised: 06/16/2011] [Accepted: 07/13/2011] [Indexed: 11/15/2022] Open
Abstract
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Basu S, Pan W, Oetting WS. A dimension reduction approach for modeling multi-locus interaction in case-control studies. Hum Hered 2011; 71:234-45. [PMID: 21734407 DOI: 10.1159/000328842] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2010] [Accepted: 04/12/2011] [Indexed: 01/01/2023] Open
Abstract
Studying one locus or one single nucleotide polymorphism (SNP) at a time may not be sufficient to understand complex diseases because they are unlikely to result from the effect of only one SNP. Each SNP alone may have little or no effect on the risk of the disease, but together they may increase the risk substantially. Analyses focusing on individual SNPs ignore the possibility of interaction among SNPs. In this paper, we propose a parsimonious model to assess the joint effect of a group of SNPs in a case-control study. The model implements a data reduction strategy within a likelihood framework and uses a test to assess the statistical significance of the effect of the group of SNPs on the binary trait. The primary advantage of the proposed approach is that the dimension reduction technique produces a test statistic with degrees of freedom significantly lower than a multiple logistic regression with only main effects of the SNPs, and our parsimonious model can incorporate the possibility of interaction among the SNPs. Moreover, the proposed approach estimates the direction of association of each SNP with the disease and provides an estimate of the average effect of the group of SNPs positively and negatively associated with the disease in the given SNP set. We illustrate the proposed model on simulated and real data, and compare its performance with a few other existing approaches. Our proposed approach appeared to outperform the other approaches for independent SNPs in our simulation studies.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, University of Minnesota, Minneapolis, USA. saonli @ umn.edu
| | | | | |
Collapse
|
49
|
Wang L, Jia P, Wolfinger RD, Chen X, Zhao Z. Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics 2011; 98:1-8. [PMID: 21565265 PMCID: PMC3852939 DOI: 10.1016/j.ygeno.2011.04.006] [Citation(s) in RCA: 164] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2010] [Revised: 03/02/2011] [Accepted: 04/15/2011] [Indexed: 12/25/2022]
Abstract
Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.
Collapse
Affiliation(s)
- Lily Wang
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | - Peilin Jia
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
- Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | | | - Xi Chen
- Division of Cancer Biostatistics, Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | - Zhongming Zhao
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
- Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
- Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| |
Collapse
|
50
|
Li M, Ye C, Fu W, Elston RC, Lu Q. Detecting genetic interactions for quantitative traits with U-statistics. Genet Epidemiol 2011; 35:457-68. [PMID: 21618602 DOI: 10.1002/gepi.20594] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Revised: 03/09/2011] [Accepted: 04/19/2011] [Indexed: 11/08/2022]
Abstract
The genetic etiology of complex human diseases has been commonly viewed as a process that involves multiple genetic variants, environmental factors, as well as their interactions. Statistical approaches, such as the multifactor dimensionality reduction (MDR) and generalized MDR (GMDR), have recently been proposed to test the joint association of multiple genetic variants with either dichotomous or continuous traits. In this study, we propose a novel Forward U-Test to evaluate the combined effect of multiple loci on quantitative traits with consideration of gene-gene/gene-environment interactions. In this new approach, a U-Statistic-based forward algorithm is first used to select potential disease-susceptibility loci and then a weighted U-statistic is used to test the joint association of the selected loci with the disease. Through a simulation study, we found the Forward U-Test outperformed GMDR in terms of greater power. Aside from that, our approach is less computationally intensive, making it feasible for high-dimensional gene-gene/gene-environment research. We illustrate our method with a real data application to nicotine dependence (ND), using three independent datasets from the Study of Addiction: Genetics and Environment. Our gene-gene interaction analysis of 155 SNPs in 67 candidate genes identified two SNPs, rs16969968 within gene CHRNA5 and rs1122530 within gene NTRK2, jointly associated with the level of ND (P-value = 5.31e-7). The association, which involves essential interaction, is replicated in two independent datasets with P-values of 1.08e-5 and 0.02, respectively. Our finding suggests that joint action may exist between the two gene products.
Collapse
Affiliation(s)
- Ming Li
- Department of Epidemiology, Michigan State University, East Lansing, MI 48824, USA
| | | | | | | | | |
Collapse
|