1
|
Pluta D, Shen T, Xue G, Chen C, Ombao H, Yu Z. Ridge-penalized adaptive Mantel test and its application in imaging genetics. Stat Med 2021; 40:5313-5332. [PMID: 34216035 DOI: 10.1002/sim.9127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 06/01/2021] [Accepted: 06/16/2021] [Indexed: 01/23/2023]
Abstract
We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement and testing. This result is not only theoretically interesting but also has important implications in penalized hypothesis testing, especially in high-dimensional settings such as imaging genetics. Applying the proposed method to an imaging genetic study of visual working memory in healthy adults, we identified interesting associations of brain connectivity (measured by electroencephalogram coherence) with selected genetic features.
Collapse
Affiliation(s)
- Dustin Pluta
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Tong Shen
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Gui Xue
- Center for Brain and Learning Science, Beijing Normal University, Beijing, China
| | - Chuansheng Chen
- Department of Psychology and Social Behavior, University of California, Irvine, Irvine, California, USA
| | - Hernando Ombao
- Statistics Program, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| |
Collapse
|
2
|
Manavalan R, Priya S. Genetic interactions effects for cancer disease identification using computational models: a review. Med Biol Eng Comput 2021; 59:733-758. [PMID: 33839998 DOI: 10.1007/s11517-021-02343-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 03/10/2021] [Indexed: 11/29/2022]
Abstract
Genome-wide association studies (GWAS) provide clear insight into understanding genetic variations and environmental influences responsible for various human diseases. Cancer identification through genetic interactions (epistasis) is one of the significant ongoing researches in GWAS. The growth of the cancer cell emerges from multi-locus as well as complex genetic interaction. It is impractical for the physician to detect cancer via manual examination of SNPs interaction. Due to its importance, several computational approaches have been modeled to infer epistasis effects. This article includes a comprehensive and multifaceted review of all relevant genetic studies published between 2001 and 2020. In this contemporary review, various computational methods are as follows: multifactor dimensionality reduction-based approaches, statistical strategies, machine learning, and optimization-based techniques are carefully reviewed and presented with their evaluation results. Moreover, these computational approaches' strengths and limitations are described. The issues behind the computational methods for identifying the cancer disease through genetic interactions and the various evaluation parameters used by researchers have been analyzed. This review is highly beneficial for researchers and medical professionals to learn techniques adapted to discover the epistasis and aids to design novel automatic epistasis detection systems with strong robustness and maximum efficiency to address the different research problems in finding practical solutions effectively.
Collapse
Affiliation(s)
- R Manavalan
- Department of Computer Science, Arignar Anna Government Arts College, Villupuram, Tamil Nadu, 605602, India.
| | - S Priya
- Computer Science, Arignar Anna Government Arts College, Villupuram, Tamil Nadu, India
| |
Collapse
|
3
|
Yi M, Negishi M, Lee SJ. Estrogen Sulfotransferase (SULT1E1): Its Molecular Regulation, Polymorphisms, and Clinical Perspectives. J Pers Med 2021; 11:jpm11030194. [PMID: 33799763 PMCID: PMC8001535 DOI: 10.3390/jpm11030194] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Revised: 03/05/2021] [Accepted: 03/08/2021] [Indexed: 12/18/2022] Open
Abstract
Estrogen sulfotransferase (SULT1E1) is a phase II enzyme that sulfates estrogens to inactivate them and regulate their homeostasis. This enzyme is also involved in the sulfation of thyroid hormones and several marketed medicines. Though the profound action of SULT1E1 in molecular/pathological biology has been extensively studied, its genetic variants and functional studies have been comparatively rarely studied. Genetic variants of this gene are associated with some diseases, especially sex-hormone-related cancers. Comprehending the role and polymorphisms of SULT1E1 is crucial to developing and integrating its clinical relevance; therefore, this study gathered and reviewed various literature studies to outline several aspects of the function, molecular regulation, and polymorphisms of SULT1E1.
Collapse
Affiliation(s)
- MyeongJin Yi
- Pharmacogenetics Section, Reproductive and Developmental Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC 27709, USA; (M.Y.); (M.N.)
| | - Masahiko Negishi
- Pharmacogenetics Section, Reproductive and Developmental Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC 27709, USA; (M.Y.); (M.N.)
| | - Su-Jun Lee
- Department of Pharmacology and Pharmacogenomics Research Center, Inje University College of Medicine, Inje University, Bokji-ro 75, Busanjin-gu, Busan 47392, Korea
- Correspondence: ; Tel.: +82-51-890-8665
| |
Collapse
|
4
|
Wei C, Li M, Wen Y, Ye C, Lu Q. A multi-locus predictiveness curve and its summary assessment for genetic risk prediction. Stat Methods Med Res 2020; 29:44-56. [PMID: 30612522 PMCID: PMC6612460 DOI: 10.1177/0962280218819202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Genetic association studies using high-throughput genotyping and sequencing technologies have identified a large number of genetic variants associated with complex human diseases. These findings have provided an unprecedented opportunity to identify individuals in the population at high risk for disease who carry causal genetic mutations and hold great promise for early intervention and individualized medicine. While interest is high in building risk prediction models based on recent genetic findings, it is crucial to have appropriate statistical measurements to assess the performance of a genetic risk prediction model. Predictiveness curves were recently proposed as a graphic tool for evaluating a risk prediction model on the basis of a single continuous biomarker. The curve evaluates a risk prediction model for classification performance as well as its usefulness when applied to a population. In this article, we extend the predictiveness curve to measure the collective contribution of multiple genetic variants. We further propose a nonparametric, U-statistics-based measurement, referred to as the U-Index, to quantify the performance of a multi-locus predictiveness curve. In particular, a global U-Index and a partial U-Index can be used in the general population and a subpopulation of particular clinical interest, respectively. Through simulation studies, we demonstrate that the proposed U-Index has advantages over several existing summary statistics under various disease models. We also show that the partial U-Index can have its own uniqueness when rare variants have a substantial contribution to disease risk. Finally, we use the proposed predictiveness curve and its corresponding U-Index to evaluate the performance of a genetic risk prediction model for nicotine dependence.
Collapse
Affiliation(s)
- Changshuai Wei
- Core Artificial Intelligence, Amazon.com Inc, Seattle, WA, USA
| | - Ming Li
- Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN, USA
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Chengyin Ye
- Department of Health Management, Hangzhou Normal University, Hangzhou, China
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
5
|
Zhao Y, Zhu H, Lu Z, Knickmeyer RC, Zou F. Structured Genome-Wide Association Studies with Bayesian Hierarchical Variable Selection. Genetics 2019; 212:397-415. [PMID: 31010934 PMCID: PMC6553832 DOI: 10.1534/genetics.119.301906] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 04/08/2019] [Indexed: 02/04/2023] Open
Abstract
It becomes increasingly important in using genome-wide association studies (GWAS) to select important genetic information associated with qualitative or quantitative traits. Currently, the discovery of biological association among SNPs motivates various strategies to construct SNP-sets along the genome and to incorporate such set information into selection procedure for a higher selection power, while facilitating more biologically meaningful results. The aim of this paper is to propose a novel Bayesian framework for hierarchical variable selection at both SNP-set (group) level and SNP (within group) level. We overcome a key limitation of existing posterior updating scheme in most Bayesian variable selection methods by proposing a novel sampling scheme to explicitly accommodate the ultrahigh-dimensionality of genetic data. Specifically, by constructing an auxiliary variable selection model under SNP-set level, the new procedure utilizes the posterior samples of the auxiliary model to subsequently guide the posterior inference for the targeted hierarchical selection model. We apply the proposed method to a variety of simulation studies and show that our method is computationally efficient and achieves substantially better performance than competing approaches in both SNP-set and SNP selection. Applying the method to the Alzheimers Disease Neuroimaging Initiative (ADNI) data, we identify biologically meaningful genetic factors under several neuroimaging volumetric phenotypes. Our method is general and readily to be applied to a wide range of biomedical studies.
Collapse
Affiliation(s)
- Yize Zhao
- Department of Healthcare Policy and Research, Cornell University Weill Cornell, New York, New York 10065
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Zhaohua Lu
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Rebecca C Knickmeyer
- Department of Pediatrics and Human Development, Michigan State University, East Lansing, Michigan 48824
| | - Fei Zou
- Department of Biostatistics, University of Florida, Gainesville, Florida 32611
| |
Collapse
|
6
|
Hou TT, Lin F, Bai S, Cleves MA, Xu HM, Lou XY. Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits. Genet Epidemiol 2018; 43:24-36. [PMID: 30387901 DOI: 10.1002/gepi.22169] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 08/31/2018] [Accepted: 09/21/2018] [Indexed: 12/11/2022]
Abstract
The manifestation of complex traits is influenced by gene-gene and gene-environment interactions, and the identification of multifactor interactions is an important but challenging undertaking for genetic studies. Many complex phenotypes such as disease severity are measured on an ordinal scale with more than two categories. A proportional odds model can improve statistical power for these outcomes, when compared to a logit model either collapsing the categories into two mutually exclusive groups or limiting the analysis to pairs of categories. In this study, we propose a proportional odds model-based generalized multifactor dimensionality reduction (GMDR) method for detection of interactions underlying polytomous ordinal phenotypes. Computer simulations demonstrated that this new GMDR method has a higher power and more accurate predictive ability than the GMDR methods based on a logit model and a multinomial logit model. We applied this new method to the genetic analysis of low-density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi-Ethnic Study of Atherosclerosis, and identified a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes. This finding provides new information to advance the limited knowledge about genetic regulation and gene interactions in metabolic pathways of LDL cholesterol. In conclusion, the proportional odds model-based GMDR is a useful tool that can boost statistical power and prediction accuracy in studying multifactor interactions underlying ordinal traits.
Collapse
Affiliation(s)
- Ting-Ting Hou
- Biostatistics Program, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas.,Arkansas Children's Research Institute, Little Rock, Arkansas.,Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Feng Lin
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Shasha Bai
- Biostatistics Program, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas.,Arkansas Children's Research Institute, Little Rock, Arkansas
| | - Mario A Cleves
- Biostatistics Program, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas.,Arkansas Children's Research Institute, Little Rock, Arkansas
| | - Hai-Ming Xu
- Biostatistics Program, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas.,Arkansas Children's Research Institute, Little Rock, Arkansas.,Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, China
| | - Xiang-Yang Lou
- Biostatistics Program, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas.,Arkansas Children's Research Institute, Little Rock, Arkansas.,Arkansas Children's Nutrition Center, Little Rock, Arkansas
| |
Collapse
|
7
|
Reexamining Dis/Similarity-Based Tests for Rare-Variant Association with Case-Control Samples. Genetics 2018; 209:105-113. [PMID: 29545466 DOI: 10.1534/genetics.118.300769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 03/02/2018] [Indexed: 11/18/2022] Open
Abstract
A properly designed distance-based measure can capture informative genetic differences among individuals with different phenotypes and can be used to detect variants responsible for the phenotypes. To detect associated variants, various tests have been designed to contrast genetic dissimilarity or similarity scores of certain subject groups in different ways, among which the most widely used strategy is to quantify the difference between the within-group genetic dissimilarity/similarity (i.e., case-case and control-control similarities) and the between-group dissimilarity/similarity (i.e., case-control similarities). While it has been noted that for common variants, the within-group and the between-group measures should all be included; in this work, we show that for rare variants, comparison based on the two within-group measures can more effectively quantify the genetic difference between cases and controls. The between-group measure tends to overlap with one of the two within-group measures for rare variants, although such overlap is not present for common variants. Consequently, a dissimilarity or similarity test that includes the between-group information tends to attenuate the association signals and leads to power loss. Based on these findings, we propose a dissimilarity test that compares the degree of SNP dissimilarity within cases to that within controls to better characterize the difference between two disease phenotypes. We provide the statistical properties, asymptotic distribution, and computation details for a small sample size of the proposed test. We use simulated and real sequence data to assess the performance of the proposed test, comparing it with other rare-variant methods including those similarity-based tests that use both within-group and between-group information. As similarity-based approaches serve as one of the dominating approaches in rare-variant analysis, our results provide some insight for the effective detection of rare variants.
Collapse
|
8
|
Jadhav S, Tong X, Lu Q. A functional U-statistic method for association analysis of sequencing data. Genet Epidemiol 2017; 41:636-643. [PMID: 28850771 DOI: 10.1002/gepi.22063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Revised: 06/06/2017] [Accepted: 07/10/2017] [Indexed: 11/08/2022]
Abstract
Although sequencing studies hold great promise for uncovering novel variants predisposing to human diseases, the high dimensionality of the sequencing data brings tremendous challenges to data analysis. Moreover, for many complex diseases (e.g., psychiatric disorders) multiple related phenotypes are collected. These phenotypes can be different measurements of an underlying disease, or measurements characterizing multiple related diseases for studying common genetic mechanism. Although jointly analyzing these phenotypes could potentially increase the power of identifying disease-associated genes, the different types of phenotypes pose challenges for association analysis. To address these challenges, we propose a nonparametric method, functional U-statistic method (FU), for multivariate analysis of sequencing data. It first constructs smooth functions from individuals' sequencing data, and then tests the association of these functions with multiple phenotypes by using a U-statistic. The method provides a general framework for analyzing various types of phenotypes (e.g., binary and continuous phenotypes) with unknown distributions. Fitting the genetic variants within a gene using a smoothing function also allows us to capture complexities of gene structure (e.g., linkage disequilibrium, LD), which could potentially increase the power of association analysis. Through simulations, we compared our method to the multivariate outcome score test (MOST), and found that our test attained better performance than MOST. In a real data application, we apply our method to the sequencing data from Minnesota Twin Study (MTS) and found potential associations of several nicotine receptor subunit (CHRN) genes, including CHRNB3, associated with nicotine dependence and/or alcohol dependence.
Collapse
Affiliation(s)
- Sneha Jadhav
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
| | - Xiaoran Tong
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
9
|
Links Between the Sequence Kernel Association and the Kernel-Based Adaptive Cluster Tests. STATISTICS IN BIOSCIENCES 2017. [DOI: 10.1007/s12561-016-9175-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
10
|
Gordon D, Londono D, Patel P, Kim W, Finch SJ, Heiman GA. An Analytic Solution to the Computation of Power and Sample Size for Genetic Association Studies under a Pleiotropic Mode of Inheritance. Hum Hered 2017; 81:194-209. [PMID: 28315880 DOI: 10.1159/000457135] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Accepted: 01/20/2017] [Indexed: 01/14/2023] Open
Abstract
Our motivation here is to calculate the power of 3 statistical tests used when there are genetic traits that operate under a pleiotropic mode of inheritance and when qualitative phenotypes are defined by use of thresholds for the multiple quantitative phenotypes. Specifically, we formulate a multivariate function that provides the probability that an individual has a vector of specific quantitative trait values conditional on having a risk locus genotype, and we apply thresholds to define qualitative phenotypes (affected, unaffected) and compute penetrances and conditional genotype frequencies based on the multivariate function. We extend the analytic power and minimum-sample-size-necessary (MSSN) formulas for 2 categorical data-based tests (genotype, linear trend test [LTT]) of genetic association to the pleiotropic model. We further compare the MSSN of the genotype test and the LTT with that of a multivariate ANOVA (Pillai). We approximate the MSSN for statistics by linear models using a factorial design and ANOVA. With ANOVA decomposition, we determine which factors most significantly change the power/MSSN for all statistics. Finally, we determine which test statistics have the smallest MSSN. In this work, MSSN calculations are for 2 traits (bivariate distributions) only (for illustrative purposes). We note that the calculations may be extended to address any number of traits. Our key findings are that the genotype test usually has lower MSSN requirements than the LTT. More inclusive thresholds (top/bottom 25% vs. top/bottom 10%) have higher sample size requirements. The Pillai test has a much larger MSSN than both the genotype test and the LTT, as a result of sample selection. With these formulas, researchers can specify how many subjects they must collect to localize genes for pleiotropic phenotypes.
Collapse
Affiliation(s)
- Derek Gordon
- Department of Genetics, The State University of New Jersey, Piscataway, NJ, USA
| | | | | | | | | | | |
Collapse
|
11
|
Li M, Wei C, Wen Y, Wang T, Lu Q. Detecting Gene-Gene Interactions Associated with Multiple Complex Traits with U-Statistics. Curr Genomics 2016; 17:403-415. [PMID: 28479869 PMCID: PMC5320542 DOI: 10.2174/1389202917666160513100946] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2015] [Revised: 05/26/2015] [Accepted: 06/06/2015] [Indexed: 12/02/2022] Open
Abstract
Many complex diseases, such as psychiatric and behavioral disorders, are commonly characterized through various measurements that reflect physical, behavioral and psychological aspects of diseases. While it remains a great challenge to find a unified measurement to characterize a disease, the available multiple phenotypes can be analyzed jointly in the genetic association study. Simultaneously testing these phenotypes has many advantages, including considering different aspects of the disease in the analysis, and utilizing correlated phenotypes to improve the power of detecting disease-associated variants. Furthermore, complex diseases are likely caused by the interplay of multiple genetic variants through complicated mechanisms. Considering gene-gene interactions in the joint association analysis of complex diseases could further increase our ability to discover genetic variants involving complex disease pathways. In this article, we propose a stepwise U-test for joint association analysis of multiple loci and multiple phenotypes. Through simulations, we demonstrated that testing multiple phenotypes simultaneously could attain higher power than testing one single phenotype at a time, especially when there are shared genes contributing to multiple phenotypes. We also illustrated the proposed method with an application to Nicotine Dependence (ND), using datasets from the Study of Addition, Genetics and Environment (SAGE). The joint analysis of three ND phenotypes identified two SNPs, rs10508649 and rs2491397, and reached a nominal P-value of 3.79e-13. The association was further replicated in two independent datasets with P-values of 2.37e-05 and 7.46e-05.
Collapse
Affiliation(s)
| | | | | | | | - Qing Lu
- Address correspondence to this author at the Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, P.R. China; Tel: 517.353.8623 x137; Fax: 517.432.1130;, E-mail:
| |
Collapse
|
12
|
Lu ZH, Zhu H, Knickmeyer RC, Sullivan PF, Williams SN, Zou F. Multiple SNP Set Analysis for Genome-Wide Association Studies Through Bayesian Latent Variable Selection. Genet Epidemiol 2015; 39:664-77. [PMID: 26515609 DOI: 10.1002/gepi.21932] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Revised: 07/23/2015] [Accepted: 08/18/2015] [Indexed: 11/07/2022]
Abstract
The power of genome-wide association studies (GWAS) for mapping complex traits with single-SNP analysis (where SNP is single-nucleotide polymorphism) may be undermined by modest SNP effect sizes, unobserved causal SNPs, correlation among adjacent SNPs, and SNP-SNP interactions. Alternative approaches for testing the association between a single SNP set and individual phenotypes have been shown to be promising for improving the power of GWAS. We propose a Bayesian latent variable selection (BLVS) method to simultaneously model the joint association mapping between a large number of SNP sets and complex traits. Compared with single SNP set analysis, such joint association mapping not only accounts for the correlation among SNP sets but also is capable of detecting causal SNP sets that are marginally uncorrelated with traits. The spike-and-slab prior assigned to the effects of SNP sets can greatly reduce the dimension of effective SNP sets, while speeding up computation. An efficient Markov chain Monte Carlo algorithm is developed. Simulations demonstrate that BLVS outperforms several competing variable selection methods in some important scenarios.
Collapse
Affiliation(s)
- Zhao-Hua Lu
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, United States of America.,Biomedical Research Imaging Center, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | - Rebecca C Knickmeyer
- Department of Psychiatry, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | - Patrick F Sullivan
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | - Stephanie N Williams
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | | |
Collapse
|
13
|
Wang C, Kao WH, Hsiao CK. Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies. PLoS One 2015; 10:e0135918. [PMID: 26302001 PMCID: PMC4547758 DOI: 10.1371/journal.pone.0135918] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Accepted: 07/28/2015] [Indexed: 11/27/2022] Open
Abstract
The availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. To construct SNP-sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs or SNP-sets should be clustered. A dendrogram can then be constructed based on such distance measure, and the number of clusters can be determined. With the resulting SNP-sets, we next develop an association test HDAT to examine susceptibility to the disease of interest. This proposed test assesses, based on Hamming distance, whether the similarity between a diseased and a normal individual differs from the similarity between two individuals of the same disease status. In our proposed methodology, only genotype information is needed. No inference of haplotypes is required, and SNPs under consideration do not need to locate in nearby regions. The proposed clustering algorithm and association test are illustrated with applications and simulation studies. As compared with other existing methods, the clustering algorithm is faster and better at identifying sets containing SNPs exerting a similar effect. In addition, the simulation studies demonstrated that the proposed test works well for SNP-sets containing a large proportion of neutral SNPs. Furthermore, employing the clustering algorithm before testing a large set of data improves the knowledge in confining the genetic regions for susceptible genetic markers.
Collapse
Affiliation(s)
- Charlotte Wang
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
| | - Wen-Hsin Kao
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
| | - Chuhsing Kate Hsiao
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
- Bioinformatics and Biostatistics Core, Division of Genomic Medicine, Research Center for Medical Excellence, National Taiwan University, Taipei, 100, Taiwan
- Department of Public Health, National Taiwan University, Taipei, 100, Taiwan
- * E-mail:
| |
Collapse
|
14
|
Zhang W, Li Q. Nonparametric Risk and Nonparametric Odds in Quantitative Genetic Association Studies. Sci Rep 2015; 5:12105. [PMID: 26174851 PMCID: PMC5378889 DOI: 10.1038/srep12105] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2014] [Accepted: 06/17/2015] [Indexed: 12/30/2022] Open
Abstract
The coefficient in a linear regression model is commonly employed to evaluate the genetic effect of a single nucleotide polymorphism associated with a quantitative trait under the assumption that the trait value follows a normal distribution or is appropriately normally distributed after a certain transformation. When this assumption is violated, the distribution-free tests are preferred. In this work, we propose the nonparametric risk (NR) and nonparametric odds (NO), obtain the asymptotic normal distribution of estimated NR and then construct the confidence intervals. We also define the genetic models using NR, construct the test statistic under a given genetic model and a robust test, which are free of the genetic uncertainty. Simulation studies show that the proposed confidence intervals have satisfactory cover probabilities and the proposed test can control the type I error rates and is more powerful than the exiting ones under most of the considered scenarios. Application to gene of PTPN22 and genomic region of 6p21.33 from the Genetic Analysis Workshop 16 for association with the anticyclic citrullinated protein antibody further show their performances.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Qizhai Li
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
15
|
A powerful nonparametric statistical framework for family-based association analyses. Genetics 2015; 200:69-78. [PMID: 25745024 DOI: 10.1534/genetics.115.175174] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Accepted: 02/23/2015] [Indexed: 01/04/2023] Open
Abstract
Family-based study design is commonly used in genetic research. It has many ideal features, including being robust to population stratification (PS). With the advance of high-throughput technologies and ever-decreasing genotyping cost, it has become common for family studies to examine a large number of variants for their associations with disease phenotypes. The yield from the analysis of these family-based genetic data can be enhanced by adopting computationally efficient and powerful statistical methods. We propose a general framework of a family-based U-statistic, referred to as family-U, for family-based association studies. Unlike existing parametric-based methods, the proposed method makes no assumption of the underlying disease models and can be applied to various phenotypes (e.g., binary and quantitative phenotypes) and pedigree structures (e.g., nuclear families and extended pedigrees). By using only within-family information, it can offer robust protection against PS. In the absence of PS, it can also utilize additional information (i.e., between-family information) for power improvement. Through simulations, we demonstrated that family-U attained higher power over a commonly used method, family-based association tests, under various disease scenarios. We further illustrated the new method with an application to large-scale family data from the Framingham Heart Study. By utilizing additional information (i.e., between-family information), family-U confirmed a previous association of CHRNA5 with nicotine dependence.
Collapse
|
16
|
Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genet Epidemiol 2015; 35:211-6. [PMID: 21308765 DOI: 10.1002/gepi.20567] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 11/21/2010] [Accepted: 01/04/2011] [Indexed: 11/10/2022]
Abstract
To detect genetic association with common and complex diseases, two powerful yet quite different multimarker association tests have been proposed, genomic distance-based regression (GDBR) (Wessel and Schork [2006] Am J Hum Genet 79:821–833) and kernel machine regression (KMR) (Kwee et al. [2008] Am J Hum Genet 82:386–397; Wu et al. [2010] Am J Hum Genet 86:929–942). GDBR is based on relating a multimarker similarity metric for a group of subjects to variation in their trait values, while KMR is based on nonparametric estimates of the effects of the multiple markers on the trait through a kernel function or kernel matrix. Since the two approaches are both powerful and general, but appear quite different, it is important to know their specific relationships. In this report, we show that, under the condition that there is no other covariate, there is a striking correspondence between the two approaches for a quantitative or a binary trait: if the same positive semi-definite matrix is used as the centered similarity matrix in GDBR and as the kernel matrix in KMR, the F-test statistic in GDBR and the score test statistic in KMR are equal (up to some ignorable constants). The result is based on the connections of both methods to linear or logistic (random-effects) regression models.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455–0392, USA.
| |
Collapse
|
17
|
Li Z, Yuan A, Han G, Gao G, Li Q. Rank-based tests for identifying multiple genetic variants associated with quantitative traits. Ann Hum Genet 2015; 78:306-10. [PMID: 24942081 DOI: 10.1111/ahg.12067] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We consider the analysis of multiple genetic variants within a gene or a region that are expected to confer risks to human complex diseases with quantitative traits, where the trait values do not follow the normal distribution even after some transformations. We rank the phenotypic values, calculate a score to measure the trend effect of a particular allele for each marker, and then construct three statistics based on the quadratic frameworks of methods Hotelling T(2) , the summation of squared univariate statistic and the inverse of the square root weighted statistics to combine the scores for different marker loci. Simulation results show that the above three test statistics can control the type I error rate well and are more robust than standard tests constructed based on linear regression. Application to GAW16 data for rheumatoid arthritis successfully detects the association between the HLA-DRB1 gene and anticyclic citrullinated protein measure, while the standard methods based on normal assumption cannot detect this association.
Collapse
|
18
|
Wei C, Li M, He Z, Vsevolozhskaya O, Schaid DJ, Lu Q. A weighted U-statistic for genetic association analyses of sequencing data. Genet Epidemiol 2014; 38:699-708. [PMID: 25331574 DOI: 10.1002/gepi.21864] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Revised: 08/15/2014] [Accepted: 09/05/2014] [Indexed: 12/13/2022]
Abstract
With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.
Collapse
Affiliation(s)
- Changshuai Wei
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America; Department of Biostatistics and Epidemiology, University of North Texas Health Science Center, Fort Worth, Texas, United States of America
| | | | | | | | | | | |
Collapse
|
19
|
Xu Z, Shen X, Pan W. Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One 2014; 9:e102312. [PMID: 25098835 PMCID: PMC4123854 DOI: 10.1371/journal.pone.0102312] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 06/17/2014] [Indexed: 01/08/2023] Open
Abstract
Most existing genome-wide association analyses are cross-sectional, utilizing only phenotypic data at a single time point, e.g. baseline. On the other hand, longitudinal studies, such as Alzheimer's Disease Neuroimaging Initiative (ADNI), collect phenotypic information at multiple time points. In this article, as a case study, we conducted both longitudinal and cross-sectional analyses of the ADNI data with several brain imaging (not clinical diagnosis) phenotypes, demonstrating the power gains of longitudinal analysis over cross-sectional analysis. Specifically, we scanned genome-wide single nucleotide polymorphisms (SNPs) with 56 brain-wide imaging phenotypes processed by FreeSurfer on 638 subjects. At the genome-wide significance level P < 1.8 x 10(9)) or a less stringent level (e.g. P < 10(7)), longitudinal analysis of the phenotypic data from the baseline to month 48 identified more SNP-phenotype associations than cross-sectional analysis of only the baseline data. In particular, at the genome-wide significance level, both SNP rs429358 in gene APOE and SNP rs2075650 in gene TOMM40 were confirmed to be associated with various imaging phenotypes in multiple regions of interests (ROIs) by both analyses, though longitudinal analysis detected more regional phenotypes associated with the two SNPs and indicated another significant SNP rs439401 in gene APOE. In light of the power advantage of longitudinal analysis, we advocate its use in current and future longitudinal neuroimaging studies.
Collapse
Affiliation(s)
- Zhiyuan Xu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
- * E-mail:
| | | |
Collapse
|
20
|
Wen Y, Lu Q. A multiclass likelihood ratio approach for genetic risk prediction allowing for phenotypic heterogeneity. Genet Epidemiol 2013; 37:715-25. [PMID: 23934726 DOI: 10.1002/gepi.21751] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Revised: 06/09/2013] [Accepted: 07/03/2013] [Indexed: 01/04/2023]
Abstract
The translation of human genome discoveries into health practice is one of the major challenges in the coming decades. The use of emerging genetic knowledge for early disease prediction, prevention, and pharmacogenetics will advance genome medicine and lead to more effective prevention/treatment strategies. For this reason, studies to assess the combined role of genetic and environmental discoveries in early disease prediction represent high priority research projects, as manifested in the multiple risk prediction studies now underway. However, the risk prediction models formed to date lack sufficient accuracy for clinical use. Converging evidence suggests that diseases with the same or similar clinical manifestations could have different pathophysiological and etiological processes. When heterogeneous subphenotypes are treated as a single entity, the effect size of predictors can be reduced substantially, leading to a low-accuracy risk prediction model. The use of more refined subphenotypes facilitates the identification of new predictors and leads to improved risk prediction models. To account for the phenotypic heterogeneity, we have developed a multiclass likelihood-ratio approach, which simultaneously determines the optimum number of subphenotype groups and builds a risk prediction model for each group. Simulation results demonstrated that the new approach had more accurate and robust performance than existing approaches under various underlying disease models. The empirical study of type II diabetes (T2D) by using data from the Genes and Environment Initiatives suggested heterogeneous etiology underlying obese and nonobese T2D patients. Considering phenotypic heterogeneity in the analysis leads to improved risk prediction models for both obese and nonobese T2D subjects.
Collapse
Affiliation(s)
- Yalu Wen
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | | |
Collapse
|
21
|
Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, Peters U. SBERIA: set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol 2013; 37:452-64. [PMID: 23720162 PMCID: PMC3713231 DOI: 10.1002/gepi.21735] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Revised: 04/04/2013] [Accepted: 04/30/2013] [Indexed: 01/28/2023]
Abstract
Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Collapse
Affiliation(s)
- Shuo Jiao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Lu Q, Wei C, Ye C, Li M, Elston RC. A likelihood ratio-based Mann-Whitney approach finds novel replicable joint gene action for type 2 diabetes. Genet Epidemiol 2012; 36:583-93. [PMID: 22760990 PMCID: PMC3634342 DOI: 10.1002/gepi.21651] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2012] [Revised: 04/09/2012] [Accepted: 05/09/2012] [Indexed: 12/29/2022]
Abstract
The potential importance of the joint action of genes, whether modeled with or without a statistical interaction term, has long been recognized. However, identifying such action has been a great challenge, especially when millions of genetic markers are involved. We propose a likelihood ratio-based Mann-Whitney test to search for joint gene action either among candidate genes or genome-wide. It extends the traditional univariate Mann-Whitney test to assess the joint association of genotypes at multiple loci with disease, allowing for high-order statistical interactions. Because only one overall significance test is conducted for the entire analysis, it avoids the issue of multiple testing. Moreover, the approach adopts a computationally efficient algorithm, making a genome-wide search feasible in a reasonable amount of time on a high performance personal computer. We evaluated the approach using both theoretical and real data. By applying the approach to 40 type 2 diabetes (T2D) susceptibility single-nucleotide polymorphisms (SNPs), we identified a four-locus model strongly associated with T2D in the Wellcome Trust (WT) study (permutation P-value < 0.001), and replicated the same finding in the Nurses' Health Study/Health Professionals Follow-Up Study (NHS/HPFS) (P-value = 3.03×10-11). We also conducted a genome-wide search on 385,598 SNPs in the WT study. The analysis took approximately 55 hr on a personal computer, identifying the same first two loci, but overall a different set of four SNPs, jointly associated with T2D (P-value = 1.29×10-5). The nominal significance of this same association reached 4.01×10-6 in the NHS/HPFS.
Collapse
Affiliation(s)
- Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Changshuai Wei
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Chengyin Ye
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Ming Li
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| | - Robert C. Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
| |
Collapse
|
23
|
Abstract
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker-based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next-generation sequencing data and rare variants association studies are discussed.
Collapse
Affiliation(s)
- Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
24
|
Braem M, Schouten L, Peeters P, den Brandt PV, Onland-Moret N. Genetic susceptibility to sporadic ovarian cancer: A systematic review. Biochim Biophys Acta Rev Cancer 2011; 1816:132-46. [DOI: 10.1016/j.bbcan.2011.05.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2011] [Revised: 05/18/2011] [Accepted: 05/18/2011] [Indexed: 11/29/2022]
|
25
|
Han F, Pan W. A composite likelihood approach to latent multivariate Gaussian modeling of SNP data with application to genetic association testing. Biometrics 2011; 68:307-15. [PMID: 21838810 DOI: 10.1111/j.1541-0420.2011.01649.x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Many statistical tests have been proposed for case-control data to detect disease association with multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium. The main reason for the existence of so many tests is that each test aims to detect one or two aspects of many possible distributional differences between cases and controls, largely due to the lack of a general and yet simple model for discrete genotype data. Here we propose a latent variable model to represent SNP data: the observed SNP data are assumed to be obtained by discretizing a latent multivariate Gaussian variate. Because the latent variate is multivariate Gaussian, its distribution is completely characterized by its mean vector and covariance matrix, in contrast to much more complex forms of a general distribution for discrete multivariate SNP data. We propose a composite likelihood approach for parameter estimation. A direct application of this latent variable model is to association testing with multiple SNPs in a candidate gene or region. In contrast to many existing tests that aim to detect only one or two aspects of many possible distributional differences of discrete SNP data, we can exclusively focus on testing the mean and covariance parameters of the latent Gaussian distributions for cases and controls. Our simulation results demonstrate potential power gains of the proposed approach over some existing methods.
Collapse
Affiliation(s)
- Fang Han
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | | |
Collapse
|
26
|
Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet 2011; 89:277-88. [PMID: 21835306 DOI: 10.1016/j.ajhg.2011.07.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Revised: 06/16/2011] [Accepted: 07/13/2011] [Indexed: 11/15/2022] Open
Abstract
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Li M, Ye C, Fu W, Elston RC, Lu Q. Detecting genetic interactions for quantitative traits with U-statistics. Genet Epidemiol 2011; 35:457-68. [PMID: 21618602 DOI: 10.1002/gepi.20594] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Revised: 03/09/2011] [Accepted: 04/19/2011] [Indexed: 11/08/2022]
Abstract
The genetic etiology of complex human diseases has been commonly viewed as a process that involves multiple genetic variants, environmental factors, as well as their interactions. Statistical approaches, such as the multifactor dimensionality reduction (MDR) and generalized MDR (GMDR), have recently been proposed to test the joint association of multiple genetic variants with either dichotomous or continuous traits. In this study, we propose a novel Forward U-Test to evaluate the combined effect of multiple loci on quantitative traits with consideration of gene-gene/gene-environment interactions. In this new approach, a U-Statistic-based forward algorithm is first used to select potential disease-susceptibility loci and then a weighted U-statistic is used to test the joint association of the selected loci with the disease. Through a simulation study, we found the Forward U-Test outperformed GMDR in terms of greater power. Aside from that, our approach is less computationally intensive, making it feasible for high-dimensional gene-gene/gene-environment research. We illustrate our method with a real data application to nicotine dependence (ND), using three independent datasets from the Study of Addiction: Genetics and Environment. Our gene-gene interaction analysis of 155 SNPs in 67 candidate genes identified two SNPs, rs16969968 within gene CHRNA5 and rs1122530 within gene NTRK2, jointly associated with the level of ND (P-value = 5.31e-7). The association, which involves essential interaction, is replicated in two independent datasets with P-values of 1.08e-5 and 0.02, respectively. Our finding suggests that joint action may exist between the two gene products.
Collapse
Affiliation(s)
- Ming Li
- Department of Epidemiology, Michigan State University, East Lansing, MI 48824, USA
| | | | | | | | | |
Collapse
|
28
|
Han F, Pan W. Powerful multi-marker association tests: unifying genomic distance-based regression and logistic regression. Genet Epidemiol 2011; 34:680-8. [PMID: 20976795 DOI: 10.1002/gepi.20529] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
To detect genetic association with common and complex diseases, many statistical tests have been proposed for candidate gene or genome-wide association studies with the case-control design. Due to linkage disequilibrium (LD), multi-marker association tests can gain power over single-marker tests with a Bonferroni multiple testing adjustment. Among many existing multi-marker association tests, most target to detect only one of many possible aspects in distributional differences between the genotypes of cases and controls, such as allele frequency differences, while a few new ones aim to target two or three aspects, all of which can be implemented in logistic regression. In contrast to logistic regression, a genomic distance-based regression (GDBR) approach aims to detect some high-order genotypic differences between cases and controls. A recent study has confirmed the high power of GDBR tests. At this moment, the popular logistic regression and the emerging GDBR approaches are completely unrelated; for example, one has to choose between the two. In this article, we reformulate GDBR as logistic regression, opening a venue to constructing other powerful tests while overcoming some limitations of GDBR. For example, asymptotic distributions can replace time-consuming permutations for deriving P-values and covariates, including gene-gene interactions, can be easily incorporated. Importantly, this reformulation facilitates combining GDBR with other existing methods in a unified framework of logistic regression. In particular, we show that Fisher's P-value combining method can boost statistical power by incorporating information from allele frequencies, Hardy-Weinberg disequilibrium, LD patterns, and other higher-order interactions among multi-markers as captured by GDBR.
Collapse
Affiliation(s)
- Fang Han
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455–0392, USA
| | | |
Collapse
|
29
|
He J, Wang K, Edmondson AC, Rader DJ, Li C, Li M. Gene-based interaction analysis by incorporating external linkage disequilibrium information. Eur J Hum Genet 2010; 19:164-72. [PMID: 20924406 DOI: 10.1038/ejhg.2010.164] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Gene-gene interactions have an important role in complex human diseases. Detection of gene-gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP-SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene-gene interactions.
Collapse
Affiliation(s)
- Jing He
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| | | | | | | | | | | |
Collapse
|
30
|
Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered 2010; 70:42-54. [PMID: 20413981 PMCID: PMC2912645 DOI: 10.1159/000288704] [Citation(s) in RCA: 241] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2009] [Accepted: 02/05/2010] [Indexed: 12/14/2022] Open
Abstract
Since associations between complex diseases and common variants are typically weak, and approaches to genotyping rare variants (e.g. by next-generation resequencing) multiply, there is an urgent demand to develop powerful association tests that are able to detect disease associations with both common and rare variants. In this article we present such a test. It is based on data-adaptive modifications to a so-called Sum test originally proposed for common variants, which aims to strike a balance between utilizing information on multiple markers in linkage disequilibrium and reducing the cost of large degrees of freedom or of multiple testing adjustment. When applied to multiple common or rare variants in a candidate region, the proposed test is easy to use with 1 degree of freedom and without the need for multiple testing adjustment. We show that the proposed test has high power across a wide range of scenarios with either common or rare variants, or both. In particular, in some situations the proposed test performs better than several commonly used methods.
Collapse
Affiliation(s)
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minn., USA
| |
Collapse
|
31
|
Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol 2009; 33:497-507. [PMID: 19170135 DOI: 10.1002/gepi.20402] [Citation(s) in RCA: 180] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We consider detecting associations between a trait and multiple single nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD). To maximize the use of information contained in multiple SNPs while minimizing the cost of large degrees of freedom (DF) in testing multiple parameters, we first theoretically explore the sum test derived under a working assumption of a common association strength between the trait and each SNP, testing on the corresponding parameter with only one DF. Under the scenarios that the association strengths between the trait and the SNPs are close to each other (and in the same direction), as considered by Wang and Elston [Am. J. Hum. Genet. [2007] 80:353-360], we show with simulated data that the sum test was powerful as compared to several existing tests; otherwise, the sum test might have much reduced power. To overcome the limitation of the sum test, based on our theoretical analysis of the sum test, we propose five new tests that are closely related to each other and are shown to consistently perform similarly well across a wide range of scenarios. We point out the close connection of the proposed tests to the Goeman test. Furthermore, we derive the asymptotic distributions of the proposed tests so that P-values can be easily calculated, in contrast to the use of computationally demanding permutations or simulations for the Goeman test. A distinguishing feature of the five new tests is their use of a diagonal working covariance matrix, rather than a full covariance matrix as used in the usual Wald or score test. We recommend the routine use of two of the new tests, along with several other tests, to detect disease associations with multiple linked SNPs.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA.
| |
Collapse
|