1
|
Ren J, Zhou F, Li X, Ma S, Jiang Y, Wu C. Robust Bayesian variable selection for gene-environment interactions. Biometrics 2023; 79:684-694. [PMID: 35394058 PMCID: PMC11086965 DOI: 10.1111/biom.13670] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 03/23/2022] [Accepted: 03/28/2022] [Indexed: 11/30/2022]
Abstract
Gene-environment (G× E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G× E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a fully Bayesian robust variable selection method for G× E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies, analysis of diabetes data with single-nucleotide polymorphism measurements from the Nurses' Health Study, and The Cancer Genome Atlas melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.
Collapse
Affiliation(s)
- Jie Ren
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| |
Collapse
|
2
|
Zhou F, Lu X, Ren J, Fan K, Ma S, Wu C. Sparse group variable selection for gene-environment interactions in the longitudinal study. Genet Epidemiol 2022; 46:317-340. [PMID: 35766061 DOI: 10.1002/gepi.22461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 01/31/2022] [Accepted: 03/15/2022] [Indexed: 11/06/2022]
Abstract
Penalized variable selection for high-dimensional longitudinal data has received much attention as it can account for the correlation among repeated measurements while providing additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies, the potential of penalization methods is far from fully understood for accommodating structured sparsity. In this article, we develop a sparse group penalization method to conduct the bi-level gene-environment (G × $\times $ E) interaction study under the repeatedly measured phenotype. Within the quadratic inference function framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual levels. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program, we conduct G × $\times $ E study by using high-dimensional single nucleotide polymorphism data as genetic factors and the longitudinal trait, forced expiratory volume in 1 s, as the phenotype. Our method leads to improved prediction and identification of main and interaction effects with important implications.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, 06520, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| |
Collapse
|
3
|
A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping. STATS 2021. [DOI: 10.3390/stats4030033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
In genome-wide association studies (GWAS), efficient incorporation of linkage disequilibria (LD) among densely typed genetic variants into association analysis is a critical yet challenging problem. Functional linear models (FLM), which impose a smoothing structure on the coefficients of correlated covariates, are advantageous in genetic mapping of multiple variants with high LD. Here we propose a novel constrained generalized FLM (cGFLM) framework to perform simultaneous association tests on a block of linked SNPs with various trait types, including continuous, binary and zero-inflated count phenotypes. The new cGFLM applies a set of inequality constraints on the FLM to ensure model identifiability under different genetic codings. The method is implemented via B-splines, and an augmented Lagrangian algorithm is employed for parameter estimation. For hypotheses testing, a test statistic that accounts for the model constraints was derived, following a mixture of chi-square distributions. Simulation results show that cGFLM is effective in identifying causal loci and gene clusters compared to several competing methods based on single markers and SKAT-C. We applied the proposed method to analyze a candidate gene-based COGEND study and a large-scale GWAS data on dental caries risk.
Collapse
|
4
|
Lau A, So HC. Turning genome-wide association study findings into opportunities for drug repositioning. Comput Struct Biotechnol J 2020; 18:1639-1650. [PMID: 32670504 PMCID: PMC7334463 DOI: 10.1016/j.csbj.2020.06.015] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Revised: 06/05/2020] [Accepted: 06/05/2020] [Indexed: 02/02/2023] Open
Abstract
Drug development is a very costly and lengthy process, while repositioned or repurposed drugs could be brought into clinical practice within a shorter time-frame and at a much reduced cost. Numerous computational approaches to drug repositioning have been developed, but methods utilizing genome-wide association studies (GWASs) data are less explored. The past decade has observed a massive growth in the amount of data from GWAS; the rich information contained in GWAS has great potential to guide drug repositioning or discovery. While multiple tools are available for finding the most relevant genes from GWAS hits, searching for top susceptibility genes is only one way to guide repositioning, which has its own limitations. Here we provide a comprehensive review of different computational approaches that employ GWAS data to guide drug repositioning. These methods include selecting top candidate genes from GWAS as drug targets, deducing drug candidates based on drug-drug and disease-disease similarities, searching for reversed expression profiles between drugs and diseases, pathway-based methods as well as approaches based on analysis of biological networks. Each method is illustrated with examples, and their respective strengths and limitations are discussed. We also discussed several areas for future research.
Collapse
Affiliation(s)
- Alexandria Lau
- School of Biomedical Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Hon-Cheong So
- School of Biomedical Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China
- KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, Kunming Zoology Institute of Zoology and The Chinese University of Hong Kong, Hong Kong SAR, China
- Department of Psychiatry, The Chinese University of Hong Kong, Hong Kong SAR, China
- Margaret K.L. Cheung Research Centre for Management of Parkinsonism, The Chinese University of Hong Kong, Hong Kong SAR, China
- Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China
- Brain and Mind Institute, The Chinese University of Hong Kong, Hong Kong SAR, China
- Hong Kong Branch of the Chinese Academy of Sciences Center for Excellence in Animal Evolution and Genetics, The Chinese University of Hong Kong, Hong Kong SAR, China
- Corresponding author at: School of Biomedical Sciences, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
5
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
6
|
Deng Y, He T, Fang R, Li S, Cao H, Cui Y. Genome-Wide Gene-Based Multi-Trait Analysis. Front Genet 2020; 11:437. [PMID: 32508874 PMCID: PMC7248273 DOI: 10.3389/fgene.2020.00437] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/08/2020] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association studies focusing on a single phenotype have been broadly conducted to identify genetic variants associated with a complex disease. The commonly applied single variant analysis is limited by failing to consider the complex interactions between variants, which motivated the development of association analyses focusing on genes or gene sets. Moreover, when multiple correlated phenotypes are available, methods based on a multi-trait analysis can improve the association power. However, most currently available multi-trait analyses are single variant-based analyses; thus have limited power when disease variants function as a group in a gene or a gene set. In this work, we propose a genome-wide gene-based multi-trait analysis method by considering genes as testing units. For a given phenotype, we adopt a rapid and powerful kernel-based testing method which can evaluate the joint effect of multiple variants within a gene. The joint effect, either linear or nonlinear, is captured through kernel functions. Given a series of candidate kernel functions, we propose an omnibus test strategy to integrate the test results based on different candidate kernels. A p-value combination method is then applied to integrate dependent p-values to assess the association between a gene and multiple correlated phenotypes. Simulation studies show a reasonable type I error control and an excellent power of the proposed method compared to its counterparts. We further show the utility of the method by applying it to two data sets: the Human Liver Cohort and the Alzheimer Disease Neuroimaging Initiative data set, and novel genes are identified. Our method has broad applications in other fields in which the interest is to evaluate the joint effect (linear or nonlinear) of a set of variants.
Collapse
Affiliation(s)
- Yamin Deng
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, United States
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, United States
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
7
|
Ren J, Zhou F, Li X, Chen Q, Zhang H, Ma S, Jiang Y, Wu C. Semiparametric Bayesian variable selection for gene-environment interactions. Stat Med 2020; 39:617-638. [PMID: 31863500 PMCID: PMC7467082 DOI: 10.1002/sim.8434] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 09/26/2019] [Accepted: 11/02/2019] [Indexed: 11/06/2022]
Abstract
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Study of gene-environment (G×E) interactions is important for elucidating the disease etiology. Existing Bayesian methods for G×E interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. Many studies have shown the advantages of penalization methods in detecting G×E interactions in "large p, small n" settings. However, Bayesian variable selection, which can provide fresh insight into G×E study, has not been widely examined. We propose a novel and powerful semiparametric Bayesian variable selection model that can investigate linear and nonlinear G×E interactions simultaneously. Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. Spike-and-slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. The proposed method conducts Bayesian variable selection more efficiently than existing methods. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. The proposed Bayesian method leads to the identification of main and interaction effects with important implications in a high-throughput profiling study with high-dimensional SNP data.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Qi Chen
- Department of Pharmacology, Toxicology and Therapeutics, University of Kansas Medical Center, Kansas City, Kansas
| | - Hongmei Zhang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas
| |
Collapse
|
8
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
9
|
Bonham LW, Evans DS, Liu Y, Cummings SR, Yaffe K, Yokoyama JS. Neurotransmitter Pathway Genes in Cognitive Decline During Aging: Evidence for GNG4 and KCNQ2 Genes. Am J Alzheimers Dis Other Demen 2018; 33:153-165. [PMID: 29338302 PMCID: PMC6209098 DOI: 10.1177/1533317517739384] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
BACKGROUND/RATIONALE Experimental studies support the role of neurotransmitter genes in dementia risk, but human studies utilizing single variants in candidate genes have had limited success. METHODS We used the gene-based testing program Versatile Gene-based Association Study to assess whether aggregate variation across 6 neurotransmitter pathways influences risk of cognitive decline in 8159 cognitively normal elderly (≥65 years old) adults from 3 community-based cohorts. RESULTS Common genetic variation in GNG4 and KCNQ2 was associated with cognitive decline. In human brain tissue data sets, both GNG4 and KCNQ2 show higher expression in hippocampus relative to other brain regions; GNG4 expression decreases with advancing age. Both GNG4 and KCNQ2 show highest expression in fetal astrocytes. CONCLUSION Genetic variation analyses and gene expression data suggest that GNG4 and KCNQ2 may be associated with cognitive decline in normal aging. Gene-based testing of neurotransmitter pathways may confirm and reveal novel risk genes in future studies of healthy cognitive aging.
Collapse
Affiliation(s)
- Luke W. Bonham
- Department of Neurology, Memory and Aging Center, University of California, San Francisco, CA, USA
| | - Daniel S. Evans
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Yongmei Liu
- Department of Epidemiology and Prevention, Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
| | - Steven R. Cummings
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Kristine Yaffe
- Department of Neurology, Memory and Aging Center, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
- Department of Veterans Affairs, San Francisco Veterans Affairs Medical Center, San Francisco, CA, USA
- Department of Psychiatry, University of California, San Francisco, CA, USA
| | - Jennifer S. Yokoyama
- Department of Neurology, Memory and Aging Center, University of California, San Francisco, CA, USA
| |
Collapse
|
10
|
Additive varying-coefficient model for nonlinear gene-environment interactions. Stat Appl Genet Mol Biol 2018; 17:sagmb-2017-0008. [DOI: 10.1515/sagmb-2017-0008] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.
Collapse
|
11
|
Wu C, Jiang Y, Ren J, Cui Y, Ma S. Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Stat Med 2017; 37:437-456. [PMID: 29034484 DOI: 10.1002/sim.7518] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Revised: 07/30/2017] [Accepted: 09/07/2017] [Indexed: 12/26/2022]
Abstract
Identification of gene-environment (G × E) interactions associated with disease phenotypes has posed a great challenge in high-throughput cancer studies. The existing marginal identification methods have suffered from not being able to accommodate the joint effects of a large number of genetic variants, while some of the joint-effect methods have been limited by failing to respect the "main effects, interactions" hierarchy, by ignoring data contamination, and by using inefficient selection techniques under complex structural sparsity. In this article, we develop an effective penalization approach to identify important G × E interactions and main effects, which can account for the hierarchical structures of the 2 types of effects. Possible data contamination is accommodated by adopting the least absolute deviation loss function. The advantage of the proposed approach over the alternatives is convincingly demonstrated in both simulation and a case study on lung cancer prognosis with gene expression measurements and clinical covariates under the accelerated failure time model.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics, and Environmental Health, University of Memphis, Memphis, TN 38111, USA
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd, East Lansing, MI 48824, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, 60 College Street, New Haven, CT 06520, USA
| |
Collapse
|
12
|
Luo T, Liu X, Cui Y. A Genome-wide Association Analysis in Four Populations Reveals Strong Genetic Heterogeneity For Birth Weight. Curr Genomics 2017; 17:416-426. [PMID: 28479870 PMCID: PMC5320544 DOI: 10.2174/1389202917666160726152033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 08/24/2015] [Accepted: 08/31/2015] [Indexed: 11/22/2022] Open
Abstract
Low or high birth weight is one of the main causes for neonatal morbidity and mortality. They are also associated with adulthood chronic illness. Birth weight is a complex trait which is affected by baby's genes, maternal environments as well as the complex interactions between them. To understand the genetic basis of birth weight, we reanalyzed a genome-wide association study data set which consists of four populations, namely Thai, Afro-Caribbean, European, and Hispanic population with regular linear models. In addition to fit the data with parametric linear models, we fitted the data with a nonparametric varying-coefficient model to identify variants that are nonlinearly modulated by mother's condition to affect birth weight. For this purpose, we used baby's cord glucose level as the mother's environmental variable. At the 10-5 genome-wide threshold, we identified 33 SNP variants in the Thai population, 26 SNPs in the Afro-Caribbean population, 18 SNPs in the European population, and 7 SNPs in the Hispanic population. Some of the variants are significantly modulated by baby's cord glucose level either linearly or nonlinearly, implying potential interactions between baby's gene and mother's glucose level to affect baby's birth weight. There is no overlap between variants identified in the four populations, indicating strong genetic heterogeneity of birth weight between the four ethnic groups. The findings of this study provide insights into the genetic basis of birth weight and reveal its genetic heterogeneity.
Collapse
Affiliation(s)
- Tiane Luo
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Shanxi, 030001, China
| | - Xu Liu
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Yuehua Cui
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Shanxi, 030001, China.,Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
13
|
A Nonlinear Model for Gene-Based Gene-Environment Interaction. Int J Mol Sci 2016; 17:ijms17060882. [PMID: 27271617 PMCID: PMC4926416 DOI: 10.3390/ijms17060882] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Revised: 05/07/2016] [Accepted: 05/21/2016] [Indexed: 11/16/2022] Open
Abstract
A vast amount of literature has confirmed the role of gene-environment (G×E) interaction in the etiology of complex human diseases. Traditional methods are predominantly focused on the analysis of interaction between a single nucleotide polymorphism (SNP) and an environmental variable. Given that genes are the functional units, it is crucial to understand how gene effects (rather than single SNP effects) are influenced by an environmental variable to affect disease risk. Motivated by the increasing awareness of the power of gene-based association analysis over single variant based approach, in this work, we proposed a sparse principle component regression (sPCR) model to understand the gene-based G×E interaction effect on complex disease. We first extracted the sparse principal components for SNPs in a gene, then the effect of each principal component was modeled by a varying-coefficient (VC) model. The model can jointly model variants in a gene in which their effects are nonlinearly influenced by an environmental variable. In addition, the varying-coefficient sPCR (VC-sPCR) model has nice interpretation property since the sparsity on the principal component loadings can tell the relative importance of the corresponding SNPs in each component. We applied our method to a human birth weight dataset in Thai population. We analyzed 12,005 genes across 22 chromosomes and found one significant interaction effect using the Bonferroni correction method and one suggestive interaction. The model performance was further evaluated through simulation studies. Our model provides a system approach to evaluate gene-based G×E interaction.
Collapse
|
14
|
Associations of Genetic Variants at Nongenic Susceptibility Loci with Breast Cancer Risk and Heterogeneity by Tumor Subtype in Southern Han Chinese Women. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3065493. [PMID: 27022606 PMCID: PMC4789034 DOI: 10.1155/2016/3065493] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Revised: 01/06/2016] [Accepted: 02/04/2016] [Indexed: 12/05/2022]
Abstract
Current understanding of cancer genomes is mainly “gene centric.” However, GWAS have identified some nongenic breast cancer susceptibility loci. Validation studies showed inconsistent results among different populations. To further explore this inconsistency and to investigate associations by intrinsic subtype (Luminal-A, Luminal-B, ER−&PR−&HER2+, and triple negative) among Southern Han Chinese women, we genotyped five nongenic polymorphisms (2q35: rs13387042, 5p12: rs981782 and rs4415084, and 8q24: rs1562430 and rs13281615) using MassARRAY IPLEX platform in 609 patients and 882 controls. Significant associations with breast cancer were observed for rs13387042 and rs4415084 with OR (95% CI) per-allele 1.29 (1.00–1.66) and 0.83 (0.71–0.97), respectively. In subtype specific analysis, rs13387042 (per-allele adjusted OR = 1.36, 95% CI = 1.00–1.87) and rs4415084 (per-allele adjusted OR = 0.82, 95% CI = 0.66–1.00) showed slightly significant association with Luminal-A subtype; however, only rs13387042 was associated with ER−&PR−&HER2+ tumors (per-allele adjusted OR = 1.55, 95% CI = 1.00–2.40), and none of them were linked to Luminal-B and triple negative subtype. Collectively, nongenic SNPs were heterogeneous according to the intrinsic subtype. Further studies with larger datasets along with intrinsic subtype categorization should explore and confirm the role of these variants in increasing breast cancer risk.
Collapse
|
15
|
Wen Y, He Z, Li M, Lu Q. Risk Prediction Modeling of Sequencing Data Using a Forward Random Field Method. Sci Rep 2016; 6:21120. [PMID: 26892725 PMCID: PMC4759688 DOI: 10.1038/srep21120] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 01/18/2016] [Indexed: 11/09/2022] Open
Abstract
With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects' phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects' genotypes, and an individual's phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study.
Collapse
Affiliation(s)
- Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| | - Zihuai He
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A
| | - Ming Li
- Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| |
Collapse
|
16
|
Wu C, Shi X, Cui Y, Ma S. A penalized robust semiparametric approach for gene-environment interactions. Stat Med 2015; 34:4016-30. [PMID: 26239060 DOI: 10.1002/sim.6609] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Revised: 06/28/2015] [Accepted: 07/06/2015] [Indexed: 11/09/2022]
Abstract
In genetic and genomic studies, gene-environment (G×E) interactions have important implications. Some of the existing G×E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection techniques. In this study, we propose a new approach for identifying important G×E interactions. It jointly models the effects of all E and G factors and their interactions. A partially linear varying coefficient model is adopted to accommodate possible nonlinear effects of E factors. A rank-based loss function is used to accommodate possible data contamination. Penalization, which has been extensively used with high-dimensional data, is adopted for selection. The proposed penalized estimation approach can automatically determine if a G factor has an interaction with an E factor, main effect but not interaction, or no effect at all. The proposed approach can be effectively realized using a coordinate descent algorithm. Simulation shows that it has satisfactory performance and outperforms several competing alternatives. The proposed approach is used to analyze a lung cancer study with gene expression measurements and clinical variables. Copyright © 2015 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Cen Wu
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, CT, 06520, U.S.A.,Department of Statistics, Kansas State University, 1116 Mid-Campus Drive N., Manhattan, KS, 66506, U.S.A
| | - Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd, East Lansing, MI, 48824, U.S.A
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, CT, 06520, U.S.A.,VA Cooperative Studies Program Coordinating Center, West Haven, CT, 06516, U.S.A
| |
Collapse
|
17
|
Wu C, Cui Y, Ma S. Integrative analysis of gene-environment interactions under a multi-response partially linear varying coefficient model. Stat Med 2014; 33:4988-98. [PMID: 25146388 PMCID: PMC4225006 DOI: 10.1002/sim.6287] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2014] [Revised: 06/10/2014] [Accepted: 07/28/2014] [Indexed: 12/29/2022]
Abstract
Consider the integrative analysis of genetic data with multiple correlated response variables. The goal is to identify important gene-environment (G × E) interactions along with main gene and environment effects that are associated with the responses. The homogeneity and heterogeneity models can be adopted to describe the genetic basis of multiple responses. To accommodate possible nonlinear effects of some environment effects, a multi-response partially linear varying coefficient model is assumed. Penalization is adopted for marker selection. The proposed penalization method can select genetic variants with G × E interactions, no G × E interactions, and no main effects simultaneously. It adopts different penalties to accommodate the homogeneity and heterogeneity models. The proposed method can be effectively computed using a coordinate descent algorithm. Simulation study and the analysis of Health Professionals Follow-up Study, which has two correlated continuous traits, SNP measurements and multiple environment effects, show superior performance of the proposed method over its competitors.
Collapse
Affiliation(s)
- Cen Wu
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, CT, 06520, U.S.A
| | | | | |
Collapse
|
18
|
Li S, Cui Y, Romero R. Entropy-based selection for maternal-fetal genotype incompatibility with application to preterm prelabor rupture of membranes. BMC Genet 2014; 15:66. [PMID: 24916189 PMCID: PMC4057811 DOI: 10.1186/1471-2156-15-66] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Accepted: 05/23/2014] [Indexed: 12/02/2022] Open
Abstract
Background Maternal-fetal genotype incompatibility (MFGI) is increasingly reported to influence human diseases, especially pregnancy-related complications. In practice, it is challenging to identify the ideal incompatibility model for analysis, since the true MFGI mechanism is generally unknown. The underlying MFGI mechanism for different genetic variants can vary, and to use a single incompatibility model for all circumstances would cause power loss in testing MFGI. Results In this article, we propose a practical 2-step procedure that incorporates a model selection strategy based on an entropy measurement to select the most appropriate MFGI model represented by data and test the significance of the MFGI effect using the chosen model within the generalized linear regression framework. Conclusions Our simulation studies show that the proposed two-step procedure controls the type I error rate and increase the testing power under various scenarios. In a real data application, our analysis reveals genes having an MFGI effect, which may not be detected with a non-model selection counterpart.
Collapse
Affiliation(s)
- Shaoyu Li
- Department of Biostatistics, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, USA.
| | | | | |
Collapse
|
19
|
Dai H, Leeder JS, Cui Y. A modified generalized Fisher method for combining probabilities from dependent tests. Front Genet 2014; 5:32. [PMID: 24600471 PMCID: PMC3929847 DOI: 10.3389/fgene.2014.00032] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Accepted: 01/27/2014] [Indexed: 11/24/2022] Open
Abstract
Rapid developments in molecular technology have yielded a large amount of high throughput genetic data to understand the mechanism for complex traits. The increase of genetic variants requires hundreds and thousands of statistical tests to be performed simultaneously in analysis, which poses a challenge to control the overall Type I error rate. Combining p-values from multiple hypothesis testing has shown promise for aggregating effects in high-dimensional genetic data analysis. Several p-value combining methods have been developed and applied to genetic data; see Dai et al. (2012b) for a comprehensive review. However, there is a lack of investigations conducted for dependent genetic data, especially for weighted p-value combining methods. Single nucleotide polymorphisms (SNPs) are often correlated due to linkage disequilibrium (LD). Other genetic data, including variants from next generation sequencing, gene expression levels measured by microarray, protein and DNA methylation data, etc. also contain complex correlation structures. Ignoring correlation structures among genetic variants may lead to severe inflation of Type I error rates for omnibus testing of p-values. In this work, we propose modifications to the Lancaster procedure by taking the correlation structure among p-values into account. The weight function in the Lancaster procedure allows meaningful biological information to be incorporated into the statistical analysis, which can increase the power of the statistical testing and/or remove the bias in the process. Extensive empirical assessments demonstrate that the modified Lancaster procedure largely reduces the Type I error rates due to correlation among p-values, and retains considerable power to detect signals among p-values. We applied our method to reassess published renal transplant data, and identified a novel association between B cell pathways and allograft tolerance.
Collapse
Affiliation(s)
- Hongying Dai
- Department of Pediatrics, Research Development and Clinical Investigation, Children's Mercy Hospital Kansas City, MO, USA ; Department of Pediatrics, University of Missouri-Kansas City Kansas City, MO, USA ; Department of Informatic Medicine and Personalized Health, University of Missouri-Kansas City Kansas City, MO, USA
| | - J Steven Leeder
- Department of Pediatrics, University of Missouri-Kansas City Kansas City, MO, USA ; Department of Pediatrics, Clinical Pharmacology and Therapeutic Innovation, Children's Mercy Hospital Kansas City, MO, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University East Lansing, MI, USA
| |
Collapse
|
20
|
Kang G, Jiang B, Cui Y. Gene-based Genomewide Association Analysis: A Comparison Study. Curr Genomics 2013; 14:250-5. [PMID: 24294105 PMCID: PMC3731815 DOI: 10.2174/13892029113149990001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Revised: 05/01/2013] [Accepted: 05/07/2013] [Indexed: 11/22/2022] Open
Abstract
The study of gene-based genetic associations has gained conceptual popularity recently. Biologic insight into the etiology of a complex disease can be gained by focusing on genes as testing units. Several gene-based methods (e.g., minimum p-value (or maximum test statistic) or entropy-based method) have been developed and have more power than a single nucleotide polymorphism (SNP)-based analysis. The objective of this study is to compare the performance of the entropy-based method with the minimum p-value and single SNP–based analysis and to explore their strengths and weaknesses. Simulation studies show that: 1) all three methods can reasonably control the false-positive rate; 2) the minimum p-value method outperforms the entropy-based and the single SNP–based method when only one disease-related SNP occurs within the gene; 3) the entropy-based method outperforms the other methods when there are more than two disease-related SNPs in the gene; and 4) the entropy-based method is computationally more efficient than the minimum p-value method. Application to a real data set shows that more significant genes were identified by the entropy-based method than by the other two methods.
Collapse
Affiliation(s)
- Guolian Kang
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105
| | | | | |
Collapse
|
21
|
Wu C, Cui Y. Boosting signals in gene-based association studies via efficient SNP selection. Brief Bioinform 2013; 15:279-91. [PMID: 23325548 DOI: 10.1093/bib/bbs087] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Set-based association studies based on genes or pathways have shown great promise in interpreting association signals associated with complex diseases. These approaches are particularly useful when variants in a set have moderate effects and are difficult to be detected with single marker analysis, especially when variants function jointly in a complicated manner. The set-based analyses use a summary statistic such as the maximum or average of individual signal (e.g. a chi-square statistic) over all variants in a set, or consider their joint distribution to assess the significance of the set. The signal obtained with this treatment, however, could be potentially diluted when noisy variants are not taken good care of, leading to either inflated false negatives or false positives. Thus, the selection of disease informative single-nucleotide polymorphism (diSNPs) plays a crucial role in improving the power of the set-based association study. In this work, we propose an efficient diSNP selection method based on the information theory. We select diSNP variants by considering their relative information contribution to a disease status, which is different from the usual tag SNP selection. The relative merit of pre-selecting diSNPs in a set-based association analysis is demonstrated through extensive simulation studies and real data analysis.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Road, Rm C432, East Lansing, MI 48824, USA. Tel.: +1-517-432-7098; Fax: +1-517-432-1405;
| | | |
Collapse
|
22
|
Wu C, Li S, Cui Y. Genetic association studies: an information content perspective. Curr Genomics 2012; 13:566-73. [PMID: 23633916 PMCID: PMC3468889 DOI: 10.2174/138920212803251382] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Revised: 06/04/2012] [Accepted: 06/18/2012] [Indexed: 01/02/2023] Open
Abstract
The availability of high-density single nucleotide polymorphisms (SNPs) data has made the human genetic association studies possible to identify common and rare variants underlying complex diseases in a genome-wide scale. A handful of novel genetic variants have been identified, which gives much hope and prospects for the future of genetic association studies. In this process, statistical and computational methods play key roles, among which information-based association tests have gained large popularity. This paper is intended to give a comprehensive review of the current literature in genetic association analysis casted in the framework of information theory. We focus our review on the following topics: (1) information theoretic approaches in genetic linkage and association studies; (2) entropy-based strategies for optimal SNP subset selection; and (3) the usage of theoretic information criteria in gene clustering and gene regulatory network construction.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824
| | - Shaoyu Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824
- Center for Computational Biology, Beijing Forestry University, Beijing, China 100083
| |
Collapse
|
23
|
Li S, Cui Y. Gene-centric gene–gene interaction: A model-based kernel machine method. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas545] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
24
|
Hong MG, Reynolds CA, Feldman AL, Kallin M, Lambert JC, Amouyel P, Ingelsson E, Pedersen NL, Prince JA. Genome-wide and gene-based association implicates FRMD6 in Alzheimer disease. Hum Mutat 2012; 33:521-9. [PMID: 22190428 PMCID: PMC3326347 DOI: 10.1002/humu.22009] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2011] [Accepted: 12/02/2011] [Indexed: 12/29/2022]
Abstract
Genome-wide association studies (GWAS) that allow for allelic heterogeneity may facilitate the discovery of novel genes not detectable by models that require replication of a single variant site. One strategy to accomplish this is to focus on genes rather than markers as units of association, and so potentially capture a spectrum of causal alleles that differ across populations. Here, we conducted a GWAS of Alzheimer disease (AD) in 2,586 Swedes and performed gene-based meta-analysis with three additional studies from France, Canada, and the United States, in total encompassing 4,259 cases and 8,284 controls. Implementing a newly designed gene-based algorithm, we identified two loci apart from the region around APOE that achieved study-wide significance in combined samples, the strongest finding being for FRMD6 on chromosome 14q (P = 2.6 × 10(-14)) and a weaker signal for NARS2 that is immediately adjacent to GAB2 on chromosome 11q (P = 7.8 × 10(-9)). Ontology-based pathway analyses revealed significant enrichment of genes involved in glycosylation. Results suggest that gene-based approaches that accommodate allelic heterogeneity in GWAS can provide a complementary avenue for gene discovery and may help to explain a portion of the missing heritability not detectable with single nucleotide polymorphisms (SNPs) derived from marker-specific meta-analysis.
Collapse
Affiliation(s)
- Mun-Gwan Hong
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 171 77 Stockholm, Sweden
| | - Chandra A. Reynolds
- Department of Psychology, University of California at Riverside, 92521 Riverside, USA
| | - Adina L. Feldman
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 171 77 Stockholm, Sweden
| | - Mikael Kallin
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 171 77 Stockholm, Sweden
| | - Jean-Charles Lambert
- Inserm U744, F-59019 Lille, France
- Institut Pasteur de Lille, F-59019 Lille, France
- Université de Lille Nord de France, F-59000 Lille, France
| | - Philippe Amouyel
- Inserm U744, F-59019 Lille, France
- Institut Pasteur de Lille, F-59019 Lille, France
- Université de Lille Nord de France, F-59000 Lille, France
| | - Erik Ingelsson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 171 77 Stockholm, Sweden
| | - Nancy L. Pedersen
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 171 77 Stockholm, Sweden
| | - Jonathan A. Prince
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, 171 77 Stockholm, Sweden
| |
Collapse
|
25
|
Wilson IJ, Howey RA, Houniet DT, Santibanez-Koref M. Finding genes that influence quantitative traits with tree-based clustering. BMC Proc 2011; 5 Suppl 9:S98. [PMID: 22373331 PMCID: PMC3287940 DOI: 10.1186/1753-6561-5-s9-s98] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We present a new statistical method to identify genes in which one or more variants influence quantitative traits. We use the Genetic Analysis Workshop 17 (GAW17) data set of unrelated individuals as a test of the method on the raw GAW17 phenotypes and on residuals after fitting linear models to individual-based covariates. By performing appropriate randomization tests, we found many significant results for a proportion of the genes that contain variants that directly contribute to disease but that have an increased type I error for analyses of raw phenotypes. Power calculations show that our methods have the ability to reliably identify a subset of the loci contributing to disease. When we applied our method to derived phenotypes, we removed many false positives, giving appropriate type I error rates at little cost to power. The correlation between genome-wide heterozygosity and the value of the trait Q1 appears to drive much of the type I error in this data set.
Collapse
Affiliation(s)
- Ian J Wilson
- Institute of Genetic Medicine, Newcastle University, Newcastle NE3 1NB, UK.
| | | | | | | |
Collapse
|
26
|
Lehne B, Lewis CM, Schlitt T. From SNPs to genes: disease association at the gene level. PLoS One 2011; 6:e20133. [PMID: 21738570 PMCID: PMC3128073 DOI: 10.1371/journal.pone.0020133] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2010] [Accepted: 04/26/2011] [Indexed: 01/16/2023] Open
Abstract
Interpreting Genome-Wide Association Studies (GWAS) at a gene level is an important step towards understanding the molecular processes that lead to disease. In order to incorporate prior biological knowledge such as pathways and protein interactions in the analysis of GWAS data it is necessary to derive one measure of association for each gene. We compare three different methods to obtain gene-wide test statistics from Single Nucleotide Polymorphism (SNP) based association data: choosing the test statistic from the most significant SNP; the mean test statistics of all SNPs; and the mean of the top quartile of all test statistics. We demonstrate that the gene-wide test statistics can be controlled for the number of SNPs within each gene and show that all three methods perform considerably better than expected by chance at identifying genes with confirmed associations. By applying each method to GWAS data for Crohn's Disease and Type 1 Diabetes we identified new potential disease genes.
Collapse
Affiliation(s)
- Benjamin Lehne
- Department of Medical and Molecular Genetics, King's College London, London, United Kingdom
| | - Cathryn M. Lewis
- Department of Medical and Molecular Genetics, King's College London, London, United Kingdom
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, King's College London, London, United Kingdom
| | - Thomas Schlitt
- Department of Medical and Molecular Genetics, King's College London, London, United Kingdom
- * E-mail:
| |
Collapse
|
27
|
Jiang B, Zhang X, Zuo Y, Kang G. A powerful truncated tail strength method for testing multiple null hypotheses in one dataset. J Theor Biol 2011; 277:67-73. [PMID: 21295595 DOI: 10.1016/j.jtbi.2011.01.029] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2010] [Revised: 01/14/2011] [Accepted: 01/19/2011] [Indexed: 10/18/2022]
Abstract
In microarray analysis, medical imaging analysis and functional magnetic resonance imaging, we often need to test an overall null hypothesis involving a large number of single hypotheses (usually larger than 1000) in one dataset. A tail strength statistic (Taylor and Tibshirani, 2006) and Fisher's probability method are useful and can be applied to measure an overall significance for a large set of independent single hypothesis tests with the overall null hypothesis assuming that all single hypotheses are true. In this paper we propose a new method that improves the tail strength statistic by considering only the values whose corresponding p-values are less than some pre-specified cutoff. We call it truncated tail strength statistic. We illustrate our method using a simulation study and two genome-wide datasets by chromosome. Our method not only controls type one error rate quite well, but also has significantly higher power than the tail strength method and Fisher's method in most cases.
Collapse
Affiliation(s)
- Bo Jiang
- Department of Biostatistics, University of Alabama at Birmingham, AL 35294, USA
| | | | | | | |
Collapse
|
28
|
Romero R, Friel LA, Velez Edwards DR, Kusanovic JP, Hassan SS, Mazaki-Tovi S, Vaisbuch E, Kim CJ, Erez O, Chaiworapongsa T, Pearce BD, Bartlett J, Salisbury BA, Anant MK, Vovis GF, Lee MS, Gomez R, Behnke E, Oyarzun E, Tromp G, Williams SM, Menon R. A genetic association study of maternal and fetal candidate genes that predispose to preterm prelabor rupture of membranes (PROM). Am J Obstet Gynecol 2010; 203:361.e1-361.e30. [PMID: 20673868 DOI: 10.1016/j.ajog.2010.05.026] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2010] [Revised: 04/10/2010] [Accepted: 05/18/2010] [Indexed: 01/19/2023]
Abstract
OBJECTIVE We sought to determine whether maternal/fetal single-nucleotide polymorphisms (SNPs) in candidate genes are associated with preterm prelabor rupture of membranes (pPROM). STUDY DESIGN A case-control study was conducted in patients with pPROM (225 mothers and 155 fetuses) and 599 mothers and 628 fetuses with a normal pregnancy; 190 candidate genes and 775 SNPs were studied. Single locus/haplotype association analyses were performed; false discovery rate was used to correct for multiple testing (q* = 0.15). RESULTS First, a SNP in tissue inhibitor of metalloproteinase 2 in mothers was significantly associated with pPROM (odds ratio, 2.12; 95% confidence interval, 1.47-3.07; P = .000068), and this association remained significant after correction for multiple comparisons. Second, haplotypes for Alpha 3 type IV collagen isoform precursor in the mother were associated with pPROM (global P = .003). Third, multilocus analysis identified a 3-locus model, which included maternal SNPs in collagen type I alpha 2, defensin alpha 5 gene, and endothelin 1. CONCLUSION DNA variants in a maternal gene involved in extracellular matrix metabolism doubled the risk of pPROM.
Collapse
Affiliation(s)
- Roberto Romero
- Perinatology Research Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development/National Institutes of Health/Department of Health and Human Services, Bethesda, MD, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. ENTROPY 2010. [DOI: 10.3390/e12071765] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
30
|
Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S, Macgregor S. A versatile gene-based test for genome-wide association studies. Am J Hum Genet 2010; 87:139-45. [PMID: 20598278 DOI: 10.1016/j.ajhg.2010.06.009] [Citation(s) in RCA: 647] [Impact Index Per Article: 43.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2010] [Revised: 06/07/2010] [Accepted: 06/11/2010] [Indexed: 12/14/2022] Open
Abstract
We have derived a versatile gene-based test for genome-wide association studies (GWAS). Our approach, called VEGAS (versatile gene-based association study), is applicable to all GWAS designs, including family-based GWAS, meta-analyses of GWAS on the basis of summary data, and DNA-pooling-based GWAS, where existing approaches based on permutation are not possible, as well as singleton data, where they are. The test incorporates information from a full set of markers (or a defined subset) within a gene and accounts for linkage disequilibrium between markers by using simulations from the multivariate normal distribution. We show that for an association study using singletons, our approach produces results equivalent to those obtained via permutation in a fraction of the computation time. We demonstrate proof-of-principle by using the gene-based test to replicate several genes known to be associated on the basis of results from a family-based GWAS for height in 11,536 individuals and a DNA-pooling-based GWAS for melanoma in approximately 1300 cases and controls. Our method has the potential to identify novel associated genes; provide a basis for selecting SNPs for replication; and be directly used in network (pathway) approaches that require per-gene association test statistics. We have implemented the approach in both an easy-to-use web interface, which only requires the uploading of markers with their association p-values, and a separate downloadable application.
Collapse
|
31
|
Genetic variants in thymic stromal lymphopoietin are associated with atopic dermatitis and eczema herpeticum. J Allergy Clin Immunol 2010; 125:1403-1407.e4. [PMID: 20466416 DOI: 10.1016/j.jaci.2010.03.016] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2009] [Revised: 03/15/2010] [Accepted: 03/18/2010] [Indexed: 01/16/2023]
|
32
|
An entropy test for single-locus genetic association analysis. BMC Genet 2010; 11:19. [PMID: 20331859 PMCID: PMC2860340 DOI: 10.1186/1471-2156-11-19] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2009] [Accepted: 03/23/2010] [Indexed: 11/13/2022] Open
Abstract
Background The etiology of complex diseases is due to the combination of genetic and environmental factors, usually many of them, and each with a small effect. The identification of these small-effect contributing factors is still a demanding task. Clearly, there is a need for more powerful tests of genetic association, and especially for the identification of rare effects Results We introduce a new genetic association test based on symbolic dynamics and symbolic entropy. Using a freely available software, we have applied this entropy test, and a conventional test, to simulated and real datasets, to illustrate the method and estimate type I error and power. We have also compared this new entropy test to the Fisher exact test for assessment of association with low-frequency SNPs. The entropy test is generally more powerful than the conventional test, and can be significantly more powerful when the genotypic test is applied to low allele-frequency markers. We have also shown that both the Fisher and Entropy methods are optimal to test for association with low-frequency SNPs (MAF around 1-5%), and both are conservative for very rare SNPs (MAF<1%) Conclusions We have developed a new, simple, consistent and powerful test to detect genetic association of biallelic/SNP markers in case-control data, by using symbolic dynamics and symbolic entropy as a measure of gene dependence. We also provide a standard asymptotic distribution of this test statistic. Given that the test is based on entropy measures, it avoids smoothed nonparametric estimation. The entropy test is generally as good or even more powerful than the conventional and Fisher tests. Furthermore, the entropy test is more computationally efficient than the Fisher's Exact test, especially for large number of markers. Therefore, this entropy-based test has the advantage of being optimal for most SNPs, regardless of their allele frequency (Minor Allele Frequency (MAF) between 1-50%). This property is quite beneficial, since many researchers tend to discard low allele-frequency SNPs from their analysis. Now they can apply the same statistical test of association to all SNPs in a single analysis., which can be especially helpful to detect rare effects.
Collapse
|
33
|
Zuo Y, Kang G. A mixed two-stage method for detecting interactions in genomewide association studies. J Theor Biol 2010; 262:576-83. [DOI: 10.1016/j.jtbi.2009.10.029] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2009] [Revised: 10/15/2009] [Accepted: 10/26/2009] [Indexed: 10/20/2022]
|
34
|
Cui Y, Li G, Li S, Wu R. Designs for linkage analysis and association studies of complex diseases. Methods Mol Biol 2010; 620:219-242. [PMID: 20652506 DOI: 10.1007/978-1-60761-580-4_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Genetic linkage analysis has been a traditional means for identifying regions of the genome with large genetic effects that contribute to a disease. Following linkage analysis, association studies are widely pursued to fine-tune regions with significant linkage signals. For complex diseases which often involve function of multi-genetic variants each with small or moderate effect, linkage analysis has little power compared to association studies. In this chapter, we give a brief review of design issues related to linkage analysis and association studies with human genetic data. We introduce methods commonly used for linkage and association studies and compared the relative merits of the family-based and population-based association studies. Compared to candidate gene studies, a genomewide blind searching of disease variant is proving to be a more powerful approach. We briefly review the commonly used two-stage designs in genome-wide association studies. As more and more biological evidences indicate the role of genomic imprinting in disease, identifying imprinted genes becomes critically important. Design and analysis in genetic mapping imprinted genes are introduced in this chapter. Recent efforts in integrating gene expression analysis and genetic mapping, termed expression quantitative trait loci (eQTLs) mapping or genetical genomics analysis, offer new prospect in elucidating the genetic architecture of gene expression. Designs in genetical genomics analysis are also covered in this chapter.
Collapse
Affiliation(s)
- Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA
| | | | | | | |
Collapse
|
35
|
Guo YF, Li J, Chen Y, Zhang LS, Deng HW. A new permutation strategy of pathway-based approach for genome-wide association study. BMC Bioinformatics 2009; 10:429. [PMID: 20021635 PMCID: PMC2809078 DOI: 10.1186/1471-2105-10-429] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2009] [Accepted: 12/18/2009] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Recently introduced pathway-based approach is promising and advantageous to improve the efficiency of analyzing genome-wide association scan (GWAS) data to identify disease variants by jointly considering variants of the genes that belong to the same biological pathway. However, the current available pathway-based approaches for analyzing GWAS have limited power and efficiency. RESULTS We proposed a new and efficient permutation strategy based on SNP randomization for determining significance in pathway analysis of GWAS. The developed permutation strategy was evaluated and compared to two previously available methods, i.e. sample permutation and gene permutation, through simulation studies and a study on a real dataset. Results showed that the proposed permutation strategy is more powerful and efficient with greatly reducing the computational complexity. CONCLUSION Our findings indicate the improved performance of SNP permutation and thus render pathway-based analysis of GWAS more applicable and attractive.
Collapse
Affiliation(s)
- Yan-Fang Guo
- 1School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, PR China.
| | | | | | | | | |
Collapse
|
36
|
Candidate genes and their interactions with other genetic/environmental risk factors in the etiology of schizophrenia. Brain Res Bull 2009; 83:86-92. [PMID: 19729054 DOI: 10.1016/j.brainresbull.2009.08.023] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2009] [Revised: 08/04/2009] [Accepted: 08/25/2009] [Indexed: 11/21/2022]
Abstract
Identification of causative factors for common, chronic disorders is a major focus of current human health science research. These disorders are likely to be caused by multiple etiological agents. Available evidence also suggests that interactions between the risk factors may explain some of their pathogenic effects. While progress in genomics and allied biological research has brought forth powerful analytic techniques, the predicted complexity poses daunting analytic challenges. The search for pathogenesis of schizophrenia shares most of these challenges. We have reviewed the analytic and logistic problems associated with the search for pathogenesis. Evidence for pathogenic interactions is presented for selected diseases and for schizophrenia. We end by suggesting 'recursive analyses' as a potential design to address these challenges. This scheme involves initial focused searches for interactions motivated by available evidence, typically involving identified individual risk factors, such as candidate gene variants. Putative interactions are tested rigorously for replication and for biological plausibility. Support for the interactions from statistical and functional analyses motivates a progressively larger array of interactants that are evaluated recursively. The risk explained by the interactions is assessed concurrently and further elaborate searches may be guided by the results of such analyses. By way of example, we summarize our ongoing analyses of dopaminergic polymorphisms, as well as infectious etiological factors in schizophrenia genesis.
Collapse
|