1
|
Fu S, Wheeler W, Wang X, Hua X, Godbole D, Duan J, Zhu B, Deng L, Qin F, Zhang H, Shi J, Yu K. A comprehensive framework for trans-ancestry pathway analysis using GWAS summary data from diverse populations. PLoS Genet 2024; 20:e1011322. [PMID: 39441834 PMCID: PMC11534268 DOI: 10.1371/journal.pgen.1011322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 11/04/2024] [Accepted: 10/07/2024] [Indexed: 10/25/2024] Open
Abstract
As more multi-ancestry GWAS summary data become available, we have developed a comprehensive trans-ancestry pathway analysis framework that effectively utilizes this diverse genetic information. Within this framework, we evaluated various strategies for integrating genetic data at different levels-SNP, gene, and pathway-from multiple ancestry groups. Through extensive simulation studies, we have identified robust strategies that demonstrate superior performance across diverse scenarios. Applying these methods, we analyzed 6,970 pathways for their association with schizophrenia, incorporating data from African, East Asian, and European populations. Our analysis identified over 200 pathways significantly associated with schizophrenia, even after excluding genes near genome-wide significant loci. This approach substantially enhances detection efficiency compared to traditional single-ancestry pathway analysis and the conventional approach that amalgamates single-ancestry pathway analysis results across different ancestry groups. Our framework provides a flexible and effective tool for leveraging the expanding pool of multi-ancestry GWAS summary data, thereby improving our ability to identify biologically relevant pathways that contribute to disease susceptibility.
Collapse
Affiliation(s)
- Sheng Fu
- School of Statistics and Data Science, Nankai University, Tianjin, China
- Key Laboratory of Pure Mathematics and Combinatorics, Nankai University, Tianjin, China
| | - William Wheeler
- Information Management Services, Inc, Bethesda, Maryland, United States of America
| | - Xiaoyu Wang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
- Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc, Rockville, Maryland, United States of America
| | - Xing Hua
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
- Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc, Rockville, Maryland, United States of America
| | - Devika Godbole
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
- Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc, Rockville, Maryland, United States of America
| | - Jubao Duan
- Center for Psychiatric Genetics, NorthShore University HealthSystem, Evanston, Illinois, United States of America
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, Illinois, United States of America
| | - Bin Zhu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Lu Deng
- School of Statistics and Data Science, Nankai University, Tianjin, China
| | - Fei Qin
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Haoyu Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Jianxin Shi
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| |
Collapse
|
2
|
Chen Z, Liang H, Wei P. Data-adaptive and pathway-based tests for association studies between somatic mutations and germline variations in human cancers. Genet Epidemiol 2023; 47:617-636. [PMID: 37822029 DOI: 10.1002/gepi.22537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 07/22/2023] [Accepted: 09/18/2023] [Indexed: 10/13/2023]
Abstract
Cancer is a disease driven by a combination of inherited genetic variants and somatic mutations. Recently available large-scale sequencing data of cancer genomes have provided an unprecedented opportunity to study the interactions between them. However, previous studies on this topic have been limited by simple, low statistical power tests such as Fisher's exact test. In this paper, we design data-adaptive and pathway-based tests based on the score statistic for association studies between somatic mutations and germline variations. Previous research has shown that two single-nucleotide polymorphism (SNP)-set-based association tests, adaptive sum of powered score (aSPU) and data-adaptive pathway-based (aSPUpath) tests, increase the power in genome-wide association studies (GWASs) with a single disease trait in a case-control study. We extend aSPU and aSPUpath to multi-traits, that is, somatic mutations of multiple genes in a cohort study, allowing extensive information aggregation at both SNP and gene levels.p $p$ -values from different parameters assuming varying genetic architecture are combined to yield data-adaptive tests for somatic mutations and germline variations. Extensive simulations show that, in comparison with some commonly used methods, our data-adaptive somatic mutations/germline variations tests can be applied to multiple germline SNPs/genes/pathways, and generally have much higher statistical powers while maintaining the appropriate type I error. The proposed tests are applied to a large-scale real-world International Cancer Genome Consortium whole genome sequencing data set of 2583 subjects, detecting more significant and biologically relevant associations compared with the other existing methods on both gene and pathway levels. Our study has systematically identified the associations between various germline variations and somatic mutations across different cancer types, which potentially provides valuable utility for cancer risk prediction, prognosis, and therapeutics.
Collapse
Affiliation(s)
- Zhongyuan Chen
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, Wisconsin, USA
| | - Han Liang
- Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, Texas, USA
| | - Peng Wei
- Department of Biostatistics, MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
3
|
Kim Y, Chi YY, Shen J, Zou F. Robust genetic model-based SNP-set association test using CauchyGM. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:6831090. [PMID: 36383169 DOI: 10.1093/bioinformatics/btac728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 10/26/2022] [Accepted: 11/15/2022] [Indexed: 11/17/2022]
Abstract
MOTIVATION Association testing on genome-wide association studies (GWAS) data is commonly performed under a single (mostly additive) genetic model framework. However, the underlying true genetic mechanisms are often unknown in practice for most complex traits. When the employed inheritance model deviates from the underlying model, statistical power may be reduced. To overcome this challenge, an integrative association test that directly infers the underlying genetic model from GWAS data has previously been proposed for single-SNP analysis. RESULTS In this article, we propose a Cauchy combination Genetic Model-based association test (CauchyGM) under a generalized linear model framework for SNP-set level analysis. CauchyGM does not require prior knowledge on the underlying inheritance pattern of each SNP. It performs a score test that first estimates an individual P-value of each SNP in an SNP-set with both minor allele frequency (MAF) > 1% and three genotypes and further aggregates the rest SNPs using SKAT. CauchyGM then combines the correlated P-values across multiple SNPs and different genetic models within the set using Cauchy Combination Test. To further accommodate both sparse and dense signal patterns, we also propose an omnibus association test (CauchyGM-O) by combining CauchyGM with SKAT and the burden test. Our extensive simulations show that both CauchyGM and CauchyGM-O maintain the type I error well at the genome-wide significance level and provide substantial power improvement compared to existing methods. We apply our methods to a pharmacogenomic GWAS data from a large cardiovascular randomized clinical trial. Both CauchyGM and CauchyGM-O identify several novel genome-wide significant genes. AVAILABILITY AND IMPLEMENTATION The R package CauchyGM is publicly available on github: https://github.com/ykim03517/CauchyGM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yeonil Kim
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Yueh-Yun Chi
- Department of Pediatrics, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
| | - Judong Shen
- Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
4
|
Alexander-Bloch AF, Sood R, Shinohara RT, Moore TM, Calkins ME, Chertavian C, Wolf DH, Gur RC, Satterthwaite TD, Gur RE, Barzilay R. Connectome-wide Functional Connectivity Abnormalities in Youth With Obsessive-Compulsive Symptoms. BIOLOGICAL PSYCHIATRY. COGNITIVE NEUROSCIENCE AND NEUROIMAGING 2022; 7:1068-1077. [PMID: 34375730 PMCID: PMC8821731 DOI: 10.1016/j.bpsc.2021.07.014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Revised: 07/16/2021] [Accepted: 07/29/2021] [Indexed: 12/16/2022]
Abstract
BACKGROUND Obsessive-compulsive symptomatology (OCS) is common in adolescence but usually does not meet the diagnostic threshold for obsessive-compulsive disorder. Nevertheless, both obsessive-compulsive disorder and subthreshold OCS are associated with increased likelihood of experiencing other serious psychiatric conditions, including depression and suicidal ideation. Unfortunately, there is limited information on the neurobiology of OCS. METHODS Here, we undertook one of the first brain imaging studies of OCS in a large adolescent sample (analyzed n = 832) from the Philadelphia Neurodevelopmental Cohort. We investigated resting-state functional magnetic resonance imaging functional connectivity using complementary analytic approaches that focus on different neuroanatomical scales, from known functional systems to connectome-wide tests. RESULTS We found a robust pattern of connectome-wide, OCS-related differences, as well as evidence of specific abnormalities involving known functional systems, including dorsal and ventral attention, frontoparietal, and default mode systems. Analysis of cerebral perfusion imaging and high-resolution structural imaging did not show OCS-related differences, consistent with domain specificity to functional connectivity. CONCLUSIONS The brain connectomic associations with OCS reported here, together with early studies of its clinical relevance, support the potential for OCS as an early marker of psychiatric risk that may enhance our understanding of mechanisms underlying the onset of adolescent psychopathology.
Collapse
Affiliation(s)
- Aaron F Alexander-Bloch
- Department of Child and Adolescent Psychiatry and Behavioral Science, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania; Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania.
| | - Rahul Sood
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Russell T Shinohara
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania; Penn Statistics in Imaging and Visualization Center, University of Pennsylvania, Philadelphia, Pennsylvania; Penn Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Tyler M Moore
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Monica E Calkins
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Casey Chertavian
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Daniel H Wolf
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Ruben C Gur
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Theodore D Satterthwaite
- Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; Penn Lifespan Informatics and Neuroimaging Center, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Raquel E Gur
- Department of Child and Adolescent Psychiatry and Behavioral Science, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania; Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Ran Barzilay
- Department of Child and Adolescent Psychiatry and Behavioral Science, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania; Department of Psychiatry, University of Pennsylvania, Philadelphia, Pennsylvania; CHOP/Penn Lifespan Brain Institute, University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
5
|
Monte AA, Mackenzie IA, Pattee J, Kaiser S, Willems E, Rumack B, Reynolds KM, Dart RC, Heard KJ. Genetic variants associated with ALT elevation from therapeutic acetaminophen. Clin Toxicol (Phila) 2022; 60:1198-1204. [PMID: 36102175 PMCID: PMC9701448 DOI: 10.1080/15563650.2022.2117053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 08/16/2022] [Accepted: 08/18/2022] [Indexed: 11/03/2022]
Abstract
BACKGROUND Several studies have suggested genetic variants associated with acetaminophen induced liver injury (DILI) following overdose. Genetic variation associated with acetaminophen-induced alanine aminotransferase elevation during therapeutic dosing has not been examined. METHODS We performed genetic analyses on patients that ingested therapeutic doses of 4 grams of acetaminophen for up to 16 days. We examined 20 genes previously implicated in the metabolism of acetaminophen or the development of immune-mediated DILI using the Illumina Multi-Ethnic Global Array 2. Autosomes were aligned and imputed using TOPMed. A candidate gene region analysis was performed by testing each gene individually using linkage disequilibrium (LD) pruned variants with the adaptive sum of powered scores (aSPU) test from the aSPU R package. The highest measured ALT during therapy, the maximum ALT, was used as the outcome. RESULTS 192 subjects taking therapeutic APAP were included in the genetic analysis. 136 (70.8%) were female, 133 (69.2%) were Caucasian race, and the median age was 34 years (IQR: 26, 46). Age > 50 years was the only clinical factor associated with maximum ALT increase. Variants in SULT1E1, the gene responsible for Sulfotransferase Family 1E Member 1 enzyme production, were associated with maximum ALT. No single variant drove this association, but rather the association was due to the additive effects of numerous variants within the gene. No other genes were associated with maximum ALT increase in this cohort. CONCLUSION Acetaminophen induced ALT elevation at therapeutic doses was not associated with variation in most genes associated with acetaminophen metabolism or immune-induced DILI in this cohort. The role of SULT1E1 polymorphism in acetaminophen-induced elevated ALT needs further examination.
Collapse
Affiliation(s)
- Andrew A. Monte
- University of Colorado School of Medicine, Department of Emergency Medicine, Aurora, CO
- University of Colorado School of Medicine, Center for Bioinformatics & Personalized Medicine, Aurora, CO
- University of Colorado, Skaggs School of Pharmacy, Aurora, CO
- Rocky Mountain Poison & Drug Safety, Denver Health and Hospital Authority, Denver, CO
| | - Ian Arriaga Mackenzie
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado-Denver Anschutz Medical Campus, Aurora, CO
| | - Jack Pattee
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado-Denver Anschutz Medical Campus, Aurora, CO
| | - Sasha Kaiser
- Rocky Mountain Poison & Drug Safety, Denver Health and Hospital Authority, Denver, CO
| | - Emileigh Willems
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado-Denver Anschutz Medical Campus, Aurora, CO
| | - Barry Rumack
- University of Colorado School of Medicine, Department of Emergency Medicine, Aurora, CO
- Rocky Mountain Poison & Drug Safety, Denver Health and Hospital Authority, Denver, CO
| | - Kate M. Reynolds
- Rocky Mountain Poison & Drug Safety, Denver Health and Hospital Authority, Denver, CO
| | - Richard C. Dart
- Rocky Mountain Poison & Drug Safety, Denver Health and Hospital Authority, Denver, CO
| | - Kennon J. Heard
- University of Colorado School of Medicine, Department of Emergency Medicine, Aurora, CO
- Rocky Mountain Poison & Drug Safety, Denver Health and Hospital Authority, Denver, CO
| |
Collapse
|
6
|
Deng Y, He Y, Xu G, Pan W. Speeding up Monte Carlo simulations for the adaptive sum of powered score test with importance sampling. Biometrics 2022; 78:261-273. [PMID: 33215683 PMCID: PMC8134502 DOI: 10.1111/biom.13407] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 08/30/2020] [Accepted: 10/29/2020] [Indexed: 12/21/2022]
Abstract
A central but challenging problem in genetic studies is to test for (usually weak) associations between a complex trait (e.g., a disease status) and sets of multiple genetic variants. Due to the lack of a uniformly most powerful test, data-adaptive tests, such as the adaptive sum of powered score (aSPU) test, are advantageous in maintaining high power against a wide range of alternatives. However, there is often no closed-form to accurately and analytically calculate the p-values of many adaptive tests like aSPU, thus Monte Carlo (MC) simulations are often used, which can be time consuming to achieve a stringent significance level (e.g., 5e-8) used in genome-wide association studies (GWAS). To estimate such a small p-value, we need a huge number of MC simulations (e.g., 1e+10). As an alternative, we propose using importance sampling to speed up such calculations. We develop some theory to motivate a proposed algorithm for the aSPU test, and show that the proposed method is computationally more efficient than the standard MC simulations. Using both simulated and real data, we demonstrate the superior performance of the new method over the standard MC simulations.
Collapse
Affiliation(s)
- Yangqing Deng
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA,Department of Mathematics, University of North Texas, Denton, TX 76203, USA
| | - Yinqiu He
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Gongjun Xu
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA,Corresponding author:
| |
Collapse
|
7
|
Banerjee K, Chen J, Zhan X. Adaptive and powerful microbiome multivariate association analysis via feature selection. NAR Genom Bioinform 2022; 4:lqab120. [PMID: 35047812 PMCID: PMC8759573 DOI: 10.1093/nargab/lqab120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Revised: 11/13/2021] [Accepted: 12/24/2021] [Indexed: 02/06/2023] Open
Abstract
The important role of human microbiome is being increasingly recognized in health and disease conditions. Since microbiome data is typically high dimensional, one popular mode of statistical association analysis for microbiome data is to pool individual microbial features into a group, and then conduct group-based multivariate association analysis. A corresponding challenge within this approach is to achieve adequate power to detect an association signal between a group of microbial features and the outcome of interest across a wide range of scenarios. Recognizing some existing methods' susceptibility to the adverse effects of noise accumulation, we introduce the Adaptive Microbiome Association Test (AMAT), a novel and powerful tool for multivariate microbiome association analysis, which unifies both blessings of feature selection in high-dimensional inference and robustness of adaptive statistical association testing. AMAT first alleviates the burden of noise accumulation via distance correlation learning, and then conducts a data-adaptive association test under the flexible generalized linear model framework. Extensive simulation studies and real data applications demonstrate that AMAT is highly robust and often more powerful than several existing methods, while preserving the correct type I error rate. A free implementation of AMAT in R computing environment is available at https://github.com/kzb193/AMAT.
Collapse
Affiliation(s)
| | | | - Xiang Zhan
- To whom correspondence should be addressed. Tel: +86 10 62744132; Fax: +86 10 62744134;
| |
Collapse
|
8
|
Yang Y, Basu S, Zhang L. A Bayesian hierarchically structured prior for gene-based association testing with multiple traits in genome-wide association studies. Genet Epidemiol 2022; 46:63-72. [PMID: 34787916 PMCID: PMC8795481 DOI: 10.1002/gepi.22437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 09/28/2021] [Accepted: 10/18/2021] [Indexed: 02/03/2023]
Abstract
Although genome-wide association studies (GWAS) often collect data on multiple correlated traits for complex diseases, conventional gene-based analysis is usually univariate, and therefore, treating traits as uncorrelated. Multivariate analysis of multiple correlated traits can potentially increase the power to detect genes that affect some or all of these traits. In this study, we propose the multivariate hierarchically structured variable selection (HSVS-M) model, a flexible Bayesian model that tests the association of a gene with multiple correlated traits. With only summary statistics, HSVS-M can account for the correlations among genetic variants and among traits simultaneously and can also estimate the various directions and magnitudes of associations between a gene and multiple traits. Simulation studies show that HSVS-M substantially outperforms competing methods in various scenarios, particularly when variants in a gene are associated with a trait in similar directions and magnitudes. We applied HSVS-M to the summary statistics of a meta-analysis GWAS on four lipid traits from the Global Lipids Genetics Consortium and identified 15 genes that have also been confirmed as risk factors in previous studies.
Collapse
Affiliation(s)
- Yi Yang
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA,Department of Biostatistics, Columbia University, New York, NY 10032, USA,Correspondence:
| | - Saonli Basu
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Lin Zhang
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
9
|
Dutta D, VandeHaar P, Fritsche LG, Zöllner S, Boehnke M, Scott LJ, Lee S. A powerful subset-based method identifies gene set associations and improves interpretation in UK Biobank. Am J Hum Genet 2021; 108:669-681. [PMID: 33730541 DOI: 10.1016/j.ajhg.2021.02.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Accepted: 02/19/2021] [Indexed: 02/06/2023] Open
Abstract
Tests of association between a phenotype and a set of genes in a biological pathway can provide insights into the genetic architecture of complex phenotypes beyond those obtained from single-variant or single-gene association analysis. However, most existing gene set tests have limited power to detect gene set-phenotype association when a small fraction of the genes are associated with the phenotype and cannot identify the potentially "active" genes that might drive a gene set-based association. To address these issues, we have developed Gene set analysis Association Using Sparse Signals (GAUSS), a method for gene set association analysis that requires only GWAS summary statistics. For each significantly associated gene set, GAUSS identifies the subset of genes that have the maximal evidence of association and can best account for the gene set association. Using pre-computed correlation structure among test statistics from a reference panel, our p value calculation is substantially faster than other permutation- or simulation-based approaches. In simulations with varying proportions of causal genes, we find that GAUSS effectively controls type 1 error rate and has greater power than several existing methods, particularly when a small proportion of genes account for the gene set signal. Using GAUSS, we analyzed UK Biobank GWAS summary statistics for 10,679 gene sets and 1,403 binary phenotypes. We found that GAUSS is scalable and identified 13,466 phenotype and gene set association pairs. Within these gene sets, we identify an average of 17.2 (max = 405) genes that underlie these gene set associations.
Collapse
Affiliation(s)
- Diptavo Dutta
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Peter VandeHaar
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Lars G Fritsche
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sebastian Zöllner
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michael Boehnke
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Laura J Scott
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Seunggeun Lee
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; Graduate School of Data Science, Seoul National University, Seoul 08826, Republic of Korea.
| |
Collapse
|
10
|
Yang Y, Basu S, Zhang L. A Bayesian hierarchically structured prior for rare-variant association testing. Genet Epidemiol 2021; 45:413-424. [PMID: 33565109 DOI: 10.1002/gepi.22379] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 01/08/2021] [Accepted: 01/25/2021] [Indexed: 12/12/2022]
Abstract
Although genome-wide association studies have been widely used to identify associations between complex diseases and genetic variants, standard single-variant analyses often have limited power when applied to rare variants. To overcome this problem, set-based methods have been developed with the aim of boosting power by borrowing strength from multiple rare variants. We propose the adaptive hierarchically structured variable selection (HSVS-A) before test for association of rare variants in a set with continuous or dichotomous phenotypes and to estimate the effect of individual rare variants simultaneously. HSVS-A has the flexibility to integrate a pairwise weighting scheme, which adaptively induces desirable correlations among variants of similar significance such that we can borrow information from potentially causal and noncausal rare variants to boost power. Simulation studies show that for both continuous and dichotomous phenotypes, HSVS-A is powerful when there are multiple causal rare variants, either in the same or opposite direction of effect, with the presence of a large number of noncausal variants. We also apply HSVS-A to the Wellcome Trust Case Control Consortium Crohn's disease data for testing the association of Crohn's disease with rare variants in pathways. HSVS-A identifies two pathways harboring novel protective rare variants for Crohn's disease.
Collapse
Affiliation(s)
- Yi Yang
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Biostatistics, Columbia University, New York, New York, USA
| | - Saonli Basu
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Lin Zhang
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
11
|
Novel directions in data pre-processing and genome-wide association study (GWAS) methodologies to overcome ongoing challenges. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
12
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020. [DOI: 10.1007/s12041-019-1166-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
Knutson KA, Deng Y, Pan W. Implicating causal brain imaging endophenotypes in Alzheimer's disease using multivariable IWAS and GWAS summary data. Neuroimage 2020; 223:117347. [PMID: 32898681 PMCID: PMC7778364 DOI: 10.1016/j.neuroimage.2020.117347] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Revised: 08/24/2020] [Accepted: 08/28/2020] [Indexed: 02/06/2023] Open
Abstract
Recent evidence suggests the existence of many undiscovered heritable brain phenotypes involved in Alzheimer's Disease (AD) pathogenesis. This finding necessitates methods for the discovery of causal brain changes in AD that integrate Magnetic Resonance Imaging measures and genotypic data. However, existing approaches for causal inference in this setting, such as the univariate Imaging Wide Association Study (UV-IWAS), suffer from inconsistent effect estimation and inflated Type I errors in the presence of genetic pleiotropy, the phenomenon in which a variant affects multiple causal intermediate risk phenotypes. In this study, we implement a multivariate extension to the IWAS model, namely MV-IWAS, to consistently estimate and test for the causal effects of multiple brain imaging endophenotypes from the Alzheimer's Disease Neuroimaging Initiative (ADNI) in the presence of pleiotropic and possibly correlated SNPs. We further extend MV-IWAS to incorporate variant-specific direct effects on AD, analogous to the existing Egger regression Mendelian Randomization approach, which allows for testing of remaining pleiotropy after adjusting for multiple intermediate pathways. We propose a convenient approach for implementing MV-IWAS that solely relies on publicly available GWAS summary data and a reference panel. Through simulations with either individual-level or summary data, we demonstrate the well controlled Type I errors and superior power of MV-IWAS over UV-IWAS in the presence of pleiotropic SNPs. We apply the summary statistic based tests to 1578 heritable imaging derived phenotypes (IDPs) from the UK Biobank. MV-IWAS detected numerous IDPs as possible false positives by UV-IWAS while uncovering many additional causal neuroimaging phenotypes in AD which are strongly supported by the existing literature.
Collapse
Affiliation(s)
- Katherine A Knutson
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota United States
| | - Yangqing Deng
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota United States
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota United States.
| |
Collapse
|
14
|
Zhang L, Papachristou C, Choudhary PK, Biswas S. A Bayesian Hierarchical Framework for Pathway Analysis in Genome-Wide Association Studies. Hum Hered 2020; 84:240-255. [PMID: 32966977 DOI: 10.1159/000508664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Accepted: 05/14/2020] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Pathway analysis allows joint consideration of multiple SNPs belonging to multiple genes, which in turn belong to a biologically defined pathway. This type of analysis is usually more powerful than single-SNP analyses for detecting joint effects of variants in a pathway. METHODS We develop a Bayesian hierarchical model by fully modeling the 3-level hierarchy, namely, SNP-gene-pathway that is naturally inherent in the structure of the pathways, unlike the currently used ad hoc ways of combining such information. We model the effects at each level conditional on the effects of the levels preceding them within the generalized linear model framework. To deal with the high dimensionality, we regularize the regression coefficients through an appropriate choice of priors. The model is fit using a combination of iteratively weighted least squares and expectation-maximization algorithms to estimate the posterior modes and their standard errors. A normal approximation is used for inference. RESULTS We conduct simulations to study the proposed method and find that our method has higher power than some standard approaches in several settings for identifying pathways with multiple modest-sized variants. We illustrate the method by analyzing data from two genome-wide association studies on breast and renal cancers. CONCLUSION Our method can be helpful in detecting pathway association.
Collapse
Affiliation(s)
- Lei Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | | | - Pankaj K Choudhary
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA,
| |
Collapse
|
15
|
Wu C, Xu G, Shen X, Pan W. A Regularization-Based Adaptive Test for High-Dimensional Generalized Linear Models. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2020; 21:128. [PMID: 32802002 PMCID: PMC7425805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In spite of its urgent importance in the era of big data, testing high-dimensional parameters in generalized linear models (GLMs) in the presence of high-dimensional nuisance parameters has been largely under-studied, especially with regard to constructing powerful tests for general (and unknown) alternatives. Most existing tests are powerful only against certain alternatives and may yield incorrect Type I error rates under high-dimensional nuisance parameter situations. In this paper, we propose the adaptive interaction sum of powered score (aiSPU) test in the framework of penalized regression with a non-convex penalty, called truncated Lasso penalty (TLP), which can maintain correct Type I error rates while yielding high statistical power across a wide range of alternatives. To calculate its p-values analytically, we derive its asymptotic null distribution. Via simulations, its superior finite-sample performance is demonstrated over several representative existing methods. In addition, we apply it and other representative tests to an Alzheimer's Disease Neuroimaging Initiative (ADNI) data set, detecting possible gene-gender interactions for Alzheimer's disease. We also put R package "aispu" implementing the proposed test on GitHub.
Collapse
Affiliation(s)
- Chong Wu
- Department of Statistics, Florida State University, FL, USA
| | - Gongjun Xu
- Department of Statistics, University of Michigan, MI, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, MN, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, MN, USA
| |
Collapse
|
16
|
Coombes BJ, Ploner A, Bergen SE, Biernacka JM. A principal component approach to improve association testing with polygenic risk scores. Genet Epidemiol 2020; 44:676-686. [PMID: 32691445 DOI: 10.1002/gepi.22339] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/13/2020] [Accepted: 07/10/2020] [Indexed: 12/16/2022]
Abstract
Polygenic risk scores (PRSs) have become an increasingly popular approach for demonstrating polygenic influences on complex traits and for establishing common polygenic signals between different traits. PRSs are typically constructed using pruning and thresholding (P+T), but the best choice of parameters is uncertain; thus multiple settings are used and the best is chosen. Optimization can lead to inflated Type I error. Permutation procedures can correct this, but they can be computationally intensive. Alternatively, a single parameter setting can be chosen a priori for the PRS, but choosing suboptimal settings results in loss of power. We propose computing PRSs under a range of parameter settings, performing principal component analysis (PCA) on the resulting set of PRSs, and using the first PRS-PC in association tests. The first PC reweights the variants included in the PRS to achieve maximum variation over all PRS settings used. Using simulations and a real data application to study PRS association with bipolar disorder and psychosis in bipolar disorder, we compare the performance of the proposed PRS-PCA approach with a permutation test and an a priori selected p-value threshold. The PRS-PCA approach is simple to implement, outperforms the other strategies in most scenarios, and provides an unbiased estimate of prediction performance.
Collapse
Affiliation(s)
- Brandon J Coombes
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Alexander Ploner
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Sarah E Bergen
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Joanna M Biernacka
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota.,Department of Psychiatry and Psychology, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
17
|
Rotroff DM. A Bioinformatics Crash Course for Interpreting Genomics Data. Chest 2020; 158:S113-S123. [PMID: 32658646 PMCID: PMC8176646 DOI: 10.1016/j.chest.2020.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 11/11/2019] [Accepted: 03/09/2020] [Indexed: 10/23/2022] Open
Abstract
Reductions in genotyping costs and improvements in computational power have made conducting genome-wide association studies (GWAS) standard practice for many complex diseases. GWAS is the assessment of genetic variants across the genome of many individuals to determine which, if any, genetic variants are associated with a specific trait. As with any analysis, there are evolving best practices that should be followed to ensure scientific rigor and reliability in the conclusions. This article presents a brief summary for many of the key bioinformatics considerations when either planning or evaluating GWAS. This review is meant to serve as a guide to those without deep expertise in bioinformatics and GWAS and give them tools to critically evaluate this popular approach to investigating complex diseases. In addition, a checklist is provided that can be used by investigators to evaluate whether a GWAS has appropriately accounted for the many potential sources of bias and generally followed current best practices.
Collapse
Affiliation(s)
- Daniel M Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH.
| |
Collapse
|
18
|
Bao F, Deng Y, Du M, Ren Z, Wan S, Liang KY, Liu S, Wang B, Xin J, Chen F, Christiani DC, Wang M, Dai Q. Explaining the Genetic Causality for Complex Phenotype via Deep Association Kernel Learning. PATTERNS 2020; 1:100057. [PMID: 33205126 PMCID: PMC7660384 DOI: 10.1016/j.patter.2020.100057] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2020] [Revised: 05/25/2020] [Accepted: 06/01/2020] [Indexed: 02/07/2023]
Abstract
The genetic effect explains the causality from genetic mutations to the development of complex diseases. Existing genome-wide association study (GWAS) approaches are always built under a linear assumption, restricting their generalization in dissecting complicated causality such as the recessive genetic effect. Therefore, a sophisticated and general GWAS model that can work with different types of genetic effects is highly desired. Here, we introduce a deep association kernel learning (DAK) model to enable automatic causal genotype encoding for GWAS at pathway level. DAK can detect both common and rare variants with complicated genetic effects where existing approaches fail. When applied to four real-world GWAS datasets including cancers and schizophrenia, our DAK discovered potential casual pathways, including the association between dilated cardiomyopathy pathway and schizophrenia.
Collapse
Affiliation(s)
- Feng Bao
- Department of Automation, Tsinghua University, Beijing 100084, China.,Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing 100084, China
| | - Yue Deng
- School of Astronautics, Beihang University, Beijing 100191, China.,Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing 100191, China
| | - Mulong Du
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.,Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing 211166, China
| | - Zhiquan Ren
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Sen Wan
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Kenny Ye Liang
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Shaohua Liu
- School of Astronautics, Beihang University, Beijing 100191, China
| | - Bo Wang
- School of Astronautics, Beihang University, Beijing 100191, China
| | - Junyi Xin
- Department of Environmental Genomics, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing 211166, China.,Department of Genetic Toxicology, The Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing 211166, China
| | - Feng Chen
- Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing 211166, China
| | - David C Christiani
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA.,Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - Meilin Wang
- Department of Environmental Genomics, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing 211166, China.,Department of Genetic Toxicology, The Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing 211166, China
| | - Qionghai Dai
- Department of Automation, Tsinghua University, Beijing 100084, China.,Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
19
|
Zhang J, Xie S, Gonzales S, Liu J, Wang X. A fast and powerful eQTL weighted method to detect genes associated with complex trait using GWAS summary data. Genet Epidemiol 2020; 44:550-563. [PMID: 32350919 DOI: 10.1002/gepi.22297] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 04/13/2020] [Accepted: 04/14/2020] [Indexed: 02/06/2023]
Abstract
Although genomewide association studies (GWASs) have identified many genetic variants underlying complex traits, a large fraction of heritability still remains unexplained. Integrative analysis that incorporates additional information, such as expression quantitativetrait locus (eQTL) data into sequencing studies (denoted as transcriptomewide association study [TWAS]), can aid the discovery of trait-associated genetic variants. However, general TWAS methods only incorporate one eQTL-derived weight (e.g., cis-effect), and thus can suffer a substantial loss of power when the single estimated cis-effect is not predictive for the effect size of a genetic variant or when there are estimation errors in the estimated cis-effect, or if the data are not consistent with the model assumption. In this study, we propose an omnibus test (OT) which utilizes a Cauchy association test to integrate association evidence demonstrated by three different traditional tests (burden test, quadratic test, and adaptive test) using GWAS summary data with multiple eQTL-derived weights. The p value of the proposed test can be calculated analytically, and thus it is fast and efficient. We applied our proposed test to two schizophrenia (SCZ) GWAS summary data sets and two lipids trait (HDL) GWAS summary data sets. Compared with the three traditional tests, our proposed OT can identify more trait-associated genes.
Collapse
Affiliation(s)
- Jianjun Zhang
- Department of Mathematics, University of North Texas, Denton, Texas
| | - Sicong Xie
- Beijing National Day School, Beijing, China
| | - Samantha Gonzales
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas
| | - Jianguo Liu
- Department of Mathematics, University of North Texas, Denton, Texas
| | - Xuexia Wang
- Department of Mathematics, University of North Texas, Denton, Texas
| |
Collapse
|
20
|
Zhang M, Gelfman S, McCarthy J, Harms MB, Moreno CAM, Goldstein DB, Allen AS. Incorporating external information to improve sparse signal detection in rare-variant gene-set-based analyses. Genet Epidemiol 2020; 44:330-338. [PMID: 32043633 DOI: 10.1002/gepi.22283] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Revised: 12/17/2019] [Accepted: 01/27/2020] [Indexed: 01/30/2023]
Abstract
Gene-set analyses are used to assess whether there is any evidence of association with disease among a set of biologically related genes. Such an analysis typically treats all genes within the sets similarly, even though there is substantial, external, information concerning the likely importance of each gene within each set. For example, for traits that are under purifying selection, we would expect genes showing extensive genic constraint to be more likely to be trait associated than unconstrained genes. Here we improve gene-set analyses by incorporating such external information into a higher-criticism-based signal detection analysis. We show that when this external information is predictive of whether a gene is associated with disease, our approach can lead to a significant increase in power. Further, our approach is particularly powerful when the signal is sparse, that is when only a small number of genes within the set are associated with the trait. We illustrate our approach with a gene-set analysis of amyotrophic lateral sclerosis (ALS) and implicate a number of gene-sets containing SOD1 and NEK1 as well as showing enrichment of small p values for gene-sets containing known ALS genes. We implement our approach in the R package wHC.
Collapse
Affiliation(s)
- Mengqi Zhang
- Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina.,Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina
| | - Sahar Gelfman
- Institute of Genomic Medicine, Columbia University, New York City, New York
| | - Janice McCarthy
- Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina
| | - Matthew B Harms
- Institute of Genomic Medicine, Columbia University, New York City, New York.,Department of Neurology, Columbia University, New York City, New York.,Center for Motor Neuron Biology and Disease, Columbia University, New York City, New York
| | - Cristiane A M Moreno
- Institute of Genomic Medicine, Columbia University, New York City, New York.,Center for Motor Neuron Biology and Disease, Columbia University, New York City, New York
| | - David B Goldstein
- Institute of Genomic Medicine, Columbia University, New York City, New York
| | - Andrew S Allen
- Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina.,Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina
| |
Collapse
|
21
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020; 99:9. [PMID: 32089528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The sum of squared score (SSU) and sequence kernel association test (SKAT) are the two good alternative tests for genetic association studies in case-control data. Both SSU and SKAT are derived through assuming a dose-response model between the risk of disease and genotypes. However, in practice, the real genetic mode of inheritance is impossible to know. Thus, these two tests might losepower substantially as shown in simulation results when the genetic model is misspecified. Here, to make both the tests suitable in broad situations, we propose two-phase SSU (tpSSU) and two-phase SKAT (tpSKAT), where the Hardy-Weinberg equilibrium test is adopted to choose the genetic model in the first phase and the SSU and SKAT are constructed corresponding to the selected genetic model in the second phase. We found that both tpSSU and tpSKAT outperformed the original SSU and SKAT in most of our simulation scenarios. Byapplying tpSSU and tpSKAT to the study of type 2 diabetes data, we successfully identified some genes that have direct effects on obesity. Besides, we also detected the significant chromosomal region 10q21.22 in GAW16 rheumatoid arthritis dataset, with P<10-6. These findings suggest that tpSSU and tpSKAT can be effective in identifying genetic variants for complex diseases in case-control association studies.
Collapse
Affiliation(s)
- Yuan Xue
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, People's Republic of China.
| | | | | | | | | |
Collapse
|
22
|
Yang T, Kim J, Wu C, Ma Y, Wei P, Pan W. An adaptive test for meta-analysis of rare variant association studies. Genet Epidemiol 2020; 44:104-116. [PMID: 31830326 PMCID: PMC6980317 DOI: 10.1002/gepi.22273] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Revised: 11/12/2019] [Accepted: 11/25/2019] [Indexed: 01/02/2023]
Abstract
Single genome-wide studies may be underpowered to detect trait-associated rare variants with moderate or weak effect sizes. As a viable alternative, meta-analysis is widely used to increase power by combining different studies. The power of meta-analysis critically depends on the underlying association patterns and heterogeneity levels, which are unknown and vary from locus to locus. However, existing methods mainly focus on one or only a few combinations of the association pattern and heterogeneity level, thus may lose power in many situations. To address this issue, we propose a general and unified framework by combining a class of tests including and beyond some existing ones, leading to high power across a wide range of scenarios. We demonstrate that the proposed test is more powerful than some existing methods in simulation studies, then show their performance with the NHLBI Exome-Sequencing Project (ESP) data. One gene (B4GALNT2) was found by our proposed test, but not by others, to be statistically significantly associated with plasma triglyceride. The signal was driven by African-ancestry subjects but it was previously reported to be associated with coronary artery disease among European-ancestry subjects. We implemented our method in an R package aSPUmeta, publicly available at https://github.com/ytzhong/metaRV and will be on CRAN soon.
Collapse
Affiliation(s)
- Tianzhong Yang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Junghi Kim
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Chong Wu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Yiding Ma
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Peng Wei
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
23
|
Yang Y, Basu S, Zhang L. A Bayesian hierarchical variable selection prior for pathway-based GWAS using summary statistics. Stat Med 2019; 39:724-739. [PMID: 31777110 DOI: 10.1002/sim.8442] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Revised: 10/27/2019] [Accepted: 11/10/2019] [Indexed: 12/23/2022]
Abstract
While genome-wide association studies (GWASs) have been widely used to uncover associations between diseases and genetic variants, standard SNP-level GWASs often lack the power to identify SNPs that individually have a moderate effect size but jointly contribute to the disease. To overcome this problem, pathway-based GWASs methods have been developed as an alternative strategy that complements SNP-level approaches. We propose a Bayesian method that uses the generalized fused hierarchical structured variable selection prior to identify pathways associated with the disease using SNP-level summary statistics. Our prior has the flexibility to take in pathway structural information so that it can model the gene-level correlation based on prior biological knowledge, an important feature that makes it appealing compared to existing pathway-based methods. Using simulations, we show that our method outperforms competing methods in various scenarios, particularly when we have pathway structural information that involves complex gene-gene interactions. We apply our method to the Wellcome Trust Case Control Consortium Crohn's disease GWAS data, demonstrating its practical application to real data.
Collapse
Affiliation(s)
- Yi Yang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Lin Zhang
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
24
|
Shimada M, Miyagawa T, Takeshima A, Kakita A, Toyoda H, Niizato K, Oshima K, Tokunaga K, Honda M. Epigenome-wide association study of narcolepsy-affected lateral hypothalamic brains, and overlapping DNA methylation profiles between narcolepsy and multiple sclerosis. Sleep 2019; 43:5574506. [DOI: 10.1093/sleep/zsz198] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Revised: 07/07/2019] [Indexed: 01/05/2023] Open
Abstract
Abstract
Narcolepsy with cataplexy is a sleep disorder caused by a deficiency in hypocretin neurons in the lateral hypothalamus (LH). Here we performed an epigenome-wide association study (EWAS) of DNA methylation for narcolepsy and replication analyses using DNA samples extracted from two brain regions: LH (Cases: N = 4; Controls: N = 4) and temporal cortex (Cases: N = 7; Controls: N = 7). Seventy-seven differentially methylated regions (DMRs) were identified in the LH analysis, with the top association of a DMR in the myelin basic protein (MBP) region. Only five DMRs were detected in the temporal cortex analysis. Genes annotated to LH DMRs were significantly associated with pathways related to fatty acid response or metabolism. Two additional analyses applying the EWAS data were performed: (1) investigation of methylation profiles shared between narcolepsy and other disorders and (2) an integrative analysis of DNA methylation data and a genome-wide association study for narcolepsy. The results of the two approaches, which included significant overlap of methylated positions associated with narcolepsy and multiple sclerosis, indicated that the two diseases may partly share their pathogenesis. In conclusion, DNA methylation in LH where loss of orexin-producing neurons occurs may play a role in the pathophysiology of the disease.
Collapse
Affiliation(s)
- Mihoko Shimada
- Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
- Department of Human Genetics, Graduate School of Medicine, University of Tokyo, Tokyo, Japan
| | - Taku Miyagawa
- Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
- Department of Human Genetics, Graduate School of Medicine, University of Tokyo, Tokyo, Japan
| | - Akari Takeshima
- Department of Pathology, Brain Research Institute, Niigata University, Niigata, Japan
| | - Akiyoshi Kakita
- Department of Pathology, Brain Research Institute, Niigata University, Niigata, Japan
| | - Hiromi Toyoda
- Department of Human Genetics, Graduate School of Medicine, University of Tokyo, Tokyo, Japan
| | - Kazuhiro Niizato
- Department of Psychiatry, Tokyo Metropolitan Matsuzawa Hospital, Tokyo, Japan
| | - Kenichi Oshima
- Department of Psychiatry, Tokyo Metropolitan Matsuzawa Hospital, Tokyo, Japan
| | - Katsushi Tokunaga
- Department of Human Genetics, Graduate School of Medicine, University of Tokyo, Tokyo, Japan
| | - Makoto Honda
- Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
- Seiwa Hospital, Institute of Neuropsychiatry, Tokyo, Japan
| |
Collapse
|
25
|
Zhang J, Zhao Z, Guo X, Guo B, Wu B. Powerful statistical method to detect disease-associated genes using publicly available genome-wide association studies summary data. Genet Epidemiol 2019; 43:941-951. [PMID: 31392781 DOI: 10.1002/gepi.22251] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2018] [Revised: 07/14/2019] [Accepted: 07/16/2019] [Indexed: 12/11/2022]
Abstract
Genome-wide association studies (GWAS) have thus far achieved substantial success. In the last decade, a large number of common variants underlying complex diseases have been identified through GWAS. In most existing GWAS, the identified common variants are obtained by single marker-based tests, that is, testing one single-nucleotide polymorphism (SNP) at a time. Generally, the basic functional unit of inheritance is a gene, rather than a SNP. Thus, results from gene-level association test can be more readily integrated with downstream functional and pathogenic investigation. In this paper, we propose a general gene-based p-value adaptive combination approach (GPA) which can integrate association evidence of multiple genetic variants using only GWAS summary statistics (either p-value or other test statistics). The proposed method could be used to test genetic association for both continuous and binary traits through not only one study but also multiple studies, which would be helpful to overcome the limitation of existing methods that can only be applied to a specific type of data. We conducted thorough simulation studies to verify that the proposed method controls type I errors well, and performs favorably compared to single-marker analysis and other existing methods. We demonstrated the utility of our proposed method through analysis of GWAS meta-analysis results for fasting glucose and lipids from the international MAGIC consortium and Global Lipids Consortium, respectively. The proposed method identified some novel trait associated genes which can improve our understanding of the mechanisms involved in β -cell function, glucose homeostasis, and lipids traits.
Collapse
Affiliation(s)
- Jianjun Zhang
- Department of Mathematics, University of North Texas, Denton, Texas
| | - Zihan Zhao
- Texas Academy of Mathematics & Science, University of North Texas, Denton, Texas
| | - Xuan Guo
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas
| | - Bin Guo
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Baolin Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
26
|
Shimada M, Miyagawa T, Toyoda H, Tokunaga K, Honda M. Epigenome-wide association study of DNA methylation in narcolepsy: an integrated genetic and epigenetic approach. Sleep 2019; 41:4841708. [PMID: 29425374 DOI: 10.1093/sleep/zsy019] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Narcolepsy with cataplexy, which is a hypersomnia characterized by excessive daytime sleepiness and cataplexy, is a multifactorial disease caused by both genetic and environmental factors. Several genetic factors including HLA-DQB1*06:02 have been identified; however, the disease etiology is still unclear. Epigenetic modifications, such as DNA methylation, have been suggested to play an important role in the pathogenesis of complex diseases. Here, we examined DNA methylation profiles of blood samples from narcolepsy and healthy control individuals and performed an epigenome-wide association study (EWAS) to investigate methylation loci associated with narcolepsy. Moreover, data from the EWAS and a previously performed narcolepsy genome-wide association study were integrated to search for methylation loci with causal links to the disease. We found that (1) genes annotated to the top-ranked differentially methylated positions (DMPs) in narcolepsy were associated with pathways of hormone secretion and monocarboxylic acid metabolism. (2) Top-ranked narcolepsy-associated DMPs were significantly more abundant in non-CpG island regions and more than 95 per cent of such sites were hypomethylated in narcolepsy patients. (3) The integrative analysis identified the CCR3 region where both a single methylation site and multiple single-nucleotide polymorphisms were found to be associated with the disease as a candidate region responsible for narcolepsy. The findings of this study suggest the importance of future replication studies, using methylation technologies with wider genome coverage and/or larger number of samples, to confirm and expand on these results.
Collapse
Affiliation(s)
- Mihoko Shimada
- Sleep Disorders Project, Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan.,Department of Human Genetics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Taku Miyagawa
- Sleep Disorders Project, Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan.,Department of Human Genetics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Hiromi Toyoda
- Sleep Disorders Project, Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| | - Katsushi Tokunaga
- Department of Human Genetics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Makoto Honda
- Sleep Disorders Project, Department of Psychiatry and Behavioral Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan.,Seiwa Hospital, Neuropsychiatric Research Institute, Tokyo, Japan
| |
Collapse
|
27
|
Banerjee K, Zhao N, Srinivasan A, Xue L, Hicks SD, Middleton FA, Wu R, Zhan X. An Adaptive Multivariate Two-Sample Test With Application to Microbiome Differential Abundance Analysis. Front Genet 2019; 10:350. [PMID: 31068967 PMCID: PMC6491633 DOI: 10.3389/fgene.2019.00350] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 04/01/2019] [Indexed: 01/21/2023] Open
Abstract
Differential abundance analysis is a crucial task in many microbiome studies, where the central goal is to identify microbiome taxa associated with certain biological or clinical conditions. There are two different modes of microbiome differential abundance analysis: the individual-based univariate differential abundance analysis and the group-based multivariate differential abundance analysis. The univariate analysis identifies differentially abundant microbiome taxa subject to multiple correction under certain statistical error measurements such as false discovery rate, which is typically complicated by the high-dimensionality of taxa and complex correlation structure among taxa. The multivariate analysis evaluates the overall shift in the abundance of microbiome composition between two conditions, which provides useful preliminary differential information for the necessity of follow-up validation studies. In this paper, we present a novel Adaptive multivariate two-sample test for Microbiome Differential Analysis (AMDA) to examine whether the composition of a taxa-set are different between two conditions. Our simulation studies and real data applications demonstrated that the AMDA test was often more powerful than several competing methods while preserving the correct type I error rate. A free implementation of our AMDA method in R software is available at https://github.com/xyz5074/AMDA.
Collapse
Affiliation(s)
- Kalins Banerjee
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, United States
| | - Arun Srinivasan
- Department of Statistics, Pennsylvania State University, University Park, PA, United States
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University, University Park, PA, United States
| | - Steven D. Hicks
- Department of Pediatrics, Pennsylvania State University, Hershey, PA, United States
| | - Frank A. Middleton
- Department of Neuroscience, State University of New York Upstate Medical University, Syracuse, NY, United States
| | - Rongling Wu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States,*Correspondence: Xiang Zhan
| |
Collapse
|
28
|
Jackknife Model Averaging Prediction Methods for Complex Phenotypes with Gene Expression Levels by Integrating External Pathway Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:2807470. [PMID: 31089389 PMCID: PMC6476151 DOI: 10.1155/2019/2807470] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 03/20/2019] [Indexed: 01/03/2023]
Abstract
Motivation In the past few years many prediction approaches have been proposed and widely employed in high dimensional genetic data for disease risk evaluation. However, those approaches typically ignore in model fitting the important group structures that naturally exists in genetic data. Methods In the present study, we applied a novel model-averaging approach, called jackknife model averaging prediction (JMAP), for high dimensional genetic risk prediction while incorporating pathway information into the model specification. JMAP selects the optimal weights across candidate models by minimizing a cross validation criterion in a jackknife way. Compared with previous approaches, one of the primary features of JMAP is to allow model weights to vary from 0 to 1 but without the limitation that the summation of weights is equal to one. We evaluated the performance of JMAP using extensive simulation studies and compared it with existing methods. We finally applied JMAP to four real cancer datasets that are publicly available from TCGA. Results The simulations showed that compared with other existing approaches (e.g., gsslasso), JMAP performed best or is among the best methods across a range of scenarios. For example, among 14 out of 16 simulation settings with PVE = 0.3, JMAP has an average of 0.075 higher prediction accuracy compared with gsslasso. We further found that in the simulation, the model weights for the true candidate models have much smaller chances to be zero compared with those for the null candidate models and are substantially greater in magnitude. In the real data application, JMAP also behaves comparably or better compared with the other methods for continuous phenotypes. For example, for the COAD, CRC, and PAAD datasets, the average gains of predictive accuracy of JMAP are 0.019, 0.064, and 0.052 compared with gsslasso. Conclusion The proposed method JMAP is a novel model-averaging approach for high dimensional genetic risk prediction while incorporating external useful group structures into the model specification.
Collapse
|
29
|
Ma Y, Wei P. FunSPU: A versatile and adaptive multiple functional annotation-based association test of whole-genome sequencing data. PLoS Genet 2019; 15:e1008081. [PMID: 31034468 PMCID: PMC6508749 DOI: 10.1371/journal.pgen.1008081] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 05/09/2019] [Accepted: 03/11/2019] [Indexed: 11/19/2022] Open
Abstract
Despite ongoing large-scale population-based whole-genome sequencing (WGS) projects such as the NIH NHLBI TOPMed program and the NHGRI Genome Sequencing Program, WGS-based association analysis of complex traits remains a tremendous challenge due to the large number of rare variants, many of which are non-trait-associated neutral variants. External biological knowledge, such as functional annotations based on the ENCODE, Epigenomics Roadmap and GTEx projects, may be helpful in distinguishing causal rare variants from neutral ones; however, each functional annotation can only provide certain aspects of the biological functions. Our knowledge for selecting informative annotations a priori is limited, and incorporating non-informative annotations will introduce noise and lose power. We propose FunSPU, a versatile and adaptive test that incorporates multiple biological annotations and is adaptive at both the annotation and variant levels and thus maintains high power even in the presence of noninformative annotations. In addition to extensive simulations, we illustrate our proposed test using the TWINSUK cohort (n = 1,752) of UK10K WGS data based on six functional annotations: CADD, RegulomeDB, FunSeq, Funseq2, GERP++, and GenoSkyline. We identified genome-wide significant genetic loci on chromosome 19 near gene TOMM40 and APOC4-APOC2 associated with low-density lipoprotein (LDL), which are replicated in the UK10K ALSPAC cohort (n = 1,497). These replicated LDL-associated loci were missed by existing rare variant association tests that either ignore external biological information or rely on a single source of biological knowledge. We have implemented the proposed test in an R package "FunSPU".
Collapse
Affiliation(s)
- Yiding Ma
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center, Houston, Texas, United States of America
| | - Peng Wei
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| |
Collapse
|
30
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
31
|
Lee S, Park T. Integration of a Large-Scale Genetic Analysis Workbench Increases the Accessibility of a High-Performance Pathway-Based Analysis Method. Genomics Inform 2018; 16:e39. [PMID: 30602100 PMCID: PMC6440666 DOI: 10.5808/gi.2018.16.4.e39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Accepted: 12/14/2018] [Indexed: 11/20/2022] Open
Abstract
The rapid increase in genetic dataset volume has demanded extensive adoption of biological knowledge to reduce the computational complexity, and the biological pathway is one well-known source of such knowledge. In this regard, we have introduced a novel statistical method that enables the pathway-based association study of large-scale genetic dataset—namely, PHARAOH. However, researcher-level application of the PHARAOH method has been limited by a lack of generally used file formats and the absence of various quality control options that are essential to practical analysis. In order to overcome these limitations, we introduce our integration of the PHARAOH method into our recently developed all-in-one workbench. The proposed new PHARAOH program not only supports various de facto standard genetic data formats but also provides many quality control measures and filters based on those measures. We expect that our updated PHARAOH provides advanced accessibility of the pathway-level analysis of large-scale genetic datasets to researchers.
Collapse
Affiliation(s)
- Sungyoung Lee
- Center for Precision Medicine, Seoul National University Hospital, Seoul 03080, Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Korea
| |
Collapse
|
32
|
Hui X, Hu Y, Sun MA, Shu X, Han R, Ge Q, Wang Y. EBT: a statistic test identifying moderate size of significant features with balanced power and precision for genome-wide rate comparisons. Bioinformatics 2018; 33:2631-2641. [PMID: 28472273 DOI: 10.1093/bioinformatics/btx294] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2016] [Accepted: 05/02/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation In genome-wide rate comparison studies, there is a big challenge for effective identification of an appropriate number of significant features objectively, since traditional statistical comparisons without multi-testing correction can generate a large number of false positives while multi-testing correction tremendously decreases the statistic power. Results In this study, we proposed a new exact test based on the translation of rate comparison to two binomial distributions. With modeling and real datasets, the exact binomial test (EBT) showed an advantage in balancing the statistical precision and power, by providing an appropriate size of significant features for further studies. Both correlation analysis and bootstrapping tests demonstrated that EBT is as robust as the typical rate-comparison methods, e.g. χ 2 test, Fisher's exact test and Binomial test. Performance comparison among machine learning models with features identified by different statistical tests further demonstrated the advantage of EBT. The new test was also applied to analyze the genome-wide somatic gene mutation rate difference between lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), two main lung cancer subtypes and a list of new markers were identified that could be lineage-specifically associated with carcinogenesis of LUAD and LUSC, respectively. Interestingly, three cilia genes were found selectively with high mutation rates in LUSC, possibly implying the importance of cilia dysfunction in the carcinogenesis. Availability and implementation An R package implementing EBT could be downloaded from the website freely: http://www.szu-bioinf.org/EBT . Contact wangyj@szu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xinjie Hui
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Yueming Hu
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Ming-An Sun
- Epigenomics and Computational Biology Lab, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA
| | - Xingsheng Shu
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Rongfei Han
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Shenzhen University Health Science Center, Shenzhen 518060, China
| | - Qinggang Ge
- Department of Critical Care Unit, Peking University Third Hospital, Beijing 100191, China
| | - Yejun Wang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Shenzhen University Health Science Center, Shenzhen 518060, China
| |
Collapse
|
33
|
Wu C, Pan W. Integrating eQTL data with GWAS summary statistics in pathway-based analysis with application to schizophrenia. Genet Epidemiol 2018; 42:303-316. [PMID: 29411426 PMCID: PMC5851843 DOI: 10.1002/gepi.22110] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 01/04/2018] [Accepted: 01/04/2018] [Indexed: 12/11/2022]
Abstract
Many genetic variants affect complex traits through gene expression, which can be exploited to boost statistical power and enhance interpretation in genome-wide association studies (GWASs) as demonstrated by the transcriptome-wide association study (TWAS) approach. Furthermore, due to polygenic inheritance, a complex trait is often affected by multiple genes with similar functions as annotated in gene pathways. Here, we extend TWAS from gene-based analysis to pathway-based analysis: we integrate public pathway collections, expression quantitative trait locus (eQTL) data and GWAS summary association statistics (or GWAS individual-level data) to identify gene pathways associated with complex traits. The basic idea is to weight the SNPs of the genes in a pathway based on their estimated cis-effects on gene expression, then adaptively test for association of the pathway with a GWAS trait by effectively aggregating possibly weak association signals across the genes in the pathway. The P values can be calculated analytically and thus fast. We applied our proposed test with the KEGG and GO pathways to two schizophrenia (SCZ) GWAS summary association data sets, denoted by SCZ1 and SCZ2 with about 20,000 and 150,000 subjects, respectively. Most of the significant pathways identified by analyzing the SCZ1 data were reproduced by the SCZ2 data. Importantly, we identified 15 novel pathways associated with SCZ, such as GABA receptor complex (GO:1902710), which could not be uncovered by the standard single SNP-based analysis or gene-based TWAS. The newly identified pathways may help us gain insights into the biological mechanism underlying SCZ. Our results showcase the power of incorporating gene expression information and gene functional annotations into pathway-based association testing for GWAS.
Collapse
Affiliation(s)
- Chong Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| |
Collapse
|
34
|
An epigenome-wide methylation study of healthy individuals with or without depressive symptoms. J Hum Genet 2018; 63:319-326. [PMID: 29305581 DOI: 10.1038/s10038-017-0382-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Revised: 09/29/2017] [Accepted: 10/11/2017] [Indexed: 12/23/2022]
Abstract
Major depressive disorder is a common psychiatric disorder that is thought to be triggered by both genetic and environmental factors. Depressive symptoms are an important public health problem and contribute to vulnerability to major depression. Although a substantial number of genetic and epigenetic studies have been performed to date, the detailed etiology of depression remains unclear and there are no validated biomarkers. DNA methylation is one of the major epigenetic modifications that play diverse roles in the etiology of complex diseases. In this study, we performed an epigenome-wide association study (EWAS) of DNA methylation on subjects with (N = 20) or without (N = 27) depressive symptoms in order to examine whether different levels of DNA methylation were associated with depressive tendencies. Employing methylation-array technology, a total of 363,887 methylation sites across the genomes were investigated and several candidate CpG sites associated with depressive symptoms were identified, especially annotated to genes linked to a G-protein coupled receptor protein signaling pathway. These data provide a strong impetus for validation studies using a larger cohort and support the possibility that G-protein coupled receptor protein signaling pathways are involved in the pathogenesis of depression.
Collapse
|
35
|
Park JY, Wu C, Basu S, McGue M, Pan W. Adaptive SNP-Set Association Testing in Generalized Linear Mixed Models with Application to Family Studies. Behav Genet 2018; 48:55-66. [PMID: 29150721 PMCID: PMC5754233 DOI: 10.1007/s10519-017-9883-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Accepted: 11/07/2017] [Indexed: 10/18/2022]
Abstract
In genome-wide association studies (GWAS), it has been increasingly recognized that, as a complementary approach to standard single SNP analyses, it may be beneficial to analyze a group of functionally related SNPs together. Among the existent population-based SNP-set association tests, two adaptive tests, the aSPU test and the aSPUpath test, offer a powerful and general approach at the gene- and pathway-levels by data-adaptively combining the results across multiple SNPs (and genes) such that high statistical power can be maintained across a wide range of scenarios. We extend the aSPU and the aSPUpath test to familial data under the framework of the generalized linear mixed models (GLMMs), which can take account of both subject relatedness and possible population structure. As in population-based GWAS, the proposed aSPU and aSPUpath tests require only fitting a single and common GLMM (under the null hypothesis) for all the SNPs, thus are computationally efficient and feasible for large GWAS data. We illustrate our approaches in identifying genes and pathways associated with alcohol dependence in the Minnesota Twin Family Study. The aSPU test detected a gene associated with the trait, in contrast to none by the standard single SNP analysis. Our aSPU test also controlled Type I errors satisfactorily in a small simulation study. We provide R code to conduct the aSPU and aSPUpath tests for familial and other correlated data.
Collapse
Affiliation(s)
- Jun Young Park
- Division of Biostatistics, University of Minnesota, A460 Mayo Building, MMC 303, 420 Delaware St. SE, Minneapolis, MN, 55455, USA
| | - Chong Wu
- Division of Biostatistics, University of Minnesota, A460 Mayo Building, MMC 303, 420 Delaware St. SE, Minneapolis, MN, 55455, USA
| | - Saonli Basu
- Division of Biostatistics, University of Minnesota, A460 Mayo Building, MMC 303, 420 Delaware St. SE, Minneapolis, MN, 55455, USA
| | - Matt McGue
- Department of Psychology, University of Minnesota, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, A460 Mayo Building, MMC 303, 420 Delaware St. SE, Minneapolis, MN, 55455, USA.
| |
Collapse
|
36
|
Xu Z, Wu C, Pan W. Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage 2017; 159:159-169. [PMID: 28736311 PMCID: PMC5671364 DOI: 10.1016/j.neuroimage.2017.07.036] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 06/22/2017] [Accepted: 07/18/2017] [Indexed: 10/19/2022] Open
Abstract
A new and powerful approach, called imaging-wide association study (IWAS), is proposed to integrate imaging endophenotypes with GWAS to boost statistical power and enhance biological interpretation for GWAS discoveries. IWAS extends the promising transcriptome-wide association study (TWAS) from using gene expression endophenotypes to using imaging and other endophenotypes with a much wider range of possible applications. As illustration, we use gray-matter volumes of several brain regions of interest (ROIs) drawn from the ADNI-1 structural MRI data as imaging endophenotypes, which are then applied to the individual-level GWAS data of ADNI-GO/2 and a large meta-analyzed GWAS summary statistics dataset (based on about 74,000 individuals), uncovering some novel genes significantly associated with Alzheimer's disease (AD). We also compare the performance of IWAS with TWAS, showing much larger numbers of significant AD-associated genes discovered by IWAS, presumably due to the stronger link between brain atrophy and AD than that between gene expression of normal individuals and the risk for AD. The proposed IWAS is general and can be applied to other imaging endophenotypes, and GWAS individual-level or summary association data.
Collapse
Affiliation(s)
- Zhiyuan Xu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Chong Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA.
| |
Collapse
|
37
|
Powerful Genetic Association Analysis for Common or Rare Variants with High-Dimensional Structured Traits. Genetics 2017. [PMID: 28642271 DOI: 10.1534/genetics.116.199646] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Many genetic association studies collect a wide range of complex traits. As these traits may be correlated and share a common genetic mechanism, joint analysis can be statistically more powerful and biologically more meaningful. However, most existing tests for multiple traits cannot be used for high-dimensional and possibly structured traits, such as network-structured transcriptomic pathway expressions. To overcome potential limitations, in this article we propose the dual kernel-based association test (DKAT) for testing the association between multiple traits and multiple genetic variants, both common and rare. In DKAT, two individual kernels are used to describe the phenotypic and genotypic similarity, respectively, between pairwise subjects. Using kernels allows for capturing structure while accommodating dimensionality. Then, the association between traits and genetic variants is summarized by a coefficient which measures the association between two kernel matrices. Finally, DKAT evaluates the hypothesis of nonassociation with an analytical P-value calculation without any computationally expensive resampling procedures. By collapsing information in both traits and genetic variants using kernels, the proposed DKAT is shown to have a correct type-I error rate and higher power than other existing methods in both simulation studies and application to a study of genetic regulation of pathway gene expressions.
Collapse
|
38
|
Larson NB, McDonnell S, Cannon Albright L, Teerlink C, Stanford J, Ostrander EA, Isaacs WB, Xu J, Cooney KA, Lange E, Schleutker J, Carpten JD, Powell I, Bailey-Wilson JE, Cussenot O, Cancel-Tassin G, Giles GG, MacInnis RJ, Maier C, Whittemore AS, Hsieh CL, Wiklund F, Catalona WJ, Foulkes W, Mandal D, Eeles R, Kote-Jarai Z, Ackerman MJ, Olson TM, Klein CJ, Thibodeau SN, Schaid DJ. gsSKAT: Rapid gene set analysis and multiple testing correction for rare-variant association studies using weighted linear kernels. Genet Epidemiol 2017; 41:297-308. [PMID: 28211093 DOI: 10.1002/gepi.22036] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2016] [Revised: 11/16/2016] [Accepted: 12/09/2016] [Indexed: 01/28/2023]
Abstract
Next-generation sequencing technologies have afforded unprecedented characterization of low-frequency and rare genetic variation. Due to low power for single-variant testing, aggregative methods are commonly used to combine observed rare variation within a single gene. Causal variation may also aggregate across multiple genes within relevant biomolecular pathways. Kernel-machine regression and adaptive testing methods for aggregative rare-variant association testing have been demonstrated to be powerful approaches for pathway-level analysis, although these methods tend to be computationally intensive at high-variant dimensionality and require access to complete data. An additional analytical issue in scans of large pathway definition sets is multiple testing correction. Gene set definitions may exhibit substantial genic overlap, and the impact of the resultant correlation in test statistics on Type I error rate control for large agnostic gene set scans has not been fully explored. Herein, we first outline a statistical strategy for aggregative rare-variant analysis using component gene-level linear kernel score test summary statistics as well as derive simple estimators of the effective number of tests for family-wise error rate control. We then conduct extensive simulation studies to characterize the behavior of our approach relative to direct application of kernel and adaptive methods under a variety of conditions. We also apply our method to two case-control studies, respectively, evaluating rare variation in hereditary prostate cancer and schizophrenia. Finally, we provide open-source R code for public use to facilitate easy application of our methods to existing rare-variant analysis results.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Shannon McDonnell
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Lisa Cannon Albright
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, United States of America
| | - Craig Teerlink
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, United States of America
| | - Janet Stanford
- Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Elaine A Ostrander
- National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | - William B Isaacs
- Brady Urological Institute, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Jianfeng Xu
- NorthShore University HealthSystem Research Institute, Chicago, Illinois, United States of America
| | - Kathleen A Cooney
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, United States of America.,Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, Michigan, United States of America.,Department of Urology, University of Michigan Medical School, Ann Arbor, Michigan, United States of America
| | - Ethan Lange
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Johanna Schleutker
- Department of Medical Biochemistry and Genetics, Institute of Biomedicine, University of Turku, Turku, Finland
| | - John D Carpten
- Department of Translational Genomics, University of Southern California, Los Angeles, California, United States of America
| | - Isaac Powell
- Department of Urology, Wayne State University, Detroit, Michigan, United States of America
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | | | | | - Graham G Giles
- Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia.,Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | - Robert J MacInnis
- Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia.,Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | | | - Alice S Whittemore
- Department of Health Research and Policy, Stanford University, Stanford, California, United States of America
| | - Chih-Lin Hsieh
- Department of Urology, University of Southern California, Los Angeles, California, United States of America
| | - Fredrik Wiklund
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - William J Catalona
- Department of Urology, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, United States of America
| | - William Foulkes
- Department of Oncology, Montreal General Hospital, Montreal, Quebec, Canada.,Department of Human Genetics, Montreal General Hospital, Montreal, Quebec, Canada
| | - Diptasri Mandal
- Department of Genetics, LSU Health Sciences Center, New Orleans, Louisiana, United States of America
| | | | - Zsofia Kote-Jarai
- The Institute of Cancer Research, London, UK.,The Institute of Cancer Research and Royal Marsden NHS Foundation Trust, London
| | - Michael J Ackerman
- Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Timothy M Olson
- Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Christopher J Klein
- Department of Neurology, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Stephen N Thibodeau
- Department of Laboratory Medicine/Pathology, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| |
Collapse
|
39
|
Bao F, Deng Y, Zhao Y, Suo J, Dai Q. Bosco: Boosting Corrections for Genome-Wide Association Studies With Imbalanced Samples. IEEE Trans Nanobioscience 2017; 16:69-77. [PMID: 28141527 DOI: 10.1109/tnb.2017.2660498] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
In genome-wide association studies (GWAS), the acquired sequential data may exhibit imbalance structure: abundant control vs. limited case samples. Such sample imbalance issue is particularly serious when investigating rare diseases or common diseases on rare populations. Conventional GWAS methods may suffer from severe statistic biases to the major group, leading to power losses in uncovering true suspicious loci. We introduce a boosting correction method termed as Bosco to deal with such imbalanced problem. Bosco is motivated by the boost learning theory in machine learning and is implemented in a coarse-to-fine learning framework: the coarse step assigns importance scores for all samples in the major group and the fine step calculates P -values by a weighted logistic regression. On simulated data sets, we demonstrate the proposed methods can dramatically improve the discovery power even on extremely imbalanced datasets, with well controlling the false positives. The Bosco is also applied to a genome-scale gastric cancer data set to conduct genome-wide analysis. Our method replicates existing reported findings (from the likelihood ratio test) with high statistical significance and shows the ability to identify new suspicious SNPs.
Collapse
|
40
|
Epigenome-wide association study of DNA methylation in panic disorder. Clin Epigenetics 2017; 9:6. [PMID: 28149334 PMCID: PMC5270210 DOI: 10.1186/s13148-016-0307-1] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2016] [Accepted: 12/26/2016] [Indexed: 12/22/2022] Open
Abstract
Background Panic disorder (PD) is considered to be a multifactorial disorder emerging from interactions among multiple genetic and environmental factors. To date, although genetic studies reported several susceptibility genes with PD, few of them were replicated and the pathogenesis of PD remains to be clarified. Epigenetics is considered to play an important role in etiology of complex traits and diseases, and DNA methylation is one of the major forms of epigenetic modifications. In this study, we performed an epigenome-wide association study of PD using DNA methylation arrays so as to investigate the possibility that different levels of DNA methylation might be associated with PD. Methods The DNA methylation levels of CpG sites across the genome were examined with genomic DNA samples (PD, N = 48, control, N = 48) extracted from peripheral blood. Methylation arrays were used for the analysis. β values, which represent the levels of DNA methylation, were normalized via an appropriate pipeline. Then, β values were converted to M values via the logit transformation for epigenome-wide association study. The relationship between each DNA methylation site and PD was assessed by linear regression analysis with adjustments for the effects of leukocyte subsets. Results Forty CpG sites showed significant association with PD at 5% FDR correction, though the differences of the DNA methylation levels were relatively small. Most of the significant CpG sites (37/40 CpG sites) were located in or around CpG islands. Many of the significant CpG sites (27/40 CpG sites) were located upstream of genes, and all such CpG sites with the exception of two were hypomethylated in PD subjects. A pathway analysis on the genes annotated to the significant CpG sites identified several pathways, including “positive regulation of lymphocyte activation.” Conclusions Although future studies with larger number of samples are necessary to confirm the small DNA methylation abnormalities associated with PD, there is a possibility that several CpG sites might be associated, together as a group, with PD. Electronic supplementary material The online version of this article (doi:10.1186/s13148-016-0307-1) contains supplementary material, which is available to authorized users.
Collapse
|
41
|
Kwak IY, Pan W. Gene- and pathway-based association tests for multiple traits with GWAS summary statistics. Bioinformatics 2017; 33:64-71. [PMID: 27592708 PMCID: PMC5198520 DOI: 10.1093/bioinformatics/btw577] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Revised: 08/08/2016] [Accepted: 08/29/2016] [Indexed: 11/15/2022] Open
Abstract
To identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene- or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP- and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or P-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods. AVAILABILITY AND IMPLEMENTATION The methods are implemented in R package aSPU, freely and publicly available at: https://cran.r-project.org/web/packages/aSPU/ CONTACT: weip@biostat.umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Il-Youp Kwak
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
42
|
Abstract
With the advance of sequencing technologies, it has become a routine practice to test for association between a quantitative trait and a set of rare variants (RVs). While a number of RV association tests have been proposed, there is a dearth of studies on the robustness of RV association testing for nonnormal distributed traits, e.g., due to skewness, which is ubiquitous in cohort studies. By extensive simulations, we demonstrate that commonly used RV tests, including sequence kernel association test (SKAT) and optimal unified SKAT (SKAT-O), are not robust to heavy-tailed or right-skewed trait distributions with inflated type I error rates; in contrast, the adaptive sum of powered score (aSPU) test is much more robust. Here we further propose a robust version of the aSPU test, called aSPUr. We conduct extensive simulations to evaluate the power of the tests, finding that for a larger number of RVs, aSPU is often more powerful than SKAT and SKAT-O, owing to its high data-adaptivity. We also compare different tests by conducting association analysis of triglyceride levels using the NHLBI ESP whole-exome sequencing data. The QQ plots for SKAT and SKAT-O were severely inflated (λ = 1.89 and 1.78, respectively), while those for aSPU and aSPUr behaved normally. Due to its relatively high robustness to outliers and high power of the aSPU test, we recommend its use complementary to SKAT and SKAT-O. If there is evidence of inflated type I error rate from the aSPU test, we would recommend the use of the more robust, but less powerful, aSPUr test.
Collapse
|
43
|
Yoo YJ, Sun L, Poirier JG, Paterson AD, Bull SB. Multiple linear combination (MLC) regression tests for common variants adapted to linkage disequilibrium structure. Genet Epidemiol 2016; 41:108-121. [PMID: 27885705 PMCID: PMC5245123 DOI: 10.1002/gepi.22024] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2016] [Revised: 05/25/2016] [Accepted: 09/27/2016] [Indexed: 11/21/2022]
Abstract
By jointly analyzing multiple variants within a gene, instead of one at a time, gene‐based multiple regression can improve power, robustness, and interpretation in genetic association analysis. We investigate multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes. MLC is a directional test that exploits LD structure in a gene to construct clusters of closely correlated variants recoded such that the majority of pairwise correlations are positive. It combines variant effects within the same cluster linearly, and aggregates cluster‐specific effects in a quadratic sum of squares and cross‐products, producing a test statistic with reduced degrees of freedom (df) equal to the number of clusters. By simulation studies of 1000 genes from across the genome, we demonstrate that MLC is a well‐powered and robust choice among existing methods across a broad range of gene structures. Compared to minimum P‐value, variance‐component, and principal‐component methods, the mean power of MLC is never much lower than that of other methods, and can be higher, particularly with multiple causal variants. Moreover, the variation in gene‐specific MLC test size and power across 1000 genes is less than that of other methods, suggesting it is a complementary approach for discovery in genome‐wide analysis. The cluster construction of the MLC test statistics helps reveal within‐gene LD structure, allowing interpretation of clustered variants as haplotypic effects, while multiple regression helps to distinguish direct and indirect associations.
Collapse
Affiliation(s)
- Yun Joo Yoo
- Department of Mathematics Education, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, Canada.,Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Julia G Poirier
- Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| | - Andrew D Paterson
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Program in Genetics and Genome Biology, Hospital for Sick Children Research Institute, Toronto, Canada
| | - Shelley B Bull
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| |
Collapse
|
44
|
Kao PYP, Leung KH, Chan LWC, Yip SP, Yap MKH. Pathway analysis of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. Biochim Biophys Acta Gen Subj 2016; 1861:335-353. [PMID: 27888147 DOI: 10.1016/j.bbagen.2016.11.030] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 10/17/2016] [Accepted: 11/19/2016] [Indexed: 12/20/2022]
Abstract
BACKGROUND Genome-wide association studies (GWAS) is a major method for studying the genetics of complex diseases. Finding all sequence variants to explain fully the aetiology of a disease is difficult because of their small effect sizes. To better explain disease mechanisms, pathway analysis is used to consolidate the effects of multiple variants, and hence increase the power of the study. While pathway analysis has previously been performed within GWAS only, it can now be extended to examining rare variants, other "-omics" and interaction data. SCOPE OF REVIEW 1. Factors to consider in the choice of software for GWAS pathway analysis. 2. Examples of how pathway analysis is used to analyse rare variants, other "-omics" and interaction data. MAJOR CONCLUSIONS To choose appropriate software tools, factors for consideration include covariate compatibility, null hypothesis, one- or two-step analysis required, curation method of gene sets, size of pathways, and size of flanking regions to define gene boundaries. For rare variants, analysis performance depends on consistency between assumed and actual effect distribution of variants. Integration of other "-omics" data and interaction can better explain gene functions. GENERAL SIGNIFICANCE Pathway analysis methods will be more readily used for integration of multiple sources of data, and enable more accurate prediction of phenotypes.
Collapse
Affiliation(s)
- Patrick Y P Kao
- Centre for Myopia Research, School of Optometry, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Kim Hung Leung
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Lawrence W C Chan
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Shea Ping Yip
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Maurice K H Yap
- Centre for Myopia Research, School of Optometry, The Hong Kong Polytechnic University, Hong Kong SAR, China
| |
Collapse
|
45
|
Zhan X, Girirajan S, Zhao N, Wu MC, Ghosh D. A novel copy number variants kernel association test with application to autism spectrum disorders studies. Bioinformatics 2016; 32:3603-3610. [PMID: 27497442 DOI: 10.1093/bioinformatics/btw500] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Revised: 06/28/2016] [Accepted: 07/22/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Copy number variants (CNVs) have been implicated in a variety of neurodevelopmental disorders, including autism spectrum disorders, intellectual disability and schizophrenia. Recent advances in high-throughput genomic technologies have enabled rapid discovery of many genetic variants including CNVs. As a result, there is increasing interest in studying the role of CNVs in the etiology of many complex diseases. Despite the availability of an unprecedented wealth of CNV data, methods for testing association between CNVs and disease-related traits are still under-developed due to the low prevalence and complicated multi-scale features of CNVs. RESULTS We propose a novel CNV kernel association test (CKAT) in this paper. To address the low prevalence, CNVs are first grouped into CNV regions (CNVR). Then, taking into account the multi-scale features of CNVs, we first design a single-CNV kernel which summarizes the similarity between two CNVs, and next aggregate the single-CNV kernel to a CNVR kernel which summarizes the similarity between two CNVRs. Finally, association between CNVR and disease-related traits is assessed by comparing the kernel-based similarity with the similarity in the trait using a score test for variance components in a random effect model. We illustrate the proposed CKAT using simulations and show that CKAT is more powerful than existing methods, while always being able to control the type I error. We also apply CKAT to a real dataset examining the association between CNV and autism spectrum disorders, which demonstrates the potential usefulness of the proposed method. AVAILABILITY AND IMPLEMENTATION A R package to implement the proposed CKAT method is available at http://works.bepress.com/debashis_ghosh/ CONTACTS: xzhan@fhcrc.org or debashis.ghosh@ucdenver.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiang Zhan
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Santhosh Girirajan
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802, USA.,Department of Anthropology, Pennsylvania State University, University Park, PA 16802, USA and
| | - Ni Zhao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado, Aurora, CO 80045, USA
| |
Collapse
|
46
|
Zhang H, Wheeler W, Hyland PL, Yang Y, Shi J, Chatterjee N, Yu K. A Powerful Procedure for Pathway-Based Meta-analysis Using Summary Statistics Identifies 43 Pathways Associated with Type II Diabetes in European Populations. PLoS Genet 2016; 12:e1006122. [PMID: 27362418 PMCID: PMC4928884 DOI: 10.1371/journal.pgen.1006122] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2016] [Accepted: 05/20/2016] [Indexed: 12/17/2022] Open
Abstract
Meta-analysis of multiple genome-wide association studies (GWAS) has become an effective approach for detecting single nucleotide polymorphism (SNP) associations with complex traits. However, it is difficult to integrate the readily accessible SNP-level summary statistics from a meta-analysis into more powerful multi-marker testing procedures, which generally require individual-level genetic data. We developed a general procedure called Summary based Adaptive Rank Truncated Product (sARTP) for conducting gene and pathway meta-analysis that uses only SNP-level summary statistics in combination with genotype correlation estimated from a panel of individual-level genetic data. We demonstrated the validity and power advantage of sARTP through empirical and simulated data. We conducted a comprehensive pathway-based meta-analysis with sARTP on type 2 diabetes (T2D) by integrating SNP-level summary statistics from two large studies consisting of 19,809 T2D cases and 111,181 controls with European ancestry. Among 4,713 candidate pathways from which genes in neighborhoods of 170 GWAS established T2D loci were excluded, we detected 43 T2D globally significant pathways (with Bonferroni corrected p-values < 0.05), which included the insulin signaling pathway and T2D pathway defined by KEGG, as well as the pathways defined according to specific gene expression patterns on pancreatic adenocarcinoma, hepatocellular carcinoma, and bladder carcinoma. Using summary data from 8 eastern Asian T2D GWAS with 6,952 cases and 11,865 controls, we showed 7 out of the 43 pathways identified in European populations remained to be significant in eastern Asians at the false discovery rate of 0.1. We created an R package and a web-based tool for sARTP with the capability to analyze pathways with thousands of genes and tens of thousands of SNPs.
Collapse
Affiliation(s)
- Han Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - William Wheeler
- Information Management Services Inc., Calverton, Maryland, United States of America
| | - Paula L. Hyland
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Yifan Yang
- Department of Statistics, University of Kentucky, Lexington, Kentucky, United States of America
| | - Jianxin Shi
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
- * E-mail: (NC); (KY)
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (NC); (KY)
| |
Collapse
|
47
|
Kwak IY, Pan W. Adaptive gene- and pathway-trait association testing with GWAS summary statistics. Bioinformatics 2016; 32:1178-84. [PMID: 26656570 PMCID: PMC5860182 DOI: 10.1093/bioinformatics/btv719] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Revised: 11/24/2015] [Accepted: 11/29/2015] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Gene- and pathway-based analyses offer a useful alternative and complement to the usual single SNP-based analysis for GWAS. On the other hand, most existing gene- and pathway-based tests are not highly adaptive, and/or require the availability of individual-level genotype and phenotype data. It would be desirable to have highly adaptive tests applicable to summary statistics for single SNPs. This has become increasingly important given the popularity of large-scale meta-analyses of multiple GWASs and the practical availability of either single GWAS or meta-analyzed GWAS summary statistics for single SNPs. RESULTS We extend two adaptive tests for gene- and pathway-level association with a univariate trait to the case with GWAS summary statistics without individual-level genotype and phenotype data. We use the WTCCC GWAS data to evaluate and compare the proposed methods and several existing methods. We further illustrate their applications to a meta-analyzed dataset to identify genes and pathways associated with blood pressure, demonstrating the potential usefulness of the proposed methods. The methods are implemented in R package aSPU, freely and publicly available. AVAILABILITY AND IMPLEMENTATION https://cran.r-project.org/web/packages/aSPU/ CONTACT: weip@biostat.umn.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Il-Youp Kwak
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
48
|
Powerful and Adaptive Testing for Multi-trait and Multi-SNP Associations with GWAS and Sequencing Data. Genetics 2016; 203:715-31. [PMID: 27075728 DOI: 10.1534/genetics.115.186502] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 04/02/2016] [Indexed: 11/18/2022] Open
Abstract
Testing for genetic association with multiple traits has become increasingly important, not only because of its potential to boost statistical power, but also for its direct relevance to applications. For example, there is accumulating evidence showing that some complex neurodegenerative and psychiatric diseases like Alzheimer's disease are due to disrupted brain networks, for which it would be natural to identify genetic variants associated with a disrupted brain network, represented as a set of multiple traits, one for each of multiple brain regions of interest. In spite of its promise, testing for multivariate trait associations is challenging: if not appropriately used, its power can be much lower than testing on each univariate trait separately (with a proper control for multiple testing). Furthermore, differing from most existing methods for single-SNP-multiple-trait associations, we consider SNP set-based association testing to decipher complicated joint effects of multiple SNPs on multiple traits. Because the power of a test critically depends on several unknown factors such as the proportions of associated SNPs and of traits, we propose a highly adaptive test at both the SNP and trait levels, giving higher weights to those likely associated SNPs and traits, to yield high power across a wide spectrum of situations. We illuminate relationships among the proposed and some existing tests, showing that the proposed test covers several existing tests as special cases. We compare the performance of the new test with that of several existing tests, using both simulated and real data. The methods were applied to structural magnetic resonance imaging data drawn from the Alzheimer's Disease Neuroimaging Initiative to identify genes associated with gray matter atrophy in the human brain default mode network (DMN). For genome-wide association studies (GWAS), genes AMOTL1 on chromosome 11 and APOE on chromosome 19 were discovered by the new test to be significantly associated with the DMN. Notably, gene AMOTL1 was not detected by single SNP-based analyses. To our knowledge, AMOTL1 has not been highlighted in other Alzheimer's disease studies before, although it was indicated to be related to cognitive impairment. The proposed method is also applicable to rare variants in sequencing data and can be extended to pathway analysis.
Collapse
|
49
|
Huang J, Wang K, Wei P, Liu X, Liu X, Tan K, Boerwinkle E, Potash JB, Han S. FLAGS: A Flexible and Adaptive Association Test for Gene Sets Using Summary Statistics. Genetics 2016; 202:919-29. [PMID: 26773050 PMCID: PMC4788129 DOI: 10.1534/genetics.115.185009] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Accepted: 01/13/2016] [Indexed: 01/06/2023] Open
Abstract
Genome-wide association studies (GWAS) have been widely used for identifying common variants associated with complex diseases. Despite remarkable success in uncovering many risk variants and providing novel insights into disease biology, genetic variants identified to date fail to explain the vast majority of the heritability for most complex diseases. One explanation is that there are still a large number of common variants that remain to be discovered, but their effect sizes are generally too small to be detected individually. Accordingly, gene set analysis of GWAS, which examines a group of functionally related genes, has been proposed as a complementary approach to single-marker analysis. Here, we propose a FL: exible and A: daptive test for G: ene S: ets (FLAGS), using summary statistics. Extensive simulations showed that this method has an appropriate type I error rate and outperforms existing methods with increased power. As a proof of principle, through real data analyses of Crohn's disease GWAS data and bipolar disorder GWAS meta-analysis results, we demonstrated the superior performance of FLAGS over several state-of-the-art association tests for gene sets. Our method allows for the more powerful application of gene set analysis to complex diseases, which will have broad use given that GWAS summary results are increasingly publicly available.
Collapse
Affiliation(s)
- Jianfei Huang
- Department of Psychiatry, University of Iowa, Iowa City, Iowa 52242
| | - Kai Wang
- Department of Biostatistics, University of Iowa, Iowa City, Iowa 52242
| | - Peng Wei
- Department of Biostatistics, University of Texas School of Public Health, Houston, Texas 77225
| | - Xiangtao Liu
- Department of Psychiatry, University of Iowa, Iowa City, Iowa 52242
| | - Xiaoming Liu
- Human Genetics Center, University of Texas Health Science Center, Houston, Texas 77030
| | - Kai Tan
- Department of Internal Medicine, University of Iowa, Iowa City, Iowa 52242 Interdisciplinary Graduate Program in Genetics, University of Iowa, Iowa City, Iowa 52242
| | - Eric Boerwinkle
- Human Genetics Center, University of Texas Health Science Center, Houston, Texas 77030 Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030
| | - James B Potash
- Department of Psychiatry, University of Iowa, Iowa City, Iowa 52242
| | - Shizhong Han
- Department of Psychiatry, University of Iowa, Iowa City, Iowa 52242 Interdisciplinary Graduate Program in Genetics, University of Iowa, Iowa City, Iowa 52242
| |
Collapse
|
50
|
Cunha MLR, Meijers JCM, Middeldorp S. Introduction to the analysis of next generation sequencing data and its application to venous thromboembolism. Thromb Haemost 2015; 114:920-32. [PMID: 26446408 DOI: 10.1160/th15-05-0411] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Accepted: 08/26/2015] [Indexed: 12/13/2022]
Abstract
Despite knowledge of various inherited risk factors associated with venous thromboembolism (VTE), no definite cause can be found in about 50% of patients. The application of data-driven searches such as GWAS has not been able to identify genetic variants with implications for clinical care, and unexplained heritability remains. In the past years, the development of several so-called next generation sequencing (NGS) platforms is offering the possibility of generating fast, inexpensive and accurate genomic information. However, so far their application to VTE has been very limited. Here we review basic concepts of NGS data analysis and explore the application of NGS technology to VTE. We provide both computational and biological viewpoints to discuss potentials and challenges of NGS-based studies.
Collapse
Affiliation(s)
- Marisa L R Cunha
- Marisa L. R. Cunha, Department of Experimental Vascular Medicine, Academic Medical Center, Meibergdreef 9, 1105 AZ Amsterdam, The Netherlands, Tel.: +31 20 5662824, Fax: +31 20 6968833, E-mail:
| | | | | |
Collapse
|