1
|
Zhang X, Wang L, Zhao J, Zhao H. Knockoff procedure improves causal gene identifications in conditional transcriptome-wide association studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.05.636660. [PMID: 39974930 PMCID: PMC11838583 DOI: 10.1101/2025.02.05.636660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Transcriptome-wide association studies (TWASs) have been developed to nominate candidate genes associated with complex traits by integrating genome-wide association studies (GWASs) with expression quantitative trait loci (eQTL) data. However, most existing TWAS methods evaluate the marginal association between a single gene and the trait of interest without accounting for other genes within the same genomic region or the same gene from different tissues. Additionally, false-positive gene-trait pairs can arise due to correlations with the direct effects of genetic variants. In this study, we introduce TWASKnockoff, a new knockoff-based framework for detecting causal gene-tissue pairs using GWAS summary statistics and eQTL data. Unlike marginal testing in traditional TWAS methods, TWASKnockoff examines the conditional independence for each gene-trait pair, considering both correlations in cis-predicted expression across genes and correlations between gene expression levels and genetic variants. TWASKnockoff estimates the theoretical correlation matrix for all genetic elements (cis-predicted expression across genes and genotypes for genetic variants) by averaging estimations from parametric boot-strap samples and then performs knockoff-based inference to detect causal gene-trait pairs while controlling the false discovery rate (FDR). Through empirical simulations and an application to type 2 diabetes (T2D) data, we demonstrate that TWASKnockoff achieves superior FDR control and improves the average power in detecting causal gene-trait pairs at a fixed FDR level.
Collapse
|
2
|
Leung YY, Lee WP, Kuzma AB, Nicaretta H, Valladares O, Gangadharan P, Qu L, Zhao Y, Ren Y, Cheng PL, Kuksa PP, Wang H, White H, Katanic Z, Bass L, Saravanan N, Greenfest-Allen E, Kirsch M, Cantwell L, Iqbal T, Wheeler NR, Farrell JJ, Zhu C, Turner SL, Gunasekaran TI, Mena PR, Jin J, Carter L, Zhang X, Vardarajan BN, Toga A, Cuccaro M, Hohman TJ, Bush WS, Naj AC, Martin E, Dalgard C, Kunkle BW, Farrer LA, Mayeux RP, Haines JL, Pericak-Vance MA, Schellenberg GD, Wang LS. Alzheimer's Disease Sequencing Project Release 4 Whole Genome Sequencing Dataset. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.12.03.24317000. [PMID: 39677464 PMCID: PMC11643159 DOI: 10.1101/2024.12.03.24317000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
The Alzheimer's Disease Sequencing Project (ADSP) is a national initiative to understand the genetic architecture of Alzheimer's Disease and Related Dementias (AD/ADRD) by sequencing whole genomes of affected participants and age-matched cognitive controls from diverse populations. The Genome Center for Alzheimer's Disease (GCAD) processed whole-genome sequencing data from 36,361 ADSP participants, including 35,014 genetically unique participants of which 45% are from non-European ancestry, across 17 cohorts in 14 countries in this fourth release (R4). This sequencing effort identified 387 million bi-allelic variants, 42 million short insertions/deletions, and 2.2 million structural variants. Annotations and quality control data are available for all variants and samples. Additionally, detailed phenotypes from 15,927 participants across 10 domains are also provided. A linkage disequilibrium panel was created using unrelated AD cases and controls. Researchers can access and analyze the genetic data via NIAGADS Data Sharing Service, the VariXam tool, or NIAGADS GenomicsDB.
Collapse
Affiliation(s)
- Yuk Yee Leung
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Wan-Ping Lee
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Amanda B Kuzma
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Heather Nicaretta
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Otto Valladares
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Prabhakaran Gangadharan
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Liming Qu
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Yi Zhao
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Youli Ren
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Po-Liang Cheng
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Pavel P Kuksa
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Hui Wang
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Heather White
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Zivadin Katanic
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Lauren Bass
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Naveen Saravanan
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Emily Greenfest-Allen
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Maureen Kirsch
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Laura Cantwell
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Taha Iqbal
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Nicholas R Wheeler
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - John J. Farrell
- Department of Medicine, Biostatistics & Bioinformatics, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Congcong Zhu
- Department of Medicine, Biostatistics & Bioinformatics, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Shannon L Turner
- Department of Neurology, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Tamil I Gunasekaran
- Columbia University Irving Medical Center, New York, NY, USA
- Gertrude H. Sergievsky Center, Taub Institute for Research on the Aging Brain, Departments of Neurology, Psychiatry, and Epidemiology, College of Physicians and Surgeons, Columbia University, New York, NY, USA
| | - Pedro R Mena
- Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Jimmy Jin
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Luke Carter
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | | | - Xiaoling Zhang
- Department of Medicine, Biostatistics & Bioinformatics, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Badri N Vardarajan
- Columbia University Irving Medical Center, New York, NY, USA
- Gertrude H. Sergievsky Center, Taub Institute for Research on the Aging Brain, Departments of Neurology, Psychiatry, and Epidemiology, College of Physicians and Surgeons, Columbia University, New York, NY, USA
| | - Arthur Toga
- Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California
| | - Michael Cuccaro
- Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Timothy J Hohman
- Department of Neurology, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - William S Bush
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Adam C Naj
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Eden Martin
- Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Clifton Dalgard
- Department of Anatomy, Physiology and Genetics, School of Medicine, Uniformed Services University of the Health Sciences, Bethesda, MD, USA
| | - Brian W Kunkle
- Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Lindsay A Farrer
- Department of Medicine, Biostatistics & Bioinformatics, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Richard P Mayeux
- Columbia University Irving Medical Center, New York, NY, USA
- Gertrude H. Sergievsky Center, Taub Institute for Research on the Aging Brain, Departments of Neurology, Psychiatry, and Epidemiology, College of Physicians and Surgeons, Columbia University, New York, NY, USA
| | - Jonathan L Haines
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
- Department of Genetics and Genome Sciences, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Margaret A Pericak-Vance
- Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA
| | - Gerard D Schellenberg
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| | - Li-San Wang
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
| |
Collapse
|
3
|
Ma S, Wang F, Border R, Buxbaum J, Zaitlen N, Ionita-Laza I. Local genetic correlation via knockoffs reduces confounding due to cross-trait assortative mating. Am J Hum Genet 2024; 111:2839-2848. [PMID: 39547235 PMCID: PMC11639086 DOI: 10.1016/j.ajhg.2024.10.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 10/18/2024] [Accepted: 10/21/2024] [Indexed: 11/17/2024] Open
Abstract
Local genetic correlation analysis is an important tool for identifying genetic loci with shared biology across traits. Recently, Border et al. have shown that the results of these analyses are confounded by cross-trait assortative mating (xAM), leading to many false-positive findings. Here, we describe LAVA-Knock, a local genetic correlation method that builds off an existing genetic correlation method, LAVA, and augments it by generating synthetic data in a way that preserves local and long-range linkage disequilibrium (LD), allowing us to reduce the confounding induced by xAM. We show in simulations based on a realistic xAM model and in genome-wide association study (GWAS) applications for 630 trait pairs that LAVA-Knock can greatly reduce the bias due to xAM relative to LAVA. Furthermore, we show a significant positive correlation between the reduction in local genetic correlations and estimates in the literature of cross-mate phenotype correlations; in particular, pairs of traits that are known to have high cross-mate phenotype correlation values have a significantly higher reduction in the number of local genetic correlations compared with other trait pairs. A few representative examples include education and intelligence, education and alcohol consumption, and attention-deficit hyperactivity disorder and depression. These results suggest that LAVA-Knock can reduce confounding due to both short-range LD and long-range LD induced by xAM.
Collapse
Affiliation(s)
- Shiyang Ma
- Clinical Research Institute, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China; School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Fan Wang
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Richard Border
- Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Joseph Buxbaum
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Noah Zaitlen
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University, New York, NY 10032, USA; Department of Statistics, Lund University, Lund, Sweden.
| |
Collapse
|
4
|
Chu BB, Gu J, Chen Z, Morrison T, Candès E, He Z, Sabatti C. Second-order group knockoffs with applications to genome-wide association studies. Bioinformatics 2024; 40:btae580. [PMID: 39340798 PMCID: PMC11639161 DOI: 10.1093/bioinformatics/btae580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 08/15/2024] [Accepted: 09/24/2024] [Indexed: 09/30/2024] Open
Abstract
MOTIVATION Conditional testing via the knockoff framework allows one to identify-among a large number of possible explanatory variables-those that carry unique information about an outcome of interest and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome-wide association studies (GWAS), which have the goal of identifying genetic variants that influence traits of medical relevance. RESULTS While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct "group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank. AVAILABILITY AND IMPLEMENTATION The described algorithms are implemented in an open-source Julia package Knockoffs.jl. R and Python wrappers are available as knockoffsr and knockoffspy packages.
Collapse
Affiliation(s)
- Benjamin B Chu
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94035, USA
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, CA, 94035, USA
| | - Tim Morrison
- Department of Statistics, Stanford University, Stanford, CA, 94035, USA
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA, 94035, USA
- Department of Mathematics, Stanford University, Stanford, CA, 94035, USA
| | - Zihuai He
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94035, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94035, USA
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
- Department of Statistics, Stanford University, Stanford, CA, 94035, USA
| |
Collapse
|
5
|
Wang A, Tian P, Zhang YD. TWAS-GKF: a novel method for causal gene identification in transcriptome-wide association studies with knockoff inference. Bioinformatics 2024; 40:btae502. [PMID: 39189955 PMCID: PMC11361808 DOI: 10.1093/bioinformatics/btae502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 07/02/2024] [Accepted: 08/24/2024] [Indexed: 08/28/2024] Open
Abstract
MOTIVATION Transcriptome-wide association study (TWAS) aims to identify trait-associated genes regulated by significant variants to explore the underlying biological mechanisms at a tissue-specific level. Despite the advancement of current TWAS methods to cover diverse traits, traditional approaches still face two main challenges: (i) the lack of methods that can guarantee finite-sample false discovery rate (FDR) control in identifying trait-associated genes; and (ii) the requirement for individual-level data, which is often inaccessible. RESULTS To address this challenge, we propose a powerful knockoff inference method termed TWAS-GKF to identify candidate trait-associated genes with a guaranteed finite-sample FDR control. TWAS-GKF introduces the main idea of Ghostknockoff inference to generate knockoff variables using only summary statistics instead of individual-level data. In extensive studies, we demonstrate that TWAS-GKF successfully controls the finite-sample FDR under a pre-specified FDR level across all settings. We further apply TWAS-GKF to identify genes in brain cerebellum tissue from the Genotype-Tissue Expression (GTEx) v8 project associated with schizophrenia (SCZ) from the Psychiatric Genomics Consortium (PGC), and genes in liver tissue related to low-density lipoprotein cholesterol (LDL-C) from the UK Biobank, respectively. The results reveal that the majority of the identified genes are validated by Open Targets Validation Platform. AVAILABILITY AND IMPLEMENTATION The R package TWAS.GKF is publicly available at https://github.com/AnqiWang2021/TWAS.GKF.
Collapse
Affiliation(s)
- Anqi Wang
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, 999077, China
| | - Peixin Tian
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, 999077, China
| | - Yan Dora Zhang
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, 999077, China
| |
Collapse
|
6
|
Yang Y, Wang Q, Wang C, Buxbaum J, Ionita-Laza I. KnockoffHybrid: A knockoff framework for hybrid analysis of trio and population designs in genome-wide association studies. Am J Hum Genet 2024; 111:1448-1461. [PMID: 38821058 PMCID: PMC11267528 DOI: 10.1016/j.ajhg.2024.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 05/02/2024] [Accepted: 05/06/2024] [Indexed: 06/02/2024] Open
Abstract
Both trio and population designs are popular study designs for identifying risk genetic variants in genome-wide association studies (GWASs). The trio design, as a family-based design, is robust to confounding due to population structure, whereas the population design is often more powerful due to larger sample sizes. Here, we propose KnockoffHybrid, a knockoff-based statistical method for hybrid analysis of both the trio and population designs. KnockoffHybrid provides a unified framework that brings together the advantages of both designs and produces powerful hybrid analysis while controlling the false discovery rate (FDR) in the presence of linkage disequilibrium and population structure. Furthermore, KnockoffHybrid has the flexibility to leverage different types of summary statistics for hybrid analyses, including expression quantitative trait loci (eQTL) and GWAS summary statistics. We demonstrate in simulations that KnockoffHybrid offers power gains over non-hybrid methods for the trio and population designs with the same number of cases while controlling the FDR with complex correlation among variants and population structure among subjects. In hybrid analyses of three trio cohorts for autism spectrum disorders (ASDs) from the Autism Speaks MSSNG, Autism Sequencing Consortium, and Autism Genome Project with GWAS summary statistics from the iPSYCH project and eQTL summary statistics from the MetaBrain project, KnockoffHybrid outperforms conventional methods by replicating several known risk genes for ASDs and identifying additional associations with variants in other genes, including the PRAME family genes involved in axon guidance and which may act as common targets for human speech/language evolution and related disorders.
Collapse
Affiliation(s)
- Yi Yang
- Department of Biostatistics, City University of Hong Kong, Hong Kong SAR, China; School of Data Science, City University of Hong Kong, Hong Kong SAR, China.
| | - Qi Wang
- School of Data Science, City University of Hong Kong, Hong Kong SAR, China
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Joseph Buxbaum
- Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University, New York, NY 10032, USA; Department of Statistics, Lund University, Lund, Sweden
| |
Collapse
|
7
|
Yu CX, Gu J, Chen Z, He Z. Summary statistics knockoffs inference with family-wise error rate control. Biometrics 2024; 80:ujae082. [PMID: 39222026 PMCID: PMC11367731 DOI: 10.1093/biomtc/ujae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 07/29/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024]
Abstract
Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.
Collapse
Affiliation(s)
- Catherine Xinrui Yu
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, 999077, China
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, California, 94305, United States
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
- Department of Medicine (Biomedical Informatics Research), Stanford University, Stanford, California, 94304, United States
| |
Collapse
|
8
|
He Z, Chu B, Yang J, Gu J, Chen Z, Liu L, Morrison T, Belloy ME, Qi X, Hejazi N, Mathur M, Le Guen Y, Tang H, Hastie T, Ionita-laza I, Sabatti C, Candès E. Beyond guilty by association at scale: searching for causal variants on the basis of genome-wide summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.28.582621. [PMID: 38464202 PMCID: PMC10925326 DOI: 10.1101/2024.02.28.582621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Here, we present a novel framework for genome-wide detection of sets of variants that carry non-redundant information on the phenotypes and are therefore more likely to be causal in a biological sense. Crucially, our framework requires only summary statistics obtained from standard genome-wide marginal association testing. The described approach, implemented in open-source software, is also computationally efficient, requiring less than 15 minutes on a single CPU to perform genome-wide analysis. Through extensive genome-wide simulation studies, we show that the method can substantially outperform usual two-stage marginal association testing and fine-mapping procedures in precision and recall. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD), we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of 67 large-scale GWAS summary statistics since 2013 for a variety of phenotypes. Results reveal the method's capacity to robustly discover additional loci for polygenic traits and pinpoint potential causal variants underpinning each locus beyond conventional GWAS pipeline, contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses.
Collapse
Affiliation(s)
- Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Benjamin Chu
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - James Yang
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Linxi Liu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Tim Morrison
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Michael E. Belloy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
| | - Xinran Qi
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Nima Hejazi
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Maya Mathur
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
- Department of Pediatrics, Stanford University, Stanford, CA 94305, USA
| | - Yann Le Guen
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Hua Tang
- Department of Pediatrics, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Trevor Hastie
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Iuliana Ionita-laza
- Department of Biostatistics, Columbia University Mailman School of Public Health, New York, NY 10032, USA
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Department of Mathematics, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
9
|
Chen Z, He Z, Chu BB, Gu J, Morrison T, Sabatti C, Candès E. Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression. ARXIV 2024:arXiv:2402.12724v1. [PMID: 38463500 PMCID: PMC10925382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.
Collapse
Affiliation(s)
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University
- Department of Medicine (Biomedical Informatics Research), Stanford University
| | - Benjamin B Chu
- Department of Biomedical Data Science, Stanford University
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University
| | | | - Chiara Sabatti
- Department of Statistics, Stanford University
- Department of Biomedical Data Science, Stanford University
| | - Emmanuel Candès
- Department of Statistics, Stanford University
- Department of Mathematics, Stanford University
| |
Collapse
|
10
|
Cui R, Elzur RA, Kanai M, Ulirsch JC, Weissbrod O, Daly MJ, Neale BM, Fan Z, Finucane HK. Improving fine-mapping by modeling infinitesimal effects. Nat Genet 2024; 56:162-169. [PMID: 38036779 PMCID: PMC11056999 DOI: 10.1038/s41588-023-01597-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 10/26/2023] [Indexed: 12/02/2023]
Abstract
Fine-mapping aims to identify causal genetic variants for phenotypes. Bayesian fine-mapping algorithms (for example, SuSiE, FINEMAP, ABF and COJO-ABF) are widely used, but assessing posterior probability calibration remains challenging in real data, where model misspecification probably exists, and true causal variants are unknown. We introduce replication failure rate (RFR), a metric to assess fine-mapping consistency by downsampling. SuSiE, FINEMAP and COJO-ABF show high RFR, indicating potential overconfidence in their output. Simulations reveal that nonsparse genetic architecture can lead to miscalibration, while imputation noise, nonuniform distribution of causal variants and quality control filters have minimal impact. Here we present SuSiE-inf and FINEMAP-inf, fine-mapping methods modeling infinitesimal effects alongside fewer larger causal effects. Our methods show improved calibration, RFR and functional enrichment, competitive recall and computational efficiency. Notably, using our methods' posterior effect sizes substantially increases polygenic risk score accuracy over SuSiE and FINEMAP. Our work improves causal variant identification for complex traits, a fundamental goal of human genetics.
Collapse
Affiliation(s)
- Ran Cui
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Roy A Elzur
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Masahiro Kanai
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan
| | - Jacob C Ulirsch
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Mark J Daly
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Benjamin M Neale
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Zhou Fan
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA.
| | - Hilary K Finucane
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|