1
|
Kernfeld E, Keener R, Cahan P, Battle A. Transcriptome data are insufficient to control false discoveries in regulatory network inference. Cell Syst 2024; 15:709-724.e13. [PMID: 39173585 DOI: 10.1016/j.cels.2024.07.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 05/31/2024] [Accepted: 07/22/2024] [Indexed: 08/24/2024]
Abstract
Inference of causal transcriptional regulatory networks (TRNs) from transcriptomic data suffers notoriously from false positives. Approaches to control the false discovery rate (FDR), for example, via permutation, bootstrapping, or multivariate Gaussian distributions, suffer from several complications: difficulty in distinguishing direct from indirect regulation, nonlinear effects, and causal structure inference requiring "causal sufficiency," meaning experiments that are free of any unmeasured, confounding variables. Here, we use a recently developed statistical framework, model-X knockoffs, to control the FDR while accounting for indirect effects, nonlinear dose-response, and user-provided covariates. We adjust the procedure to estimate the FDR correctly even when measured against incomplete gold standards. However, benchmarking against chromatin immunoprecipitation (ChIP) and other gold standards reveals higher observed than reported FDR. This indicates that unmeasured confounding is a major driver of FDR in TRN inference. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Eric Kernfeld
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA
| | - Rebecca Keener
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA
| | - Patrick Cahan
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA; Institute for Cell Engineering, Johns Hopkins Medicine, Baltimore, MD, USA; Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA.
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Genetic Medicine, Johns Hopkins Medicine, Baltimore, MD, USA; Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, MD, USA; Data Science and AI Institute, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
2
|
Yu CX, Gu J, Chen Z, He Z. Summary statistics knockoffs inference with family-wise error rate control. Biometrics 2024; 80:ujae082. [PMID: 39222026 PMCID: PMC11367731 DOI: 10.1093/biomtc/ujae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 07/29/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024]
Abstract
Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.
Collapse
Affiliation(s)
- Catherine Xinrui Yu
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, 999077, China
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, California, 94305, United States
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
- Department of Medicine (Biomedical Informatics Research), Stanford University, Stanford, California, 94304, United States
| |
Collapse
|
3
|
He Z, Chu B, Yang J, Gu J, Chen Z, Liu L, Morrison T, Belloy ME, Qi X, Hejazi N, Mathur M, Le Guen Y, Tang H, Hastie T, Ionita-laza I, Sabatti C, Candès E. Beyond guilty by association at scale: searching for causal variants on the basis of genome-wide summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.28.582621. [PMID: 38464202 PMCID: PMC10925326 DOI: 10.1101/2024.02.28.582621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Here, we present a novel framework for genome-wide detection of sets of variants that carry non-redundant information on the phenotypes and are therefore more likely to be causal in a biological sense. Crucially, our framework requires only summary statistics obtained from standard genome-wide marginal association testing. The described approach, implemented in open-source software, is also computationally efficient, requiring less than 15 minutes on a single CPU to perform genome-wide analysis. Through extensive genome-wide simulation studies, we show that the method can substantially outperform usual two-stage marginal association testing and fine-mapping procedures in precision and recall. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD), we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of 67 large-scale GWAS summary statistics since 2013 for a variety of phenotypes. Results reveal the method's capacity to robustly discover additional loci for polygenic traits and pinpoint potential causal variants underpinning each locus beyond conventional GWAS pipeline, contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses.
Collapse
Affiliation(s)
- Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Benjamin Chu
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - James Yang
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Linxi Liu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Tim Morrison
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Michael E. Belloy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
| | - Xinran Qi
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Nima Hejazi
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Maya Mathur
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
- Department of Pediatrics, Stanford University, Stanford, CA 94305, USA
| | - Yann Le Guen
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Hua Tang
- Department of Pediatrics, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Trevor Hastie
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Iuliana Ionita-laza
- Department of Biostatistics, Columbia University Mailman School of Public Health, New York, NY 10032, USA
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Department of Mathematics, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
4
|
Cheng X, Wang H. A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 229:107269. [PMID: 36463676 DOI: 10.1016/j.cmpb.2022.107269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 11/22/2022] [Accepted: 11/23/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND AND OBJECTIVE Identifying active features from ultra-high dimensional data is one of the primary and vital tasks in statistical learning and biological discovery. METHODS In this paper, we develop a generic concordance index screening (CI-SIS) procedure to wrestle with ultra-high dimensional data with categorical response. The proposed procedure is model-free and nonparametric based on the concordance index measure. It enjoys both sure screening and ranking consistency properties under some relatively weak assumptions. We investigate the flexibility of this procedure by considering some commonly-encountered challenging settings in biomedical studies, such as category-adaptive data and extremely unbalanced response distributions. A data-driven threshold selection procedure via knockoff features is also presented. RESULTS On the real lung dataset, our method achieves a lower prediction error with a mean error of 0.107 with linear discriminant analysis (LDA) and 0.117 with random forest (RF), respectively. In addition, we obtain an accuracy improvement of 3% with LDA and 5% with RF compared to the runner-up method. In a more challenging real data of SRBCT (Small round blue cell tumours), CI-SIS brings about a amazing performance improvement, which is at least 8% higher than all other competing methods. CONCLUSION Experimental results show that the proposed method can efficiently identify genes that are associated with certain types of diseases. Therefore, survived features (filtering out irrelevant features) selected by our procedure can help doctors make precision diagnoses and refined treatments of patients.
Collapse
Affiliation(s)
- Xuewei Cheng
- School of Mathematics and Statistics, Central South University, Changsha, China; Department of Statistics and Data Science, National University of Singapore, Singapore.
| | - Hong Wang
- School of Mathematics and Statistics, Central South University, Changsha, China.
| |
Collapse
|
5
|
He Z, Liu L, Belloy ME, Le Guen Y, Sossin A, Liu X, Qi X, Ma S, Gyawali PK, Wyss-Coray T, Tang H, Sabatti C, Candès E, Greicius MD, Ionita-Laza I. GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies. Nat Commun 2022; 13:7209. [PMID: 36418338 PMCID: PMC9684164 DOI: 10.1038/s41467-022-34932-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 11/09/2022] [Indexed: 11/27/2022] Open
Abstract
Recent advances in genome sequencing and imputation technologies provide an exciting opportunity to comprehensively study the contribution of genetic variants to complex phenotypes. However, our ability to translate genetic discoveries into mechanistic insights remains limited at this point. In this paper, we propose an efficient knockoff-based method, GhostKnockoff, for genome-wide association studies (GWAS) that leads to improved power and ability to prioritize putative causal variants relative to conventional GWAS approaches. The method requires only Z-scores from conventional GWAS and hence can be easily applied to enhance existing and future studies. The method can also be applied to meta-analysis of multiple GWAS allowing for arbitrary sample overlap. We demonstrate its performance using empirical simulations and two applications: (1) a meta-analysis for Alzheimer's disease comprising nine overlapping large-scale GWAS, whole-exome and whole-genome sequencing studies and (2) analysis of 1403 binary phenotypes from the UK Biobank data in 408,961 samples of European ancestry. Our results demonstrate that GhostKnockoff can identify putatively functional variants with weaker statistical effects that are missed by conventional association tests.
Collapse
Affiliation(s)
- Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA.
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA.
| | - Linxi Liu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Michael E Belloy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Yann Le Guen
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
- Institut du Cerveau - Paris Brain Institute - ICM, Paris, 75013, France
| | - Aaron Sossin
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
| | - Xiaoxia Liu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Xinran Qi
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Shiyang Ma
- Department of Biostatistics, Columbia University, New York, NY, 10032, USA
| | - Prashnna K Gyawali
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Tony Wyss-Coray
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Hua Tang
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
- Department of Mathematics, Stanford University, Stanford, CA, 94305, USA
| | - Michael D Greicius
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | | |
Collapse
|
6
|
Tian P, Hu Y, Liu Z, Zhang YD. Grace-AKO: a novel and stable knockoff filter for variable selection incorporating gene network structures. BMC Bioinformatics 2022; 23:478. [DOI: 10.1186/s12859-022-05016-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 10/29/2022] [Indexed: 11/16/2022] Open
Abstract
Abstract
Motivation
Variable selection is a common statistical approach to identifying genes associated with clinical outcomes of scientific interest. There are thousands of genes in genomic studies, while only a limited number of individual samples are available. Therefore, it is important to develop a method to identify genes associated with outcomes of interest that can control finite-sample false discovery rate (FDR) in high-dimensional data settings.
Results
This article proposes a novel method named Grace-AKO for graph-constrained estimation (Grace), which incorporates aggregation of multiple knockoffs (AKO) with the network-constrained penalty. Grace-AKO can control FDR in finite-sample settings and improve model stability simultaneously. Simulation studies show that Grace-AKO has better performance in finite-sample FDR control than the original Grace model. We apply Grace-AKO to the prostate cancer data in The Cancer Genome Atlas program by incorporating prostate-specific antigen (PSA) pathways in the Kyoto Encyclopedia of Genes and Genomes as the prior information. Grace-AKO finally identifies 47 candidate genes associated with PSA level, and more than 75% of the detected genes can be validated.
Collapse
|
7
|
LOPER JH, Lei L, FITHIAN W, TANSEY W. Smoothed Nested Testing on Directed Acyclic Graphs. Biometrika 2022; 109:457-471. [PMID: 38694183 PMCID: PMC11061840 DOI: 10.1093/biomet/asab041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/04/2024] Open
Abstract
We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.
Collapse
Affiliation(s)
- J. H. LOPER
- Department of Neuroscience, Columbia University, 716 Jerome L. Greene Building, New York, New York 10025, U.S.A
| | - L. Lei
- Department of Statistics, Stanford University, Sequoia Hall, Palo Alto, California 94305, U.S.A
| | - W. FITHIAN
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720, U.S.A
| | - W. TANSEY
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 321 E 61st St., New York, New York 10065, U.S.A
| |
Collapse
|
8
|
Zhou J, Li Y, Zheng Z, Li D. Reproducible learning in large-scale graphical models. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
Affiliation(s)
| | - Buyu Lin
- Department of Statistics, Harvard University
| | - Xin Xing
- Department of Statistics, Virginia Tech
| | - Jun S. Liu
- Department of Statistics, Harvard University
| |
Collapse
|
10
|
Mary D, Roquain E. Semi-supervised multiple testing. Electron J Stat 2022. [DOI: 10.1214/22-ejs2050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- David Mary
- Université Côte d’Azur, Observatoire de la Côte d’Azur, CNRS, Laboratoire Lagrange, Bd de l’Observatoire, CS 34229, 06304, Nice cedex 4, France
| | - Etienne Roquain
- Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, Université de Paris & CNRS, 4, place Jussieu, 75005 Paris, France
| |
Collapse
|
11
|
Wang W, Janson L. A High-Dimensional Power Analysis of the Conditional Randomization Test and Knockoffs. Biometrika 2021. [DOI: 10.1093/biomet/asab052] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary
In many scientific problems, researchers try to relate a response variable Y to a set of potential explanatory variables X = (X1,…,Xp), and start by trying to identify variables that contribute to this relationship. In statistical terms, this goal can be posed as trying to identify the Xj’s upon which Y is conditionally dependent. Sometimes it is of value to simultaneously test for each j, which is more commonly known as variable selection. The conditional randomization test, CRT, and model-X knockoffs are two recently proposed methods that respectively perform conditional independence testing and variable selection by, for each Xj, computing any test statistic on the data and assessing that test statistic’s significance by comparing it to test statistics computed on synthetic variables generated using knowledge of X’s distribution. Our main contribution is to analyse their power in a high-dimensional linear model where the ratio of the dimension p and the sample size n converge to a positive constant. We give explicit expressions for the asymptotic power of the CRT, variable selection with CRT p-values, and model-X knockoffs, each with a test statistic based on either the marginal covariance, the least squares coefficient, or the lasso. One useful application of our analysis is the direct theoretical comparison of the asymptotic powers of variable selection with CRT p-values and model-X knockoffs; in the instances with independent covariates that we consider, the CRT provably dominates knockoffs. We also analyse the power gain from using unlabelled data in the CRT when limited knowledge of X’s distribution is available, and the power of the CRT when samples are collected retrospectively.
Collapse
Affiliation(s)
- Wenshuo Wang
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A
| | - Lucas Janson
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A
| |
Collapse
|
12
|
Abstract
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Collapse
|
13
|
Sesia M, Bates S, Candès E, Marchini J, Sabatti C. False discovery rate control in genome-wide association studies with population structure. Proc Natl Acad Sci U S A 2021; 118:e2105841118. [PMID: 34580220 PMCID: PMC8501795 DOI: 10.1073/pnas.2105841118] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/18/2021] [Indexed: 12/25/2022] Open
Abstract
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Collapse
Affiliation(s)
- Matteo Sesia
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA 90089;
| | - Stephen Bates
- Department of Statistics, University of California, Berkeley, CA 94720
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Mathematics, Stanford University, Stanford, CA 94305
| | | | - Chiara Sabatti
- Department of Statistics, Stanford University, Stanford, CA 94305
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA 94305
| |
Collapse
|
14
|
Chia C, Sesia M, Ho CS, Jeffrey SS, Dionne J, Candes EJ, Howe RT. Interpretable Classification of Bacterial Raman Spectra with Knockoff Wavelets. IEEE J Biomed Health Inform 2021; 26:740-748. [PMID: 34232897 DOI: 10.1109/jbhi.2021.3094873] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Deep neural networks and other machine learning models are widely applied to biomedical signal data because they can detect complex patterns and compute accurate predictions. However, the difficulty of interpreting such models is a limitation, especially for applications involving high-stakes decision, including the identification of bacterial infections. This paper considers fast Raman spectroscopy data and demonstrates that a logistic regression model with carefully selected features achieves accuracy comparable to that of neural networks, while being much simpler and more transparent. Our analysis leverages wavelet features with intuitive chemical interpretations, and performs controlled variable selection with knockoffs to ensure the predictors are relevant and non-redundant. Although we focus on a particular data set, the proposed approach is broadly applicable to other types of signal data for which interpretability may be important.
Collapse
|
15
|
Li J, Maathuis MH. GGM knockoff filter: False discovery rate control for Gaussian graphical models. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Affiliation(s)
- Jinzhou Li
- Seminar für StatistikETH Zürich Zürich Switzerland
| | | |
Collapse
|
16
|
Zhu G, Zhao T. Deep-gKnock: Nonlinear group-feature selection with deep neural networks. Neural Netw 2021; 135:139-147. [PMID: 33385830 DOI: 10.1016/j.neunet.2020.12.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 11/26/2020] [Accepted: 12/02/2020] [Indexed: 01/21/2023]
Abstract
Feature selection is central to contemporary high-dimensional data analysis. Group structure among features arises naturally in various scientific problems. Many methods have been proposed to incorporate the group structure information into feature selection. However, these methods are normally restricted to a linear regression setting. To relax the linear constraint, we design a new Deep Neural Network (DNN) architecture and integrating it with the recently proposed knockoff technique to perform nonlinear group-feature selection with controlled group-wise False Discovery Rate (gFDR). Experimental results on high-dimensional synthetic data demonstrate that our method achieves the highest power and accurate gFDR control compared with state-of-the-art methods. The performance of Deep-gKnock is especially superior in the following five situations: (1) nonlinearity relationship; (2) dimension p greater than sample size n; (3) high between-group correlation; (4) high within-group correlation; (5) large number of associated groups. And Deep-gKnock is also demonstrated to be robust to the misspecification of the feature distribution and the change of network architecture. Moreover, Deep-gKnock achieves scientifically meaningful group-feature selection results for cutting-edge real world datasets.
Collapse
Affiliation(s)
- Guangyu Zhu
- Department of Computer Science and Statistics, University of Rhode Island, United States of America.
| | - Tingting Zhao
- Department of Electrical and Computer Engineering, Northeastern University, United States of America
| |
Collapse
|
17
|
Katsevich E, Ramdas A. Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann Stat 2020. [DOI: 10.1214/19-aos1938] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C. Multi-resolution localization of causal variants across the genome. Nat Commun 2020; 11:1093. [PMID: 32107378 PMCID: PMC7046731 DOI: 10.1038/s41467-020-14791-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 02/01/2020] [Indexed: 01/07/2023] Open
Abstract
In the statistical analysis of genome-wide association data, it is challenging to precisely localize the variants that affect complex traits, due to linkage disequilibrium, and to maximize power while limiting spurious findings. Here we report on KnockoffZoom: a flexible method that localizes causal variants at multiple resolutions by testing the conditional associations of genetic segments of decreasing width, while provably controlling the false discovery rate. Our method utilizes artificial genotypes as negative controls and is equally valid for quantitative and binary phenotypes, without requiring any assumptions about their genetic architectures. Instead, we rely on well-established genetic models of linkage disequilibrium. We demonstrate that our method can detect more associations than mixed effects models and achieve fine-mapping precision, at comparable computational cost. Lastly, we apply KnockoffZoom to data from 350k subjects in the UK Biobank and report many new findings.
Collapse
Affiliation(s)
- Matteo Sesia
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
| | - Eugene Katsevich
- Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Stephen Bates
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
| | - Emmanuel Candès
- Departments of Mathematics and of Statistics, Stanford University, Stanford, CA, 94305, USA.
| | - Chiara Sabatti
- Departments of Biomedical Data Science and of Statistics, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
19
|
Ramdas AK, Barber RF, Wainwright MJ, Jordan MI. A unified treatment of multiple testing with prior knowledge using the p-filter. Ann Stat 2019. [DOI: 10.1214/18-aos1765] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
20
|
Katsevich E, Sabatti C. MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS. Ann Appl Stat 2019; 13:1-33. [PMID: 31687060 PMCID: PMC6827557 DOI: 10.1214/18-aoas1185] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We tackle the problem of selecting from among a large number of variables those that are "important" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [Ann. Statist. 43 (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [J. Roy. Statist. Soc. Ser. B 79 (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
Collapse
Affiliation(s)
- Eugene Katsevich
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| | - Chiara Sabatti
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| |
Collapse
|