101
|
Fan J, Fan Y, Han X, Lv J. Asymptotic Theory of Eigenvectors for Random Matrices with Diverging Spikes. J Am Stat Assoc 2022; 117:996-1009. [PMID: 36060554 PMCID: PMC9438751 DOI: 10.1080/01621459.2020.1840990] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Characterizing the asymptotic distributions of eigenvectors for large random matrices poses important challenges yet can provide useful insights into a range of statistical applications. To this end, in this paper we introduce a general framework of asymptotic theory of eigenvectors (ATE) for large spiked random matrices with diverging spikes and heterogeneous variances, and establish the asymptotic properties of the spiked eigenvectors and eigenvalues for the scenario of the generalized Wigner matrix noise. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we establish asymptotic expansions for the general linear combination and further show that it is asymptotically normal after some normalization, where the weight vector can be arbitrary. We also provide a more general asymptotic theory for the spiked eigenvectors using the bilinear form. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications.
Collapse
Affiliation(s)
| | | | - Xiao Han
- University of Southern California
| | | |
Collapse
|
102
|
Mary D, Roquain E. Semi-supervised multiple testing. Electron J Stat 2022. [DOI: 10.1214/22-ejs2050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- David Mary
- Université Côte d’Azur, Observatoire de la Côte d’Azur, CNRS, Laboratoire Lagrange, Bd de l’Observatoire, CS 34229, 06304, Nice cedex 4, France
| | - Etienne Roquain
- Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, Université de Paris & CNRS, 4, place Jussieu, 75005 Paris, France
| |
Collapse
|
103
|
OUP accepted manuscript. Biostatistics 2022; 23:1039-1055. [DOI: 10.1093/biostatistics/kxac001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 11/12/2021] [Accepted: 12/04/2021] [Indexed: 11/13/2022] Open
|
104
|
Abraham K, Castillo I, Roquain É. Empirical Bayes cumulative ℓ-value multiple testing procedure for sparse sequences. Electron J Stat 2022. [DOI: 10.1214/22-ejs1979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Kweku Abraham
- University of Cambridge, Statistical Laboratory, Wilberforce Road, Cambridge CB3 0WB, UK
| | - Ismaël Castillo
- Université de Paris and Sorbonne Université, CNRS, Laboratoire de Probabilités, Statistique et Modélisation, F-75013 Paris, France
| | - Étienne Roquain
- Université de Paris and Sorbonne Université, CNRS, Laboratoire de Probabilités, Statistique et Modélisation, F-75013 Paris, France
| |
Collapse
|
105
|
Sarkar SK, Tang CY. Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff-assisted variable selection. Biometrika 2021. [DOI: 10.1093/biomet/asab066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
We consider the knockoff-based multiple testing setup of Barber & Candés (2015) for variable selection in multiple regression. The method of Benjamini & Hochberg (1995) and an adaptive version of it are adjusted to this setup, transforming them to valid p-value based, false discovery rate controlling methods that do not rely on specifying the correlation structure of the explanatory 15 variables. Simulations and real data applications show that our proposed methods are powerful competitors of the false discovery rate controlling method in Barber & Candés (2015).
Collapse
Affiliation(s)
- Sanat K Sarkar
- Department of Statistical Science, Temple University, 1810 Liacouras Walk, Philadelphia, Pennsylvania 19122-6083, U.S.A
| | - Cheng Yong Tang
- Department of Statistical Science, Temple University, 1810 Liacouras Walk, Philadelphia, Pennsylvania 19122-6083, U.S.A
| |
Collapse
|
106
|
Zhao Q, Small DS, Ertefaie A. Selective inference for effect modification via the lasso. J R Stat Soc Series B Stat Methodol 2021; 84:382-413. [DOI: 10.1111/rssb.12483] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Affiliation(s)
- Qingyuan Zhao
- Department of Pure Mathematics and Mathematical Statistics University of Cambridge Cambridge UK
| | - Dylan S. Small
- Department of Statistics and Data Science University of Pennsylvania Philadelphia Pennsylvania USA
| | - Ashkan Ertefaie
- Department of Biostatistics and Computational Biology University of Rochester Rochester New York USA
| |
Collapse
|
107
|
Chen H, Ren H, Yao F, Zou C. Data-driven selection of the number of change-points via error rate control. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1999820] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Hui Chen
- School of Statistics and Data Science, Nankai University, China
| | - Haojie Ren
- School of Mathematical Sciences, Shanghai Jiao Tong University, China
| | - Fang Yao
- School of Mathematical Sciences, Peking University, China
| | - Changliang Zou
- School of Statistics and Data Science, Nankai University, China
| |
Collapse
|
108
|
Dong R, Zhou J, Zheng Z. Controlling the false discovery rate for latent factors via unit-rank deflation. Stat Probab Lett 2021. [DOI: 10.1016/j.spl.2021.109178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
109
|
Wang G, Zou C, Qiu P. Data-Driven Determination of the Number of Jumps in Regression Curves. Technometrics 2021. [DOI: 10.1080/00401706.2021.1978551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Guanghui Wang
- KLATASDS-MOE, Academy of Statistics and Interdisciplinary Sciences, East China Normal University, Shanghai, China
| | - Changliang Zou
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Peihua Qiu
- Department of Biostatistics, University of Florida, Gainesville, FL
| |
Collapse
|
110
|
Jiang W, Bogdan M, Josse J, Majewski S, Miasojedow B, Ročková V. Adaptive Bayesian SLOPE: Model Selection With Incomplete Data. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1963263] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Wei Jiang
- Inria XPOP and CMAP, École Polytechnique, Palaiseau, France
| | - Małgorzata Bogdan
- Faculty of Mathematics and Computer Science Reference, University of Wroclaw, Wroclaw, Poland
- Department of Statistics Reference, Lund University, Lund, Sweden
| | - Julie Josse
- Inria XPOP and CMAP, École Polytechnique, Palaiseau, France
| | | | - Błażej Miasojedow
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
| | | | | |
Collapse
|
111
|
Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, Wang N, Li W, Wang LD, Li JJ. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 2021; 22:288. [PMID: 34635147 PMCID: PMC8504070 DOI: 10.1186/s13059-021-02506-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 09/21/2021] [Indexed: 12/12/2022] Open
Abstract
High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
Collapse
Affiliation(s)
- Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Yiling Elaine Chen
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Dongyuan Song
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - MeiLu McDermott
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
- The Quantitative and Computational Biology section, University of Southern California, Los Angeles, 90089, CA, USA
| | - Kyla Woyshner
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Antigoni Manousopoulou
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Ning Wang
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, 92697, CA, USA
| | - Leo D Wang
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA.
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095, CA, USA.
| |
Collapse
|
112
|
Abstract
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Collapse
|
113
|
Sesia M, Bates S, Candès E, Marchini J, Sabatti C. False discovery rate control in genome-wide association studies with population structure. Proc Natl Acad Sci U S A 2021; 118:e2105841118. [PMID: 34580220 PMCID: PMC8501795 DOI: 10.1073/pnas.2105841118] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/18/2021] [Indexed: 12/25/2022] Open
Abstract
We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.
Collapse
Affiliation(s)
- Matteo Sesia
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA 90089;
| | - Stephen Bates
- Department of Statistics, University of California, Berkeley, CA 94720
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Mathematics, Stanford University, Stanford, CA 94305
| | | | - Chiara Sabatti
- Department of Statistics, Stanford University, Stanford, CA 94305
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA 94305
| |
Collapse
|
114
|
Hutchinson A, Reales G, Willis T, Wallace C. Leveraging auxiliary data from arbitrary distributions to boost GWAS discovery with Flexible cFDR. PLoS Genet 2021; 17:e1009853. [PMID: 34669738 PMCID: PMC8559959 DOI: 10.1371/journal.pgen.1009853] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 11/01/2021] [Accepted: 09/30/2021] [Indexed: 12/15/2022] Open
Abstract
Genome-wide association studies (GWAS) have identified thousands of genetic variants that are associated with complex traits. However, a stringent significance threshold is required to identify robust genetic associations. Leveraging relevant auxiliary covariates has the potential to boost statistical power to exceed the significance threshold. Particularly, abundant pleiotropy and the non-random distribution of SNPs across various functional categories suggests that leveraging GWAS test statistics from related traits and/or functional genomic data may boost GWAS discovery. While type 1 error rate control has become standard in GWAS, control of the false discovery rate can be a more powerful approach. The conditional false discovery rate (cFDR) extends the standard FDR framework by conditioning on auxiliary data to call significant associations, but current implementations are restricted to auxiliary data satisfying specific parametric distributions, typically GWAS p-values for related traits. We relax these distributional assumptions, enabling an extension of the cFDR framework that supports auxiliary covariates from arbitrary continuous distributions ("Flexible cFDR"). Our method can be applied iteratively, thereby supporting multi-dimensional covariate data. Through simulations we show that Flexible cFDR increases sensitivity whilst controlling FDR after one or several iterations. We further demonstrate its practical potential through application to an asthma GWAS, leveraging various functional genomic data to find additional genetic associations for asthma, which we validate in the larger, independent, UK Biobank data resource.
Collapse
Affiliation(s)
- Anna Hutchinson
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
| | - Guillermo Reales
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, United Kingdom
- Department of Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Thomas Willis
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
- Cambridge Institute of Therapeutic Immunology and Infectious Disease (CITIID), University of Cambridge, Cambridge, United Kingdom
- Department of Medicine, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
115
|
Affiliation(s)
- Zhimei Ren
- Department of Statistics, University of Chicago, Chicago, IL
| | - Yuting Wei
- Statistics & Data Science Department, University of Pennsylvania, Philadelphia, PA
| | - Emmanuel Candès
- Department of Mathematics, Department of Statistics, Stanford University, Stanford, CA
| |
Collapse
|
116
|
Chen D, Tashman K, Palmer DS, Neale B, Roeder K, Bloemendal A, Churchhouse C, Ke ZT. A data harmonization pipeline to leverage external controls and boost power in GWAS. Hum Mol Genet 2021; 31:481-489. [PMID: 34508597 PMCID: PMC8825237 DOI: 10.1093/hmg/ddab261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 09/02/2021] [Accepted: 09/03/2021] [Indexed: 11/12/2022] Open
Abstract
The use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors, and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control (QC) and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27 517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn's disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required.
Collapse
Affiliation(s)
- Danfeng Chen
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, 08544, New Jersey, United States
| | - Katherine Tashman
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Duncan S Palmer
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Benjamin Neale
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States.,Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, 02142, Massachusetts, United States
| | - Kathryn Roeder
- Department of Statistics, Carnegie Mellon University, Pittsburgh, 15213, Pennsylvania, United States
| | - Alex Bloemendal
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States
| | - Claire Churchhouse
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, 02114, Massachusetts, United States.,Stanley Center for Psychiatric Research, Broad Institute of of MIT and Harvard, Cambridge, 02142, Massachusetts, United States.,Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, 02142, Massachusetts, United States
| | - Zheng Tracy Ke
- Department of Statistics, Harvard University, Cambridge, 02138, Massachusetts, United States
| |
Collapse
|
117
|
Zhu Z, Fan Y, Kong Y, Lv J, Sun F. DeepLINK: Deep learning inference using knockoffs with applications to genomics. Proc Natl Acad Sci U S A 2021; 118:e2104683118. [PMID: 34480002 PMCID: PMC8433583 DOI: 10.1073/pnas.2104683118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 07/16/2021] [Indexed: 11/18/2022] Open
Abstract
We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.
Collapse
Affiliation(s)
- Zifan Zhu
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089;
| | - Yinfei Kong
- Department of Information Systems and Decision Sciences, California State University, Fullerton, CA 92831
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089;
| |
Collapse
|
118
|
Ebrahimpoor M, Goeman JJ. Inflated false discovery rate due to volcano plots: problem and solutions. Brief Bioinform 2021; 22:bbab053. [PMID: 33758907 PMCID: PMC8425469 DOI: 10.1093/bib/bbab053] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 02/01/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Volcano plots are used to select the most interesting discoveries when too many discoveries remain after application of Benjamini-Hochberg's procedure (BH). The volcano plot suggests a double filtering procedure that selects features with both small adjusted $P$-value and large estimated effect size. Despite its popularity, this type of selection overlooks the fact that BH does not guarantee error control over filtered subsets of discoveries. Therefore the selected subset of features may include an inflated number of false discoveries. RESULTS In this paper, we illustrate the substantially inflated type I error rate of volcano plot selection with simulation experiments and RNA-seq data. In particular, we show that the feature with the largest estimated effect is a very likely false positive result. Next, we investigate two alternative approaches for multiple testing with double filtering that do not inflate the false discovery rate. Our procedure is implemented in an interactive web application and is publicly available.
Collapse
Affiliation(s)
- Mitra Ebrahimpoor
- Medical statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, 2333 ZA, The Netherlands
| | - Jelle J Goeman
- Medical statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, 2333 ZA, The Netherlands
| |
Collapse
|
119
|
Srinivasan A, Xue L, Zhan X. Compositional knockoff filter for high-dimensional regression analysis of microbiome data. Biometrics 2021; 77:984-995. [PMID: 32683674 PMCID: PMC7831267 DOI: 10.1111/biom.13336] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 06/29/2020] [Accepted: 07/09/2020] [Indexed: 01/10/2023]
Abstract
A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
Collapse
Affiliation(s)
- Arun Srinivasan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033, U.S.A
| |
Collapse
|
120
|
Freijeiro‐González L, Febrero‐Bande M, González‐Manteiga W. A Critical Review of LASSO and Its Derivatives for Variable Selection Under Dependence Among Covariates. Int Stat Rev 2021. [DOI: 10.1111/insr.12469] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Laura Freijeiro‐González
- Department of Statistics Mathematical Analysis and Optimization; Santiago de Compostela University Santiago de Compostela Spain
| | - Manuel Febrero‐Bande
- Department of Statistics Mathematical Analysis and Optimization; Santiago de Compostela University Santiago de Compostela Spain
| | - Wenceslao González‐Manteiga
- Department of Statistics Mathematical Analysis and Optimization; Santiago de Compostela University Santiago de Compostela Spain
| |
Collapse
|
121
|
Du L, Guo X, Sun W, Zou C. False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1945459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Lilun Du
- Department of ISOM, Hong Kong University of Science and Technology, ISOM, Kowloon, Hong Kong
| | - Xu Guo
- Department of Mathematical Statistics, Beijing Normal University, Beijing, China
| | - Wenguang Sun
- Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Changliang Zou
- Department of Statistics and Data Sciences, Nankai University, Tianjin, China
| |
Collapse
|
122
|
Ignatiadis N, Huber W. Covariate powered cross‐weighted multiple testing. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12411] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
| | - Wolfgang Huber
- European Molecular Biology Laboratory Heidelberg Germany
| |
Collapse
|
123
|
Abstract
AbstractWe propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Candès et al. (J R Stat Soc Ser B 80:551–577, 2018), we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and synthetic datasets. Simulations confirm that our inference procedures successfully control Type I error with competitive power in a range of settings. Our method has been implemented in an package, , which can be downloaded from https://github.com/dswatson/cpi.
Collapse
|
124
|
Sechidis K, Kormaksson M, Ohlssen D. Using knockoffs for controlled predictive biomarker identification. Stat Med 2021; 40:5453-5473. [PMID: 34328655 DOI: 10.1002/sim.9134] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/18/2021] [Accepted: 06/22/2021] [Indexed: 12/20/2022]
Abstract
One of the key challenges of personalized medicine is to identify which patients will respond positively to a given treatment. The area of subgroup identification focuses on this challenge, that is, identifying groups of patients that experience desirable characteristics, such as an enhanced treatment effect. A crucial first step towards the subgroup identification is to identify the baseline variables (eg, biomarkers) that influence the treatment effect, which are known as predictive variables. Many subgroup discovery algorithms return importance scores that capture the variables' predictive strength. However, a major limitation of these scores is that they do not answer the core question: "Which variables are actually predictive?" With our work we answer this question by using the knockoff framework, which is a general framework for controlling the false discovery rate when performing prognostic variable selection. In contrast, our work is the first that uses knockoffs for predictive variable selection. We introduce two novel knockoff filters: one parametric, building on variable importance scores derived from a penalized linear regression model, and one non-parametric, building on causal forest variable importance scores. We conduct extensive simulations to validate performance of the proposed methodology and we also apply the proposed methods to data from a randomized clinical trial.
Collapse
Affiliation(s)
| | - Matthias Kormaksson
- Advanced Methodology and Data Science, Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | - David Ohlssen
- Advanced Methodology and Data Science, Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| |
Collapse
|
125
|
Generative Adversarial Network-Based Scheme for Diagnosing Faults in Cyber-Physical Power Systems. SENSORS 2021; 21:s21155173. [PMID: 34372410 PMCID: PMC8348776 DOI: 10.3390/s21155173] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 07/25/2021] [Accepted: 07/27/2021] [Indexed: 11/17/2022]
Abstract
This paper presents a novel diagnostic framework for distributed power systems that is based on using generative adversarial networks for generating artificial knockoffs in the power grid. The proposed framework makes use of the raw data measurements including voltage, frequency, and phase-angle that are collected from each bus in the cyber-physical power systems. The collected measurements are firstly fed into a feature selection module, where multiple state-of-the-art techniques have been used to extract the most informative features from the initial set of available features. The selected features are inputs to a knockoff generation module, where the generative adversarial networks are employed to generate the corresponding knockoffs of the selected features. The generated knockoffs are then fed into a classification module, in which two different classification models are used for the sake of fault diagnosis. Multiple experiments have been designed to investigate the effect of noise, fault resistance value, and sampling rate on the performance of the proposed framework. The effectiveness of the proposed framework is validated through a comprehensive study on the IEEE 118-bus system.
Collapse
|
126
|
Bin Masud S, Jenkins C, Hussey E, Elkin-Frankston S, Mach P, Dhummakupt E, Aeron S. Utilizing machine learning with knockoff filtering to extract significant metabolites in Crohn's disease with a publicly available untargeted metabolomics dataset. PLoS One 2021; 16:e0255240. [PMID: 34324558 PMCID: PMC8320926 DOI: 10.1371/journal.pone.0255240] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 07/12/2021] [Indexed: 12/26/2022] Open
Abstract
Metabolomic data processing pipelines have been improving in recent years, allowing for greater feature extraction and identification. Lately, machine learning and robust statistical techniques to control false discoveries are being incorporated into metabolomic data analysis. In this paper, we introduce one such recently developed technique called aggregate knockoff filtering to untargeted metabolomic analysis. When applied to a publicly available dataset, aggregate knockoff filtering combined with typical p-value filtering improves the number of significantly changing metabolites by 25% when compared to conventional untargeted metabolomic data processing. By using this method, features that would normally not be extracted under standard processing would be brought to researchers' attention for further analysis.
Collapse
Affiliation(s)
- Shoaib Bin Masud
- Department of Electrical and Computer Engineering, Tufts University, Medford, MA, United States of America
| | - Conor Jenkins
- DEVCOM Chemical Biological Center, Aberdeen Proving Ground, Aberdeen, MD, United States of America
| | - Erika Hussey
- DEVCOM Soldier Center, Natick, MA, United States of America
| | | | - Phillip Mach
- DEVCOM Chemical Biological Center, Aberdeen Proving Ground, Aberdeen, MD, United States of America
| | - Elizabeth Dhummakupt
- DEVCOM Chemical Biological Center, Aberdeen Proving Ground, Aberdeen, MD, United States of America
| | - Shuchin Aeron
- Department of Electrical and Computer Engineering, Tufts University, Medford, MA, United States of America
| |
Collapse
|
127
|
Distribution-dependent feature selection for deep neural networks. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02663-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
128
|
Liu M, Katsevich E, Janson L, Ramdas A. Fast and powerful conditional randomization testing via distillation. Biometrika 2021. [DOI: 10.1093/biomet/asab039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Summary
We consider the problem of conditional independence testing: given a response $Y$ and covariates $(X,Z)$, we test the null hypothesis that $Y {\perp\!\!\!\perp} X \mid Z$. The conditional randomization test was recently proposed as a way to use distributional information about $X\mid Z$ to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about $Y\mid (X,Z)$. This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test’s statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.
Collapse
Affiliation(s)
- Molei Liu
- Department of Biostatistics, Harvard Chan School of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, U.S.A
| | - Eugene Katsevich
- Department of Statistics and Data Science, Wharton School of the University of Pennsylvania, 265 South 37th Street, Philadelphia, Pennsylvania 19104, U.S.A
| | - Lucas Janson
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A
| | - Aaditya Ramdas
- Department of Statistics & Data Science, Carnegie Mellon University, 132H Baker Hall, Pittsburgh, Pennsylvania 15213, U.S.A
| |
Collapse
|
129
|
Chia C, Sesia M, Ho CS, Jeffrey SS, Dionne J, Candes EJ, Howe RT. Interpretable Classification of Bacterial Raman Spectra with Knockoff Wavelets. IEEE J Biomed Health Inform 2021; 26:740-748. [PMID: 34232897 DOI: 10.1109/jbhi.2021.3094873] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Deep neural networks and other machine learning models are widely applied to biomedical signal data because they can detect complex patterns and compute accurate predictions. However, the difficulty of interpreting such models is a limitation, especially for applications involving high-stakes decision, including the identification of bacterial infections. This paper considers fast Raman spectroscopy data and demonstrates that a logistic regression model with carefully selected features achieves accuracy comparable to that of neural networks, while being much simpler and more transparent. Our analysis leverages wavelet features with intuitive chemical interpretations, and performs controlled variable selection with knockoffs to ensure the predictors are relevant and non-redundant. Although we focus on a particular data set, the proposed approach is broadly applicable to other types of signal data for which interpretability may be important.
Collapse
|
130
|
Liu Y, Ročková V. Variable Selection Via Thompson Sampling. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1928514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Yi Liu
- Department of Statistics, University of Chicago, Chicago, IL
| | | |
Collapse
|
131
|
Liu Y, Ročková V, Wang Y. Variable selection with ABC Bayesian forests. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12423] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Yi Liu
- Department of Statistics University of Chicago Chicago USA
| | | | - Yuexi Wang
- Booth School of Business University of Chicago Chicago USA
| |
Collapse
|
132
|
Li J, Maathuis MH. GGM knockoff filter: False discovery rate control for Gaussian graphical models. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Affiliation(s)
- Jinzhou Li
- Seminar für StatistikETH Zürich Zürich Switzerland
| | | |
Collapse
|
133
|
Affiliation(s)
| | - Zhigen Zhao
- Department of Statistical Science, Temple University, Philadelphia, PA
| | - Jun S. Liu
- Department of Statistics, Harvard University, Cambridge, MA
| |
Collapse
|
134
|
Jiang T, Li Y, Motsinger-Reif AA. Knockoff boosted tree for model-free variable selection. Bioinformatics 2021; 37:976-983. [PMID: 32966559 DOI: 10.1093/bioinformatics/btaa770] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Revised: 08/17/2020] [Accepted: 09/09/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The recently proposed knockoff filter is a general framework for controlling the false discovery rate (FDR) when performing variable selection. This powerful new approach generates a 'knockoff' of each variable tested for exact FDR control. Imitation variables that mimic the correlation structure found within the original variables serve as negative controls for statistical inference. Current applications of knockoff methods use linear regression models and conduct variable selection only for variables existing in model functions. Here, we extend the use of knockoffs for machine learning with boosted trees, which are successful and widely used in problems where no prior knowledge of model function is required. However, currently available importance scores in tree models are insufficient for variable selection with FDR control. RESULTS We propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. We extend the current knockoff method to model-free variable selection through the use of tree-based models. Additionally, we propose and evaluate two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test and compare these methods with the original knockoff method regarding their ability to control type I errors and power. In simulation tests, we compare the properties and performance of importance test statistics of tree models. The results include different combinations of knockoffs and importance test statistics. We consider scenarios that include main-effect, interaction, exponential and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using Cancer Genome Atlas (TCGA) gene expression data. Our results show improved discrimination between difficult-to-discriminate cancer types. AVAILABILITY AND IMPLEMENTATION The proposed algorithm is included in the KOBT package, which is available at https://cran.r-project.org/web/packages/KOBT/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tao Jiang
- Department of Statistics, Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA
| | - Yuanyuan Li
- Biostatistics & Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC 27709, USA
| | - Alison A Motsinger-Reif
- Biostatistics & Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC 27709, USA
| |
Collapse
|
135
|
He Z, Liu L, Wang C, Le Guen Y, Lee J, Gogarten S, Lu F, Montgomery S, Tang H, Silverman EK, Cho MH, Greicius M, Ionita-Laza I. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nat Commun 2021; 12:3152. [PMID: 34035245 PMCID: PMC8149672 DOI: 10.1038/s41467-021-22889-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Accepted: 03/26/2021] [Indexed: 02/04/2023] Open
Abstract
The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.
Collapse
Affiliation(s)
- Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA.
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA.
| | - Linxi Liu
- Department of Statistics, Columbia University, New York, NY, USA
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York, NY, USA
| | - Yann Le Guen
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA
| | - Justin Lee
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA
| | | | - Fred Lu
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Stephen Montgomery
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Hua Tang
- Department of Statistics, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Edwin K Silverman
- Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine Division, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Michael H Cho
- Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine Division, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Michael Greicius
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
136
|
Song Z, Li J. Variable selection with false discovery rate control in deep neural networks. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00308-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
137
|
Xue C, Zhang T, Xiao D. Output-Related and -Unrelated Fault Monitoring with an Improvement Prototype Knockoff Filter and Feature Selection Based on Laplacian Eigen Maps and Sparse Regression. ACS OMEGA 2021; 6:10828-10839. [PMID: 34056237 PMCID: PMC8153765 DOI: 10.1021/acsomega.1c00506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
In the process industry, fault monitoring related to output is an important step to ensure product quality and improve economic benefits. In order to distinguish the influence of input variables on the output more accurately, this paper introduces a subalgorithm of fault-unrelated block partition into the prototype knockoff filter (PKF) algorithm for its improvement. The improved PKF algorithm can divide the input data into three blocks: fault-unrelated block, output-related block, and output-unrelated block. Removing the data of fault-unrelated blocks can greatly reduce the difficulty of fault monitoring. This paper proposes a feature selection based on the Laplacian Eigen maps and sparse regression algorithm for output-unrelated blocks. The algorithm has the ability to detect faults caused by variables with small contribution to variance and proves the descent of the algorithm from a theoretical point of view. The output relation block is monitored by the Broyden-Fletcher-Goldfarb-Shanno method. Finally, the effectiveness of the proposed fault detection method is verified by the recognized Eastman process data in Tennessee.
Collapse
Affiliation(s)
- Cuiping Xue
- College
of Science, Northeastern University, Shenyang 110819, China
| | - Tie Zhang
- College
of Science, Northeastern University, Shenyang 110819, China
| | - Dong Xiao
- College
of Information Science and Engineering and Liaoning Key Laboratory
of Intelligent Diagnosis and Safety for Metallurgical Industry, Northeastern University, Shenyang 110819, China
| |
Collapse
|
138
|
Kormaksson M, Kelly LJ, Zhu X, Haemmerle S, Pricop L, Ohlssen D. Sequential knockoffs for continuous and categorical predictors: With application to a large psoriatic arthritis clinical trial pool. Stat Med 2021; 40:3313-3328. [PMID: 33899260 DOI: 10.1002/sim.8955] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 02/22/2021] [Accepted: 03/01/2021] [Indexed: 01/10/2023]
Abstract
Knockoffs provide a general framework for controlling the false discovery rate when performing variable selection. Much of the Knockoffs literature focuses on theoretical challenges and we recognize a need for bringing some of the current ideas into practice. In this paper we propose a sequential algorithm for generating knockoffs when underlying data consists of both continuous and categorical (factor) variables. Further, we present a heuristic multiple knockoffs approach that offers a practical assessment of how robust the knockoff selection process is for a given dataset. We conduct extensive simulations to validate performance of the proposed methodology. Finally, we demonstrate the utility of the methods on a large clinical data pool of more than 2000 patients with psoriatic arthritis evaluated in four clinical trials with an IL-17A inhibitor, secukinumab (Cosentyx), where we determine prognostic factors of a well established clinical outcome. The analyses presented in this paper could provide a wide range of applications to commonly encountered datasets in medical practice and other fields where variable selection is of particular interest.
Collapse
Affiliation(s)
| | | | - Xuan Zhu
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | | | - Luminita Pricop
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | - David Ohlssen
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| |
Collapse
|
139
|
Deb N, Saha S, Guntuboyina A, Sen B. Two-Component Mixture Model in the Presence of Covariates. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1888739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Nabarun Deb
- Department of Statistics, Columbia University, New York, NY
| | | | | | | |
Collapse
|
140
|
Seiler C, Ferreira AM, Kronstad LM, Simpson LJ, Le Gars M, Vendrame E, Blish CA, Holmes S. CytoGLMM: conditional differential analysis for flow and mass cytometry experiments. BMC Bioinformatics 2021; 22:137. [PMID: 33752595 PMCID: PMC7983283 DOI: 10.1186/s12859-021-04067-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 03/03/2021] [Indexed: 11/10/2022] Open
Abstract
Background Flow and mass cytometry are important modern immunology tools for measuring expression levels of multiple proteins on single cells. The goal is to better understand the mechanisms of responses on a single cell basis by studying differential expression of proteins. Most current data analysis tools compare expressions across many computationally discovered cell types. Our goal is to focus on just one cell type. Our narrower field of application allows us to define a more specific statistical model with easier to control statistical guarantees. Results Differential analysis of marker expressions can be difficult due to marker correlations and inter-subject heterogeneity, particularly for studies of human immunology. We address these challenges with two multiple regression strategies: a bootstrapped generalized linear model and a generalized linear mixed model. On simulated datasets, we compare the robustness towards marker correlations and heterogeneity of both strategies. For paired experiments, we find that both strategies maintain the target false discovery rate under medium correlations and that mixed models are statistically more powerful under the correct model specification. For unpaired experiments, our results indicate that much larger patient sample sizes are required to detect differences. We illustrate the CytoGLMM R package and workflow for both strategies on a pregnancy dataset. Conclusion Our approach to finding differential proteins in flow and mass cytometry data reduces biases arising from marker correlations and safeguards against false discoveries induced by patient heterogeneity.
Collapse
Affiliation(s)
- Christof Seiler
- Department of Data Science and Knowledge Engineering, Maastricht University, Maastricht, The Netherlands. .,Mathematics Centre Maastricht, Maastricht University, Maastricht, The Netherlands. .,Department of Statistics, Stanford University, Stanford, USA.
| | | | - Lisa M Kronstad
- Immunology Program, Stanford University School of Medicine, Stanford, USA.,Department of Medicine, Stanford University School of Medicine, Stanford, USA.,Department of Microbiology and Immunology, Midwestern University, Downers Grove, USA
| | - Laura J Simpson
- Immunology Program, Stanford University School of Medicine, Stanford, USA.,Department of Medicine, Stanford University School of Medicine, Stanford, USA
| | - Mathieu Le Gars
- Immunology Program, Stanford University School of Medicine, Stanford, USA.,Department of Medicine, Stanford University School of Medicine, Stanford, USA
| | - Elena Vendrame
- Immunology Program, Stanford University School of Medicine, Stanford, USA.,Department of Medicine, Stanford University School of Medicine, Stanford, USA
| | - Catherine A Blish
- Immunology Program, Stanford University School of Medicine, Stanford, USA.,Department of Medicine, Stanford University School of Medicine, Stanford, USA.,Chan Zuckerberg Biohub, San Francisco, USA
| | - Susan Holmes
- Department of Statistics, Stanford University, Stanford, USA
| |
Collapse
|
141
|
Decoding with confidence: Statistical control on decoder maps. Neuroimage 2021; 234:117921. [PMID: 33722670 DOI: 10.1016/j.neuroimage.2021.117921] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 02/17/2021] [Accepted: 02/21/2021] [Indexed: 11/22/2022] Open
Abstract
In brain imaging, decoding is widely used to infer relationships between brain and cognition, or to craft brain-imaging biomarkers of pathologies. Yet, standard decoding procedures do not come with statistical guarantees, and thus do not give confidence bounds to interpret the pattern maps that they produce. Indeed, in whole-brain decoding settings, the number of explanatory variables is much greater than the number of samples, hence classical statistical inference methodology cannot be applied. Specifically, the standard practice that consists in thresholding decoding maps is not a correct inference procedure. We contribute a new statistical-testing framework for this type of inference. To overcome the statistical inefficiency of voxel-level control, we generalize the Family Wise Error Rate (FWER) to account for a spatial tolerance δ, introducing the δ-Family Wise Error Rate (δ-FWER). Then, we present a decoding procedure that can control the δ-FWER: the Ensemble of Clustered Desparsified Lasso (EnCluDL), a procedure for multivariate statistical inference on high-dimensional structured data. We evaluate the statistical properties of EnCluDL with a thorough empirical study, along with three alternative procedures including decoder map thresholding. We show that EnCluDL exhibits the best recovery properties while ensuring the expected statistical control.
Collapse
|
142
|
Descloux P, Sardy S. Model Selection With Lasso-Zero: Adding Straw to the Haystack to Better Find Needles. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1869026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
| | - Sylvain Sardy
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| |
Collapse
|
143
|
Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data. ENTROPY 2021; 23:e23020230. [PMID: 33669462 PMCID: PMC7920469 DOI: 10.3390/e23020230] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 02/11/2021] [Indexed: 12/31/2022]
Abstract
Recent discoveries suggest that our gut microbiome plays an important role in our health and wellbeing. However, the gut microbiome data are intricate; for example, the microbial diversity in the gut makes the data high-dimensional. While there are dedicated high-dimensional methods, such as the lasso estimator, they always come with the risk of false discoveries. Knockoffs are a recent approach to control the number of false discoveries. In this paper, we show that knockoffs can be aggregated to increase power while retaining sharp control over the false discoveries. We support our method both in theory and simulations, and we show that it can lead to new discoveries on microbiome data from the American Gut Project. In particular, our results indicate that several phyla that have been overlooked so far are associated with obesity.
Collapse
|
144
|
Carpentier A, Delattre S, Roquain E, Verzelen N. Estimating minimum effect with outlier selection. Ann Stat 2021. [DOI: 10.1214/20-aos1956] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
145
|
Demirkaya E, Feng Y, Basu P, Lv J. Large-scale model selection in misspecified generalized linear models. Biometrika 2021. [DOI: 10.1093/biomet/asab005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
Model selection is crucial both to high-dimensional learning and to inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work implicitly assumes that the models are correctly specified or have fixed dimensionality, yet both model misspecification and high dimensionality are prevalent in practice. In this paper, we exploit the framework of model selection principles under the misspecified generalized linear models presented in Lv & Liu (2014), and investigate the asymptotic expansion of the posterior model probability in the setting of high-dimensional misspecified models. With a natural choice of prior probabilities that encourages interpretability and incorporates the Kullback–Leibler divergence, we suggest using the high-dimensional generalized Bayesian information criterion with prior probability for large-scale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of the new information criterion in ultrahigh dimensions under some mild regularity conditions. Our numerical studies demonstrate that the proposed method enjoys improved model selection consistency over its main competitors.
Collapse
|
146
|
Zhu G, Zhao T. Deep-gKnock: Nonlinear group-feature selection with deep neural networks. Neural Netw 2021; 135:139-147. [PMID: 33385830 DOI: 10.1016/j.neunet.2020.12.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Revised: 11/26/2020] [Accepted: 12/02/2020] [Indexed: 01/21/2023]
Abstract
Feature selection is central to contemporary high-dimensional data analysis. Group structure among features arises naturally in various scientific problems. Many methods have been proposed to incorporate the group structure information into feature selection. However, these methods are normally restricted to a linear regression setting. To relax the linear constraint, we design a new Deep Neural Network (DNN) architecture and integrating it with the recently proposed knockoff technique to perform nonlinear group-feature selection with controlled group-wise False Discovery Rate (gFDR). Experimental results on high-dimensional synthetic data demonstrate that our method achieves the highest power and accurate gFDR control compared with state-of-the-art methods. The performance of Deep-gKnock is especially superior in the following five situations: (1) nonlinearity relationship; (2) dimension p greater than sample size n; (3) high between-group correlation; (4) high within-group correlation; (5) large number of associated groups. And Deep-gKnock is also demonstrated to be robust to the misspecification of the feature distribution and the change of network architecture. Moreover, Deep-gKnock achieves scientifically meaningful group-feature selection results for cutting-edge real world datasets.
Collapse
Affiliation(s)
- Guangyu Zhu
- Department of Computer Science and Statistics, University of Rhode Island, United States of America.
| | - Tingting Zhao
- Department of Electrical and Computer Engineering, Northeastern University, United States of America
| |
Collapse
|
147
|
Schultheiss C, Renaux C, Bühlmann P. Multicarving for high-dimensional post-selection inference. Electron J Stat 2021. [DOI: 10.1214/21-ejs1825] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
148
|
Yang S, Wen J, Eckert ST, Wang Y, Liu DJ, Wu R, Li R, Zhan X. Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning. Bioinformatics 2020; 36:3811-3817. [PMID: 32246825 DOI: 10.1093/bioinformatics/btaa229] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 02/19/2020] [Accepted: 03/31/2020] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION Large scale genome-wide association studies (GWAS) have resulted in the identification of a wide range of genetic variants related to a host of complex traits and disorders. Despite their success, the individual single-nucleotide polymorphism (SNP) analysis approach adopted in most current GWAS can be limited in that it is usually biologically simple to elucidate a comprehensive genetic architecture of phenotypes and statistically underpowered due to heavy multiple-testing correction burden. On the other hand, multiple-SNP analyses (e.g. gene-based or region-based SNP-set analysis) are usually more powerful to examine the joint effects of a set of SNPs on the phenotype of interest. However, current multiple-SNP approaches can only draw an overall conclusion at the SNP-set level and does not directly inform which SNPs in the SNP-set are driving the overall genotype-phenotype association. RESULTS In this article, we propose a new permutation-assisted tuning procedure in lasso (plasso) to identify phenotype-associated SNPs in a joint multiple-SNP regression model in GWAS. The tuning parameter of lasso determines the amount of shrinkage and is essential to the performance of variable selection. In the proposed plasso procedure, we first generate permutations as pseudo-SNPs that are not associated with the phenotype. Then, the lasso tuning parameter is delicately chosen to separate true signal SNPs and non-informative pseudo-SNPs. We illustrate plasso using simulations to demonstrate its superior performance over existing methods, and application of plasso to a real GWAS dataset gains new additional insights into the genetic control of complex traits. AVAILABILITY AND IMPLEMENTATION R codes to implement the proposed methodology is available at https://github.com/xyz5074/plasso. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Songshan Yang
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Jiawei Wen
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Scott T Eckert
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Yaqun Wang
- Department of Biostatistics, Rutgers University, New Brunswick, NJ 08901, USA
| | - Dajiang J Liu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Rongling Wu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Runze Li
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| |
Collapse
|
149
|
Tansey W, Wang Y, Rabadan R, Blei DM. Double Empirical Bayes Testing. Int Stat Rev 2020; 88:S91-S113. [PMID: 35356801 PMCID: PMC8963776 DOI: 10.1111/insr.12430] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Accepted: 10/20/2020] [Indexed: 12/18/2022]
Abstract
Analyzing data from large-scale, multi-experiment studies requires scientists to both analyze each experiment and to assess the results as a whole. In this article, we develop double empirical Bayes testing (DEBT), an empirical Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. DEBT is a two-stage method: in the first stage, it reports which experiments yielded significant outcomes; in the second stage, it hypothesizes which covariates drive the experimental significance. In both of its stages, DEBT builds on Efron (2008), which lays out an elegant empirical Bayes approach to testing. DEBT enhances this framework by learning a series of black box predictive models to boost power and control the false discovery rate (FDR). In Stage 1, it uses a deep neural network prior to report which experiments yielded significant outcomes. In Stage 2, it uses an empirical Bayes version of the knockoff filter (Candes et al., 2018) to select covariates that have significant predictive power of Stage-1 significance. In both simulated and real data, DEBT increases the proportion of discovered significant outcomes and selects more features when signals are weak. In a real study of cancer cell lines, DEBT selects a robust set of biologically-plausible genomic drivers of drug sensitivity and resistance in cancer.
Collapse
Affiliation(s)
- Wesley Tansey
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Yixin Wang
- Department of Statistics, Columbia University, New York, NY, USA
| | - Raul Rabadan
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| | - David M. Blei
- Department of Statistics, Columbia University, New York, NY, USA
- Department of Computer Science, Columbia University, New York, NY, USA
| |
Collapse
|
150
|
Katsevich E, Ramdas A. Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann Stat 2020. [DOI: 10.1214/19-aos1938] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|