1
|
Zimmermann MR, Baillie M, Kormaksson M, Ohlssen D, Sechidis K. All that Glitters Is not Gold: Type-I Error Controlled Variable Selection from Clinical Trial Data. Clin Pharmacol Ther 2024; 115:774-785. [PMID: 38419357 DOI: 10.1002/cpt.3211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 02/02/2024] [Indexed: 03/02/2024]
Abstract
Clinical trials are primarily conducted to estimate causal effects, but the data collected can also be invaluable for additional research, such as identifying prognostic measures of disease or biomarkers that predict treatment efficacy. However, these exploratory settings are prone to false discoveries (type-I errors) due to the multiple comparisons they entail. Unfortunately, many methods fail to address this issue, in part because the algorithms used are generally designed to optimize predictions and often only provide the measures used for variable selection, such as machine learning model importance scores, as a byproduct. To address the resulting unclear uncertainty in the selection sets, the knockoff framework offers a model-agnostic, robust approach to variable selection with guaranteed type-I error control. Here, we review the knockoff framework in the setting of clinical data, highlighting main considerations using simulation studies. We also extend the framework by introducing a novel knockoff generation method that addresses two main limitations of previously suggested methods relevant for clinical development settings. With this new method, we empirically obtain tighter bounds on type-I error control and gain an order of magnitude in computational efficiency in mixed data settings. We demonstrate comparable selections to those of the competing method for identifying prognostic biomarkers for C-reactive protein levels in patients with psoriatic arthritis in four clinical trials. Our work increases access to the knockoff framework for variable selection from clinical trial data. Hereby, this paper helps to address the current replicability crisis which can result in unnecessary research efforts, increased patient burden, and avoidable costs.
Collapse
Affiliation(s)
| | | | | | - David Ohlssen
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | | |
Collapse
|
2
|
Luo D, Ebadi A, Emery K, He Y, Noble WS, Keich U. Competition-based control of the false discovery proportion. Biometrics 2023; 79:3472-3484. [PMID: 36652258 DOI: 10.1111/biom.13830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 10/12/2022] [Accepted: 01/02/2023] [Indexed: 01/19/2023]
Abstract
Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level α, the FDP in the list of discoveries can significantly exceed α. We offer FDP-SD, a new procedure that rigorously controls the FDP in the knockoff/TDC competition setup by guaranteeing that the FDP is bounded by α at a desired confidence level. Compared with the recently published framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated and real data.
Collapse
Affiliation(s)
- Dong Luo
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Arya Ebadi
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Kristen Emery
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Yilun He
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| |
Collapse
|
3
|
Dai C, Lin B, Xing X, Liu JS. A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models. J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2165930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
| | - Buyu Lin
- Department of Statistics, Harvard University
| | - Xin Xing
- Department of Statistics, Virginia Tech
| | - Jun S. Liu
- Department of Statistics, Harvard University
| |
Collapse
|
4
|
Yuan P, Feng S, Li G. Revisiting feature selection for linear models with FDR and power guarantees. J Korean Stat Soc 2022. [DOI: 10.1007/s42952-022-00179-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
5
|
Affiliation(s)
- Xu Guo
- School of Statistics, Beijing Normal University, Beijing, China
| | - Haojie Ren
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China
| | - Changliang Zou
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA
| |
Collapse
|
6
|
Sarkar SK, Tang CY. Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff-assisted variable selection. Biometrika 2021. [DOI: 10.1093/biomet/asab066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
We consider the knockoff-based multiple testing setup of Barber & Candés (2015) for variable selection in multiple regression. The method of Benjamini & Hochberg (1995) and an adaptive version of it are adjusted to this setup, transforming them to valid p-value based, false discovery rate controlling methods that do not rely on specifying the correlation structure of the explanatory 15 variables. Simulations and real data applications show that our proposed methods are powerful competitors of the false discovery rate controlling method in Barber & Candés (2015).
Collapse
Affiliation(s)
- Sanat K Sarkar
- Department of Statistical Science, Temple University, 1810 Liacouras Walk, Philadelphia, Pennsylvania 19122-6083, U.S.A
| | - Cheng Yong Tang
- Department of Statistical Science, Temple University, 1810 Liacouras Walk, Philadelphia, Pennsylvania 19122-6083, U.S.A
| |
Collapse
|
7
|
Affiliation(s)
- Zhimei Ren
- Department of Statistics, University of Chicago, Chicago, IL
| | - Yuting Wei
- Statistics & Data Science Department, University of Pennsylvania, Philadelphia, PA
| | - Emmanuel Candès
- Department of Mathematics, Department of Statistics, Stanford University, Stanford, CA
| |
Collapse
|
8
|
Sechidis K, Kormaksson M, Ohlssen D. Using knockoffs for controlled predictive biomarker identification. Stat Med 2021; 40:5453-5473. [PMID: 34328655 DOI: 10.1002/sim.9134] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/18/2021] [Accepted: 06/22/2021] [Indexed: 12/20/2022]
Abstract
One of the key challenges of personalized medicine is to identify which patients will respond positively to a given treatment. The area of subgroup identification focuses on this challenge, that is, identifying groups of patients that experience desirable characteristics, such as an enhanced treatment effect. A crucial first step towards the subgroup identification is to identify the baseline variables (eg, biomarkers) that influence the treatment effect, which are known as predictive variables. Many subgroup discovery algorithms return importance scores that capture the variables' predictive strength. However, a major limitation of these scores is that they do not answer the core question: "Which variables are actually predictive?" With our work we answer this question by using the knockoff framework, which is a general framework for controlling the false discovery rate when performing prognostic variable selection. In contrast, our work is the first that uses knockoffs for predictive variable selection. We introduce two novel knockoff filters: one parametric, building on variable importance scores derived from a penalized linear regression model, and one non-parametric, building on causal forest variable importance scores. We conduct extensive simulations to validate performance of the proposed methodology and we also apply the proposed methods to data from a randomized clinical trial.
Collapse
Affiliation(s)
| | - Matthias Kormaksson
- Advanced Methodology and Data Science, Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | - David Ohlssen
- Advanced Methodology and Data Science, Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| |
Collapse
|
9
|
Liu M, Katsevich E, Janson L, Ramdas A. Fast and powerful conditional randomization testing via distillation. Biometrika 2021. [DOI: 10.1093/biomet/asab039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Summary
We consider the problem of conditional independence testing: given a response $Y$ and covariates $(X,Z)$, we test the null hypothesis that $Y {\perp\!\!\!\perp} X \mid Z$. The conditional randomization test was recently proposed as a way to use distributional information about $X\mid Z$ to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about $Y\mid (X,Z)$. This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test’s statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.
Collapse
Affiliation(s)
- Molei Liu
- Department of Biostatistics, Harvard Chan School of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, U.S.A
| | - Eugene Katsevich
- Department of Statistics and Data Science, Wharton School of the University of Pennsylvania, 265 South 37th Street, Philadelphia, Pennsylvania 19104, U.S.A
| | - Lucas Janson
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A
| | - Aaditya Ramdas
- Department of Statistics & Data Science, Carnegie Mellon University, 132H Baker Hall, Pittsburgh, Pennsylvania 15213, U.S.A
| |
Collapse
|
10
|
Li J, Maathuis MH. GGM knockoff filter: False discovery rate control for Gaussian graphical models. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Affiliation(s)
- Jinzhou Li
- Seminar für StatistikETH Zürich Zürich Switzerland
| | | |
Collapse
|
11
|
He Z, Liu L, Wang C, Le Guen Y, Lee J, Gogarten S, Lu F, Montgomery S, Tang H, Silverman EK, Cho MH, Greicius M, Ionita-Laza I. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nat Commun 2021; 12:3152. [PMID: 34035245 PMCID: PMC8149672 DOI: 10.1038/s41467-021-22889-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Accepted: 03/26/2021] [Indexed: 02/04/2023] Open
Abstract
The analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.
Collapse
Affiliation(s)
- Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA.
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA.
| | - Linxi Liu
- Department of Statistics, Columbia University, New York, NY, USA
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York, NY, USA
| | - Yann Le Guen
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA
| | - Justin Lee
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, USA
| | | | - Fred Lu
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Stephen Montgomery
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Hua Tang
- Department of Statistics, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Edwin K Silverman
- Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine Division, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Michael H Cho
- Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine Division, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Michael Greicius
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
12
|
Goeman JJ, Hemerik J, Solari A. Only closed testing procedures are admissible for controlling false discovery proportions. Ann Stat 2021. [DOI: 10.1214/20-aos1999] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Jelle J. Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center
| | - Jesse Hemerik
- Oslo Centre for Biostatistics and Epidemiology, University of Oslo, and Biometris, Wageningen University & Research
| | - Aldo Solari
- Department of Economics, Management and Statistics, University of Milano-Bicocca
| |
Collapse
|
13
|
Decoding with confidence: Statistical control on decoder maps. Neuroimage 2021; 234:117921. [PMID: 33722670 DOI: 10.1016/j.neuroimage.2021.117921] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 02/17/2021] [Accepted: 02/21/2021] [Indexed: 11/22/2022] Open
Abstract
In brain imaging, decoding is widely used to infer relationships between brain and cognition, or to craft brain-imaging biomarkers of pathologies. Yet, standard decoding procedures do not come with statistical guarantees, and thus do not give confidence bounds to interpret the pattern maps that they produce. Indeed, in whole-brain decoding settings, the number of explanatory variables is much greater than the number of samples, hence classical statistical inference methodology cannot be applied. Specifically, the standard practice that consists in thresholding decoding maps is not a correct inference procedure. We contribute a new statistical-testing framework for this type of inference. To overcome the statistical inefficiency of voxel-level control, we generalize the Family Wise Error Rate (FWER) to account for a spatial tolerance δ, introducing the δ-Family Wise Error Rate (δ-FWER). Then, we present a decoding procedure that can control the δ-FWER: the Ensemble of Clustered Desparsified Lasso (EnCluDL), a procedure for multivariate statistical inference on high-dimensional structured data. We evaluate the statistical properties of EnCluDL with a thorough empirical study, along with three alternative procedures including decoder map thresholding. We show that EnCluDL exhibits the best recovery properties while ensuring the expected statistical control.
Collapse
|
14
|
|
15
|
Javanmard A, Lee JD. A flexible framework for hypothesis testing in high dimensions. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12373] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
16
|
Tardivel PJC, Servien R, Concordet D. Simple expressions of the LASSO and SLOPE estimators in low-dimension. STATISTICS-ABINGDON 2020. [DOI: 10.1080/02331888.2020.1720019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
| | - Rémi Servien
- INTHERES, Université de Toulouse, INRA, ENVT, Toulouse, France
| | | |
Collapse
|
17
|
|
18
|
Rosenblatt JD, Ritov Y, Goeman JJ. Discussion of ‘Gene hunting with hidden Markov model knockoffs’. Biometrika 2019. [DOI: 10.1093/biomet/asy062] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Jonathan D Rosenblatt
- Department of Industrial Engineering and Management, Ben Gurion University of the Negev, , Beer Sheva 84105, Israel
| | - Ya’acov Ritov
- Department of Statistics, University of Michigan, 1085 South University, Ann Arbor, Michigan, U.S.A
| | - Jelle J Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center, Albinusdreef 2, ZA Leiden, The Netherlands
| |
Collapse
|
19
|
|
20
|
Su WJ. When is the first spurious variable selected by sequential regression procedures? Biometrika 2018. [DOI: 10.1093/biomet/asy032] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Weijie J Su
- Department of Statistics, University of Pennsylvania, 472 John M. Huntsman Hall, 3730 Walnut Street, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
21
|
Candès E, Fan Y, Janson L, Lv J. Panning for gold: ‘model‐X’ knockoffs for high dimensional controlled variable selection. J R Stat Soc Series B Stat Methodol 2018. [DOI: 10.1111/rssb.12265] [Citation(s) in RCA: 189] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
| | - Yingying Fan
- University of Southern California Los Angeles USA
| | | | - Jinchi Lv
- University of Southern California Los Angeles USA
| |
Collapse
|