1
|
Hédou J, Marić I, Bellan G, Einhaus J, Gaudillière DK, Ladant FX, Verdonk F, Stelzer IA, Feyaerts D, Tsai AS, Ganio EA, Sabayev M, Gillard J, Amar J, Cambriel A, Oskotsky TT, Roldan A, Golob JL, Sirota M, Bonham TA, Sato M, Diop M, Durand X, Angst MS, Stevenson DK, Aghaeepour N, Montanari A, Gaudillière B. Discovery of sparse, reliable omic biomarkers with Stabl. Nat Biotechnol 2024; 42:1581-1593. [PMID: 38168992 PMCID: PMC11217152 DOI: 10.1038/s41587-023-02033-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 10/16/2023] [Indexed: 01/05/2024]
Abstract
Adoption of high-content omic technologies in clinical studies, coupled with computational methods, has yielded an abundance of candidate biomarkers. However, translating such findings into bona fide clinical biomarkers remains challenging. To facilitate this process, we introduce Stabl, a general machine learning method that identifies a sparse, reliable set of biomarkers by integrating noise injection and a data-driven signal-to-noise threshold into multivariable predictive modeling. Evaluation of Stabl on synthetic datasets and five independent clinical studies demonstrates improved biomarker sparsity and reliability compared to commonly used sparsity-promoting regularization methods while maintaining predictive performance; it distills datasets containing 1,400-35,000 features down to 4-34 candidate biomarkers. Stabl extends to multi-omic integration tasks, enabling biological interpretation of complex predictive models, as it hones in on a shortlist of proteomic, metabolomic and cytometric events predicting labor onset, microbial biomarkers of pre-term birth and a pre-operative immune signature of post-surgical infections. Stabl is available at https://github.com/gregbellan/Stabl .
Collapse
Affiliation(s)
- Julien Hédou
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Ivana Marić
- Department of Pediatrics, Stanford University, Stanford, CA, USA
| | - Grégoire Bellan
- Télécom Paris, Institut Polytechnique de Paris, Paris, France
| | - Jakob Einhaus
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pathology and Neuropathology, University Hospital and Comprehensive Cancer Center Tübingen, Tübingen, Germany
| | - Dyani K Gaudillière
- Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University, Stanford, CA, USA
| | | | - Franck Verdonk
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Sorbonne University, GRC 29, AP-HP, DMU DREAM, Department of Anesthesiology and Intensive Care, Hôpital Saint-Antoine, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Ina A Stelzer
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pathology, University of California San Diego, La Jolla, CA, USA
| | - Dorien Feyaerts
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Amy S Tsai
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Edward A Ganio
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Maximilian Sabayev
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Joshua Gillard
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Medical BioSciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Jonas Amar
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Amelie Cambriel
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Tomiko T Oskotsky
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Alennie Roldan
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Jonathan L Golob
- Department of Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Marina Sirota
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Thomas A Bonham
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Masaki Sato
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Maïgane Diop
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Xavier Durand
- École Polytechnique, Institut Polytechnique de Paris, Paris, France
| | - Martin S Angst
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | | | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pediatrics, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Andrea Montanari
- Department of Statistics, Stanford University, Stanford, CA, USA
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Brice Gaudillière
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA.
- Department of Pediatrics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
2
|
Yu CX, Gu J, Chen Z, He Z. Summary statistics knockoffs inference with family-wise error rate control. Biometrics 2024; 80:ujae082. [PMID: 39222026 PMCID: PMC11367731 DOI: 10.1093/biomtc/ujae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 07/29/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024]
Abstract
Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.
Collapse
Affiliation(s)
- Catherine Xinrui Yu
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, 999077, China
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, California, 94305, United States
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
- Department of Medicine (Biomedical Informatics Research), Stanford University, Stanford, California, 94304, United States
| |
Collapse
|
3
|
Zimmermann MR, Baillie M, Kormaksson M, Ohlssen D, Sechidis K. All that Glitters Is not Gold: Type-I Error Controlled Variable Selection from Clinical Trial Data. Clin Pharmacol Ther 2024; 115:774-785. [PMID: 38419357 DOI: 10.1002/cpt.3211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 02/02/2024] [Indexed: 03/02/2024]
Abstract
Clinical trials are primarily conducted to estimate causal effects, but the data collected can also be invaluable for additional research, such as identifying prognostic measures of disease or biomarkers that predict treatment efficacy. However, these exploratory settings are prone to false discoveries (type-I errors) due to the multiple comparisons they entail. Unfortunately, many methods fail to address this issue, in part because the algorithms used are generally designed to optimize predictions and often only provide the measures used for variable selection, such as machine learning model importance scores, as a byproduct. To address the resulting unclear uncertainty in the selection sets, the knockoff framework offers a model-agnostic, robust approach to variable selection with guaranteed type-I error control. Here, we review the knockoff framework in the setting of clinical data, highlighting main considerations using simulation studies. We also extend the framework by introducing a novel knockoff generation method that addresses two main limitations of previously suggested methods relevant for clinical development settings. With this new method, we empirically obtain tighter bounds on type-I error control and gain an order of magnitude in computational efficiency in mixed data settings. We demonstrate comparable selections to those of the competing method for identifying prognostic biomarkers for C-reactive protein levels in patients with psoriatic arthritis in four clinical trials. Our work increases access to the knockoff framework for variable selection from clinical trial data. Hereby, this paper helps to address the current replicability crisis which can result in unnecessary research efforts, increased patient burden, and avoidable costs.
Collapse
Affiliation(s)
| | | | | | - David Ohlssen
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | | |
Collapse
|
4
|
Huang SH, Shedden K, Chang HW. Inference for the dimension of a regression relationship using pseudo-covariates. Biometrics 2023; 79:2394-2403. [PMID: 36511353 DOI: 10.1111/biom.13812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 11/25/2022] [Indexed: 12/14/2022]
Abstract
In data analysis using dimension reduction methods, the main goal is to summarize how the response is related to the covariates through a few linear combinations. One key issue is to determine the number of independent, relevant covariate combinations, which is the dimension of the sufficient dimension reduction (SDR) subspace. In this work, we propose an easily-applied approach to conduct inference for the dimension of the SDR subspace, based on augmentation of the covariate set with simulated pseudo-covariates. Applying the partitioning principal to the possible dimensions, we use rigorous sequential testing to select the dimensionality, by comparing the strength of the signal arising from the actual covariates to that appearing to arise from the pseudo-covariates. We show that under a "uniform direction" condition, our approach can be used in conjunction with several popular SDR methods, including sliced inverse regression. In these settings, the test statistic asymptotically follows a beta distribution and therefore is easily calibrated. Moreover, the family-wise type I error rate of our sequential testing is rigorously controlled. Simulation studies and an analysis of newborn anthropometric data demonstrate the robustness of the proposed approach, and indicate that the power is comparable to or greater than the alternatives.
Collapse
Affiliation(s)
- Shih-Hao Huang
- Department of Mathematics, National Central University, Taoyuan, Taiwan
| | - Kerby Shedden
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Hsin-Wen Chang
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
6
|
Li S, Sesia M, Romano Y, Candès E, Sabatti C. Searching for robust associations with a multi-environment knockoff filter. Biometrika 2022; 109:611-629. [PMID: 38633763 PMCID: PMC11022501 DOI: 10.1093/biomet/asab055] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024] Open
Abstract
This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is widely applicable, this paper highlights its relevance to genome-wide association studies, in which robustness across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data.
Collapse
Affiliation(s)
- S Li
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - M Sesia
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, California 90089, USA
| | - Y Romano
- Departments of Electrical Engineering and of Computer Science, Technion, Haifa, Israel
| | - E Candès
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| | - C Sabatti
- Department of Statistics, Stanford University, Stanford, California 94305, USA
| |
Collapse
|