1
|
Hédou J, Marić I, Bellan G, Einhaus J, Gaudillière DK, Ladant FX, Verdonk F, Stelzer IA, Feyaerts D, Tsai AS, Ganio EA, Sabayev M, Gillard J, Amar J, Cambriel A, Oskotsky TT, Roldan A, Golob JL, Sirota M, Bonham TA, Sato M, Diop M, Durand X, Angst MS, Stevenson DK, Aghaeepour N, Montanari A, Gaudillière B. Discovery of sparse, reliable omic biomarkers with Stabl. Nat Biotechnol 2024; 42:1581-1593. [PMID: 38168992 PMCID: PMC11217152 DOI: 10.1038/s41587-023-02033-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 10/16/2023] [Indexed: 01/05/2024]
Abstract
Adoption of high-content omic technologies in clinical studies, coupled with computational methods, has yielded an abundance of candidate biomarkers. However, translating such findings into bona fide clinical biomarkers remains challenging. To facilitate this process, we introduce Stabl, a general machine learning method that identifies a sparse, reliable set of biomarkers by integrating noise injection and a data-driven signal-to-noise threshold into multivariable predictive modeling. Evaluation of Stabl on synthetic datasets and five independent clinical studies demonstrates improved biomarker sparsity and reliability compared to commonly used sparsity-promoting regularization methods while maintaining predictive performance; it distills datasets containing 1,400-35,000 features down to 4-34 candidate biomarkers. Stabl extends to multi-omic integration tasks, enabling biological interpretation of complex predictive models, as it hones in on a shortlist of proteomic, metabolomic and cytometric events predicting labor onset, microbial biomarkers of pre-term birth and a pre-operative immune signature of post-surgical infections. Stabl is available at https://github.com/gregbellan/Stabl .
Collapse
Affiliation(s)
- Julien Hédou
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Ivana Marić
- Department of Pediatrics, Stanford University, Stanford, CA, USA
| | - Grégoire Bellan
- Télécom Paris, Institut Polytechnique de Paris, Paris, France
| | - Jakob Einhaus
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pathology and Neuropathology, University Hospital and Comprehensive Cancer Center Tübingen, Tübingen, Germany
| | - Dyani K Gaudillière
- Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University, Stanford, CA, USA
| | | | - Franck Verdonk
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Sorbonne University, GRC 29, AP-HP, DMU DREAM, Department of Anesthesiology and Intensive Care, Hôpital Saint-Antoine, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Ina A Stelzer
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pathology, University of California San Diego, La Jolla, CA, USA
| | - Dorien Feyaerts
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Amy S Tsai
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Edward A Ganio
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Maximilian Sabayev
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Joshua Gillard
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Medical BioSciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Jonas Amar
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Amelie Cambriel
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Tomiko T Oskotsky
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Alennie Roldan
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Jonathan L Golob
- Department of Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Marina Sirota
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Thomas A Bonham
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Masaki Sato
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Maïgane Diop
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Xavier Durand
- École Polytechnique, Institut Polytechnique de Paris, Paris, France
| | - Martin S Angst
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | | | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pediatrics, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Andrea Montanari
- Department of Statistics, Stanford University, Stanford, CA, USA
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Brice Gaudillière
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA.
- Department of Pediatrics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
2
|
Sun X, Fu Y. Local false discovery rate estimation with competition-based procedures for variable selection. Stat Med 2024; 43:61-88. [PMID: 37927105 DOI: 10.1002/sim.9942] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 08/23/2023] [Accepted: 09/29/2023] [Indexed: 11/07/2023]
Abstract
Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr throughP $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free ofP $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.
Collapse
Affiliation(s)
- Xiaoya Sun
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yan Fu
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
3
|
Nii Adoquaye Acquaye FL, Kertesz-Farkas A, Noble WS. Efficient Indexing of Peptides for Database Search Using Tide. J Proteome Res 2023; 22:577-584. [PMID: 36633229 DOI: 10.1021/acs.jproteome.2c00617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The first step in the analysis of protein tandem mass spectrometry data typically involves searching the observed spectra against a protein database. During database search, the search engine must digest the proteins in the database into peptides, subject to digestion rules that are under user control. The choice of these digestion parameters, as well as selection of post-translational modifications (PTMs), can dramatically affect the size of the search space and hence the statistical power of the search. The Tide search engine separates the creation of the peptide index from the database search step, thereby saving time by allowing a peptide index to be reused in multiple searches. Here we describe an improved implementation of the indexing component of Tide that consumes around four times less resources (CPU and RAM) than the previous version and can generate arbitrarily large peptide databases, limited by only the amount of available disk space. We use this improved implementation to explore the relationship between database size and the parameters controlling digestion and PTMs, as well as database size and statistical power. Our results can help guide practitioners in proper selection of these important parameters.
Collapse
Affiliation(s)
- Frank Lawrence Nii Adoquaye Acquaye
- Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, Moscow 109028, Russia
| | - Attila Kertesz-Farkas
- Department of Data Analysis and Artificial Intelligence and Laboratory on AI for Computational Biology, Faculty of Computer Science, HSE University, Moscow 109028, Russia
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States.,Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
4
|
Etourneau L, Burger T. Challenging Targets or Describing Mismatches? A Comment on Common Decoy Distribution by Madej et al. J Proteome Res 2022; 21:2840-2845. [PMID: 36305797 DOI: 10.1021/acs.jproteome.2c00279] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
In their recent article, Madej et al. (Madej, D.; Wu, L.; Lam, H.Common Decoy Distributions Simplify False Discovery Rate Estimation in Shotgun Proteomics. J. Proteome Res.2022, 21 (2), 339-348) proposed an original way to solve the recurrent issue of controlling for the false discovery rate (FDR) in peptide-spectrum-match (PSM) validation. Briefly, they proposed to derive a single precise distribution of decoy matches termed the Common Decoy Distribution (CDD) and to use it to control for FDR during a target-only search. Conceptually, this approach is appealing as it takes the best of two worlds, i.e., decoy-based approaches (which leverage a large-scale collection of empirical mismatches) and decoy-free approaches (which are not subject to the randomness of decoy generation while sparing an additional database search). Interestingly, CDD also corresponds to a middle-of-the-road approach in statistics with respect to the two main families of FDR control procedures: Although historically based on estimating the false-positive distribution, FDR control has recently been demonstrated to be possible thanks to competition between the original variables (in proteomics, target sequences) and their fictional counterparts (in proteomics, decoys). Discriminating between these two theoretical trends is of prime importance for computational proteomics. In addition to highlighting why proteomics was a source of inspiration for theoretical biostatistics, it provides practical insights into the improvements that can be made to FDR control methods used in proteomics, including CDD.
Collapse
Affiliation(s)
- Lucas Etourneau
- Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI, FR2048Grenoble, France
| | - Thomas Burger
- Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI, FR2048Grenoble, France
| |
Collapse
|