1
|
Ji J, Hou Z, He Y, Liu L, Xue F, Chen H, Yuan Z. Differential network knockoff filter with application to brain connectivity analysis. Stat Med 2024. [PMID: 38922944 DOI: 10.1002/sim.10155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 04/30/2024] [Accepted: 06/10/2024] [Indexed: 06/28/2024]
Abstract
The brain functional connectivity can typically be represented as a brain functional network, where nodes represent regions of interest (ROIs) and edges symbolize their connections. Studying group differences in brain functional connectivity can help identify brain regions and recover the brain functional network linked to neurodegenerative diseases. This process, known as differential network analysis focuses on the differences between estimated precision matrices for two groups. Current methods struggle with individual heterogeneity in measuring the brain connectivity, false discovery rate (FDR) control, and accounting for confounding factors, resulting in biased estimates and diminished power. To address these issues, we present a two-stage FDR-controlled feature selection method for differential network analysis using functional magnetic resonance imaging (fMRI) data. First, we create individual brain connectivity measures using a high-dimensional precision matrix estimation technique. Next, we devise a penalized logistic regression model that employs individual brain connectivity data and integrates a new knockoff filter for FDR control when detecting significant differential edges. Through extensive simulations, we showcase the superiority of our approach compared to other methods. Additionally, we apply our technique to fMRI data to identify differential edges between Alzheimer's disease and control groups. Our results are consistent with prior experimental studies, emphasizing the practical applicability of our method.
Collapse
Affiliation(s)
- Jiadong Ji
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Zhendong Hou
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Yong He
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Lei Liu
- Division of Biostatistics, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Hao Chen
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Zhongshang Yuan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| |
Collapse
|
2
|
VanderDoes J, Marceaux C, Yokote K, Asselin-Labat ML, Rice G, Hywood JD. Using random forests to uncover the predictive power of distance-varying cell interactions in tumor microenvironments. PLoS Comput Biol 2024; 20:e1011361. [PMID: 38875302 PMCID: PMC11210873 DOI: 10.1371/journal.pcbi.1011361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 06/27/2024] [Accepted: 05/31/2024] [Indexed: 06/16/2024] Open
Abstract
Tumor microenvironments (TMEs) contain vast amounts of information on patient's cancer through their cellular composition and the spatial distribution of tumor cells and immune cell populations. Exploring variations in TMEs between patient groups, as well as determining the extent to which this information can predict outcomes such as patient survival or treatment success with emerging immunotherapies, is of great interest. Moreover, in the face of a large number of cell interactions to consider, we often wish to identify specific interactions that are useful in making such predictions. We present an approach to achieve these goals based on summarizing spatial relationships in the TME using spatial K functions, and then applying functional data analysis and random forest models to both predict outcomes of interest and identify important spatial relationships. This approach is shown to be effective in simulation experiments at both identifying important spatial interactions while also controlling the false discovery rate. We further used the proposed approach to interrogate two real data sets of Multiplexed Ion Beam Images of TMEs in triple negative breast cancer and lung cancer patients. The methods proposed are publicly available in a companion R package funkycells.
Collapse
Affiliation(s)
- Jeremy VanderDoes
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada
| | - Claire Marceaux
- Personalised Oncology Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Kenta Yokote
- Personalised Oncology Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | - Marie-Liesse Asselin-Labat
- Personalised Oncology Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Gregory Rice
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada
| | - Jack D. Hywood
- Department of Anatomical Pathology, Royal Melbourne Hospital, Parkville, Australia
| |
Collapse
|
3
|
Marić I, Stevenson DK, Aghaeepour N, Gaudillière B, Wong RJ, Angst MS. Predicting Preterm Birth Using Proteomics. Clin Perinatol 2024; 51:391-409. [PMID: 38705648 PMCID: PMC11186213 DOI: 10.1016/j.clp.2024.02.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/07/2024]
Abstract
The complexity of preterm birth (PTB), both spontaneous and medically indicated, and its various etiologies and associated risk factors pose a significant challenge for developing tools to accurately predict risk. This review focuses on the discovery of proteomics signatures that might be useful for predicting spontaneous PTB or preeclampsia, which often results in PTB. We describe methods for proteomics analyses, proteomics biomarker candidates that have so far been identified, obstacles for discovering biomarkers that are sufficiently accurate for clinical use, and the derivation of composite signatures including clinical parameters to increase predictive power.
Collapse
Affiliation(s)
- Ivana Marić
- Division of Neonatal and Developmental Medicine, Department of Pediatrics, Stanford University School of Medicine, 453 Quarry Road, Palo Alto, CA 94304, USA.
| | - David K Stevenson
- Division of Neonatal and Developmental Medicine, Department of Pediatrics, Stanford University School of Medicine, 453 Quarry Road, Palo Alto, CA 94304, USA
| | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Grant Building, Office 276A, 300 Pasteur Drive, Stanford, CA 94305-5117, USA; Division of Neonatal and Developmental Medicine, Department of Pediatrics, Stanford University School of Medicine, 300 Pasteur Drive, Grant S280, Stanford, CA 94305, USA
| | - Brice Gaudillière
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Grant Building, Office 276A, 300 Pasteur Drive, Stanford, CA 94305-5117, USA; Division of Neonatal and Developmental Medicine, Department of Pediatrics, Stanford University School of Medicine, 300 Pasteur Drive, Grant S280, Stanford, CA 94305, USA
| | - Ronald J Wong
- Division of Neonatal and Developmental Medicine, Department of Pediatrics, Stanford University School of Medicine, 453 Quarry Road, Palo Alto, CA 94304, USA
| | - Martin S Angst
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Grant Building, Office 276A, 300 Pasteur Drive, Stanford, CA 94305-5117, USA
| |
Collapse
|
4
|
Hlongwane R, Ramaboa KKKM, Mongwe W. Enhancing credit scoring accuracy with a comprehensive evaluation of alternative data. PLoS One 2024; 19:e0303566. [PMID: 38771812 PMCID: PMC11108212 DOI: 10.1371/journal.pone.0303566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 04/27/2024] [Indexed: 05/23/2024] Open
Abstract
This study explores the potential of utilizing alternative data sources to enhance the accuracy of credit scoring models, compared to relying solely on traditional data sources, such as credit bureau data. A comprehensive dataset from the Home Credit Group's home loan portfolio is analysed. The research examines the impact of incorporating alternative predictors that are typically overlooked, such as an applicant's social network default status, regional economic ratings, and local population characteristics. The modelling approach applies the model-X knockoffs framework for systematic variable selection. By including these alternative data sources, the credit scoring models demonstrate improved predictive performance, achieving an area under the curve metric of 0.79360 on the Kaggle Home Credit default risk competition dataset, outperforming models that relied solely on traditional data sources, such as credit bureau data. The findings highlight the significance of leveraging diverse, non-traditional data sources to augment credit risk assessment capabilities and overall model accuracy.
Collapse
Affiliation(s)
- Rivalani Hlongwane
- Graduate School of Business, University of Cape, Cape Town, South Africa
| | | | - Wilson Mongwe
- Electrical and Electronic Engineering, University of Johannesburg, Johannesburg, South Africa
| |
Collapse
|
5
|
Rahimikollu J, Xiao H, Rosengart A, Rosen ABI, Tabib T, Zdinak PM, He K, Bing X, Bunea F, Wegkamp M, Poholek AC, Joglekar AV, Lafyatis RA, Das J. SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains. Nat Methods 2024; 21:835-845. [PMID: 38374265 DOI: 10.1038/s41592-024-02175-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 01/09/2024] [Indexed: 02/21/2024]
Abstract
Modern multiomic technologies can generate deep multiscale profiles. However, differences in data modalities, multicollinearity of the data, and large numbers of irrelevant features make analyses and integration of high-dimensional omic datasets challenging. Here we present Significant Latent Factor Interaction Discovery and Exploration (SLIDE), a first-in-class interpretable machine learning technique for identifying significant interacting latent factors underlying outcomes of interest from high-dimensional omic datasets. SLIDE makes no assumptions regarding data-generating mechanisms, comes with theoretical guarantees regarding identifiability of the latent factors/corresponding inference, and has rigorous false discovery rate control. Using SLIDE on single-cell and spatial omic datasets, we uncovered significant interacting latent factors underlying a range of molecular, cellular and organismal phenotypes. SLIDE outperforms/performs at least as well as a wide range of state-of-the-art approaches, including other latent factor approaches. More importantly, it provides biological inference beyond prediction that other methods do not afford. Thus, SLIDE is a versatile engine for biological discovery from modern multiomic datasets.
Collapse
Affiliation(s)
- Javad Rahimikollu
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
- Joint CMU-Pitt PhD Program in Computational Biology, Pittsburgh, PA, USA
| | - Hanxi Xiao
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
- Joint CMU-Pitt PhD Program in Computational Biology, Pittsburgh, PA, USA
| | - AnnaElaine Rosengart
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Aaron B I Rosen
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
- Joint CMU-Pitt PhD Program in Computational Biology, Pittsburgh, PA, USA
| | - Tracy Tabib
- Division of Rheumatology and Clinical Immunology, Department of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Paul M Zdinak
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Kun He
- Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xin Bing
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Florentina Bunea
- Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
| | - Marten Wegkamp
- Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
- Department of Mathematics, Cornell University, Ithaca, NY, USA
| | - Amanda C Poholek
- Department of Pediatrics, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Alok V Joglekar
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Robert A Lafyatis
- Division of Rheumatology and Clinical Immunology, Department of Medicine, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Jishnu Das
- Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| |
Collapse
|
6
|
Wang Y, Fu Y, Sun X. Knockoffs-SPR: Clean Sample Selection in Learning With Noisy Labels. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:3242-3256. [PMID: 38039178 DOI: 10.1109/tpami.2023.3338268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2023]
Abstract
A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this article, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR.
Collapse
|
7
|
Uncovering hidden states driving biological outcomes using machine learning. Nat Methods 2024; 21:758-759. [PMID: 38374270 DOI: 10.1038/s41592-024-02176-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2024]
|
8
|
Freestone J, Noble WS, Keich U. Analysis of Tandem Mass Spectrometry Data with CONGA: Combining Open and Narrow Searches with Group-Wise Analysis. J Proteome Res 2024. [PMID: 38652578 DOI: 10.1021/acs.jproteome.3c00399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides. We therefore developed a method, called CONGA (combining open and narrow searches with group-wise analysis), that takes into account results from both types of searches─a traditional "narrow window" search and an open modification search─while carrying out rigorous false discovery rate control. The result is an algorithm that provides the best of both worlds: the ability to detect unexpected PTMs without a concomitant loss of power to detect unmodified peptides.
Collapse
Affiliation(s)
- Jack Freestone
- School of Mathematics and Statistics F07, University of Sydney, NSW 2006, Australia
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics and Statistics F07, University of Sydney, NSW 2006, Australia
| |
Collapse
|
9
|
Zhuang Y, Dyas A, Meguid RA, Henderson WG, Bronsert M, Madsen H, Colborn KL. Preoperative Prediction of Postoperative Infections Using Machine Learning and Electronic Health Record Data. Ann Surg 2024; 279:720-726. [PMID: 37753703 DOI: 10.1097/sla.0000000000006106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2023]
Abstract
OBJECTIVE To estimate preoperative risk of postoperative infections using structured electronic health record (EHR) data. BACKGROUND Surveillance and reporting of postoperative infections is primarily done through costly, labor-intensive manual chart reviews on a small sample of patients. Automated methods using statistical models applied to postoperative EHR data have shown promise to augment manual review as they can cover all operations in a timely manner. However, there are no specific models for risk-adjusting infectious complication rates using EHR data. METHODS Preoperative EHR data from 30,639 patients (2013-2019) were linked to the American College of Surgeons National Surgical Quality Improvement Program preoperative data and postoperative infection outcomes data from 5 hospitals in the University of Colorado Health System. EHR data included diagnoses, procedures, operative variables, patient characteristics, and medications. Lasso and the knockoff filter were used to perform controlled variable selection. Outcomes included surgical site infection, urinary tract infection, sepsis/septic shock, and pneumonia up to 30 days postoperatively. RESULTS Among >15,000 candidate predictors, 7 were chosen for the surgical site infection model and 6 for each of the urinary tract infection, sepsis, and pneumonia models. Important variables included preoperative presence of the specific outcome, wound classification, comorbidities, and American Society of Anesthesiologists physical status classification. The area under the receiver operating characteristic curve for each model ranged from 0.73 to 0.89. CONCLUSIONS Parsimonious preoperative models for predicting postoperative infection risk using EHR data were developed and showed comparable performance to existing American College of Surgeons National Surgical Quality Improvement Program risk models that use manual chart review. These models can be used to estimate risk-adjusted postoperative infection rates applied to large volumes of EHR data in a timely manner.
Collapse
Affiliation(s)
- Yaxu Zhuang
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
- Department of Biostatistics and Informatics, Colorado School of Public Health
| | - Adam Dyas
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus
| | - Robert A Meguid
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus
- Adult and Child Consortium for Health Outcomes Research and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - William G Henderson
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
| | - Michael Bronsert
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
- Adult and Child Consortium for Health Outcomes Research and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - Helen Madsen
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus
| | - Kathryn L Colborn
- Department of Surgery, Surgical Outcomes and Applied Research Program, University of Colorado Anschutz Medical Campus
- Department of Biostatistics and Informatics, Colorado School of Public Health
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus
- Adult and Child Consortium for Health Outcomes Research and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, CO
| |
Collapse
|
10
|
Zimmermann MR, Baillie M, Kormaksson M, Ohlssen D, Sechidis K. All that Glitters Is not Gold: Type-I Error Controlled Variable Selection from Clinical Trial Data. Clin Pharmacol Ther 2024; 115:774-785. [PMID: 38419357 DOI: 10.1002/cpt.3211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 02/02/2024] [Indexed: 03/02/2024]
Abstract
Clinical trials are primarily conducted to estimate causal effects, but the data collected can also be invaluable for additional research, such as identifying prognostic measures of disease or biomarkers that predict treatment efficacy. However, these exploratory settings are prone to false discoveries (type-I errors) due to the multiple comparisons they entail. Unfortunately, many methods fail to address this issue, in part because the algorithms used are generally designed to optimize predictions and often only provide the measures used for variable selection, such as machine learning model importance scores, as a byproduct. To address the resulting unclear uncertainty in the selection sets, the knockoff framework offers a model-agnostic, robust approach to variable selection with guaranteed type-I error control. Here, we review the knockoff framework in the setting of clinical data, highlighting main considerations using simulation studies. We also extend the framework by introducing a novel knockoff generation method that addresses two main limitations of previously suggested methods relevant for clinical development settings. With this new method, we empirically obtain tighter bounds on type-I error control and gain an order of magnitude in computational efficiency in mixed data settings. We demonstrate comparable selections to those of the competing method for identifying prognostic biomarkers for C-reactive protein levels in patients with psoriatic arthritis in four clinical trials. Our work increases access to the knockoff framework for variable selection from clinical trial data. Hereby, this paper helps to address the current replicability crisis which can result in unnecessary research efforts, increased patient burden, and avoidable costs.
Collapse
Affiliation(s)
| | | | | | - David Ohlssen
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | | |
Collapse
|
11
|
Lin A, See D, Fondrie WE, Keich U, Noble WS. Target-decoy false discovery rate estimation using Crema. Proteomics 2024; 24:e2300084. [PMID: 38380501 DOI: 10.1002/pmic.202300084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 01/06/2024] [Accepted: 01/16/2024] [Indexed: 02/22/2024]
Abstract
Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington, USA
| | - Donavan See
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
| | - William Stafford Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, USA
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| |
Collapse
|
12
|
Lac L, Leung CK, Hu P. Computational frameworks integrating deep learning and statistical models in mining multimodal omics data. J Biomed Inform 2024; 152:104629. [PMID: 38552994 DOI: 10.1016/j.jbi.2024.104629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 02/26/2024] [Accepted: 03/25/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND In health research, multimodal omics data analysis is widely used to address important clinical and biological questions. Traditional statistical methods rely on the strong assumptions of distribution. Statistical methods such as testing and differential expression are commonly used in omics analysis. Deep learning, on the other hand, is an advanced computer science technique that is powerful in mining high-dimensional omics data for prediction tasks. Recently, integrative frameworks or methods have been developed for omics studies that combine statistical models and deep learning algorithms. METHODS AND RESULTS The aim of these integrative frameworks is to combine the strengths of both statistical methods and deep learning algorithms to improve prediction accuracy while also providing interpretability and explainability. This review report discusses the current state-of-the-art integrative frameworks, their limitations, and potential future directions in survival and time-to-event longitudinal analysis, dimension reduction and clustering, regression and classification, feature selection, and causal and transfer learning.
Collapse
Affiliation(s)
- Leann Lac
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Statistics, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Carson K Leung
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Pingzhao Hu
- Department of Computer Science, University of Manitoba, Winnipeg, Manitoba, Canada; Department of Biochemistry, Western University, London, Ontario, Canada; Department of Computer Science, Western University, London, Ontario, Canada; Department of Oncology, Western University, London, Ontario, Canada; Department of Epidemiology and Biostatistics, Western University, London, Ontario, Canada; The Children's Health Research Institute, Lawson Health Research Institute, London, Ontario, Canada.
| |
Collapse
|
13
|
Burger T. Fudging the volcano-plot without dredging the data. Nat Commun 2024; 15:1392. [PMID: 38360828 PMCID: PMC10869345 DOI: 10.1038/s41467-024-45834-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 02/02/2024] [Indexed: 02/17/2024] Open
Affiliation(s)
- Thomas Burger
- Univ. Grenoble Alpes, INSERM, CEA, UA13 BGE, CNRS, CEA, FR2048 ProFI, 38000, Grenoble, France.
| |
Collapse
|
14
|
Williamson BD, Huang Y. Flexible variable selection in the presence of missing data. Int J Biostat 2024; 0:ijb-2023-0059. [PMID: 38348882 DOI: 10.1515/ijb-2023-0059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 11/21/2023] [Indexed: 05/22/2024]
Abstract
In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.
Collapse
Affiliation(s)
- Brian D Williamson
- Biostatistics Division, Kaiser Permanente Washington Health Research Institute, Seattle, USA
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, USA
- Department of Biostatistics, University of Washington, Seattle, USA
| | - Ying Huang
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, USA
- Department of Biostatistics, University of Washington, Seattle, USA
| |
Collapse
|
15
|
Sun X, Fu Y. Local false discovery rate estimation with competition-based procedures for variable selection. Stat Med 2024; 43:61-88. [PMID: 37927105 DOI: 10.1002/sim.9942] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 08/23/2023] [Accepted: 09/29/2023] [Indexed: 11/07/2023]
Abstract
Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr throughP $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free ofP $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.
Collapse
Affiliation(s)
- Xiaoya Sun
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yan Fu
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
16
|
Hédou J, Marić I, Bellan G, Einhaus J, Gaudillière DK, Ladant FX, Verdonk F, Stelzer IA, Feyaerts D, Tsai AS, Ganio EA, Sabayev M, Gillard J, Amar J, Cambriel A, Oskotsky TT, Roldan A, Golob JL, Sirota M, Bonham TA, Sato M, Diop M, Durand X, Angst MS, Stevenson DK, Aghaeepour N, Montanari A, Gaudillière B. Discovery of sparse, reliable omic biomarkers with Stabl. Nat Biotechnol 2024:10.1038/s41587-023-02033-x. [PMID: 38168992 PMCID: PMC11217152 DOI: 10.1038/s41587-023-02033-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 10/16/2023] [Indexed: 01/05/2024]
Abstract
Adoption of high-content omic technologies in clinical studies, coupled with computational methods, has yielded an abundance of candidate biomarkers. However, translating such findings into bona fide clinical biomarkers remains challenging. To facilitate this process, we introduce Stabl, a general machine learning method that identifies a sparse, reliable set of biomarkers by integrating noise injection and a data-driven signal-to-noise threshold into multivariable predictive modeling. Evaluation of Stabl on synthetic datasets and five independent clinical studies demonstrates improved biomarker sparsity and reliability compared to commonly used sparsity-promoting regularization methods while maintaining predictive performance; it distills datasets containing 1,400-35,000 features down to 4-34 candidate biomarkers. Stabl extends to multi-omic integration tasks, enabling biological interpretation of complex predictive models, as it hones in on a shortlist of proteomic, metabolomic and cytometric events predicting labor onset, microbial biomarkers of pre-term birth and a pre-operative immune signature of post-surgical infections. Stabl is available at https://github.com/gregbellan/Stabl .
Collapse
Affiliation(s)
- Julien Hédou
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Ivana Marić
- Department of Pediatrics, Stanford University, Stanford, CA, USA
| | - Grégoire Bellan
- Télécom Paris, Institut Polytechnique de Paris, Paris, France
| | - Jakob Einhaus
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pathology and Neuropathology, University Hospital and Comprehensive Cancer Center Tübingen, Tübingen, Germany
| | - Dyani K Gaudillière
- Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University, Stanford, CA, USA
| | | | - Franck Verdonk
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Sorbonne University, GRC 29, AP-HP, DMU DREAM, Department of Anesthesiology and Intensive Care, Hôpital Saint-Antoine, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Ina A Stelzer
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pathology, University of California San Diego, La Jolla, CA, USA
| | - Dorien Feyaerts
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Amy S Tsai
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Edward A Ganio
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Maximilian Sabayev
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Joshua Gillard
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Medical BioSciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Jonas Amar
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Amelie Cambriel
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Tomiko T Oskotsky
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Alennie Roldan
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Jonathan L Golob
- Department of Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Marina Sirota
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Thomas A Bonham
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Masaki Sato
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Maïgane Diop
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | - Xavier Durand
- École Polytechnique, Institut Polytechnique de Paris, Paris, France
| | - Martin S Angst
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
| | | | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA
- Department of Pediatrics, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Andrea Montanari
- Department of Statistics, Stanford University, Stanford, CA, USA
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Brice Gaudillière
- Department of Anesthesiology, Perioperative & Pain Medicine, Stanford University, Stanford, CA, USA.
- Department of Pediatrics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
17
|
Luo D, Ebadi A, Emery K, He Y, Noble WS, Keich U. Competition-based control of the false discovery proportion. Biometrics 2023; 79:3472-3484. [PMID: 36652258 DOI: 10.1111/biom.13830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 10/12/2022] [Accepted: 01/02/2023] [Indexed: 01/19/2023]
Abstract
Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level α, the FDP in the list of discoveries can significantly exceed α. We offer FDP-SD, a new procedure that rigorously controls the FDP in the knockoff/TDC competition setup by guaranteeing that the FDP is bounded by α at a desired confidence level. Compared with the recently published framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated and real data.
Collapse
Affiliation(s)
- Dong Luo
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Arya Ebadi
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Kristen Emery
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Yilun He
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| |
Collapse
|
18
|
Dai R, Zheng C. False discovery rate-controlled multiple testing for union null hypotheses: a knockoff-based approach. Biometrics 2023; 79:3497-3509. [PMID: 36854821 PMCID: PMC10460825 DOI: 10.1111/biom.13848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 02/17/2023] [Indexed: 03/02/2023]
Abstract
False discovery rate (FDR) controlling procedures provide important statistical guarantees for replicability in signal identification based on multiple hypotheses testing. In many fields of study, FDR controling procedures are used in high-dimensional (HD) analyses to discover features that are truly associated with the outcome. In some recent applications, data on the same set of candidate features are independently collected in multiple different studies. For example, gene expression data are collected at different facilities and with different cohorts, to identify the genetic biomarkers of multiple types of cancers. These studies provide us with opportunities to identify signals by considering information from different sources (with potential heterogeneity) jointly. This paper is about how to provide FDR control guarantees for the tests of union null hypotheses of conditional independence. We present a knockoff-based variable selection method (Simultaneous knockoffs) to identify mutual signals from multiple independent datasets, providing exact FDR control guarantees under finite sample settings. This method can work with very general model settings and test statistics. We demonstrate the performance of this method with extensive numerical studies and two real-data examples.
Collapse
Affiliation(s)
- Ran Dai
- Department of Biostatistics, University of Nebraska Medical Center, Omaha, Nebraska, U.S.A
| | | |
Collapse
|
19
|
Luo Y, Guo X. Inference on tree-structured subgroups with subgroup size and subgroup effect relationship in clinical trials. Stat Med 2023; 42:5039-5053. [PMID: 37732390 DOI: 10.1002/sim.9900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 08/18/2023] [Accepted: 09/01/2023] [Indexed: 09/22/2023]
Abstract
When multiple candidate subgroups are considered in clinical trials, we often need to make statistical inference on the subgroups simultaneously. Classical multiple testing procedures might not lead to an interpretable and efficient inference on the subgroups as they often fail to take subgroup size and subgroup effect relationship into account. In this paper, built on the selective traversed accumulation rules (STAR), we propose a data-adaptive and interactive multiple testing procedure for subgroups which can take subgroup size and subgroup effect relationship into account under prespecified tree structure. The proposed method is easy-to-implement and can lead to a more interpretable and efficient inference on prespecified tree-structured subgroups. Possible accommodations to post hoc identified tree-structure subgroups are also discussed in the paper. We demonstrate the merit of our proposed method by re-analyzing the panitumumab trial with the proposed method.
Collapse
Affiliation(s)
- Yuanhui Luo
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
| | - Xinzhou Guo
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
| |
Collapse
|
20
|
Huang SH, Shedden K, Chang HW. Inference for the dimension of a regression relationship using pseudo-covariates. Biometrics 2023; 79:2394-2403. [PMID: 36511353 DOI: 10.1111/biom.13812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 11/25/2022] [Indexed: 12/14/2022]
Abstract
In data analysis using dimension reduction methods, the main goal is to summarize how the response is related to the covariates through a few linear combinations. One key issue is to determine the number of independent, relevant covariate combinations, which is the dimension of the sufficient dimension reduction (SDR) subspace. In this work, we propose an easily-applied approach to conduct inference for the dimension of the SDR subspace, based on augmentation of the covariate set with simulated pseudo-covariates. Applying the partitioning principal to the possible dimensions, we use rigorous sequential testing to select the dimensionality, by comparing the strength of the signal arising from the actual covariates to that appearing to arise from the pseudo-covariates. We show that under a "uniform direction" condition, our approach can be used in conjunction with several popular SDR methods, including sliced inverse regression. In these settings, the test statistic asymptotically follows a beta distribution and therefore is easily calibrated. Moreover, the family-wise type I error rate of our sequential testing is rigorously controlled. Simulation studies and an analysis of newborn anthropometric data demonstrate the robustness of the proposed approach, and indicate that the power is comparable to or greater than the alternatives.
Collapse
Affiliation(s)
- Shih-Hao Huang
- Department of Mathematics, National Central University, Taoyuan, Taiwan
| | - Kerby Shedden
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Hsin-Wen Chang
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
21
|
Burger T. Controlling for false discoveries subsequently to large scale one-way ANOVA testing in proteomics: Practical considerations. Proteomics 2023; 23:e2200406. [PMID: 37357151 DOI: 10.1002/pmic.202200406] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 05/30/2023] [Accepted: 05/31/2023] [Indexed: 06/27/2023]
Abstract
In discovery proteomics, as well as many other "omic" approaches, the possibility to test for the differential abundance of hundreds (or of thousands) of features simultaneously is appealing, despite requiring specific statistical safeguards, among which controlling for the false discovery rate (FDR) has become standard. Moreover, when more than two biological conditions or group treatments are considered, it has become customary to rely on the one-way analysis of variance (ANOVA) framework, where a first global differential abundance landscape provided by an omnibus test can be subsequently refined using various post-hoc tests (PHTs). However, the interactions between the FDR control procedures and the PHTs are complex, because both correspond to different types of multiple test corrections (MTCs). This article surveys various ways to orchestrate them in a data processing workflow and discusses their pros and cons.
Collapse
Affiliation(s)
- Thomas Burger
- Univ. Grenoble Alpes, CNRS, CEA, INSERM, ProFI, EDyP, Grenoble, France
| |
Collapse
|
22
|
Li S, Yao Y, Zhang CH. Comments on "A Scale-Free Approach for False Discovery Rate Control in Generalized Linear Models". J Am Stat Assoc 2023; 118:1586-1589. [PMID: 38404948 PMCID: PMC10888134 DOI: 10.1080/01621459.2023.2224412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 06/02/2023] [Indexed: 02/27/2024]
Affiliation(s)
- Sai Li
- Associate Professor, Institute of Statistics and Big Data, Renmin University of China, China
| | - Yisha Yao
- Postdoctoral Associate, Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Cun-Hui Zhang
- Distinguished Professor, Department of Statistics, Rutgers University, Piscataway, NJ
| |
Collapse
|
23
|
Song D, Li K, Ge X, Li JJ. ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550107. [PMID: 37546812 PMCID: PMC10401959 DOI: 10.1101/2023.07.21.550107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE test for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality. The core idea of ClusterDE is to generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to find cell-type marker genes that are biologically meaningful. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
| | - Kexin Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA 02138
| |
Collapse
|
24
|
Sun N, Akay LA, Murdock MH, Park Y, Galiana-Melendez F, Bubnys A, Galani K, Mathys H, Jiang X, Ng AP, Bennett DA, Tsai LH, Kellis M. Single-nucleus multiregion transcriptomic analysis of brain vasculature in Alzheimer's disease. Nat Neurosci 2023; 26:970-982. [PMID: 37264161 PMCID: PMC10464935 DOI: 10.1038/s41593-023-01334-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 04/17/2023] [Indexed: 06/03/2023]
Abstract
Cerebrovascular dysregulation is a hallmark of Alzheimer's disease (AD), but the changes that occur in specific cell types have not been fully characterized. Here, we profile single-nucleus transcriptomes in the human cerebrovasculature in six brain regions from 220 individuals with AD and 208 age-matched controls. We annotate 22,514 cerebrovascular cells, including 11 subtypes of endothelial, pericyte, smooth muscle, perivascular fibroblast and ependymal cells. We identify 2,676 differentially expressed genes in AD, including downregulation of PDGFRB in pericytes, and of ABCB1 and ATP10A in endothelial cells, and validate the downregulation of SLC6A1 and upregulation of APOD, INSR and COL4A1 in postmortem AD brain tissues. We detect vasculature, glial and neuronal coexpressed gene modules, suggesting coordinated neurovascular unit dysregulation in AD. Integration with AD genetics reveals 125 AD differentially expressed genes directly linked to AD-associated genetic variants. Lastly, we show that APOE4 genotype-associated differences are significantly enriched among AD-associated genes in capillary and venule endothelial cells, as well as subsets of pericytes and fibroblasts.
Collapse
Affiliation(s)
- Na Sun
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Leyla Anne Akay
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Mitchell H Murdock
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Yongjin Park
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Pathology and Laboratory Medicine, Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada
- Department of Molecular Oncology, BC Cancer, Vancouver, British Columbia, Canada
| | - Fabiola Galiana-Melendez
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Adele Bubnys
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kyriaki Galani
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Hansruedi Mathys
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Neurobiology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Xueqiao Jiang
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ayesha P Ng
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - David A Bennett
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA
| | - Li-Huei Tsai
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
25
|
Wai Tsang K, Tsung F, Xu Z. Knockoff procedure for false discovery rate control in high-dimensional data streams. J Appl Stat 2023; 50:2970-2983. [PMID: 37808615 PMCID: PMC10557548 DOI: 10.1080/02664763.2023.2200496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/03/2023] [Indexed: 10/10/2023]
Abstract
Motivated by applications to root-cause identification of faults in high-dimensional data streams that may have very limited samples after faults are detected, we consider multiple testing in models for multivariate statistical process control (SPC). With quick fault detection, only small portion of data streams being out-of-control (OC) can be assumed. It is a long standing problem to identify those OC data streams while controlling the number of false discoveries. It is challenging due to the limited number of OC samples after the termination of the process when faults are detected. Although several false discovery rate (FDR) controlling methods have been proposed, people may prefer other methods for quick detection. With a recently developed method called Knockoff filtering, we propose a knockoff procedure that can combine with other fault detection methods in the sense that the knockoff procedure does not change the stopping time, but may identify another set of faults to control FDR. A theorem for the FDR control of the proposed procedure is provided. Simulation studies show that the proposed procedure can control FDR while maintaining high power. We also illustrate the performance in an application to semiconductor manufacturing processes that motivated this development.
Collapse
Affiliation(s)
- Ka Wai Tsang
- School of Data Science, The Chinese University of Hong Kong, ShenzhenGuangdong518172, People's Republic of China
| | - Fugee Tsung
- Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong
| | - Zhihao Xu
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
26
|
Chen S, Li Z, Liu L, Wen Y. The systematic comparison between Gaussian mirror and Model-X knockoff models. Sci Rep 2023; 13:5478. [PMID: 37015993 PMCID: PMC10073103 DOI: 10.1038/s41598-023-32605-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 03/30/2023] [Indexed: 04/06/2023] Open
Abstract
While the high-dimensional biological data have provided unprecedented data resources for the identification of biomarkers, consensus is still lacking on how to best analyze them. The recently developed Gaussian mirror (GM) and Model-X (MX) knockoff-based methods have much related model assumptions, which makes them appealing for the detection of new biomarkers. However, there are no guidelines for their practical use. In this research, we systematically compared the performance of MX-based and GM methods, where the impacts of the distribution of explanatory variables, their relatedness and the signal-to-noise ratio were evaluated. MX with knockoff generated using the second-order approximates (MX-SO) has the best performance as compared to other MX-based methods. MX-SO and GM have similar levels of power and computational speed under most of the simulations, but GM is more robust in the control of false discovery rate (FDR). In particular, MX-SO can only control the FDR well when there are weak correlations among explanatory variables and the sample size is at least moderate. On the contrary, GM can have the desired FDR as long as explanatory variables are not highly correlated. We further used GM and MX-based methods to detect biomarkers that are associated with the Alzheimer's disease-related PET-imaging trait and the Parkinson's disease-related T-tau of cerebrospinal fluid. We found that MX-based and GM methods are both powerful for the analysis of big biological data. Although genes selected from MX-based methods are more similar as compared to those from the GM method, both MX-based and GM methods can identify the well-known disease-associated genes for each disease. While MX-based methods can have a slightly higher power than that of the GM method, it is less robust, especially for data with small sample sizes, unknown distributions, and high correlations.
Collapse
Affiliation(s)
- Shuai Chen
- Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi Province, China
| | - Ziqi Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi Province, China
| | - Long Liu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi Province, China.
| | - Yalu Wen
- Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi Province, China.
- Department of Statistics, University of Auckland, 38 Princes Street, Auckland Central, Auckland, New Zealand, 1010.
| |
Collapse
|
27
|
Chu BB, Ko S, Zhou JJ, Jensen A, Zhou H, Sinsheimer JS, Lange K. Multivariate genome-wide association analysis by iterative hard thresholding. Bioinformatics 2023; 39:btad193. [PMID: 37067496 PMCID: PMC10133532 DOI: 10.1093/bioinformatics/btad193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 04/07/2023] [Accepted: 04/13/2023] [Indexed: 04/18/2023] Open
Abstract
MOTIVATION In a genome-wide association study, analyzing multiple correlated traits simultaneously is potentially superior to analyzing the traits one by one. Standard methods for multivariate genome-wide association study operate marker-by-marker and are computationally intensive. RESULTS We present a sparsity constrained regression algorithm for multivariate genome-wide association study based on iterative hard thresholding and implement it in a convenient Julia package MendelIHT.jl. In simulation studies with up to 100 quantitative traits, iterative hard thresholding exhibits similar true positive rates, smaller false positive rates, and faster execution times than GEMMA's linear mixed models and mv-PLINK's canonical correlation analysis. On UK Biobank data with 470 228 variants, MendelIHT completed a three-trait joint analysis (n=185 656) in 20 h and an 18-trait joint analysis (n=104 264) in 53 h with an 80 GB memory footprint. In short, MendelIHT enables geneticists to fit a single regression model that simultaneously considers the effect of all SNPs and dozens of traits. AVAILABILITY AND IMPLEMENTATION Software, documentation, and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelIHT.jl.
Collapse
Affiliation(s)
- Benjamin B Chu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Seyoon Ko
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Jin J Zhou
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Aubrey Jensen
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Hua Zhou
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Janet S Sinsheimer
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Kenneth Lange
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Statistics at UCLA, Los Angeles, CA 90095-1554, United States
| |
Collapse
|
28
|
Rajchert A, Keich U. Controlling the false discovery rate via competition: Is the +1 needed? Stat Probab Lett 2023. [DOI: 10.1016/j.spl.2023.109819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
|
29
|
Päll T, Luidalepp H, Tenson T, Maiväli Ü. A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias. PLoS Biol 2023; 21:e3002007. [PMID: 36862747 PMCID: PMC10013925 DOI: 10.1371/journal.pbio.3002007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 03/14/2023] [Accepted: 01/20/2023] [Indexed: 03/03/2023] Open
Abstract
We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.
Collapse
Affiliation(s)
- Taavi Päll
- Institute of Biomedicine and Translational Medicine, University of Tartu, Tartu, Estonia
| | | | - Tanel Tenson
- Institute of Technology, University of Tartu, Tartu, Estonia
| | - Ülo Maiväli
- Institute of Technology, University of Tartu, Tartu, Estonia
- * E-mail:
| |
Collapse
|
30
|
Liu Z, Zheng J, Pan Y. Doubly robust estimation for non‐probability samples with modified intertwined probabilistic factors decoupling. Stat Anal Data Min 2023. [DOI: 10.1002/sam.11614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Affiliation(s)
- Zhan Liu
- Hubei Key Laboratory of Applied Mathematics, School of Mathematics and Statistics Hubei University Wuhan China
| | - Junbo Zheng
- Hubei Key Laboratory of Applied Mathematics, School of Mathematics and Statistics Hubei University Wuhan China
| | - Yingli Pan
- Hubei Key Laboratory of Applied Mathematics, School of Mathematics and Statistics Hubei University Wuhan China
| |
Collapse
|
31
|
Liu YY. Controlling the human microbiome. Cell Syst 2023; 14:135-159. [PMID: 36796332 PMCID: PMC9942095 DOI: 10.1016/j.cels.2022.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 10/18/2022] [Accepted: 12/21/2022] [Indexed: 02/17/2023]
Abstract
We coexist with a vast number of microbes that live in and on our bodies. Those microbes and their genes are collectively known as the human microbiome, which plays important roles in human physiology and diseases. We have acquired extensive knowledge of the organismal compositions and metabolic functions of the human microbiome. However, the ultimate proof of our understanding of the human microbiome is reflected in our ability to manipulate it for health benefits. To facilitate the rational design of microbiome-based therapies, there are many fundamental questions to be addressed at the systems level. Indeed, we need a deep understanding of the ecological dynamics associated with such a complex ecosystem before we rationally design control strategies. In light of this, this review discusses progress from various fields, e.g., community ecology, network science, and control theory, that are helping us make progress toward the ultimate goal of controlling the human microbiome.
Collapse
Affiliation(s)
- Yang-Yu Liu
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Center for Artificial Intelligence and Modeling, The Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA.
| |
Collapse
|
32
|
Learning to increase the power of conditional randomization tests. Mach Learn 2023. [DOI: 10.1007/s10994-023-06302-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
33
|
Wang X, Pennello G, deSouza NM, Huang EP, Buckler AJ, Barnhart HX, Delfino JG, Raunig DL, Wang L, Guimaraes AR, Hall TJ, Obuchowski NA. Multiparametric Data-driven Imaging Markers: Guidelines for Development, Application and Reporting of Model Outputs in Radiomics. Acad Radiol 2023; 30:215-229. [PMID: 36411153 PMCID: PMC9825652 DOI: 10.1016/j.acra.2022.10.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 09/21/2022] [Accepted: 10/01/2022] [Indexed: 11/19/2022]
Abstract
This paper is the fifth in a five-part series on statistical methodology for performance assessment of multi-parametric quantitative imaging biomarkers (mpQIBs) for radiomic analysis. Radiomics is the process of extracting visually imperceptible features from radiographic medical images using data-driven algorithms. We refer to the radiomic features as data-driven imaging markers (DIMs), which are quantitative measures discovered under a data-driven framework from images beyond visual recognition but evident as patterns of disease processes irrespective of whether or not ground truth exists for the true value of the DIM. This paper aims to set guidelines on how to build machine learning models using DIMs in radiomics and to apply and report them appropriately. We provide a list of recommendations, named RANDAM (an abbreviation of "Radiomic ANalysis and DAta Modeling"), for analysis, modeling, and reporting in a radiomic study to make machine learning analyses in radiomics more reproducible. RANDAM contains five main components to use in reporting radiomics studies: design, data preparation, data analysis and modeling, reporting, and material availability. Real case studies in lung cancer research are presented along with simulation studies to compare different feature selection methods and several validation strategies.
Collapse
Affiliation(s)
- Xiaofeng Wang
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Ave/JJN3, Cleveland, OH 44195.
| | - Gene Pennello
- Center for Devices and Radiological Health, US Food and Drug Administration Division of Imaging, Diagnostic and Software Reliability, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland
| | - Nandita M deSouza
- Division of Radiotherapy and Imaging, The Institute of Cancer Research and Royal Marsden Hospital, London, United Kingdom; European Imaging Biomarkers Alliance, European Society of Radiology, London, UK
| | - Erich P Huang
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Bethesda, Maryland
| | | | - Huiman X Barnhart
- Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina
| | - Jana G Delfino
- Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland
| | - David L Raunig
- Data Science Institute, Statistical and Quantitative Sciences, Takeda Pharmaceuticals America Inc, Lexington, Massachusetts
| | - Lu Wang
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Ave/JJN3, Cleveland, OH 44195
| | - Alexander R Guimaraes
- Department of Diagnostic Radiology, Oregon Health & Sciences University, Portland, Oregon
| | - Timothy J Hall
- Department of Medical Physics, University of Wisconsin, Madison, Wisconsin
| | - Nancy A Obuchowski
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9500 Euclid Ave/JJN3, Cleveland, OH 44195
| |
Collapse
|
34
|
Colborn KL, Zhuang Y, Dyas AR, Henderson WG, Madsen HJ, Bronsert MR, Matheny ME, Lambert-Kerzner A, Myers QWO, Meguid RA. Development and validation of models for detection of postoperative infections using structured electronic health records data and machine learning. Surgery 2023; 173:464-471. [PMID: 36470694 PMCID: PMC10204069 DOI: 10.1016/j.surg.2022.10.026] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 10/18/2022] [Accepted: 10/26/2022] [Indexed: 12/04/2022]
Abstract
BACKGROUND Postoperative infections constitute more than half of all postoperative complications. Surveillance of these complications is primarily done through manual chart review, which is time consuming, expensive, and typically only covers 10% to 15% of all operations. Automated surveillance would permit the timely evaluation of and reporting of all operations. METHODS The goal of this study was to develop and validate parsimonious, interpretable models for conducting surveillance of postoperative infections using structured electronic health records data. This was a retrospective study using 30,639 unique operations from 5 major hospitals between 2013 and 2019. Structured electronic health records data were linked to postoperative outcomes data from the American College of Surgeons National Surgical Quality Improvement Program. Predictors from the electronic health records included diagnoses, procedures, and medications. Infectious complications included surgical site infection, urinary tract infection, sepsis, and pneumonia within 30 days of surgery. The knockoff filter, a penalized regression technique that controls type I error, was applied for variable selection. Models were validated in a chronological held-out dataset. RESULTS Seven percent of patients experienced at least one type of postoperative infection. Models selected contained between 4 and 8 variables and achieved >0.91 area under the receiver operating characteristic curve, >81% specificity, >87% sensitivity, >99% negative predictive value, and 10% to 15% positive predictive value in a held-out test dataset. CONCLUSION Surveillance and reporting of postoperative infection rates can be implemented for all operations with high accuracy using electronic health records data and simple linear regression models.
Collapse
Affiliation(s)
- Kathryn L Colborn
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO; Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO; Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO; Adult and Child Consortium for Health Outcomes Research and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, CO.
| | - Yaxu Zhuang
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| | - Adam R Dyas
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO; Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - William G Henderson
- Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO; Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| | - Helen J Madsen
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO; Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - Michael R Bronsert
- Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO; Adult and Child Consortium for Health Outcomes Research and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - Michael E Matheny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN; Division of General Internal Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Anne Lambert-Kerzner
- Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - Quintin W O Myers
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO; Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO
| | - Robert A Meguid
- Department of Surgery, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO; Surgical Outcomes and Applied Research Program, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO; Adult and Child Consortium for Health Outcomes Research and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, CO
| |
Collapse
|
35
|
Cui J, Wang G, Zou C, Wang Z. Change-point testing for parallel data sets with FDR control. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2023.107705] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
36
|
Explaining classifiers with measures of statistical association. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2023.107701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
37
|
Cheng X, Wang H. A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 229:107269. [PMID: 36463676 DOI: 10.1016/j.cmpb.2022.107269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 11/22/2022] [Accepted: 11/23/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND AND OBJECTIVE Identifying active features from ultra-high dimensional data is one of the primary and vital tasks in statistical learning and biological discovery. METHODS In this paper, we develop a generic concordance index screening (CI-SIS) procedure to wrestle with ultra-high dimensional data with categorical response. The proposed procedure is model-free and nonparametric based on the concordance index measure. It enjoys both sure screening and ranking consistency properties under some relatively weak assumptions. We investigate the flexibility of this procedure by considering some commonly-encountered challenging settings in biomedical studies, such as category-adaptive data and extremely unbalanced response distributions. A data-driven threshold selection procedure via knockoff features is also presented. RESULTS On the real lung dataset, our method achieves a lower prediction error with a mean error of 0.107 with linear discriminant analysis (LDA) and 0.117 with random forest (RF), respectively. In addition, we obtain an accuracy improvement of 3% with LDA and 5% with RF compared to the runner-up method. In a more challenging real data of SRBCT (Small round blue cell tumours), CI-SIS brings about a amazing performance improvement, which is at least 8% higher than all other competing methods. CONCLUSION Experimental results show that the proposed method can efficiently identify genes that are associated with certain types of diseases. Therefore, survived features (filtering out irrelevant features) selected by our procedure can help doctors make precision diagnoses and refined treatments of patients.
Collapse
Affiliation(s)
- Xuewei Cheng
- School of Mathematics and Statistics, Central South University, Changsha, China; Department of Statistics and Data Science, National University of Singapore, Singapore.
| | - Hong Wang
- School of Mathematics and Statistics, Central South University, Changsha, China.
| |
Collapse
|
38
|
Somanchi S, Abbasi A, Kelley K, Dobolyi D, Yuan TT. Examining User Heterogeneity in Digital Experiments. ACM T INFORM SYST 2023. [DOI: 10.1145/3578931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Digital experiments are routinely used to test the value of a treatment relative to a status quo control setting — for instance, a new search relevance algorithm for a website or a new results layout for a mobile app. As digital experiments have become increasingly pervasive in organizations and a wide variety of research areas, their growth has prompted a new set of challenges for experimentation platforms. One challenge is that experiments often focus on the average treatment effect (ATE) without explicitly considering differences across major sub-groups — heterogeneous treatment effect (HTE). This is especially problematic because ATEs have decreased in many organizations as the more obvious benefits have already been realized. However, questions abound regarding the pervasiveness of user HTEs and how best to detect them. We propose a framework for detecting and analyzing user HTEs in digital experiments. Our framework combines an array of user characteristics with double machine learning. Analysis of 27 real-world experiments spanning 1.76 billion sessions and simulated data demonstrates the effectiveness of our detection method relative to existing techniques. We also find that transaction, demographic, engagement, satisfaction, and lifecycle characteristics exhibit statistically significant HTEs in 10% to 20% of our real-world experiments, underscoring the importance of considering user heterogeneity when analyzing experiment results, otherwise personalized features and experiences cannot happen, thus reducing effectiveness. In terms of the number of experiments and user sessions, we are not aware of any study that has examined user HTEs at this scale. Our findings have important implications for information retrieval, user modeling, platforms, and digital experience contexts, in which online experiments are often used to evaluate the effectiveness of design artifacts.
Collapse
|
39
|
Spooner A, Mohammadi G, Sachdev PS, Brodaty H, Sowmya A. Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery. BMC Bioinformatics 2023; 24:9. [PMID: 36624372 PMCID: PMC9830744 DOI: 10.1186/s12859-022-05132-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 12/30/2022] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Feature selection is often used to identify the important features in a dataset but can produce unstable results when applied to high-dimensional data. The stability of feature selection can be improved with the use of feature selection ensembles, which aggregate the results of multiple base feature selectors. However, a threshold must be applied to the final aggregated feature set to separate the relevant features from the redundant ones. A fixed threshold, which is typically used, offers no guarantee that the final set of selected features contains only relevant features. This work examines a selection of data-driven thresholds to automatically identify the relevant features in an ensemble feature selector and evaluates their predictive accuracy and stability. Ensemble feature selection with data-driven thresholding is applied to two real-world studies of Alzheimer's disease. Alzheimer's disease is a progressive neurodegenerative disease with no known cure, that begins at least 2-3 decades before overt symptoms appear, presenting an opportunity for researchers to identify early biomarkers that might identify patients at risk of developing Alzheimer's disease. RESULTS The ensemble feature selectors, combined with data-driven thresholds, produced more stable results, on the whole, than the equivalent individual feature selectors, showing an improvement in stability of up to 34%. The most successful data-driven thresholds were the robust rank aggregation threshold and the threshold algorithm threshold from the field of information retrieval. The features identified by applying these methods to datasets from Alzheimer's disease studies reflect current findings in the AD literature. CONCLUSIONS Data-driven thresholds applied to ensemble feature selectors provide more stable, and therefore more reproducible, selections of features than individual feature selectors, without loss of performance. The use of a data-driven threshold eliminates the need to choose a fixed threshold a-priori and can select a more meaningful set of features. A reliable and compact set of features can produce more interpretable models by identifying the factors that are important in understanding a disease.
Collapse
Affiliation(s)
- Annette Spooner
- grid.1005.40000 0004 4902 0432School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
| | - Gelareh Mohammadi
- grid.1005.40000 0004 4902 0432School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
| | - Perminder S. Sachdev
- grid.1005.40000 0004 4902 0432Centre for Healthy Brain Ageing (CHeBA), Discipline of Psychiatry & Mental Health, School of Clinical Medicine, University of New South Wales, Sydney, Australia
| | - Henry Brodaty
- grid.1005.40000 0004 4902 0432Centre for Healthy Brain Ageing (CHeBA), Discipline of Psychiatry & Mental Health, School of Clinical Medicine, University of New South Wales, Sydney, Australia
| | - Arcot Sowmya
- grid.1005.40000 0004 4902 0432School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
| | | |
Collapse
|
40
|
Dai C, Lin B, Xing X, Liu JS. A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models. J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2165930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
| | - Buyu Lin
- Department of Statistics, Harvard University
| | - Xin Xing
- Department of Statistics, Virginia Tech
| | - Jun S. Liu
- Department of Statistics, Harvard University
| |
Collapse
|
41
|
Etourneau L, Varoquaux N, Burger T. Unveiling the Links Between Peptide Identification and Differential Analysis FDR Controls by Means of a Practical Introduction to Knockoff Filters. Methods Mol Biol 2023; 2426:1-24. [PMID: 36308682 DOI: 10.1007/978-1-0716-1967-4_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
In proteomic differential analysis, FDR control is often performed through a multiple test correction (i.e., the adjustment of the original p-values). In this protocol, we apply a recent and alternative method, based on so-called knockoff filters. It shares interesting conceptual similarities with the target-decoy competition procedure, classically used in proteomics for FDR control at peptide identification. To provide practitioners with a unified understanding of FDR control in proteomics, we apply the knockoff procedure on real and simulated quantitative datasets. Leveraging these comparisons, we propose to adapt the knockoff procedure to better fit the specificities of quantitative proteomic data (mainly very few samples). Performances of knockoff procedure are compared with those of the classical Benjamini-Hochberg procedure, hereby shedding a new light on the strengths and weaknesses of target-decoy competition.
Collapse
Affiliation(s)
- Lucas Etourneau
- Univ. Grenoble Alpes, CEA, INSERM, BioSanté U1292, Grenoble, France.
- Univ. Grenoble Alpes, CNRS, TIMC, Grenoble, France.
| | | | - Thomas Burger
- Univ. Grenoble Alpes, CNRS, CEA, INSERM, Grenoble, France
| |
Collapse
|
42
|
Hasam S, Emery K, Noble WS, Keich U. A Pipeline for Peptide Detection Using Multiple Decoys. Methods Mol Biol 2023; 2426:25-34. [PMID: 36308683 DOI: 10.1007/978-1-0716-1967-4_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Target-decoy competition has been commonly used for over a decade to control the false discovery rate when analyzing tandem mass spectrometry (MS/MS) data. We recently developed a framework that uses multiple decoys to increase the number of detected peptides in MS/MS data. Here, we present a pipeline of Apache licensed, open-source software that allows the user to readily take advantage of our framework.
Collapse
Affiliation(s)
| | | | | | - Uri Keich
- University of Sydney, Sydney, NSW, Australia.
| |
Collapse
|
43
|
Park M, Lee J, Baek C. Controlling the false discovery rate in sparse VHAR models using knockoffs. KOREAN JOURNAL OF APPLIED STATISTICS 2022. [DOI: 10.5351/kjas.2022.35.6.685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Minsu Park
- Department of Statistics, Sungkyunkwan University
| | - Jaewon Lee
- Department of Statistics, Sungkyunkwan University
| | | |
Collapse
|
44
|
Vreš D, Robnik-Šikonja M. Preventing deception with explanation methods using focused sampling. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-022-00900-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
45
|
Etourneau L, Burger T. Challenging Targets or Describing Mismatches? A Comment on Common Decoy Distribution by Madej et al. J Proteome Res 2022; 21:2840-2845. [PMID: 36305797 DOI: 10.1021/acs.jproteome.2c00279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
In their recent article, Madej et al. (Madej, D.; Wu, L.; Lam, H.Common Decoy Distributions Simplify False Discovery Rate Estimation in Shotgun Proteomics. J. Proteome Res.2022, 21 (2), 339-348) proposed an original way to solve the recurrent issue of controlling for the false discovery rate (FDR) in peptide-spectrum-match (PSM) validation. Briefly, they proposed to derive a single precise distribution of decoy matches termed the Common Decoy Distribution (CDD) and to use it to control for FDR during a target-only search. Conceptually, this approach is appealing as it takes the best of two worlds, i.e., decoy-based approaches (which leverage a large-scale collection of empirical mismatches) and decoy-free approaches (which are not subject to the randomness of decoy generation while sparing an additional database search). Interestingly, CDD also corresponds to a middle-of-the-road approach in statistics with respect to the two main families of FDR control procedures: Although historically based on estimating the false-positive distribution, FDR control has recently been demonstrated to be possible thanks to competition between the original variables (in proteomics, target sequences) and their fictional counterparts (in proteomics, decoys). Discriminating between these two theoretical trends is of prime importance for computational proteomics. In addition to highlighting why proteomics was a source of inspiration for theoretical biostatistics, it provides practical insights into the improvements that can be made to FDR control methods used in proteomics, including CDD.
Collapse
Affiliation(s)
- Lucas Etourneau
- Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI, FR2048Grenoble, France
| | - Thomas Burger
- Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI, FR2048Grenoble, France
| |
Collapse
|
46
|
Fithian W, Lei L. Conditional calibration for false discovery rate control under dependence. Ann Stat 2022. [DOI: 10.1214/21-aos2137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- William Fithian
- Department of Statistics, University of California, Berkeley
| | - Lihua Lei
- Department of Statistics, Stanford University
| |
Collapse
|
47
|
Yu G, Witten D, Bien J. Controlling costs: Feature selection on a budget. Stat (Int Stat Inst) 2022; 11:e427. [PMID: 38250253 PMCID: PMC10798788 DOI: 10.1002/sta4.427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Accepted: 09/28/2021] [Indexed: 11/10/2022]
Abstract
The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
Collapse
Affiliation(s)
- Guo Yu
- Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, California, USA
| | - Daniela Witten
- Department of Statistics and Biostatistics, University of Washington, Seattle, Washington, USA
| | - Jacob Bien
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
48
|
Dyas AR, Zhuang Y, Meguid RA, Henderson WG, Madsen HJ, Bronsert MR, Colborn KL. Development and validation of a model for surveillance of postoperative bleeding complications using structured electronic health records data. Surgery 2022; 172:1728-1732. [PMID: 36150923 PMCID: PMC10204070 DOI: 10.1016/j.surg.2022.08.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 07/01/2022] [Accepted: 08/22/2022] [Indexed: 01/28/2023]
Abstract
BACKGROUND Postoperative bleeding complications surveillance is done primarily through manual chart review. The purpose of this study was to develop and validate a detection model for postoperative bleeding complications using structured electronic health records data. METHODS Patients who underwent operations at 1 of 5 hospitals within our local health system between 2013 and 2019 and whose complications were reported by the American College of Surgeons National Surgical Quality Improvement Program were included. Electronic health records data were linked to American College of Surgeons National Surgical Quality Improvement Program data using personal health identifiers. Electronic health records predictors included diagnosis codes mapped to PheCodes, procedure names, and medications within 30 days after surgery. We defined bleeding events as the transfusion of red blood cell components within 30 days after surgery. The knockoff filter and the lasso were used to develop a model in a training set of operations from January 2013 to March 2017. Performance of each model was tested in a held-out data set of patients who underwent operations from March 2017 to October 2019. RESULTS A total of 30,639 patients were included; 1,112 patients (3.6%) had a bleeding event. Eight predictor variables were selected by the knockoff filter. When applied to the test set, specificity was 94%, sensitivity was 94%, area under the curve was 0.97, and accuracy was 93%. Calibration was consistent in lower predicted risk patients, whereas the model slightly overpredicted risk in high-risk patients. CONCLUSION We created a parsimonious, accurate model for identifying patients with bleeding complications. This model can be used to augment manual chart review for surveillance and reporting of perioperative bleeding complications, enabling inclusion of all surgeries in quality improvement efforts.
Collapse
Affiliation(s)
- Adam R Dyas
- Department of Surgery, University of Colorado School of Medicine, Aurora, CO; Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO.
| | - Yaxu Zhuang
- Department of Surgery, University of Colorado School of Medicine, Aurora, CO; Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO; Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| | - Robert A Meguid
- Department of Surgery, University of Colorado School of Medicine, Aurora, CO; Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO; Adult and Child Center for Health Outcomes Research and Delivery Science, University of Colorado School of Medicine, Aurora, CO
| | - William G Henderson
- Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO; Adult and Child Center for Health Outcomes Research and Delivery Science, University of Colorado School of Medicine, Aurora, CO; Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| | - Helen J Madsen
- Department of Surgery, University of Colorado School of Medicine, Aurora, CO; Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO
| | - Michael R Bronsert
- Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO; Adult and Child Center for Health Outcomes Research and Delivery Science, University of Colorado School of Medicine, Aurora, CO
| | - Kathryn L Colborn
- Department of Surgery, University of Colorado School of Medicine, Aurora, CO; Surgical Outcomes and Applied Research Program, University of Colorado School of Medicine, Aurora, CO; Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| |
Collapse
|
49
|
He Z, Liu L, Belloy ME, Le Guen Y, Sossin A, Liu X, Qi X, Ma S, Gyawali PK, Wyss-Coray T, Tang H, Sabatti C, Candès E, Greicius MD, Ionita-Laza I. GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies. Nat Commun 2022; 13:7209. [PMID: 36418338 PMCID: PMC9684164 DOI: 10.1038/s41467-022-34932-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 11/09/2022] [Indexed: 11/27/2022] Open
Abstract
Recent advances in genome sequencing and imputation technologies provide an exciting opportunity to comprehensively study the contribution of genetic variants to complex phenotypes. However, our ability to translate genetic discoveries into mechanistic insights remains limited at this point. In this paper, we propose an efficient knockoff-based method, GhostKnockoff, for genome-wide association studies (GWAS) that leads to improved power and ability to prioritize putative causal variants relative to conventional GWAS approaches. The method requires only Z-scores from conventional GWAS and hence can be easily applied to enhance existing and future studies. The method can also be applied to meta-analysis of multiple GWAS allowing for arbitrary sample overlap. We demonstrate its performance using empirical simulations and two applications: (1) a meta-analysis for Alzheimer's disease comprising nine overlapping large-scale GWAS, whole-exome and whole-genome sequencing studies and (2) analysis of 1403 binary phenotypes from the UK Biobank data in 408,961 samples of European ancestry. Our results demonstrate that GhostKnockoff can identify putatively functional variants with weaker statistical effects that are missed by conventional association tests.
Collapse
Affiliation(s)
- Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA. .,Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94305, USA.
| | - Linxi Liu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Michael E Belloy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Yann Le Guen
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA.,Institut du Cerveau - Paris Brain Institute - ICM, Paris, 75013, France
| | - Aaron Sossin
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
| | - Xiaoxia Liu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Xinran Qi
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Shiyang Ma
- Department of Biostatistics, Columbia University, New York, NY, 10032, USA
| | - Prashnna K Gyawali
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Tony Wyss-Coray
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Hua Tang
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA.,Department of Mathematics, Stanford University, Stanford, CA, 94305, USA
| | - Michael D Greicius
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
| | | |
Collapse
|
50
|
Transfer Learning in Genome-Wide Association Studies with Knockoffs. SANKHYA B 2022. [DOI: 10.1007/s13571-022-00297-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
AbstractThis paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to address the pressing need for principled ways to suitably account for, and efficiently learn from the genetic variation associated to diverse ancestries. Finally, we apply these methods to analyze several phenotypes in the UK Biobank data set, demonstrating that transfer learning helps knockoffs discover more associations in the data collected from minority populations, potentially opening the way to the development of more accurate polygenic risk scores.
Collapse
|