1
|
Fan K, Subedi S, Yang G, Lu X, Ren J, Wu C. Is Seeing Believing? A Practitioner's Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies. ENTROPY (BASEL, SWITZERLAND) 2024; 26:794. [PMID: 39330127 PMCID: PMC11430850 DOI: 10.3390/e26090794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 08/23/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the "unpenalized loss function + penalty term" formulation for regularization methods and the "likelihood function × shrinkage prior" framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.
Collapse
Affiliation(s)
- Kun Fan
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Srijana Subedi
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Gongshun Yang
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Xi Lu
- Department of Pharmaceutical Health Outcomes and Policy, College of Pharmacy, University of Houston, Houston, TX 77204, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| |
Collapse
|
2
|
Ji J, Hou Z, He Y, Liu L, Xue F, Chen H, Yuan Z. Differential network knockoff filter with application to brain connectivity analysis. Stat Med 2024; 43:3830-3861. [PMID: 38922944 DOI: 10.1002/sim.10155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 04/30/2024] [Accepted: 06/10/2024] [Indexed: 06/28/2024]
Abstract
The brain functional connectivity can typically be represented as a brain functional network, where nodes represent regions of interest (ROIs) and edges symbolize their connections. Studying group differences in brain functional connectivity can help identify brain regions and recover the brain functional network linked to neurodegenerative diseases. This process, known as differential network analysis focuses on the differences between estimated precision matrices for two groups. Current methods struggle with individual heterogeneity in measuring the brain connectivity, false discovery rate (FDR) control, and accounting for confounding factors, resulting in biased estimates and diminished power. To address these issues, we present a two-stage FDR-controlled feature selection method for differential network analysis using functional magnetic resonance imaging (fMRI) data. First, we create individual brain connectivity measures using a high-dimensional precision matrix estimation technique. Next, we devise a penalized logistic regression model that employs individual brain connectivity data and integrates a new knockoff filter for FDR control when detecting significant differential edges. Through extensive simulations, we showcase the superiority of our approach compared to other methods. Additionally, we apply our technique to fMRI data to identify differential edges between Alzheimer's disease and control groups. Our results are consistent with prior experimental studies, emphasizing the practical applicability of our method.
Collapse
Affiliation(s)
- Jiadong Ji
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Zhendong Hou
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Yong He
- Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Lei Liu
- Division of Biostatistics, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Hao Chen
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| | - Zhongshang Yuan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, Shandong, China
| |
Collapse
|
3
|
Yu CX, Gu J, Chen Z, He Z. Summary statistics knockoffs inference with family-wise error rate control. Biometrics 2024; 80:ujae082. [PMID: 39222026 PMCID: PMC11367731 DOI: 10.1093/biomtc/ujae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 07/29/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024]
Abstract
Testing multiple hypotheses of conditional independence with provable error rate control is a fundamental problem with various applications. To infer conditional independence with family-wise error rate (FWER) control when only summary statistics of marginal dependence are accessible, we adopt GhostKnockoff to directly generate knockoff copies of summary statistics and propose a new filter to select features conditionally dependent on the response. In addition, we develop a computationally efficient algorithm to greatly reduce the computational cost of knockoff copies generation without sacrificing power and FWER control. Experiments on simulated data and a real dataset of Alzheimer's disease genetics demonstrate the advantage of the proposed method over existing alternatives in both statistical power and computational efficiency.
Collapse
Affiliation(s)
- Catherine Xinrui Yu
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, 999077, China
| | - Jiaqi Gu
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
| | - Zhaomeng Chen
- Department of Statistics, Stanford University, Stanford, California, 94305, United States
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, California, 94304, United States
- Department of Medicine (Biomedical Informatics Research), Stanford University, Stanford, California, 94304, United States
| |
Collapse
|
4
|
Wang Y, Fu Y, Sun X. Knockoffs-SPR: Clean Sample Selection in Learning With Noisy Labels. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:3242-3256. [PMID: 38039178 DOI: 10.1109/tpami.2023.3338268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2023]
Abstract
A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this article, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR.
Collapse
|
5
|
Staerk C, Byrd A, Mayr A. Recent Methodological Trends in Epidemiology: No Need for Data-Driven Variable Selection? Am J Epidemiol 2024; 193:370-376. [PMID: 37771042 DOI: 10.1093/aje/kwad193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 08/02/2023] [Accepted: 09/27/2023] [Indexed: 09/30/2023] Open
Abstract
Variable selection in regression models is a particularly important issue in epidemiology, where one usually encounters observational studies. In contrast to randomized trials or experiments, confounding is often not controlled by the study design, but has to be accounted for by suitable statistical methods. For instance, when risk factors should be identified with unconfounded effect estimates, multivariable regression techniques can help to adjust for confounders. We investigated the current practice of variable selection in 4 major epidemiologic journals in 2019 and found that the majority of articles used subject-matter knowledge to determine a priori the set of included variables. In comparison with previous reviews from 2008 and 2015, fewer articles applied data-driven variable selection. Furthermore, for most articles the main aim of analysis was hypothesis-driven effect estimation in rather low-dimensional data situations (i.e., large sample size compared with the number of variables). Based on our results, we discuss the role of data-driven variable selection in epidemiology.
Collapse
|
6
|
Dai R, Zheng C. False discovery rate-controlled multiple testing for union null hypotheses: a knockoff-based approach. Biometrics 2023; 79:3497-3509. [PMID: 36854821 PMCID: PMC10460825 DOI: 10.1111/biom.13848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 02/17/2023] [Indexed: 03/02/2023]
Abstract
False discovery rate (FDR) controlling procedures provide important statistical guarantees for replicability in signal identification based on multiple hypotheses testing. In many fields of study, FDR controling procedures are used in high-dimensional (HD) analyses to discover features that are truly associated with the outcome. In some recent applications, data on the same set of candidate features are independently collected in multiple different studies. For example, gene expression data are collected at different facilities and with different cohorts, to identify the genetic biomarkers of multiple types of cancers. These studies provide us with opportunities to identify signals by considering information from different sources (with potential heterogeneity) jointly. This paper is about how to provide FDR control guarantees for the tests of union null hypotheses of conditional independence. We present a knockoff-based variable selection method (Simultaneous knockoffs) to identify mutual signals from multiple independent datasets, providing exact FDR control guarantees under finite sample settings. This method can work with very general model settings and test statistics. We demonstrate the performance of this method with extensive numerical studies and two real-data examples.
Collapse
Affiliation(s)
- Ran Dai
- Department of Biostatistics, University of Nebraska Medical Center, Omaha, Nebraska, U.S.A
| | | |
Collapse
|
7
|
Wai Tsang K, Tsung F, Xu Z. Knockoff procedure for false discovery rate control in high-dimensional data streams. J Appl Stat 2023; 50:2970-2983. [PMID: 37808615 PMCID: PMC10557548 DOI: 10.1080/02664763.2023.2200496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/03/2023] [Indexed: 10/10/2023]
Abstract
Motivated by applications to root-cause identification of faults in high-dimensional data streams that may have very limited samples after faults are detected, we consider multiple testing in models for multivariate statistical process control (SPC). With quick fault detection, only small portion of data streams being out-of-control (OC) can be assumed. It is a long standing problem to identify those OC data streams while controlling the number of false discoveries. It is challenging due to the limited number of OC samples after the termination of the process when faults are detected. Although several false discovery rate (FDR) controlling methods have been proposed, people may prefer other methods for quick detection. With a recently developed method called Knockoff filtering, we propose a knockoff procedure that can combine with other fault detection methods in the sense that the knockoff procedure does not change the stopping time, but may identify another set of faults to control FDR. A theorem for the FDR control of the proposed procedure is provided. Simulation studies show that the proposed procedure can control FDR while maintaining high power. We also illustrate the performance in an application to semiconductor manufacturing processes that motivated this development.
Collapse
Affiliation(s)
- Ka Wai Tsang
- School of Data Science, The Chinese University of Hong Kong, ShenzhenGuangdong518172, People's Republic of China
| | - Fugee Tsung
- Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong
| | - Zhihao Xu
- Department of Statistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
8
|
Zhao Y, Sun L. A stable and adaptive polygenic signal detection method based on repeated sample splitting. CAN J STAT 2023. [DOI: 10.1002/cjs.11768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
9
|
Cheng X, Wang H. A generic model-free feature screening procedure for ultra-high dimensional data with categorical response. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 229:107269. [PMID: 36463676 DOI: 10.1016/j.cmpb.2022.107269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 11/22/2022] [Accepted: 11/23/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND AND OBJECTIVE Identifying active features from ultra-high dimensional data is one of the primary and vital tasks in statistical learning and biological discovery. METHODS In this paper, we develop a generic concordance index screening (CI-SIS) procedure to wrestle with ultra-high dimensional data with categorical response. The proposed procedure is model-free and nonparametric based on the concordance index measure. It enjoys both sure screening and ranking consistency properties under some relatively weak assumptions. We investigate the flexibility of this procedure by considering some commonly-encountered challenging settings in biomedical studies, such as category-adaptive data and extremely unbalanced response distributions. A data-driven threshold selection procedure via knockoff features is also presented. RESULTS On the real lung dataset, our method achieves a lower prediction error with a mean error of 0.107 with linear discriminant analysis (LDA) and 0.117 with random forest (RF), respectively. In addition, we obtain an accuracy improvement of 3% with LDA and 5% with RF compared to the runner-up method. In a more challenging real data of SRBCT (Small round blue cell tumours), CI-SIS brings about a amazing performance improvement, which is at least 8% higher than all other competing methods. CONCLUSION Experimental results show that the proposed method can efficiently identify genes that are associated with certain types of diseases. Therefore, survived features (filtering out irrelevant features) selected by our procedure can help doctors make precision diagnoses and refined treatments of patients.
Collapse
Affiliation(s)
- Xuewei Cheng
- School of Mathematics and Statistics, Central South University, Changsha, China; Department of Statistics and Data Science, National University of Singapore, Singapore.
| | - Hong Wang
- School of Mathematics and Statistics, Central South University, Changsha, China.
| |
Collapse
|
10
|
Leung D, Sun W. ZAP: Z$$ Z $$‐value adaptive procedures for false discovery rate control with side information. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Affiliation(s)
- Dennis Leung
- School of Mathematics and Statistics University of Melbourne Parkville Victoria Australia
| | - Wenguang Sun
- Center for Data Science Zhejiang University Hangzhou China
| |
Collapse
|
11
|
Feature screening and FDR control with knockoff features for ultrahigh-dimensional right-censored data. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
12
|
Monti GS, Filzmoser P. A robust knockoff filter for sparse regression analysis of microbiome compositional data. Comput Stat 2022. [DOI: 10.1007/s00180-022-01268-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
AbstractMicrobiome data analysis often relies on the identification of a subset of potential biomarkers associated with a clinical outcome of interest. Robust ZeroSum regression, an elastic-net penalized compositional regression built on the least trimmed squares estimator, is a variable selection procedure capable to cope with the high dimensionality of these data, their compositional nature, and, at the same time, it guarantees robustness against the presence of outliers. The necessity of discovering “true” effects and to improve clinical research quality and reproducibility has motivated us to propose a two-step robust compositional knockoff filter procedure, which allows selecting the set of relevant biomarkers, among the many measured features having a nonzero effect on the response, controlling the expected fraction of false positives. We demonstrate the effectiveness of our proposal in an extensive simulation study, and illustrate its usefulness in an application to intestinal microbiome analysis.
Collapse
|
13
|
Yuan P, Feng S, Li G. Revisiting feature selection for linear models with FDR and power guarantees. J Korean Stat Soc 2022. [DOI: 10.1007/s42952-022-00179-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
14
|
Hyde R, O'Grady L, Green M. Stability selection for mixed effect models with large numbers of predictor variables: A simulation study. Prev Vet Med 2022; 206:105714. [PMID: 35843027 DOI: 10.1016/j.prevetmed.2022.105714] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 07/08/2022] [Accepted: 07/10/2022] [Indexed: 10/17/2022]
Abstract
Covariate selection when the number of available variables is large relative to the number of observations is problematic in epidemiology and remains the focus of continued research. Whilst a variety of statistical methods have been developed to attempt to overcome this issue, at present very few methods are available for wide data that include a clustered outcome. The purpose of this research was to make an empirical evaluation of a new method for covariate selection in wide data settings when the dependent variable is clustered. We used 3300 simulated datasets with a variety of defined structures and known sets of true predictor variables to conduct an empirical evaluation of a mixed model stability selection procedure. Comparison was made with an alternative method based on regularisation using the least absolute shrinkage and selection operator (Lasso) penalty. Model performance was assessed using several metrics including the true positive rate (proportion of true covariates selected in a final model) and false discovery rate (proportion of variables selected in a final model that were non-true (false) variables). For stability selection, the false discovery rate was consistently low, generally remaining ≤ 0.02 indicating that on average fewer than 1 in 50 of the variables selected in a final model were false variables. This was in contrast to the Lasso-based method in which the false discovery rate was between 0.59 and 0.72, indicating that generally more than 60% of variables selected in a final model were false variables. In contrast however, the Lasso method attained higher true positive rates than stability selection, although both methods achieved good results. For the Lasso method, true positive rates remained ≥ 0.93 whereas for stability selection the true positive rate was 0.73-0.97. Our results suggest both methods may be of value for covariate selection with high dimensional data with a clustered outcome. When high specificity is needed for identification of true covariates, stability selection appeared to offer the better solution, although with a slight loss of sensitivity. Conversely when high sensitivity is needed, the Lasso approach may be useful, even if accompanied by a substantial loss of specificity. Overall, the results indicated the loss of sensitivity when employing stability selection is relatively small compared to the loss of specificity when using the Lasso and therefore stability selection may provide the better option for the analyst when evaluating data of this type.
Collapse
Affiliation(s)
- Robert Hyde
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom
| | - Luke O'Grady
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom
| | - Martin Green
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington Campus, Leicestershire, United Kingdom.
| |
Collapse
|
15
|
Tong Z, Cai Z, Yang S, Li R. Model-Free Conditional Feature Screening with FDR Control. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2063130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Zhaoxue Tong
- Pennsylvania State University, University Park, PA
| | | | | | - Runze Li
- Pennsylvania State University, University Park, PA
| |
Collapse
|
16
|
Zhou J, Li Y, Zheng Z, Li D. Reproducible learning in large-scale graphical models. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
17
|
Null-free False Discovery Rate Control Using Decoy Permutations. ACTA MATHEMATICAE APPLICATAE SINICA, ENGLISH SERIES 2022; 38:235-253. [PMID: 35431377 PMCID: PMC8994022 DOI: 10.1007/s10255-022-1077-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 01/27/2022] [Indexed: 11/09/2022]
Abstract
The traditional approaches to false discovery rate (FDR) control in multiple hypothesis testing are usually based on the null distribution of a test statistic. However, all types of null distributions, including the theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null might fail because of improper assumptions on the sample distribution. Here, we propose a null distribution-free approach to FDR control for multiple hypothesis testing in the case-control study. This approach, named target-decoy procedure, simply builds on the ordering of tests by some statistic or score, the null distribution of which is not required to be known. Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries. We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests. Simulation demonstrates that it is more stable and powerful than two popular traditional approaches, even in the existence of dependency. Evaluation is also made on two real datasets, including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
Collapse
|
18
|
Affiliation(s)
| | - Buyu Lin
- Department of Statistics, Harvard University
| | - Xin Xing
- Department of Statistics, Virginia Tech
| | - Jun S. Liu
- Department of Statistics, Harvard University
| |
Collapse
|
19
|
Dai X, Lyu X, Li L. Kernel Knockoffs Selection for Nonparametric Additive Models. J Am Stat Assoc 2022; 118:2158-2170. [PMID: 38143786 PMCID: PMC10746135 DOI: 10.1080/01621459.2022.2039671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Accepted: 01/07/2022] [Indexed: 12/17/2022]
Abstract
Thanks to its fine balance between model flexibility and interpretability, the nonparametric additive model has been widely used, and variable selection for this type of model has been frequently studied. However, none of the existing solutions can control the false discovery rate (FDR) unless the sample size tends to infinity. The knockoff framework is a recent proposal that can address this issue, but few knockoff solutions are directly applicable to nonparametric models. In this article, we propose a novel kernel knockoffs selection procedure for the nonparametric additive model. We integrate three key components: the knockoffs, the subsampling for stability, and the random feature mapping for nonparametric function approximation. We show that the proposed method is guaranteed to control the FDR for any sample size, and achieves a power that approaches one as the sample size tends to infinity. We demonstrate the efficacy of our method through intensive simulations and comparisons with the alternative solutions. our proposal thus makes useful contributions to the methodology of nonparametric variable selection, FDR-based inference, as well as knockoffs.
Collapse
Affiliation(s)
| | | | - Lexin Li
- University of California, Berkeley
| |
Collapse
|
20
|
Reproducible feature selection in high-dimensional accelerated failure time models. Stat Probab Lett 2022. [DOI: 10.1016/j.spl.2021.109275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
21
|
Affiliation(s)
- Xu Guo
- School of Statistics, Beijing Normal University, Beijing, China
| | - Haojie Ren
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China
| | - Changliang Zou
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA
| |
Collapse
|
22
|
Luo W. Determine the number of clusters by data augmentation. Electron J Stat 2022. [DOI: 10.1214/22-ejs2032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Wei Luo
- Center for Data Science, Zhejiang University, 866 Yuhangtang Road, Hangzhou, China
| |
Collapse
|
23
|
Mary D, Roquain E. Semi-supervised multiple testing. Electron J Stat 2022. [DOI: 10.1214/22-ejs2050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- David Mary
- Université Côte d’Azur, Observatoire de la Côte d’Azur, CNRS, Laboratoire Lagrange, Bd de l’Observatoire, CS 34229, 06304, Nice cedex 4, France
| | - Etienne Roquain
- Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, Université de Paris & CNRS, 4, place Jussieu, 75005 Paris, France
| |
Collapse
|
24
|
Abraham K, Castillo I, Roquain É. Empirical Bayes cumulative ℓ-value multiple testing procedure for sparse sequences. Electron J Stat 2022. [DOI: 10.1214/22-ejs1979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Kweku Abraham
- University of Cambridge, Statistical Laboratory, Wilberforce Road, Cambridge CB3 0WB, UK
| | - Ismaël Castillo
- Université de Paris and Sorbonne Université, CNRS, Laboratoire de Probabilités, Statistique et Modélisation, F-75013 Paris, France
| | - Étienne Roquain
- Université de Paris and Sorbonne Université, CNRS, Laboratoire de Probabilités, Statistique et Modélisation, F-75013 Paris, France
| |
Collapse
|
25
|
Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, Wang N, Li W, Wang LD, Li JJ. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 2021; 22:288. [PMID: 34635147 PMCID: PMC8504070 DOI: 10.1186/s13059-021-02506-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 09/21/2021] [Indexed: 12/12/2022] Open
Abstract
High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
Collapse
Affiliation(s)
- Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Yiling Elaine Chen
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Dongyuan Song
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - MeiLu McDermott
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
- The Quantitative and Computational Biology section, University of Southern California, Los Angeles, 90089, CA, USA
| | - Kyla Woyshner
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Antigoni Manousopoulou
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Ning Wang
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, 92697, CA, USA
| | - Leo D Wang
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA.
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095, CA, USA.
| |
Collapse
|
26
|
Hallou A, Yevick HG, Dumitrascu B, Uhlmann V. Deep learning for bioimage analysis in developmental biology. Development 2021; 148:dev199616. [PMID: 34490888 PMCID: PMC8451066 DOI: 10.1242/dev.199616] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Deep learning has transformed the way large and complex image datasets can be processed, reshaping what is possible in bioimage analysis. As the complexity and size of bioimage data continues to grow, this new analysis paradigm is becoming increasingly ubiquitous. In this Review, we begin by introducing the concepts needed for beginners to understand deep learning. We then review how deep learning has impacted bioimage analysis and explore the open-source resources available to integrate it into a research project. Finally, we discuss the future of deep learning applied to cell and developmental biology. We analyze how state-of-the-art methodologies have the potential to transform our understanding of biological systems through new image-based analysis and modelling that integrate multimodal inputs in space and time.
Collapse
Affiliation(s)
- Adrien Hallou
- Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge, CB3 0HE, UK
- Wellcome Trust/Cancer Research UK Gurdon Institute, University of Cambridge, Cambridge, CB2 1QN, UK
- Wellcome Trust/Medical Research Council Stem Cell Institute, University of Cambridge, Cambridge, CB2 1QR, UK
| | - Hannah G. Yevick
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02142, USA
| | - Bianca Dumitrascu
- Computer Laboratory, Cambridge, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Virginie Uhlmann
- European Bioinformatics Institute, European Molecular Biology Laboratory, Cambridge, CB10 1SD, UK
| |
Collapse
|
27
|
Affiliation(s)
- Zhimei Ren
- Department of Statistics, University of Chicago, Chicago, IL
| | - Yuting Wei
- Statistics & Data Science Department, University of Pennsylvania, Philadelphia, PA
| | - Emmanuel Candès
- Department of Mathematics, Department of Statistics, Stanford University, Stanford, CA
| |
Collapse
|
28
|
Srinivasan A, Xue L, Zhan X. Compositional knockoff filter for high-dimensional regression analysis of microbiome data. Biometrics 2021; 77:984-995. [PMID: 32683674 PMCID: PMC7831267 DOI: 10.1111/biom.13336] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 06/29/2020] [Accepted: 07/09/2020] [Indexed: 01/10/2023]
Abstract
A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
Collapse
Affiliation(s)
- Arun Srinivasan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033, U.S.A
| |
Collapse
|
29
|
Freijeiro‐González L, Febrero‐Bande M, González‐Manteiga W. A Critical Review of LASSO and Its Derivatives for Variable Selection Under Dependence Among Covariates. Int Stat Rev 2021. [DOI: 10.1111/insr.12469] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Laura Freijeiro‐González
- Department of Statistics Mathematical Analysis and Optimization; Santiago de Compostela University Santiago de Compostela Spain
| | - Manuel Febrero‐Bande
- Department of Statistics Mathematical Analysis and Optimization; Santiago de Compostela University Santiago de Compostela Spain
| | - Wenceslao González‐Manteiga
- Department of Statistics Mathematical Analysis and Optimization; Santiago de Compostela University Santiago de Compostela Spain
| |
Collapse
|
30
|
Du L, Guo X, Sun W, Zou C. False Discovery Rate Control Under General Dependence By Symmetrized Data Aggregation. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1945459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Lilun Du
- Department of ISOM, Hong Kong University of Science and Technology, ISOM, Kowloon, Hong Kong
| | - Xu Guo
- Department of Mathematical Statistics, Beijing Normal University, Beijing, China
| | - Wenguang Sun
- Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Changliang Zou
- Department of Statistics and Data Sciences, Nankai University, Tianjin, China
| |
Collapse
|
31
|
Distribution-dependent feature selection for deep neural networks. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02663-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
32
|
Li J, Maathuis MH. GGM knockoff filter: False discovery rate control for Gaussian graphical models. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Affiliation(s)
- Jinzhou Li
- Seminar für StatistikETH Zürich Zürich Switzerland
| | | |
Collapse
|
33
|
Armstrong MJ, Paulson HL, Maixner SM, Fields JA, Lunde AM, Boeve BF, Manning C, Galvin JE, Taylor AS, Li Z. Protocol for an observational cohort study identifying factors predicting accurately end of life in dementia with Lewy bodies and promoting quality end-of-life experiences: the PACE-DLB study. BMJ Open 2021; 11:e047554. [PMID: 34039578 PMCID: PMC8160156 DOI: 10.1136/bmjopen-2020-047554] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
INTRODUCTION Dementia with Lewy bodies (DLB) is one of the most common degenerative dementias. Despite the fact that most individuals with DLB die from complications of the disease, little is known regarding what factors predict impending end of life or are associated with a quality end of life. METHODS AND ANALYSIS This is a multisite longitudinal cohort study. Participants are being recruited from five academic centres providing subspecialty DLB care and volunteers through the Lewy Body Dementia Association (not receiving specialty care). Dyads must be US residents, include individuals with a clinical diagnosis of DLB and at least moderate-to-severe dementia and include the primary caregiver, who must pass a brief cognitive screen. The first dyad was enrolled 25 February 2021; recruitment is ongoing. Dyads will attend study visits every 6 months through the end of life or 3 years. Study visits will occur in-person or virtually. Measures include demographics, DLB characteristics, caregiver considerations, quality of life and satisfaction with end-of-life experiences. For dyads where the individual with DLB dies, the caregiver will complete a final study visit 3 months after the death to assess grief, recovery and quality of the end-of-life experience. Terminal trend models will be employed to identify significant predictors of approaching end of life (death in the next 6 months). Similar models will assess caregiver factors (eg, grief, satisfaction with end-of-life experience) after the death of the individual with DLB. A qualitative descriptive analysis approach will evaluate interview transcripts regarding end-of-life experiences. ETHICS AND DISSEMINATION This study was approved by the University of Florida institutional review board (IRB202001438) and is listed on clinicaltrials.gov (NCT04829656). Data sharing follows National Institutes of Health policies. Study results will be disseminated via traditional scientific strategies (conferences, publications) and through collaborating with the Lewy Body Dementia Association, National Institute on Aging and other partnerships.
Collapse
Affiliation(s)
- Melissa J Armstrong
- Neurology, University of Florida College of Medicine, Gainesville, Florida, USA
| | | | - Susan M Maixner
- Psychiatry, University of Michigan, Ann Arbor, Michigan, USA
| | - Julie A Fields
- Psychiatry and Psychology, Mayo Clinic Rochester, Rochester, Minnesota, USA
| | - Angela M Lunde
- Psychiatry and Psychology, Mayo Clinic Rochester, Rochester, Minnesota, USA
| | | | - Carol Manning
- Neurology, University of Virginia, Charlottesville, Virginia, USA
| | - James E Galvin
- Neurology, University of Miami Miller School of Medicine, Miami, Florida, USA
| | | | - Zhigang Li
- Biostatistics, University of Florida College of Medicine, Gainesville, Florida, USA
| |
Collapse
|
34
|
Xue C, Zhang T, Xiao D. Output-Related and -Unrelated Fault Monitoring with an Improvement Prototype Knockoff Filter and Feature Selection Based on Laplacian Eigen Maps and Sparse Regression. ACS OMEGA 2021; 6:10828-10839. [PMID: 34056237 PMCID: PMC8153765 DOI: 10.1021/acsomega.1c00506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
In the process industry, fault monitoring related to output is an important step to ensure product quality and improve economic benefits. In order to distinguish the influence of input variables on the output more accurately, this paper introduces a subalgorithm of fault-unrelated block partition into the prototype knockoff filter (PKF) algorithm for its improvement. The improved PKF algorithm can divide the input data into three blocks: fault-unrelated block, output-related block, and output-unrelated block. Removing the data of fault-unrelated blocks can greatly reduce the difficulty of fault monitoring. This paper proposes a feature selection based on the Laplacian Eigen maps and sparse regression algorithm for output-unrelated blocks. The algorithm has the ability to detect faults caused by variables with small contribution to variance and proves the descent of the algorithm from a theoretical point of view. The output relation block is monitored by the Broyden-Fletcher-Goldfarb-Shanno method. Finally, the effectiveness of the proposed fault detection method is verified by the recognized Eastman process data in Tennessee.
Collapse
Affiliation(s)
- Cuiping Xue
- College
of Science, Northeastern University, Shenyang 110819, China
| | - Tie Zhang
- College
of Science, Northeastern University, Shenyang 110819, China
| | - Dong Xiao
- College
of Information Science and Engineering and Liaoning Key Laboratory
of Intelligent Diagnosis and Safety for Metallurgical Industry, Northeastern University, Shenyang 110819, China
| |
Collapse
|
35
|
Kormaksson M, Kelly LJ, Zhu X, Haemmerle S, Pricop L, Ohlssen D. Sequential knockoffs for continuous and categorical predictors: With application to a large psoriatic arthritis clinical trial pool. Stat Med 2021; 40:3313-3328. [PMID: 33899260 DOI: 10.1002/sim.8955] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 02/22/2021] [Accepted: 03/01/2021] [Indexed: 01/10/2023]
Abstract
Knockoffs provide a general framework for controlling the false discovery rate when performing variable selection. Much of the Knockoffs literature focuses on theoretical challenges and we recognize a need for bringing some of the current ideas into practice. In this paper we propose a sequential algorithm for generating knockoffs when underlying data consists of both continuous and categorical (factor) variables. Further, we present a heuristic multiple knockoffs approach that offers a practical assessment of how robust the knockoff selection process is for a given dataset. We conduct extensive simulations to validate performance of the proposed methodology. Finally, we demonstrate the utility of the methods on a large clinical data pool of more than 2000 patients with psoriatic arthritis evaluated in four clinical trials with an IL-17A inhibitor, secukinumab (Cosentyx), where we determine prognostic factors of a well established clinical outcome. The analyses presented in this paper could provide a wide range of applications to commonly encountered datasets in medical practice and other fields where variable selection is of particular interest.
Collapse
Affiliation(s)
| | | | - Xuan Zhu
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | | | - Luminita Pricop
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| | - David Ohlssen
- Novartis Pharmaceuticals Corporation, East Hanover, New Jersey, USA
| |
Collapse
|
36
|
Shi C, Li L. Testing Mediation Effects Using Logic of Boolean Matrices. J Am Stat Assoc 2021; 117:2014-2027. [PMID: 36945327 PMCID: PMC10027382 DOI: 10.1080/01621459.2021.1895177] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2020] [Revised: 02/17/2021] [Accepted: 02/19/2021] [Indexed: 10/22/2022]
Abstract
A central question in high-dimensional mediation analysis is to infer the significance of individual mediators. The main challenge is that the total number of potential paths that go through any mediator is super-exponential in the number of mediators. Most existing mediation inference solutions either explicitly impose that the mediators are conditionally independent given the exposure, or ignore any potential directed paths among the mediators. In this article, we propose a novel hypothesis testing procedure to evaluate individual mediation effects, while taking into account potential interactions among the mediators. Our proposal thus fills a crucial gap, and greatly extends the scope of existing mediation tests. Our key idea is to construct the test statistic using the logic of Boolean matrices, which enables us to establish the proper limiting distribution under the null hypothesis. We further employ screening, data splitting, and decorrelated estimation to reduce the bias and increase the power of the test. We show that our test can control both the size and false discovery rate asymptotically, and the power of the test approaches one, while allowing the number of mediators to diverge to infinity with the sample size. We demonstrate the efficacy of the method through simulations and a neuroimaging study of Alzheimer's disease. A Python implementation of the proposed procedure is available at https://github.com/callmespring/LOGAN.
Collapse
Affiliation(s)
- Chengchun Shi
- London School of Economics and Political Science and University of California at Berkeley
| | - Lexin Li
- London School of Economics and Political Science and University of California at Berkeley
| |
Collapse
|
37
|
Schultheiss C, Renaux C, Bühlmann P. Multicarving for high-dimensional post-selection inference. Electron J Stat 2021. [DOI: 10.1214/21-ejs1825] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
38
|
Chen F, He X, Wang J. Learning sparse conditional distribution: An efficient kernel-based approach. Electron J Stat 2021. [DOI: 10.1214/21-ejs1824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Fang Chen
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Xin He
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Junhui Wang
- School of Data Science, City University of Hong Kong
| |
Collapse
|
39
|
Gyllenberg D, McKeague IW, Sourander A, Brown AS. Robust data-driven identification of risk factors and their interactions: A simulation and a study of parental and demographic risk factors for schizophrenia. Int J Methods Psychiatr Res 2020; 29:1-11. [PMID: 32520440 PMCID: PMC7723216 DOI: 10.1002/mpr.1834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 03/12/2020] [Accepted: 04/29/2020] [Indexed: 01/01/2023] Open
Abstract
OBJECTIVES Few interactions between risk factors for schizophrenia have been replicated, but fitting all such interactions is difficult due to high-dimensionality. Our aims are to examine significant main and interaction effects for schizophrenia and the performance of our approach using simulated data. METHODS We apply the machine learning technique elastic net to a high-dimensional logistic regression model to produce a sparse set of predictors, and then assess the significance of odds ratios (OR) with Bonferroni-corrected p-values and confidence intervals (CI). We introduce a simulation model that resembles a Finnish nested case-control study of schizophrenia which uses national registers to identify cases (n = 1,468) and controls (n = 2,975). The predictors include nine sociodemographic factors and all interactions (31 predictors). RESULTS In the simulation, interactions with OR = 3 and prevalence = 4% were identified with <5% false positive rate and ≥80% power. None of the studied interactions were significantly associated with schizophrenia, but main effects of parental psychosis (OR = 5.2, CI 2.9-9.7; p < .001), urbanicity (1.3, 1.1-1.7; p = .001), and paternal age ≥35 (1.3, 1.004-1.6; p = .04) were significant. CONCLUSIONS We have provided an analytic pipeline for data-driven identification of main and interaction effects in case-control data. We identified highly replicated main effects for schizophrenia, but no interactions.
Collapse
Affiliation(s)
- David Gyllenberg
- Department of Child Psychiatry, University of Turku, Turku, Finland.,Department of Adolescent Psychiatry, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland.,Welfare Department, National Institute for Health and Welfare, Helsinki, Finland
| | - Ian W McKeague
- Department of Biostatistics, Columbia University Mailman School of Public Health, New York, New York, USA
| | - Andre Sourander
- Department of Child Psychiatry, University of Turku, Turku, Finland.,Department of Child Psychiatry, Turku University Central Hospital, Turku, Finland.,Department of Psychiatry, College of Physicians and Surgeons of Columbia University and New York State Psychiatric Institute, New York, New York, USA
| | - Alan S Brown
- Department of Psychiatry, College of Physicians and Surgeons of Columbia University and New York State Psychiatric Institute, New York, New York, USA.,Department of Epidemiology, Columbia University Mailman School of Public Health, New York, New York, USA
| |
Collapse
|
40
|
Abstract
Summary
In many dimension reduction problems in statistics and machine learning, such as in principal component analysis, canonical correlation analysis, independent component analysis and sufficient dimension reduction, it is important to determine the dimension of the reduced predictor, which often amounts to estimating the rank of a matrix. This problem is called order determination. In this article, we propose a novel and highly effective order-determination method based on the idea of predictor augmentation. We show that if the predictor is augmented by an artificially generated random vector, then the parts of the eigenvectors of the matrix induced by the augmentation display a pattern that reveals information about the order to be determined. This information, when combined with the information provided by the eigenvalues of the matrix, greatly enhances the accuracy of order determination.
Collapse
Affiliation(s)
- Wei Luo
- Center for Data Science, Zhejiang University, 866 Yuhangtang Road, Hangzhou 310058, China
| | - Bing Li
- Department of Statistics, The Pennsylvania State University, 326 Thomas Building, University Park, Pennsylvania 16802, U.S.A.
| |
Collapse
|
41
|
Liu W, Ke Y, Liu J, Li R. Model-Free Feature Screening and FDR Control With Knockoff Features. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1783274] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Wanjun Liu
- Department of Statistics, The Pennsylvania State University, University Park, PA
| | - Yuan Ke
- Department of Statistics, University of Georgia, Athens, GA
| | - Jingyuan Liu
- MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA
| |
Collapse
|
42
|
Gégout-Petit A, Gueudin-Muller A, Karmann C. The revisited knockoffs method for variable selection in L1-penalized regressions. COMMUN STAT-SIMUL C 2020. [DOI: 10.1080/03610918.2020.1775850] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Anne Gégout-Petit
- Université de Lorraine, CNRS, Inria, IECL, Nancy, France
- Inria BIGS team
| | | | - Clémence Karmann
- Université de Lorraine, CNRS, Inria, IECL, Nancy, France
- Inria BIGS team
| |
Collapse
|
43
|
|
44
|
Tian Z, Liang K, Li P. A powerful procedure that controls the false discovery rate with directional information. Biometrics 2020; 77:212-222. [PMID: 32277471 DOI: 10.1111/biom.13277] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Revised: 02/14/2020] [Accepted: 03/23/2020] [Indexed: 11/28/2022]
Abstract
In many multiple testing applications in genetics, the signs of the test statistics provide useful directional information, such as whether genes are potentially up- or down-regulated between two experimental conditions. However, most existing procedures that control the false discovery rate (FDR) are P-value based and ignore such directional information. We introduce a novel procedure, the signed-knockoff procedure, to utilize the directional information and control the FDR in finite samples. We demonstrate the power advantage of our procedure through simulation studies and two real applications.
Collapse
Affiliation(s)
- Zhaoyang Tian
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Kun Liang
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Pengfei Li
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
45
|
Zhao SD, Nguyen YT. Nonparametric false discovery rate control for identifying simultaneous signals. Electron J Stat 2020. [DOI: 10.1214/19-ejs1663] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|