151
|
Yang S, Wen J, Eckert ST, Wang Y, Liu DJ, Wu R, Li R, Zhan X. Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning. Bioinformatics 2020; 36:3811-3817. [PMID: 32246825 DOI: 10.1093/bioinformatics/btaa229] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 02/19/2020] [Accepted: 03/31/2020] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION Large scale genome-wide association studies (GWAS) have resulted in the identification of a wide range of genetic variants related to a host of complex traits and disorders. Despite their success, the individual single-nucleotide polymorphism (SNP) analysis approach adopted in most current GWAS can be limited in that it is usually biologically simple to elucidate a comprehensive genetic architecture of phenotypes and statistically underpowered due to heavy multiple-testing correction burden. On the other hand, multiple-SNP analyses (e.g. gene-based or region-based SNP-set analysis) are usually more powerful to examine the joint effects of a set of SNPs on the phenotype of interest. However, current multiple-SNP approaches can only draw an overall conclusion at the SNP-set level and does not directly inform which SNPs in the SNP-set are driving the overall genotype-phenotype association. RESULTS In this article, we propose a new permutation-assisted tuning procedure in lasso (plasso) to identify phenotype-associated SNPs in a joint multiple-SNP regression model in GWAS. The tuning parameter of lasso determines the amount of shrinkage and is essential to the performance of variable selection. In the proposed plasso procedure, we first generate permutations as pseudo-SNPs that are not associated with the phenotype. Then, the lasso tuning parameter is delicately chosen to separate true signal SNPs and non-informative pseudo-SNPs. We illustrate plasso using simulations to demonstrate its superior performance over existing methods, and application of plasso to a real GWAS dataset gains new additional insights into the genetic control of complex traits. AVAILABILITY AND IMPLEMENTATION R codes to implement the proposed methodology is available at https://github.com/xyz5074/plasso. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Songshan Yang
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Jiawei Wen
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Scott T Eckert
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Yaqun Wang
- Department of Biostatistics, Rutgers University, New Brunswick, NJ 08901, USA
| | - Dajiang J Liu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Rongling Wu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Runze Li
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| |
Collapse
|
152
|
Tansey W, Wang Y, Rabadan R, Blei DM. Double Empirical Bayes Testing. Int Stat Rev 2020; 88:S91-S113. [PMID: 35356801 PMCID: PMC8963776 DOI: 10.1111/insr.12430] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Accepted: 10/20/2020] [Indexed: 12/18/2022]
Abstract
Analyzing data from large-scale, multi-experiment studies requires scientists to both analyze each experiment and to assess the results as a whole. In this article, we develop double empirical Bayes testing (DEBT), an empirical Bayes method for analyzing multi-experiment studies when many covariates are gathered per experiment. DEBT is a two-stage method: in the first stage, it reports which experiments yielded significant outcomes; in the second stage, it hypothesizes which covariates drive the experimental significance. In both of its stages, DEBT builds on Efron (2008), which lays out an elegant empirical Bayes approach to testing. DEBT enhances this framework by learning a series of black box predictive models to boost power and control the false discovery rate (FDR). In Stage 1, it uses a deep neural network prior to report which experiments yielded significant outcomes. In Stage 2, it uses an empirical Bayes version of the knockoff filter (Candes et al., 2018) to select covariates that have significant predictive power of Stage-1 significance. In both simulated and real data, DEBT increases the proportion of discovered significant outcomes and selects more features when signals are weak. In a real study of cancer cell lines, DEBT selects a robust set of biologically-plausible genomic drivers of drug sensitivity and resistance in cancer.
Collapse
Affiliation(s)
- Wesley Tansey
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Yixin Wang
- Department of Statistics, Columbia University, New York, NY, USA
| | - Raul Rabadan
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| | - David M. Blei
- Department of Statistics, Columbia University, New York, NY, USA
- Department of Computer Science, Columbia University, New York, NY, USA
| |
Collapse
|
153
|
Katsevich E, Ramdas A. Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann Stat 2020. [DOI: 10.1214/19-aos1938] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
154
|
Gyllenberg D, McKeague IW, Sourander A, Brown AS. Robust data-driven identification of risk factors and their interactions: A simulation and a study of parental and demographic risk factors for schizophrenia. Int J Methods Psychiatr Res 2020; 29:1-11. [PMID: 32520440 PMCID: PMC7723216 DOI: 10.1002/mpr.1834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 03/12/2020] [Accepted: 04/29/2020] [Indexed: 01/01/2023] Open
Abstract
OBJECTIVES Few interactions between risk factors for schizophrenia have been replicated, but fitting all such interactions is difficult due to high-dimensionality. Our aims are to examine significant main and interaction effects for schizophrenia and the performance of our approach using simulated data. METHODS We apply the machine learning technique elastic net to a high-dimensional logistic regression model to produce a sparse set of predictors, and then assess the significance of odds ratios (OR) with Bonferroni-corrected p-values and confidence intervals (CI). We introduce a simulation model that resembles a Finnish nested case-control study of schizophrenia which uses national registers to identify cases (n = 1,468) and controls (n = 2,975). The predictors include nine sociodemographic factors and all interactions (31 predictors). RESULTS In the simulation, interactions with OR = 3 and prevalence = 4% were identified with <5% false positive rate and ≥80% power. None of the studied interactions were significantly associated with schizophrenia, but main effects of parental psychosis (OR = 5.2, CI 2.9-9.7; p < .001), urbanicity (1.3, 1.1-1.7; p = .001), and paternal age ≥35 (1.3, 1.004-1.6; p = .04) were significant. CONCLUSIONS We have provided an analytic pipeline for data-driven identification of main and interaction effects in case-control data. We identified highly replicated main effects for schizophrenia, but no interactions.
Collapse
Affiliation(s)
- David Gyllenberg
- Department of Child Psychiatry, University of Turku, Turku, Finland.,Department of Adolescent Psychiatry, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland.,Welfare Department, National Institute for Health and Welfare, Helsinki, Finland
| | - Ian W McKeague
- Department of Biostatistics, Columbia University Mailman School of Public Health, New York, New York, USA
| | - Andre Sourander
- Department of Child Psychiatry, University of Turku, Turku, Finland.,Department of Child Psychiatry, Turku University Central Hospital, Turku, Finland.,Department of Psychiatry, College of Physicians and Surgeons of Columbia University and New York State Psychiatric Institute, New York, New York, USA
| | - Alan S Brown
- Department of Psychiatry, College of Physicians and Surgeons of Columbia University and New York State Psychiatric Institute, New York, New York, USA.,Department of Epidemiology, Columbia University Mailman School of Public Health, New York, New York, USA
| |
Collapse
|
155
|
Couté Y, Bruley C, Burger T. Beyond Target-Decoy Competition: Stable Validation of Peptide and Protein Identifications in Mass Spectrometry-Based Discovery Proteomics. Anal Chem 2020; 92:14898-14906. [PMID: 32970414 DOI: 10.1021/acs.analchem.0c00328] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
In bottom-up discovery proteomics, target-decoy competition (TDC) is the most popular method for false discovery rate (FDR) control. Despite unquestionable statistical foundations, this method has drawbacks, including its hitherto unknown intrinsic lack of stability vis-à-vis practical conditions of application. Although some consequences of this instability have already been empirically described, they may have been misinterpreted. This article provides evidence that TDC has become less reliable as the accuracy of modern mass spectrometers improved. We therefore propose to replace TDC by a totally different method to control the FDR at the spectrum, peptide, and protein levels, while benefiting from the theoretical guarantees of the Benjamini-Hochberg framework. As this method is simpler to use, faster to compute, and more stable than TDC, we argue that it is better adapted to the standardization and throughput constraints of current proteomic platforms.
Collapse
Affiliation(s)
- Yohann Couté
- Université Grenoble Alpes, CNRS, CEA, INSERM, IRIG, BGE, F-38000 Grenoble, France
| | - Christophe Bruley
- Université Grenoble Alpes, CNRS, CEA, INSERM, IRIG, BGE, F-38000 Grenoble, France
| | - Thomas Burger
- Université Grenoble Alpes, CNRS, CEA, INSERM, IRIG, BGE, F-38000 Grenoble, France
| |
Collapse
|
156
|
|
157
|
Li JJ, Tong X. Statistical Hypothesis Testing versus Machine Learning Binary Classification: Distinctions and Guidelines. PATTERNS (NEW YORK, N.Y.) 2020; 1:100115. [PMID: 33073257 PMCID: PMC7546185 DOI: 10.1016/j.patter.2020.100115] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Making binary decisions is a common data analytical task in scientific research and industrial applications. In data sciences, there are two related but distinct strategies: hypothesis testing and binary classification. In practice, how to choose between these two strategies can be unclear and rather confusing. Here, we summarize key distinctions between these two strategies in three aspects and list five practical guidelines for data analysts to choose the appropriate strategy for specific analysis needs. We demonstrate the use of those guidelines in a cancer driver gene prediction example.
Collapse
Affiliation(s)
- Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA
| | - Xin Tong
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
158
|
Abstract
Summary
In many dimension reduction problems in statistics and machine learning, such as in principal component analysis, canonical correlation analysis, independent component analysis and sufficient dimension reduction, it is important to determine the dimension of the reduced predictor, which often amounts to estimating the rank of a matrix. This problem is called order determination. In this article, we propose a novel and highly effective order-determination method based on the idea of predictor augmentation. We show that if the predictor is augmented by an artificially generated random vector, then the parts of the eigenvectors of the matrix induced by the augmentation display a pattern that reveals information about the order to be determined. This information, when combined with the information provided by the eigenvalues of the matrix, greatly enhances the accuracy of order determination.
Collapse
Affiliation(s)
- Wei Luo
- Center for Data Science, Zhejiang University, 866 Yuhangtang Road, Hangzhou 310058, China
| | - Bing Li
- Department of Statistics, The Pennsylvania State University, 326 Thomas Building, University Park, Pennsylvania 16802, U.S.A.
| |
Collapse
|
159
|
Abstract
This paper concerns statistical inference for longitudinal data with ultrahigh dimensional covariates. We first study the problem of constructing confidence intervals and hypothesis tests for a low dimensional parameter of interest. The major challenge is how to construct a powerful test statistic in the presence of high-dimensional nuisance parameters and sophisticated within-subject correlation of longitudinal data. To deal with the challenge, we propose a new quadratic decorrelated inference function approach, which simultaneously removes the impact of nuisance parameters and incorporates the correlation to enhance the efficiency of the estimation procedure. When the parameter of interest is of fixed dimension, we prove that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal Wald test statistic. We further extend this result and establish the limiting distribution of the estimator under the setting with the dimension of the parameter of interest growing with the sample size at a polynomial rate. Finally, we study how to control the false discovery rate (FDR) when a vector of high-dimensional regression parameters is of interest. We prove that applying the Storey (2002)'s procedure to the proposed test statistics for each regression parameter controls FDR asymptotically in longitudinal data. We conduct simulation studies to assess the finite sample performance of the proposed procedures. Our simulation results imply that the newly proposed procedure can control both Type I error for testing a low dimensional parameter of interest and the FDR in the multiple testing problem. We also apply the proposed procedure to a real data example.
Collapse
Affiliation(s)
- Ethan X Fang
- Department of Statistics, the Pennsylvania State University, University Park, PA 16802-2111, USA
| | - Yang Ning
- Department of Statistics and Data Science, Cornell University, Ithaca, NY 14850, USA
| | - Runze Li
- Department of Statistics, the Pennsylvania State University, University Park, PA 16802-2111, USA
| |
Collapse
|
160
|
|
161
|
|
162
|
Lei L, Bickel PJ. An assumption-free exact test for fixed-design linear models with exchangeable errors. Biometrika 2020. [DOI: 10.1093/biomet/asaa079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary
We propose the cyclic permutation test to test general linear hypotheses for linear models. The test is nonrandomized and valid in finite samples with exact Type I error $\alpha$ for an arbitrary fixed design matrix and arbitrary exchangeable errors, whenever $1 / \alpha$ is an integer and $n / p \geqslant 1 / \alpha - 1$, where $n$ is the sample size and $p$ is the number of parameters. The test involves applying the marginal rank test to $1 / \alpha$ linear statistics of the outcome vector, where the coefficient vectors are determined by solving a linear system such that the joint distribution of the linear statistics is invariant with respect to a nonstandard cyclic permutation group under the null hypothesis. The power can be further enhanced by solving a secondary nonlinear travelling salesman problem, for which the genetic algorithm can find a reasonably good solution. Extensive simulation studies show that the cyclic permutation test has comparable power to existing tests. When testing for a single contrast of coefficients, an exact confidence interval can be obtained by inverting the test.
Collapse
Affiliation(s)
- Lihua Lei
- Department of Statistics, Stanford University, 202 Sequoia Hall, 390 Serra Mall, Stanford, California 94305, U.S.A
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720, U.S.A
| |
Collapse
|
163
|
Bates S, Sesia M, Sabatti C, Candès E. Causal inference in genetic trio studies. Proc Natl Acad Sci U S A 2020; 117:24117-24126. [PMID: 32948695 PMCID: PMC7533659 DOI: 10.1073/pnas.2007743117] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Accepted: 08/11/2020] [Indexed: 12/26/2022] Open
Abstract
We introduce a method to draw causal inferences-inferences immune to all possible confounding-from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by developing a conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed digital twin test compares an observed offspring to carefully constructed synthetic offspring from the same parents to determine statistical significance, and it can leverage any black-box multivariate model and additional nontrio genetic data to increase power. Crucially, our inferences are based only on a well-established mathematical model of recombination and make no assumptions about the relationship between the genotypes and phenotypes. We compare our method to the widely used transmission disequilibrium test and demonstrate enhanced power and localization.
Collapse
Affiliation(s)
- Stephen Bates
- Department of Statistics, Stanford University, Stanford, CA 94305;
| | - Matteo Sesia
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, CA 90089
| | - Chiara Sabatti
- Department of Statistics, Stanford University, Stanford, CA 94305
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305
| | - Emmanuel Candès
- Department of Statistics, Stanford University, Stanford, CA 94305;
- Department of Mathematics, Stanford University, Stanford, CA 94305
| |
Collapse
|
164
|
Guo J, Jin M, Chen Y, Liu J. An embedded gene selection method using knockoffs optimizing neural network. BMC Bioinformatics 2020; 21:414. [PMID: 32962627 PMCID: PMC7510330 DOI: 10.1186/s12859-020-03717-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Accepted: 08/19/2020] [Indexed: 11/30/2022] Open
Abstract
Background Gene selection refers to find a small subset of discriminant genes from the gene expression profiles. How to select genes that affect specific phenotypic traits effectively is an important research work in the field of biology. The neural network has better fitting ability when dealing with nonlinear data, and it can capture features automatically and flexibly. In this work, we propose an embedded gene selection method using neural network. The important genes can be obtained by calculating the weight coefficient after the training is completed. In order to solve the problem of black box of neural network and further make the training results interpretable in neural network, we use the idea of knockoffs to construct the knockoff feature genes of the original feature genes. This method not only make each feature gene to compete with each other, but also make each feature gene compete with its knockoff feature gene. This approach can help to select the key genes that affect the decision-making of neural networks. Results We use maize carotenoids, tocopherol methyltransferase, raffinose family oligosaccharides and human breast cancer dataset to do verification and analysis. Conclusions The experiment results demonstrate that the knockoffs optimizing neural network method has better detection effect than the other existing algorithms, and specially for processing the nonlinear gene expression and phenotype data.
Collapse
Affiliation(s)
- Juncheng Guo
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.,Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 10049, China.,School of Cyber Security, University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Min Jin
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yuanyuan Chen
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianxiao Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China. .,National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
165
|
Wu W, Yin X. Pseudo estimation and variable selection in regression. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2020.01.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
166
|
Huang Y, Jin W, Yu Z, Li B. Supervised feature selection through Deep Neural Networks with pairwise connected structure. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106202] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
167
|
Lei L, Ramdas A, Fithian W. A general interactive framework for false discovery rate control under structural constraints. Biometrika 2020. [DOI: 10.1093/biomet/asaa064] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Summary
We propose a general framework based on selectively traversed accumulation rules for interactive multiple testing with generic structural constraints on the rejection set. It combines accumulation tests from ordered multiple testing with data-carving ideas from post-selection inference, allowing highly flexible adaptation to generic structural information. Our procedure defines an interactive protocol for gradually pruning a candidate rejection set, beginning with the set of all hypotheses and shrinking the set with each step. By restricting the information at each step via a technique we call masking, our protocol enables interaction while controlling the false discovery rate in finite samples for any data-adaptive update rule that the analyst may choose. We suggest update rules for a variety of applications with complex structural constraints, demonstrate that selectively traversed accumulation rules perform well in problems ranging from convex region detection to false discovery rate control on directed acyclic graphs, and show how to extend the framework to regression problems where knockoff statistics are available in lieu of $p$-values.
Collapse
Affiliation(s)
- Lihua Lei
- Department of Statistics, Stanford University, 202 Sequoia Hall, 390 Serra Mall, Stanford, California 94305, U.S.A
| | - Aaditya Ramdas
- Department of Statistics and Data Science, Carnegie Mellon University, 132H Baker Hall, Pittsburgh, Pennsylvania 15213, U.S.A
| | - William Fithian
- Department of Statistics, University of California, Berkeley, 301 Evans Hall, Berkeley, California 94720, U.S.A
| |
Collapse
|
168
|
Liu W, Ke Y, Liu J, Li R. Model-Free Feature Screening and FDR Control With Knockoff Features. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1783274] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Wanjun Liu
- Department of Statistics, The Pennsylvania State University, University Park, PA
| | - Yuan Ke
- Department of Statistics, University of Georgia, Athens, GA
| | - Jingyuan Liu
- MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA
| |
Collapse
|
169
|
Gégout-Petit A, Gueudin-Muller A, Karmann C. The revisited knockoffs method for variable selection in L1-penalized regressions. COMMUN STAT-SIMUL C 2020. [DOI: 10.1080/03610918.2020.1775850] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Anne Gégout-Petit
- Université de Lorraine, CNRS, Inria, IECL, Nancy, France
- Inria BIGS team
| | | | - Clémence Karmann
- Université de Lorraine, CNRS, Inria, IECL, Nancy, France
- Inria BIGS team
| |
Collapse
|
170
|
Wang G, Sarkar A, Carbonetto P, Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Series B Stat Methodol 2020; 82:1273-1300. [DOI: 10.1111/rssb.12388] [Citation(s) in RCA: 176] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
171
|
|
172
|
Lian J, Shi Y, Zhang Y, Jia W, Fan X, Zheng Y. Revealing False Positive Features in Epileptic EEG Identification. Int J Neural Syst 2020; 30:2050017. [DOI: 10.1142/s0129065720500173] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Feature selection plays a vital role in the detection and discrimination of epileptic seizures in electroencephalogram (EEG) signals. The state-of-the-art EEG classification techniques commonly entail the extraction of the multiple features that would be fed into classifiers. For some techniques, the feature selection strategies have been used to reduce the dimensionality of the entire feature space. However, most of these approaches focus on the performance of classifiers while neglecting the association between the feature and the EEG activity itself. To enhance the inner relationship between the feature subset and the epileptic EEG task with a promising classification accuracy, we propose a machine learning-based pipeline using a novel feature selection algorithm built upon a knockoff filter. First, a number of temporal, spectral, and spatial features are extracted from the raw EEG signals. Second, the proposed feature selection algorithm is exploited to obtain the optimal subgroup of features. Afterwards, three classifiers including [Formula: see text]-nearest neighbor (KNN), random forest (RF) and support vector machine (SVM) are used. The experimental results on the Bonn dataset demonstrate that the proposed approach outperforms the state-of-the-art techniques, with accuracy as high as 99.93% for normal and interictal EEG discrimination and 98.95% for interictal and ictal EEG classification. Meanwhile, it has achieved satisfactory sensitivity (95.67% in average), specificity (98.83% in average), and accuracy (98.89% in average) over the Freiburg dataset.
Collapse
Affiliation(s)
- Jian Lian
- School of Information Science and Engineering, Shandong Normal University, Jinan 250358, P. R. China
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan 250031, P. R. China
| | - Yunfeng Shi
- School of Information Science and Engineering, Shandong Normal University, Jinan 250358, P. R. China
| | - Yan Zhang
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan 250031, P. R. China
| | - Weikuan Jia
- School of Information Science and Engineering, Shandong Normal University, Jinan 250358, P. R. China
| | - Xiaojun Fan
- Antai College of Economics and Management, Shanghai Jiaotong University, Shanghai 200240, P. R. China
| | - Yuanjie Zheng
- School of Information Science and Engineering, Shandong Normal University, Key Lab of Intelligent Computing and Information Security in Universities of Shandong, Institute of Life Sciences, Shandong Provincial Key Laboratory for Distributed Computer Software and Novel Technologies, and Key Lab of Intelligent Information Processing, Jinan 250358, P. R. China
| |
Collapse
|
173
|
Javanmard A, Lee JD. A flexible framework for hypothesis testing in high dimensions. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12373] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
174
|
Tian Z, Liang K, Li P. A powerful procedure that controls the false discovery rate with directional information. Biometrics 2020; 77:212-222. [PMID: 32277471 DOI: 10.1111/biom.13277] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Revised: 02/14/2020] [Accepted: 03/23/2020] [Indexed: 11/28/2022]
Abstract
In many multiple testing applications in genetics, the signs of the test statistics provide useful directional information, such as whether genes are potentially up- or down-regulated between two experimental conditions. However, most existing procedures that control the false discovery rate (FDR) are P-value based and ignore such directional information. We introduce a novel procedure, the signed-knockoff procedure, to utilize the directional information and control the FDR in finite samples. We demonstrate the power advantage of our procedure through simulation studies and two real applications.
Collapse
Affiliation(s)
- Zhaoyang Tian
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Kun Liang
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Pengfei Li
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
175
|
Chen S, Arias-Castro E. On the power of some sequential multiple testing procedures. ANN I STAT MATH 2020. [DOI: 10.1007/s10463-020-00752-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
176
|
Fu GH, Wu YJ, Zong MJ, Pan J. Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinformatics 2020; 21:121. [PMID: 32293252 PMCID: PMC7092448 DOI: 10.1186/s12859-020-3411-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Accepted: 02/12/2020] [Indexed: 11/11/2022] Open
Abstract
Background Feature selection in class-imbalance learning has gained increasing attention in recent years due to the massive growth of high-dimensional class-imbalanced data across many scientific fields. In addition to reducing model complexity and discovering key biomarkers, feature selection is also an effective method of combating overlapping which may arise in such data and become a crucial aspect for determining classification performance. However, ordinary feature selection techniques for classification can not be simply used for addressing class-imbalanced data without any adjustment. Thus, more efficient feature selection technique must be developed for complicated class-imbalanced data, especially in the context of high-dimensionality. Results We proposed an algorithm called sssHD to achieve stable sparse feature selection applied it to complicated class-imbalanced data. sssHD is based on the Hellinger distance (HD) coupled with sparse regularization techniques. We stated that Hellinger distance is not only class-insensitive but also translation-invariant. Simulation result indicates that HD-based selection algorithm is effective in recognizing key features and control false discoveries for class-imbalance learning. Five gene expression datasets are also employed to test the performance of the sssHD algorithm, and a comparison with several existing selection procedures is performed. The result shows that sssHD is highly competitive in terms of five assessment metrics. In addition, sssHD presents limited differences between performing and not performing re-balance preprocessing. Conclusions sssHD is a practical feature selection method for high-dimensional class-imbalanced data, which is simple and can be an alternative for performing feature selection in class-imbalanced data. sssHD can be easily extended by connecting it with different re-balance preprocessing, different sparse regularization structures as well as different classifiers. As such, the algorithm is extremely general and has a wide range of applicability.
Collapse
Affiliation(s)
- Guang-Hui Fu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China.
| | - Yuan-Jiao Wu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Min-Jie Zong
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Jianxin Pan
- School of Mathematics, The University of Manchester, Manchester, M13 9PL, UK
| |
Collapse
|
177
|
Affiliation(s)
- Stephen Bates
- Department of Statistics, Stanford University , Stanford , CA
| | - Emmanuel Candès
- Department of Mathematics and Statistics, Stanford University , Stanford , CA
| | - Lucas Janson
- Department of Statistics, Harvard University , Cambridge , MA
| | - Wenshuo Wang
- Department of Statistics, Harvard University , Cambridge , MA
| |
Collapse
|
178
|
Sesia M, Katsevich E, Bates S, Candès E, Sabatti C. Multi-resolution localization of causal variants across the genome. Nat Commun 2020; 11:1093. [PMID: 32107378 PMCID: PMC7046731 DOI: 10.1038/s41467-020-14791-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Accepted: 02/01/2020] [Indexed: 01/07/2023] Open
Abstract
In the statistical analysis of genome-wide association data, it is challenging to precisely localize the variants that affect complex traits, due to linkage disequilibrium, and to maximize power while limiting spurious findings. Here we report on KnockoffZoom: a flexible method that localizes causal variants at multiple resolutions by testing the conditional associations of genetic segments of decreasing width, while provably controlling the false discovery rate. Our method utilizes artificial genotypes as negative controls and is equally valid for quantitative and binary phenotypes, without requiring any assumptions about their genetic architectures. Instead, we rely on well-established genetic models of linkage disequilibrium. We demonstrate that our method can detect more associations than mixed effects models and achieve fine-mapping precision, at comparable computational cost. Lastly, we apply KnockoffZoom to data from 350k subjects in the UK Biobank and report many new findings.
Collapse
Affiliation(s)
- Matteo Sesia
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
| | - Eugene Katsevich
- Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Stephen Bates
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
| | - Emmanuel Candès
- Departments of Mathematics and of Statistics, Stanford University, Stanford, CA, 94305, USA.
| | - Chiara Sabatti
- Departments of Biomedical Data Science and of Statistics, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
179
|
Hu S, Huo D, Yu Z, Chen Y, Liu J, Liu L, Wu X, Zhang Y. ncHMR detector: a computational framework to systematically reveal non-classical functions of histone modification regulators. Genome Biol 2020; 21:48. [PMID: 32093739 PMCID: PMC7038559 DOI: 10.1186/s13059-020-01953-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 02/06/2020] [Indexed: 01/02/2023] Open
Abstract
Recently, several non-classical functions of histone modification regulators (HMRs), independent of their known histone modification substrates and products, have been reported to be essential for specific cellular processes. However, there is no framework designed for identifying such functions systematically. Here, we develop ncHMR detector, the first computational framework to predict non-classical functions and cofactors of a given HMR, based on ChIP-seq data mining. We apply ncHMR detector in ChIP-seq data-rich cell types and predict non-classical functions of HMRs. Finally, we experimentally reveal that the predicted non-classical function of CBX7 is biologically significant for the maintenance of pluripotency.
Collapse
Affiliation(s)
- Shengen Hu
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| | - Dawei Huo
- Department of Cell Biology, Tianjin Medical University, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Qixiangtai Road 22, Tianjin, China
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin, China
| | - Zhaowei Yu
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| | - Yujie Chen
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| | - Jing Liu
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
- Present address: Key Laboratory of Forensic Genetics, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing, China
| | - Lin Liu
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - Xudong Wu
- Department of Cell Biology, Tianjin Medical University, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Qixiangtai Road 22, Tianjin, China
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin, China
- State Key Laboratory of Experimental Hematology, Institute of Hematology and Blood Diseases Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Tianjin, 300020 China
| | - Yong Zhang
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| |
Collapse
|
180
|
Fu H, Archer KJ. High-dimensional variable selection for ordinal outcomes with error control. Brief Bioinform 2020; 22:334-345. [PMID: 32031572 DOI: 10.1093/bib/bbaa007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 01/06/2020] [Indexed: 12/24/2022] Open
Abstract
Many high-throughput genomic applications involve a large set of potential covariates and a response which is frequently measured on an ordinal scale, and it is crucial to identify which variables are truly associated with the response. Effectively controlling the false discovery rate (FDR) without sacrificing power has been a major challenge in variable selection research. This study reviews two existing variable selection frameworks, model-X knockoffs and a modified version of reference distribution variable selection (RDVS), both of which utilize artificial variables as benchmarks for decision making. Model-X knockoffs constructs a 'knockoff' variable for each covariate to mimic the covariance structure, while RDVS generates only one null variable and forms a reference distribution by performing multiple runs of model fitting. Herein, we describe how different importance measures for ordinal responses can be constructed that fit into these two selection frameworks, using either penalized regression or machine learning techniques. We compared these measures in terms of the FDR and power using simulated data. Moreover, we applied these two frameworks to high-throughput methylation data for identifying features associated with the progression from normal liver tissue to hepatocellular carcinoma to further compare and contrast their performances.
Collapse
|
181
|
Tardivel PJC, Servien R, Concordet D. Simple expressions of the LASSO and SLOPE estimators in low-dimension. STATISTICS-ABINGDON 2020. [DOI: 10.1080/02331888.2020.1720019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
| | - Rémi Servien
- INTHERES, Université de Toulouse, INRA, ENVT, Toulouse, France
| | | |
Collapse
|
182
|
Affiliation(s)
- Ying Liu
- Mental Health Data Science, Department of PsychiatryColumbia University Irving Medical Center New York 10032 NY USA
| | - Cheng Zheng
- Joseph J. Zilber School of Public HealthUniversity of Wisconsin‐Milwaukee Milwaukee 53211 WI USA
| |
Collapse
|
183
|
Duan B, Ramdas A, Balakrishnan S, Wasserman L. Interactive martingale tests for the global null. Electron J Stat 2020. [DOI: 10.1214/20-ejs1790] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
184
|
Gao RX, Wang L, Helu M, Teti R. Big data analytics for smart factories of the future. CIRP ANNALS ... MANUFACTURING TECHNOLOGY 2020; 9:10.1016/j.cirp.2020.05.002. [PMID: 39391704 PMCID: PMC11465481 DOI: 10.1016/j.cirp.2020.05.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
Continued advancement of sensors has led to an ever-increasing amount of data of various physical nature to be acquired from production lines. As rich information relevant to the machines and processes are embedded within these "big data", how to effectively and efficiently discover patterns in the big data to enhance productivity and economy has become both a challenge and an opportunity. This paper discusses essential elements of and promising solutions enabled by data science that are critical to processing data of high volume, velocity, variety, and low veracity, towards the creation of added-value in smart factories of the future.
Collapse
Affiliation(s)
- Robert X Gao
- Department of Mechanical and Aerospace Engineering, Case Western Reserve University, Cleveland, OH, USA
| | - Lihui Wang
- Department of Production Engineering, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Moneer Helu
- Engineering Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Roberto Teti
- Department of Chemical, Materials and Industrial Production Engineering, University of Naples Federico II, Naples, Italy
| |
Collapse
|
185
|
Zhao SD, Nguyen YT. Nonparametric false discovery rate control for identifying simultaneous signals. Electron J Stat 2020. [DOI: 10.1214/19-ejs1663] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
186
|
Affiliation(s)
- Nan Bi
- Department of Statistics Stanford University
| | | | - Lucy Xia
- Department of Statistics Stanford University
| | | |
Collapse
|
187
|
Rinaldo A, Wasserman L, G’Sell M. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann Stat 2019. [DOI: 10.1214/18-aos1784] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
188
|
Affiliation(s)
- Yaniv Romano
- Department of Statistics, Stanford University, Stanford, CA
| | - Matteo Sesia
- Department of Statistics, Stanford University, Stanford, CA
| | | |
Collapse
|
189
|
|
190
|
Fan Y, Lv J, Sharifvaghefi M, Uematsu Y. IPAD: Stable Interpretable Forecasting with Knockoffs Inference. J Am Stat Assoc 2019; 115:1822-1834. [PMID: 33716359 PMCID: PMC7954402 DOI: 10.1080/01621459.2019.1654878] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 06/07/2019] [Accepted: 08/03/2019] [Indexed: 10/26/2022]
Abstract
Interpretability and stability are two important features that are desired in many contemporary big data applications arising in statistics, economics, and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance greatly the interpretability is still largely underdeveloped. To this end, in this paper we exploit the general framework of model-X knockoffs introduced recently in Candès, Fan, Janson and Lv (2018), which is nonconventional for reproducible large-scale inference in that the framework is completely free of the use of p-values for significance testing, and suggest a new method of intertwined probabilistic factors decoupling (IPAD) for stable interpretable forecasting with knockoffs inference in high-dimensional models. The recipe of the method is constructing the knockoff variables by assuming a latent factor model that is exploited widely in economics and finance for the association structure of covariates. Our method and work are distinct from the existing literature in that we estimate the covariate distribution from data instead of assuming that it is known when constructing the knockoff variables, our procedure does not require any sample splitting, we provide theoretical justifications on the asymptotic false discovery rate control, and the theory for the power analysis is also established. Several simulation examples and the real data analysis further demonstrate that the newly suggested method has appealing finite-sample performance with desired interpretability and stability compared to some popularly used forecasting methods.
Collapse
|
191
|
Read DF, Cook K, Lu YY, Le Roch KG, Noble WS. Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features. PLoS Comput Biol 2019; 15:e1007329. [PMID: 31509524 PMCID: PMC6756558 DOI: 10.1371/journal.pcbi.1007329] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 09/23/2019] [Accepted: 08/12/2019] [Indexed: 12/02/2022] Open
Abstract
Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors.
Collapse
Affiliation(s)
- David F. Read
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Kate Cook
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Yang Y. Lu
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Karine G. Le Roch
- Department of Molecular, Cell and Systems Biology, University of California, Riverside, California, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
192
|
False Discovery Rate Control in Cancer Biomarker Selection Using Knockoffs. Cancers (Basel) 2019; 11:cancers11060744. [PMID: 31146393 PMCID: PMC6628039 DOI: 10.3390/cancers11060744] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 05/23/2019] [Indexed: 11/17/2022] Open
Abstract
The discovery of biomarkers that are informative for cancer risk assessment, diagnosis, prognosis and treatment predictions is crucial. Recent advances in high-throughput genomics make it plausible to select biomarkers from the vast number of human genes in an unbiased manner. Yet, control of false discoveries is challenging given the large number of genes versus the relatively small number of patients in a typical cancer study. To ensure that most of the discoveries are true, we employ a knockoff procedure to control false discoveries. Our method is general and flexible, accommodating arbitrary covariate distributions, linear and nonlinear associations, and survival models. In simulations, our method compares favorably to the alternatives; its utility of identifying important genes in real clinical applications is demonstrated by the identification of seven genes associated with Breslow thickness in skin cutaneous melanoma patients.
Collapse
|
193
|
Jeng XJ, Chen X. Predictor ranking and false discovery proportion control in high-dimensional regression. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2018.12.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
194
|
Banerjee K, Zhao N, Srinivasan A, Xue L, Hicks SD, Middleton FA, Wu R, Zhan X. An Adaptive Multivariate Two-Sample Test With Application to Microbiome Differential Abundance Analysis. Front Genet 2019; 10:350. [PMID: 31068967 PMCID: PMC6491633 DOI: 10.3389/fgene.2019.00350] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 04/01/2019] [Indexed: 01/21/2023] Open
Abstract
Differential abundance analysis is a crucial task in many microbiome studies, where the central goal is to identify microbiome taxa associated with certain biological or clinical conditions. There are two different modes of microbiome differential abundance analysis: the individual-based univariate differential abundance analysis and the group-based multivariate differential abundance analysis. The univariate analysis identifies differentially abundant microbiome taxa subject to multiple correction under certain statistical error measurements such as false discovery rate, which is typically complicated by the high-dimensionality of taxa and complex correlation structure among taxa. The multivariate analysis evaluates the overall shift in the abundance of microbiome composition between two conditions, which provides useful preliminary differential information for the necessity of follow-up validation studies. In this paper, we present a novel Adaptive multivariate two-sample test for Microbiome Differential Analysis (AMDA) to examine whether the composition of a taxa-set are different between two conditions. Our simulation studies and real data applications demonstrated that the AMDA test was often more powerful than several competing methods while preserving the correct type I error rate. A free implementation of our AMDA method in R software is available at https://github.com/xyz5074/AMDA.
Collapse
Affiliation(s)
- Kalins Banerjee
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, United States
| | - Arun Srinivasan
- Department of Statistics, Pennsylvania State University, University Park, PA, United States
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University, University Park, PA, United States
| | - Steven D. Hicks
- Department of Pediatrics, Pennsylvania State University, Hershey, PA, United States
| | - Frank A. Middleton
- Department of Neuroscience, State University of New York Upstate Medical University, Syracuse, NY, United States
| | - Rongling Wu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States,*Correspondence: Xiang Zhan
| |
Collapse
|
195
|
Fan Y, Demirkaya E, Li G, Lv J. RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs. J Am Stat Assoc 2019; 115:362-379. [PMID: 32742045 PMCID: PMC7394464 DOI: 10.1080/01621459.2018.1546589] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 10/09/2018] [Accepted: 10/31/2018] [Indexed: 10/27/2022]
Abstract
Power and reproducibility are key to enabling refined scientific discoveries in contemporary big data applications with general high-dimensional nonlinear models. In this paper, we provide theoretical foundations on the power and robustness for the model-X knockoffs procedure introduced recently in Candès, Fan, Janson and Lv (2018) in high-dimensional setting when the covariate distribution is characterized by Gaussian graphical model. We establish that under mild regularity conditions, the power of the oracle knockoffs procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to infinity. When moving away from the ideal case, we suggest the modified model-X knockoffs method called graphical nonlin-ear knockoffs (RANK) to accommodate the unknown covariate distribution. We provide theoretical justifications on the robustness of our modified procedure by showing that the false discovery rate (FDR) is asymptotically controlled at the target level and the power is asymptotically one with the estimated covariate distribution. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure. Simulation results demonstrate that compared to existing approaches, our method performs competitively in both FDR control and power. A real data set is analyzed to further assess the performance of the suggested knockoffs procedure.
Collapse
|
196
|
|
197
|
Barber RF, Candès E. On the construction of knockoffs in case–control studies. Stat (Int Stat Inst) 2019. [DOI: 10.1002/sta4.225] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Rina Foygel Barber
- Department of Statistics University of Chicago Chicago 60637‐5418 Illinois
| | - Emmanuel Candès
- Departments of Mathematics and Statistics Stanford University Stanford 94305 California
| |
Collapse
|
198
|
Tony Cai T, Sun W, Wang W. Covariate‐assisted ranking and screening for large‐scale two‐sample inference. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12304] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
| | - Wenguang Sun
- University of Southern California Los Angeles USA
| | - Weinan Wang
- University of Southern California Los Angeles USA
| |
Collapse
|
199
|
Jewell SW, Witten DM. Discussion of 'Gene hunting with hidden Markov model knockoffs'. Biometrika 2019; 106:23-26. [PMID: 30799876 PMCID: PMC6373413 DOI: 10.1093/biomet/asy061] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- S W Jewell
- Department of Statistics, University of Washington, Seattle, Washington, U.S.A
| | - D M Witten
- Departments of Statistics and Biostatistics, University of Washington, Seattle, Washington, U.S.A
| |
Collapse
|
200
|
Katsevich E, Sabatti C. MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS. Ann Appl Stat 2019; 13:1-33. [PMID: 31687060 PMCID: PMC6827557 DOI: 10.1214/18-aoas1185] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We tackle the problem of selecting from among a large number of variables those that are "important" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [Ann. Statist. 43 (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [J. Roy. Statist. Soc. Ser. B 79 (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
Collapse
Affiliation(s)
- Eugene Katsevich
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| | - Chiara Sabatti
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| |
Collapse
|