51
|
Fan J, Ma C, Wang K. Comment on “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1837138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University , Princeton , NJ
| | - Cong Ma
- Department of Electrical Engineering and Computer Sciences, UC Berkeley , Berkeley , CA
| | - Kaizheng Wang
- Department of Industrial Engineering and Operations Research, Columbia University , New York , NY
| |
Collapse
|
52
|
Wang L, Peng B, Bradic J, Li R, Wu Y. Rejoinder to “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1843865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Lan Wang
- Department of Management Science, University of Miami , Coral Gables , FL
| | - Bo Peng
- Adobe Systems, Inc. San Jose , CA
| | - Jelena Bradic
- Department of Mathematics, Halicioglu Data Science Institute, University of California at San Diego , La Jolla , CA
| | - Runze Li
- Department of Statistics, Pennsylvania State University , University Park , PA
| | - Yunan Wu
- School of Statistics, University of Minnesota , Minneapolis , MN
| |
Collapse
|
53
|
Li X, Shojaie A. Discussion of “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1837139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Xiudi Li
- Department of Biostatistics, University of Washington , Seattle , WA , USA
| | - Ali Shojaie
- Department of Biostatistics, University of Washington , Seattle , WA , USA
| |
Collapse
|
54
|
Lin L, Drton M, Shojaie A. Statistical significance in high-dimensional linear mixed models. FODS '20 : PROCEEDINGS OF THE 2020 ACM-IMS FOUNDATIONS OF DATA SCIENCE CONFERENCE : OCTOBER 19-20, 2020, VIRTUAL EVENT, USA. ACM-IMS FOUNDATIONS OF DATA SCIENCE CONFERENCE (2020 : ONLINE) 2020; 2020:171-181. [PMID: 35497571 PMCID: PMC9053448 DOI: 10.1145/3412815.3416883] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
This paper concerns the development of an inferential framework for high-dimensional linear mixed effect models. These are suitable models, for instance, when we have n repeated measurements for M subjects. We consider a scenario where the number of fixed effects p is large (and may be larger than M), but the number of random effects q is small. Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators to perform inference for high-dimensional linear models with fixed effects only. In particular, we demonstrate how to correct a 'naive' ridge estimator in extension of work by Bühlmann (2013) to build asymptotically valid confidence intervals for mixed effect models. We validate our theoretical results with numerical experiments, in which we show our method outperforms those that fail to account for correlation induced by the random effects. For a practical demonstration we consider a riboflavin production dataset that exhibits group structure, and show that conclusions drawn using our method are consistent with those obtained on a similar dataset without group structure.
Collapse
Affiliation(s)
- Lina Lin
- Department of Statistics, University of Washington
| | - Mathias Drton
- Department of Mathematics, Technical University of Munich
| | - Ali Shojaie
- Department of Biostatistics, University of Washington
| |
Collapse
|
55
|
Qiu Y, Zhou XH. Estimating c-level partial correlation graphs with application to brain imaging. Biostatistics 2020; 21:641-658. [PMID: 30596883 DOI: 10.1093/biostatistics/kxy076] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Revised: 09/20/2018] [Accepted: 11/11/2018] [Indexed: 11/13/2022] Open
Abstract
Alzheimer's disease (AD) is a chronic neurodegenerative disease that changes the functional connectivity of the brain. The alteration of the strong connections between different brain regions is of particular interest to researchers. In this article, we use partial correlations to model the brain connectivity network and propose a data-driven procedure to recover a $c$-level partial correlation graph based on PET data, which is the graph of the absolute partial correlations larger than a pre-specified constant $c$. The proposed procedure is adaptive to the "large p, small n" scenario commonly seen in whole brain studies, and it incorporates the variation of the estimated partial correlations, which results in higher power compared to the existing methods. A case study on the FDG-PET images from AD and normal control (NC) subjects discovers new brain regions, Sup Frontal and Mid Frontal in the frontal lobe, which have different brain functional connectivity between AD and NC.
Collapse
Affiliation(s)
- Yumou Qiu
- Department of Statistics, Iowa State University, 2438 Osborn Dr., Ames, Iowa, USA
| | - Xiao-Hua Zhou
- Beijing International Center for Mathematical Research, Peking University, No. 5 Yiheyuan Rd., Haidian District, Beijing, P. R. China
| |
Collapse
|
56
|
Wu J, Zheng Z, Li Y, Zhang Y. Scalable interpretable learning for multi-response error-in-variables regression. J MULTIVARIATE ANAL 2020. [DOI: 10.1016/j.jmva.2020.104644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
57
|
Comparing six shrinkage estimators with large sample theory and asymptotically optimal prediction intervals. Stat Pap (Berl) 2020. [DOI: 10.1007/s00362-020-01193-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
58
|
Regression Models for Compositional Data: General Log-Contrast Formulations, Proximal Optimization, and Microbiome Data Applications. STATISTICS IN BIOSCIENCES 2020. [DOI: 10.1007/s12561-020-09283-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
AbstractCompositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks.
Collapse
|
59
|
Javanmard A, Lee JD. A flexible framework for hypothesis testing in high dimensions. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12373] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
60
|
Janková J, Shah RD, Bühlmann P, Samworth RJ. Goodness-of-fit testing in high dimensional generalized linear models. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12371] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | | | - Peter Bühlmann
- Eidgenössische Technische Hochschule Zürich; Switzerland
| | | |
Collapse
|
61
|
Zhou RR, Wang L, Zhao SD. Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika 2020; 107:573-589. [PMID: 32831353 DOI: 10.1093/biomet/asaa016] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Indexed: 12/19/2022] Open
Abstract
Mediation analysis is difficult when the number of potential mediators is larger than the sample size. In this paper we propose new inference procedures for the indirect effect in the presence of high-dimensional mediators for linear mediation models. We develop methods for both incomplete mediation, where a direct effect may exist, and complete mediation, where the direct effect is known to be absent. We prove consistency and asymptotic normality of our indirect effect estimators. Under complete mediation, where the indirect effect is equivalent to the total effect, we further prove that our approach gives a more powerful test compared to directly testing for the total effect. We confirm our theoretical results in simulations, as well as in an integrative analysis of gene expression and genotype data from a pharmacogenomic study of drug response. We present a novel analysis of gene sets to understand the molecular mechanisms of drug response, and also identify a genome-wide significant noncoding genetic variant that cannot be detected using standard analysis methods.
Collapse
Affiliation(s)
- Ruixuan Rachel Zhou
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright Street, Champaign, Illinois 61820, U.S.A
| | - Liewei Wang
- Division of Clinical Pharmacology, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, 200 First St. SW, Rochester, Minnesota 55905, U.S.A
| | - Sihai Dave Zhao
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright Street, Champaign, Illinois 61820, U.S.A
| |
Collapse
|
62
|
Zhu Y. Covariate-adjusted Gaussian graphical model estimation with false discovery rate control. COMMUN STAT-THEOR M 2020. [DOI: 10.1080/03610926.2020.1752385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Yunlong Zhu
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, People’s Republic of China
| |
Collapse
|
63
|
Sun Q, Zhang H. Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression. J Am Stat Assoc 2020; 116:1472-1486. [PMID: 34538987 PMCID: PMC8447956 DOI: 10.1080/01621459.2020.1737079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Revised: 02/05/2020] [Accepted: 02/22/2020] [Indexed: 10/24/2022]
Abstract
Analysis of high dimensional data has received considerable and increasing attention in statistics. In practice, we may not be interested in every variable that is observed. Instead, often some of the variables are of particular interest, and the remaining variables are nuisance. To this end, we propose the nuisance penalized regression which does not penalize the parameters of interest. When the coherence between interest parameters and nuisance parameters is negligible, we show that resulting estimator can be directly used for inference without any correction. When the coherence is not negligible, we propose an iteratively procedure to further refine the estimate of interest parameters, based on which we propose a modified profile likelihood based statistic for hypothesis testing. The utilities of our general results are demonstrated in three specific examples. Numerical studies lend further support to our method.
Collapse
Affiliation(s)
- Qiang Sun
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario M5S 3G3, Canada
| | - Heping Zhang
- Department of Biostatistics, Yale University School of Public Health, 300 George Street Suite 523, New Haven, CT 06511, USA
| |
Collapse
|
64
|
Tian X. Prediction error after model search. Ann Stat 2020. [DOI: 10.1214/19-aos1818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
65
|
Amato U, Antoniadis A, De Feis I, Gijbels I. Penalised robust estimators for sparse and high-dimensional linear models. STAT METHOD APPL-GER 2020. [DOI: 10.1007/s10260-020-00511-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
66
|
Abstract
Summary
We propose a novel estimator of error variance and establish its asymptotic properties based on ridge regression and random matrix theory. The proposed estimator is valid under both low- and high-dimensional models, and performs well not only in nonsparse cases, but also in sparse ones. The finite-sample performance of the proposed method is assessed through an intensive numerical study, which indicates that the method is promising compared with its competitors in many interesting scenarios.
Collapse
Affiliation(s)
- X Liu
- School of Statistics and Management, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| | - S Zheng
- School of Mathematics & Statistics, Northeast Normal University, 5268 Renmin Street, Changchun 130024, China
| | - X Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| |
Collapse
|
67
|
|
68
|
Abstract
This paper considers inference in a linear regression model with outliers in which the number of outliers can grow with sample size while their proportion goes to 0. We propose a square-root lasso ℓ1-norm penalized estimator. We derive rates of convergence and establish asymptotic normality. Our estimator has the same asymptotic variance as the OLS estimator in the standard linear model. This enables us to build tests and confidence sets in the usual and simple manner. The proposed procedure is also computationally advantageous, it amounts to solving a convex optimization program. Overall, the suggested approach offers a practical robust alternative to the ordinary least squares estimator.
Collapse
|
69
|
Combettes PL, Müller CL. Perspective maximum likelihood-type estimation via proximal decomposition. Electron J Stat 2020. [DOI: 10.1214/19-ejs1662] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
70
|
|
71
|
Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing Mediation Effects in High-Dimensional Epigenetic Studies. Front Genet 2019; 10:1195. [PMID: 31824577 PMCID: PMC6883258 DOI: 10.3389/fgene.2019.01195] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 10/29/2019] [Indexed: 12/24/2022] Open
Abstract
Mediation analysis has been a powerful tool to identify factors mediating the association between exposure variables and outcomes. It has been applied to various genomic applications with the hope to gain novel insights into the underlying mechanism of various diseases. Given the high-dimensional nature of epigenetic data, recent effort on epigenetic mediation analysis is to first reduce the data dimension by applying high-dimensional variable selection techniques, then conducting testing in a low dimensional setup. In this paper, we propose to assess the mediation effect by adopting a high-dimensional testing procedure which can produce unbiased estimates of the regression coefficients and can properly handle correlations between variables. When the data dimension is ultra-high, we first reduce the data dimension from ultra-high to high by adopting a sure independence screening (SIS) method. We apply the method to two high-dimensional epigenetic studies: one is to assess how DNA methylations mediate the association between alcohol consumption and epithelial ovarian cancer (EOC) status; the other one is to assess how methylation signatures mediate the association between childhood maltreatment and post-traumatic stress disorder (PTSD) in adulthood. We compare the performance of the method with its counterpart via simulation studies. Our method can be applied to other high-dimensional mediation studies where high-dimensional mediation variables are collected.
Collapse
Affiliation(s)
- Yuzhao Gao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Haitao Yang
- Division of Health Statistics, School of Public Health, Hebei Medical University, Shijiazhuang, China
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yanbo Zhang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Ellen L Goode
- Department of Health Sciences Research, College of Medicine, Mayo Clinic, Rochester, MN, United States
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
72
|
Bellec PC. Localized Gaussian width of $M$-convex hulls with applications to Lasso and convex aggregation. BERNOULLI 2019. [DOI: 10.3150/18-bej1078] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
73
|
He Y, Zhang L, Ji J, Zhang X. Robust feature screening for elliptical copula regression model. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2019.05.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
74
|
|
75
|
|
76
|
Abstract
Summary
The lasso has been studied extensively as a tool for estimating the coefficient vector in the high-dimensional linear model; however, considerably less is known about estimating the error variance in this context. In this paper, we propose the natural lasso estimator for the error variance, which maximizes a penalized likelihood objective. A key aspect of the natural lasso is that the likelihood is expressed in terms of the natural parameterization of the multi-parameter exponential family of a Gaussian with unknown mean and variance. The result is a remarkably simple estimator of the error variance with provably good performance in terms of mean squared error. These theoretical results do not require placing any assumptions on the design matrix or the true regression coefficients. We also propose a companion estimator, called the organic lasso, which theoretically does not require tuning of the regularization parameter. Both estimators do well empirically compared to pre-existing methods, especially in settings where successful recovery of the true support of the coefficient vector is hard. Finally, we show that existing methods can do well under fewer assumptions than previously known, thus providing a fuller story about the problem of estimating the error variance in high-dimensional linear models.
Collapse
Affiliation(s)
- Guo Yu
- Department of Statistics, University of Washington, Box 354322, Seattle, Washington 98105, USA
| | - Jacob Bien
- Data Sciences and Operations, Marshall School of Business, University of Southern California, 3670 Trousdale Pkwy, Los Angeles, California 90089, U.S.A.
| |
Collapse
|
77
|
Abstract
Summary
Consider a high-dimensional linear regression problem, where the number of covariates is larger than the number of observations and the interest is in estimating the conditional variance of the response variable given the covariates. A conditional and an unconditional framework are considered, where conditioning is with respect to the covariates, which are ancillary to the parameter of interest. In recent papers, a consistent estimator was developed in the unconditional framework when the marginal distribution of the covariates is normal with known mean and variance. In the present work, a certain Bayesian hypothesis test is formulated under the conditional framework, and it is shown that the Bayes risk is a constant. This implies that no consistent estimator exists in the conditional framework. However, when the marginal distribution of the covariates is normal, the conditional error of the above consistent estimator converges to zero, with probability converging to one. It follows that even in the conditional setting, information about the marginal distribution of an ancillary statistic may have a significant impact on statistical inference. The practical implication in the context of high-dimensional regression models is that additional observations where only the covariates are given are potentially very useful and should not be ignored. This finding is most relevant to semi-supervised learning problems where covariate information is easy to obtain.
Collapse
Affiliation(s)
- D Azriel
- Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology, Technion City, Haifa 3200003, Israel
| |
Collapse
|
78
|
|
79
|
Li X, Wu D, Cui Y, Liu B, Walter H, Schumann G, Li C, Jiang T. Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies. BMC Bioinformatics 2019; 20:219. [PMID: 31039742 PMCID: PMC6492418 DOI: 10.1186/s12859-019-2792-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 04/02/2019] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Data from genome-wide association studies (GWASs) have been used to estimate the heritability of human complex traits in recent years. Existing methods are based on the linear mixed model, with the assumption that the genetic effects are random variables, which is opposite to the fixed effect assumption embedded in the framework of quantitative genetics theory. Moreover, heritability estimators provided by existing methods may have large standard errors, which calls for the development of reliable and accurate methods to estimate heritability. RESULTS In this paper, we first investigate the influences of the fixed and random effect assumption on heritability estimation, and prove that these two assumptions are equivalent under mild conditions in the theoretical aspect. Second, we propose a two-stage strategy by first performing sparse regularization via cross-validated elastic net, and then applying variance estimation methods to construct reliable heritability estimations. Results on both simulated data and real data show that our strategy achieves a considerable reduction in the standard error while reserving the accuracy. CONCLUSIONS The proposed strategy allows for a reliable and accurate heritability estimation using GWAS data. It shows the promising future that reliable estimations can still be obtained with even a relatively restricted sample size, and should be especially useful for large-scale heritability analyses in the genomics era.
Collapse
Affiliation(s)
- Xin Li
- School of Mathematical Sciences, Zhejiang University, 38 Zheda Road, Hangzhou, 310027 China
| | - Dongya Wu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing, 100049 China
| | - Yue Cui
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
| | - Bing Liu
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
| | - Henrik Walter
- Department of Psychiatry and Psychotherapy, Campus Charité Mitte, Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Gunter Schumann
- Centre for Population Neuroscience and Stratified Medicine (PONS) and MRC-SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, United Kingdom
| | - Chong Li
- School of Mathematical Sciences, Zhejiang University, 38 Zheda Road, Hangzhou, 310027 China
| | - Tianzi Jiang
- Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, 95 East Zhongguancun Road, Beijing, 100190 China
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, 4 Section 2 North Jianshe Road, Chengdu, 610054 China
- The Queensland Brain Institute, University of Queensland, Brisbane, QLD 4072 Australia
- University of Chinese Academy of Sciences, 19 Yuquan Road, Beijing, 100049 China
| |
Collapse
|
80
|
Affiliation(s)
- Zhao Ren
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA
| | - Yongjian Kang
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA
| |
Collapse
|
81
|
Wang Z, Xue L. Variance estimation for sparse ultra-high dimensional varying coefficient models. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1429627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Zhaoliang Wang
- College of Applied Sciences, Beijing University of Technology, Beijing, China
- School of Mathematics and Information Science, Henan Polytechnic University, Jiaozuo, China
| | - Liugen Xue
- College of Applied Sciences, Beijing University of Technology, Beijing, China
| |
Collapse
|
82
|
|
83
|
Zheng L, Raskutti G. Testing for high-dimensional network parameters in auto-regressive models. Electron J Stat 2019. [DOI: 10.1214/19-ejs1646] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
84
|
|
85
|
Luu TD, Fadili J, Chesneau C. Sharp oracle inequalities for low-complexity priors. ANN I STAT MATH 2018. [DOI: 10.1007/s10463-018-0693-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
86
|
Affiliation(s)
- Yinchu Zhu
- Department of Mathematics, Rady School of Management, University of California at San Diego, La Jolla, CA
| | - Jelena Bradic
- Department of Mathematics, Rady School of Management, University of California at San Diego, La Jolla, CA
| |
Collapse
|
87
|
|
88
|
Fujimori K. The Dantzig selector for a linear model of diffusion processes. STATISTICAL INFERENCE FOR STOCHASTIC PROCESSES 2018. [DOI: 10.1007/s11203-018-9191-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
89
|
Dalalyan AS, Grappin E, Paris Q. On the exponentially weighted aggregate with the Laplace prior. Ann Stat 2018. [DOI: 10.1214/17-aos1626] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
90
|
Zheng Z, Li Y, Yu C, Li G. Balanced estimation for high-dimensional measurement error models. Comput Stat Data Anal 2018. [DOI: 10.1016/j.csda.2018.04.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
91
|
Tian X, Loftus JR, Taylor JE. Selective inference with unknown variance via the square-root lasso. Biometrika 2018. [DOI: 10.1093/biomet/asy045] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Xiaoying Tian
- Farallon Capital Management LLC, One Maritime Plaza, 21st Floor, San Francisco, California, USA
| | - Joshua R Loftus
- Department of Information, Operations, and Management Sciences, New York University, 44 West Fourth Street, New York, New York, USA
| | - Jonathan E Taylor
- Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, California, USA
| |
Collapse
|
92
|
Bai R, Ghosh M. High-dimensional multivariate posterior consistency under global–local shrinkage priors. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2018.04.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
93
|
Chen Z, Jiang Y. A two-stage sequential conditional selection approach to sparse high-dimensional multivariate regression models. ANN I STAT MATH 2018. [DOI: 10.1007/s10463-018-0686-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
94
|
|
95
|
Gossmann A, Cao S, Brzyski D, Zhao LJ, Deng HW, Wang YP. A Sparse Regression Method for Group-Wise Feature Selection with False Discovery Rate Control. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1066-1078. [PMID: 29990279 PMCID: PMC6326365 DOI: 10.1109/tcbb.2017.2780106] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The method of Sorted L-One Penalized Estimation, or SLOPE, is a sparse regression method recently introduced by Bogdan et. al. [1] . It can be used to identify significant predictor variables in a linear model that may have more unknown parameters than observations. When the correlations between predictor variables are small, the SLOPE method is shown to successfully control the false discovery rate (the expected proportion of the irrelevant among all selected predictors) at a user specified level. However, the requirement for nearly uncorrelated predictors is too restrictive for genomic data, as demonstrated in our recent study [2] by an application of SLOPE to realistic simulated DNA sequence data. A possible solution is to divide the predictor variables into nearly uncorrelated groups, and to modify the procedure to select entire groups with an overall significant group effect, rather than individual predictors. Following this motivation, we extend SLOPE in the spirit of Group LASSO to Group SLOPE, a method that can handle group structures between the predictor variables, which are ubiquitous in real genomic data. Our theoretical results show that Group SLOPE controls the group-wise false discovery rate (gFDR), when groups are orthogonal to each other. For use in non-orthogonal settings, we propose two types of Monte Carlo based heuristics, which lead to gFDR control with Group SLOPE in simulations based on real SNP data. As an illustration of the merits of this method, an application of Group SLOPE to a dataset from the Framingham Heart Study results in the identification of some known DNA sequence regions associated with bone health, as well as some new candidate regions. The novel methods are implemented in the R package grpSLOPEMC , which is publicly available at https://github.com/agisga/grpSLOPEMC.
Collapse
|
96
|
Homrighausen D, McDonald DJ. A study on tuning parameter selection for the high-dimensional lasso. J STAT COMPUT SIM 2018. [DOI: 10.1080/00949655.2018.1491575] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
97
|
Xu C, Fang J, Shen H, Wang YP, Deng HW. EPS-LASSO: test for high-dimensional regression under extreme phenotype sampling of continuous traits. Bioinformatics 2018; 34:1996-2003. [PMID: 29385408 PMCID: PMC6454442 DOI: 10.1093/bioinformatics/bty042] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Revised: 01/20/2018] [Accepted: 01/24/2018] [Indexed: 01/19/2023] Open
Abstract
MOTIVATION Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in extreme phenotypic samples, EPS can boost the association power compared to random sampling. Most existing statistical methods for EPS examine the genetic factors individually, despite many quantitative traits have multiple genetic factors underlying their variation. It is desirable to model the joint effects of genetic factors, which may increase the power and identify novel quantitative trait loci under EPS. The joint analysis of genetic data in high-dimensional situations requires specialized techniques, e.g. the least absolute shrinkage and selection operator (LASSO). Although there are extensive research and application related to LASSO, the statistical inference and testing for the sparse model under EPS remain unknown. RESULTS We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function. The comprehensive simulation shows EPS-LASSO outperforms existing methods with stable type I error and FDR control. EPS-LASSO can provide a consistent power for both low- and high-dimensional situations compared with the other methods dealing with high-dimensional situations. The power of EPS-LASSO is close to other low-dimensional methods when the causal effect sizes are small and is superior when the effects are large. Applying EPS-LASSO to a transcriptome-wide gene expression study for obesity reveals 10 significant body mass index associated genes. Our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/xu1912/EPSLASSO. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Xu
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA
- Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| | - Jian Fang
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, USA
| | - Hui Shen
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA
- Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
| | - Yu-Ping Wang
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, USA
| | - Hong-Wen Deng
- Center of Bioinformatics and Genomics, Tulane University, New Orleans, LA, USA
- Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, USA
- Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan, China
| |
Collapse
|
98
|
Bien J, Gaynanova I, Lederer J, Müller CL. Prediction error bounds for linear regression with the TREX. TEST-SPAIN 2018. [DOI: 10.1007/s11749-018-0584-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
99
|
|
100
|
Randolph TW, Zhao S, Copeland W, Hullar M, Shojaie A. KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA. Ann Appl Stat 2018; 12:540-566. [PMID: 30224943 PMCID: PMC6138053 DOI: 10.1214/17-aoas1102] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.
Collapse
|