101
|
Affiliation(s)
- Tao Wang
- School of Mathematical Sciences, Nankai University, Tianjin City, People's Republic of China
- School of Mathematics and Statistics, Kashgar University, Kashgar City, People's Republic of China
| | - Lin Zheng
- School of Mathematical Sciences, Nankai University, Tianjin City, People's Republic of China
| | - Zhonghua Li
- Institute of Statistics and LPMC, Nankai University, Tianjin City, People's Republic of China
| | - Haiyang Liu
- Department of Aviation Material Management, Air Force Logistics College, Xuzhou City, People's Republic of China
| |
Collapse
|
102
|
Chu W, Li R, Reimherr M. FEATURE SCREENING FOR TIME-VARYING COEFFICIENT MODELS WITH ULTRAHIGH DIMENSIONAL LONGITUDINAL DATA. Ann Appl Stat 2016; 10:596-617. [PMID: 27630755 DOI: 10.1214/16-aoas912] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Motivated by an empirical analysis of the Childhood Asthma Management Project, CAMP, we introduce a new screening procedure for varying coefficient models with ultrahigh dimensional longitudinal predictor variables. The performance of the proposed procedure is investigated via Monte Carlo simulation. Numerical comparisons indicate that it outperforms existing ones substantially, resulting in significant improvements in explained variability and prediction error. Applying these methods to CAMP, we are able to find a number of potentially important genetic mutations related to lung function, several of which exhibit interesting nonlinear patterns around puberty.
Collapse
Affiliation(s)
- Wanghuan Chu
- Department of Statistics, Pennsylvania State University, State College, PA, 16801, USA,
| | - Runze Li
- Department of Statistics and the Methodology Center, Pennsylvania State University, State College, PA, 16801, USA,
| | - Matthew Reimherr
- Department of Statistics, Pennsylvania State University, State College, PA, 16801, USA,
| |
Collapse
|
103
|
Hong HG, Wang L, He X. A data‐driven approach to conditional screening of high‐dimensional variables. Stat (Int Stat Inst) 2016. [DOI: 10.1002/sta4.115] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Hyokyoung G. Hong
- Department of Statistics and Probability Michigan State University East Lansing 48824 MI USA
| | - Lan Wang
- School of Statistics University of Minnesota Minneapolis 55455 MN USA
| | - Xuming He
- Department of Statistics University of Michigan Ann Arbor 48109 MI USA
| |
Collapse
|
104
|
|
105
|
Lv S, He X, Wang J. A unified penalized method for sparse additive quantile models: an RKHS approach. ANN I STAT MATH 2016. [DOI: 10.1007/s10463-016-0566-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
106
|
Liu J. Feature screening and variable selection for partially linear models with ultrahigh-dimensional longitudinal data. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.09.122] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
107
|
Ni L, Fang F. Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification. J Nonparametr Stat 2016. [DOI: 10.1080/10485252.2016.1167206] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
108
|
|
109
|
|
110
|
Zhang J, Zhang R, Lu Z. Quantile-adaptive variable screening in ultra-high dimensional varying coefficient models. J Appl Stat 2016. [DOI: 10.1080/02664763.2015.1072141] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
111
|
Li J, Zheng Q, Peng L, Huang Z. Survival impact index and ultrahigh-dimensional model-free screening with survival outcomes. Biometrics 2016; 72:1145-1154. [PMID: 26910137 DOI: 10.1111/biom.12499] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 01/01/2016] [Accepted: 01/01/2016] [Indexed: 02/03/2023]
Abstract
Motivated by ultrahigh-dimensional biomarkers screening studies, we propose a model-free screening approach tailored to censored lifetime outcomes. Our proposal is built upon the introduction of a new measure, survival impact index (SII). By its design, SII sensibly captures the overall influence of a covariate on the outcome distribution, and can be estimated with familiar nonparametric procedures that do not require smoothing and are readily adaptable to handle lifetime outcomes under various censoring and truncation mechanisms. We provide large sample distributional results that facilitate the inference on SII in classical multivariate settings. More importantly, we investigate SII as an effective screener for ultrahigh-dimensional data, not relying on rigid regression model assumptions for real applications. We establish the sure screening property of the proposed SII-based screener. Extensive numerical studies are carried out to assess the performance of our method compared with other existing screening methods. A lung cancer microarray data is analyzed to demonstrate the practical utility of our proposals.
Collapse
Affiliation(s)
- Jialiang Li
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore.,Duke-NUS Graduate Medical School, Singapore.,Singapore Eye Research Institute, Singapore
| | - Qi Zheng
- School of Public Health and Information Sciences, University of Louisville, Louisville, KY 40202
| | - Limin Peng
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, 30321, U.S.A
| | - Zhipeng Huang
- Department of Statistics and Applied Probability, National University of Singapore, Singapore 117546, Singapore.,Singapore Eye Research Institute, Singapore.,McDermott Center for Human Growth and Development, UT Southwestern Medical Center, Dallas, Texas 75390, U.S.A
| |
Collapse
|
112
|
Sherwood B, Wang L. Partially linear additive quantile regression in ultra-high dimension. Ann Stat 2016. [DOI: 10.1214/15-aos1367] [Citation(s) in RCA: 61] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
113
|
|
114
|
JingYuan LIU, Wei ZHONG, RunZe LI. A selective overview of feature screening for ultrahigh-dimensional data. SCIENCE CHINA. MATHEMATICS 2015; 58:2033-2054. [PMID: 26779257 PMCID: PMC4711389 DOI: 10.1007/s11425-015-5062-9] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
High-dimensional data have frequently been collected in many scientific areas including genomewide association study, biomedical imaging, tomography, tumor classifications, and finance. Analysis of high-dimensional data poses many challenges for statisticians. Feature selection and variable selection are fundamental for high-dimensional data analysis. The sparsity principle, which assumes that only a small number of predictors contribute to the response, is frequently adopted and deemed useful in the analysis of high-dimensional data. Following this general principle, a large number of variable selection approaches via penalized least squares or likelihood have been developed in the recent literature to estimate a sparse model and select significant variables simultaneously. While the penalized variable selection methods have been successfully applied in many high-dimensional analyses, modern applications in areas such as genomics and proteomics push the dimensionality of data to an even larger scale, where the dimension of data may grow exponentially with the sample size. This has been called ultrahigh-dimensional data in the literature. This work aims to present a selective overview of feature screening procedures for ultrahigh-dimensional data. We focus on insights into how to construct marginal utilities for feature screening on specific models and motivation for the need of model-free feature screening procedures.
Collapse
Affiliation(s)
- LIU JingYuan
- Department of Statistics, School of Economics, Xiamen University, Xiamen 361005, China
- Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen 361005, China
- Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China
| | - ZHONG Wei
- Department of Statistics, School of Economics, Xiamen University, Xiamen 361005, China
- Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen 361005, China
- Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China
| | - LI RunZe
- Department of Statistics and The Methodology Center, Pennsylvania State University, University Park, PA 16802-2111, USA
| |
Collapse
|
115
|
DiRienzo AG. Parsimonious covariate selection with censored outcomes. Biometrics 2015; 72:452-62. [PMID: 26410381 DOI: 10.1111/biom.12420] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Revised: 08/01/2015] [Accepted: 08/01/2015] [Indexed: 11/30/2022]
Abstract
A new objective methodology is proposed to select the parsimonious set of important covariates that are associated with a censored outcome variable Y; the method simplifies to accommodate uncensored outcomes. Covariate selection proceeds in an iterated forward manner and is controlled by the pre-chosen upper bound for the number of covariates to be selected and the global false selection rate and level. A sequence of working regression models for the event (Y≤y) given a covariate set is fit among subjects not censored before y and the corresponding process (through y) of conditional prediction error estimated; the direction and magnitude of covariate effects can arbitrarily change with y. The newly proposed adequacy measure for the covariate set is the slope coefficient resulting from a regression (with no intercept) between the baseline prediction error process for the intercept-only model and that process corresponding to the covariate set. Under quite general conditions on the censoring variable, the methods are shown to asymptotically control the false selection rate at the nominal level while consistently ranking covariate sets which permits recruitment of all important covariates from those available with probability tending to 1. A simulation study confirms these analytical results and compares the proposed methods to recent competitors. Two real data illustrations are provided.
Collapse
Affiliation(s)
- Albert Gregory DiRienzo
- Department of Epidemiology and Biostatistics, University at Albany ' SUNY, Rensselaer, New York, 12144, U.S.A
| |
Collapse
|
116
|
|
117
|
Cui H, Li R, Zhong W. Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. J Am Stat Assoc 2015; 110:630-641. [PMID: 26392643 DOI: 10.1080/01621459.2014.920256] [Citation(s) in RCA: 103] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
This work is concerned with marginal sure independence feature screening for ultra-high dimensional discriminant analysis. The response variable is categorical in discriminant analysis. This enables us to use conditional distribution function to construct a new index for feature screening. In this paper, we propose a marginal feature screening procedure based on empirical conditional distribution function. We establish the sure screening and ranking consistency properties for the proposed procedure without assuming any moment condition on the predictors. The proposed procedure enjoys several appealing merits. First, it is model-free in that its implementation does not require specification of a regression model. Second, it is robust to heavy-tailed distributions of predictors and the presence of potential outliers. Third, it allows the categorical response having a diverging number of classes in the order of O(nκ ) with some κ ≥ 0. We assess the finite sample property of the proposed procedure by Monte Carlo simulation studies and numerical comparison. We further illustrate the proposed methodology by empirical analyses of two real-life data sets.
Collapse
Affiliation(s)
- Hengjian Cui
- Capital Normal University, The Pennsylvania State University and Xiamen University
| | - Runze Li
- Capital Normal University, The Pennsylvania State University and Xiamen University
| | - Wei Zhong
- Capital Normal University, The Pennsylvania State University and Xiamen University
| |
Collapse
|
118
|
Wu Y, Yin G. Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika 2015. [DOI: 10.1093/biomet/asu068] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
119
|
Zhao SD, Li Y. Score test variable screening. Biometrics 2014; 70:862-71. [PMID: 25124197 PMCID: PMC4427573 DOI: 10.1111/biom.12209] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Revised: 05/01/2014] [Accepted: 06/01/2014] [Indexed: 11/27/2022]
Abstract
Variable screening has emerged as a crucial first step in the analysis of high-throughput data, but existing procedures can be computationally cumbersome, difficult to justify theoretically, or inapplicable to certain types of analyses. Motivated by a high-dimensional censored quantile regression problem in multiple myeloma genomics, this article makes three contributions. First, we establish a score test-based screening framework, which is widely applicable, extremely computationally efficient, and relatively simple to justify. Secondly, we propose a resampling-based procedure for selecting the number of variables to retain after screening according to the principle of reproducibility. Finally, we propose a new iterative score test screening method which is closely related to sparse regression. In simulations we apply our methods to four different regression models and show that they can outperform existing procedures. We also apply score test screening to an analysis of gene expression data from multiple myeloma patients using a censored quantile regression model to identify high-risk genes.
Collapse
Affiliation(s)
- Sihai Dave Zhao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois 61820, U.S.A
| | | |
Collapse
|
120
|
Song R, Lu W, Ma S, Jeng XJ. Censored Rank Independence Screening for High-dimensional Survival Data. Biometrika 2014; 101:799-814. [PMID: 25663709 DOI: 10.1093/biomet/asu047] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated.
Collapse
Affiliation(s)
- Rui Song
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA
| | - Shuangge Ma
- Division of Biostatistics, School of Public Health, Yale University, New Haven, Connecticut 06510, USA
| | - X Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA
| |
Collapse
|
121
|
Shao X, Zhang J. Martingale Difference Correlation and Its Use in High-Dimensional Variable Screening. J Am Stat Assoc 2014. [DOI: 10.1080/01621459.2014.887012] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
122
|
|
123
|
Lee ER, Noh H, Park BU. Model Selection via Bayesian Information Criterion for Quantile Regression Models. J Am Stat Assoc 2014. [DOI: 10.1080/01621459.2013.836975] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
124
|
Liu J, Li R, Wu R. Feature Selection for Varying Coefficient Models With Ultrahigh Dimensional Covariates. J Am Stat Assoc 2014; 109:266-274. [PMID: 24678135 PMCID: PMC3963210 DOI: 10.1080/01621459.2013.850086] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
This paper is concerned with feature screening and variable selection for varying coefficient models with ultrahigh dimensional covariates. We propose a new feature screening procedure for these models based on conditional correlation coefficient. We systematically study the theoretical properties of the proposed procedure, and establish their sure screening property and the ranking consistency. To enhance the finite sample performance of the proposed procedure, we further develop an iterative feature screening procedure. Monte Carlo simulation studies were conducted to examine the performance of the proposed procedures. In practice, we advocate a two-stage approach for varying coefficient models. The two stage approach consists of (a) reducing the ultrahigh dimensionality by using the proposed procedure and (b) applying regularization methods for dimension-reduced varying coefficient models to make statistical inferences on the coefficient functions. We illustrate the proposed two-stage approach by a real data example.
Collapse
Affiliation(s)
- Jingyuan Liu
- Assistant Professor of Wang Yanan Institute for Studies in Economics and Department of Statistics and Fujian Key Laboratory of Statistical Science, Xiamen University, China
| | - Runze Li
- Distinguished Professor, Department of Statistics and The Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111
| | - Rongling Wu
- Professor, Department of Public Health Sciences, Penn State Hershey College of Medicine, Hershey, PA 17033
| |
Collapse
|
125
|
Huang D, Li R, Wang H. Feature Screening for Ultrahigh Dimensional Categorical Data with Applications. JOURNAL OF BUSINESS & ECONOMIC STATISTICS : A PUBLICATION OF THE AMERICAN STATISTICAL ASSOCIATION 2014; 32:237-244. [PMID: 25328278 PMCID: PMC4197855 DOI: 10.1080/07350015.2013.863158] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Ultrahigh dimensional data with both categorical responses and categorical covariates are frequently encountered in the analysis of big data, for which feature screening has become an indispensable statistical tool. We propose a Pearson chi-square based feature screening procedure for categorical response with ultrahigh dimensional categorical covariates. The proposed procedure can be directly applied for detection of important interaction effects. We further show that the proposed procedure possesses screening consistency property in the terminology of Fan and Lv (2008). We investigate the finite sample performance of the proposed procedure by Monte Carlo simulation studies, and illustrate the proposed method by two empirical datasets.
Collapse
Affiliation(s)
| | - Runze Li
- Peking University & Pennsylvania State University
| | | |
Collapse
|