1
|
High-dimensional variable screening through kernel-based conditional mean dependence. J Stat Plan Inference 2023. [DOI: 10.1016/j.jspi.2022.10.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
2
|
Xiong W, Pan H, Wang J, Tian M. An efficient model-free approach to interaction screening for high dimensional data. Stat Med 2023; 42:1583-1605. [PMID: 36857779 DOI: 10.1002/sim.9688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 12/02/2022] [Accepted: 02/06/2023] [Indexed: 03/03/2023]
Abstract
An innovated model-free interaction screening procedure called the MCVIS is proposed for high dimensional data analysis. Specifically, we adopt the introduced MCV index for quantifying the importance of an interaction effect among predictors. Our proposed method is fully nonparametric and is capable of successfully selecting interactions even if the signal of parental main effects is weak. The MCVIS procedure has many distinctive features: (i) it can work with discrete, categorical and continuous covariates; (ii) it can deal with both categorical and continuous response, even handle the missing response; (iii) it is robust for heavy-tailed distributions, thus well accommodates heterogeneity typically caused by high dimensionality; (iv) it enjoys the sure screening and ranking consistency properties, therefore achieves dimension reduction without information loss. In another respect, computational feasibility is a top concern in high dimensional data analysis, by transforming our MCV into several variants, the MCVIS procedure is simple and fast to implement. Extensive numerical experiments and comparisons confirm the effectiveness and wide applicability of our MCVIS procedure. We further illustrate the proposed methodology by empirical study of two real datasets. Supplementary materials are available online.
Collapse
Affiliation(s)
- Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Han Pan
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Jianrong Wang
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Maozai Tian
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
3
|
Li T, Yu J, Meng C. Scalable model-free feature screening via sliced-Wasserstein dependency. J Comput Graph Stat 2023. [DOI: 10.1080/10618600.2023.2183213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Affiliation(s)
- Tao Li
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China
| | - Jun Yu
- School of Mathematics and Statistics, Beijing Institute of Technology
| | - Cheng Meng
- Center for Applied Statistics, Institute of Statistics and Big Data, Renmin University of China
| |
Collapse
|
4
|
Zhong W, Qian C, Liu W, Zhu L, Li R. Feature Screening for Interval-Valued Response with Application to Study Association between Posted Salary and Required Skills. J Am Stat Assoc 2023; 118:805-817. [PMID: 37448462 PMCID: PMC10338024 DOI: 10.1080/01621459.2022.2152342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/17/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022]
Abstract
It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this paper, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k-10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China's online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary.
Collapse
|
5
|
Craig SJ, Kenney AM, Lin J, Paul IM, Birch LL, Savage JS, Marini ME, Chiaromonte F, Reimherr ML, Makova KD. Constructing a polygenic risk score for childhood obesity using functional data analysis. ECONOMETRICS AND STATISTICS 2023; 25:66-86. [PMID: 36620476 PMCID: PMC9813976 DOI: 10.1016/j.ecosta.2021.10.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain-a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS.
Collapse
Affiliation(s)
- Sarah J.C. Craig
- Department of Biology, Penn State University, University Park
- Center for Medical Genomics, Penn State University, University Park, PA
| | - Ana M. Kenney
- Department of Statistics, Penn State University, University Park, PA
| | - Junli Lin
- Department of Statistics, Penn State University, University Park, PA
| | - Ian M. Paul
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Pediatrics, Penn State College of Medicine, Hershey, PA
| | - Leann L. Birch
- Department of Foods and Nutrition, University of Georgia, Athens, GA
| | - Jennifer S. Savage
- Department of Nutritional Sciences, Penn State University, University Park, PA
- Center for Childhood Obesity Research, Penn State University, University Park, PA
| | - Michele E. Marini
- Center for Childhood Obesity Research, Penn State University, University Park, PA
| | - Francesca Chiaromonte
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Statistics, Penn State University, University Park, PA
- EMbeDS, Sant’Anna School of Advanced Studies, Piazza Martiri della Libertà, Pisa, Italy
| | - Matthew L. Reimherr
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Statistics, Penn State University, University Park, PA
| | - Kateryna D. Makova
- Department of Biology, Penn State University, University Park
- Center for Medical Genomics, Penn State University, University Park, PA
| |
Collapse
|
6
|
Li L, Ke C, Yin X, Yu Z. Generalized martingale difference divergence: Detecting conditional mean independence with applications in variable screening. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
7
|
Cai Z, Lei J, Roeder K. Model-free prediction test with application to genomics data. Proc Natl Acad Sci U S A 2022; 119:e2205518119. [PMID: 35969737 PMCID: PMC9407618 DOI: 10.1073/pnas.2205518119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 07/20/2022] [Indexed: 11/18/2022] Open
Abstract
Testing the significance of predictors in a regression model is one of the most important topics in statistics. This problem is especially difficult without any parametric assumptions on the data. This paper aims to test the null hypothesis that given confounding variables Z, X does not significantly contribute to the prediction of Y under the model-free setting, where X and Z are possibly high dimensional. We propose a general framework that first fits nonparametric machine learning regression algorithms on [Formula: see text] and [Formula: see text], then compares the prediction power of the two models. The proposed method allows us to leverage the strength of the most powerful regression algorithms developed in the modern machine learning community. The P value for the test can be easily obtained by permutation. In simulations, we find that the proposed method is more powerful compared to existing methods. The proposed method allows us to draw biologically meaningful conclusions from two gene expression data analyses without strong distributional assumptions: 1) testing the prediction power of sequencing RNA for the proteins in cellular indexing of transcriptomes and epitopes by sequencing data and 2) identification of spatially variable genes in spatially resolved transcriptomics data.
Collapse
Affiliation(s)
- Zhanrui Cai
- Department of Statistics, Iowa State University, Ames, IA 50011
| | - Jing Lei
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213
| | - Kathryn Roeder
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213
| |
Collapse
|
8
|
Khan MHR, Akhter M. Ranking based variable selection for censored data using AFT models. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2092639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
| | - Marzan Akhter
- Institute of Statistical Research and Training, University of Dhaka, Dhaka, Bangladesh
| |
Collapse
|
9
|
Ma W, Xiao J, Yang Y, Ye F. Model-free feature screening for ultrahigh dimensional data via a Pearson chi-square based index. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2062358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Weidong Ma
- Department of Mathematical Sciences, Tsinghua University, Beijing, People's Republic of China
| | - Jingsong Xiao
- Department of Mathematical Sciences, Tsinghua University, Beijing, People's Republic of China
| | - Ying Yang
- Department of Mathematical Sciences, Tsinghua University, Beijing, People's Republic of China
| | - Fei Ye
- School of Statistics, Capital University of Economics and Business, Beijing, People's Republic of China
| |
Collapse
|
10
|
Yu C, Guo W, Song X, Cui H. Feature screening with latent responses. Biometrics 2022. [PMID: 35246841 DOI: 10.1111/biom.13658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Accepted: 02/25/2022] [Indexed: 11/30/2022]
Abstract
A novel feature screening method is proposed to examine the correlation between latent responses and potential predictors in ultrahigh dimensional data analysis. First, a confirmatory factor analysis (CFA) model is used to characterize latent responses through multiple observed variables. The expectation-maximization algorithm is employed to estimate the parameters in the CFA model. Second, R-Vector (RV) correlation is used to measure the dependence between the multivariate latent responses and covariates of interest. Third, a feature screening procedure is proposed on the basis of an unbiased estimator of the RV coefficient. The sure screening property of the proposed screening procedure is established under certain mild conditions. Monte Carlo simulations are conducted to assess the finite sample performance of the feature screening procedure. The proposed method is applied to an investigation of the relationship between psychological well-being and the human genome. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Congran Yu
- School of Mathematical Sciences, Capital Normal University, Beijing, China
| | - Wenwen Guo
- School of Mathematical Sciences, Capital Normal University, Beijing, China
| | - Xinyuan Song
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, China
| | - Hengjian Cui
- School of Mathematical Sciences, Capital Normal University, Beijing, China
| |
Collapse
|
11
|
Liu J, Si Y, Niu Y, Zhang R. Projection quantile correlation and its use in high-dimensional grouped variable screening. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2021.107369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
12
|
Li R, Xu K, Zhou Y, Zhu L. Testing the effects of high-dimensional covariates via aggregating cumulative covariances. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2044334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Runze Li
- The Pennsylvania State University
| | | | | | | |
Collapse
|
13
|
Abstract
Summary
We consider Fréchet sufficient dimension reduction with responses being complex random objects in a metric space and high dimension Euclidean predictors. We propose a novel approach, called the weighted inverse regression ensemble method for linear Fréchet sufficient dimension reduction. The method is further generalized as a new operator defined on reproducing kernel Hilbert spaces for nonlinear Fréchet sufficient dimension reduction. We provide theoretical guarantees for the new method via asymptotic analysis. Intensive simulation studies verify the performance of our proposals, and we apply our methods to analyse the handwritten digits data and the real-world affective faces data to demonstrate its use in real applications.
Collapse
Affiliation(s)
- Chao Ying
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science - MOE, School of Statistics, East China Normal University, Shanghai 200241, China
| | - Zhou Yu
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science - MOE, School of Statistics, East China Normal University, Shanghai 200241, China
| |
Collapse
|
14
|
Nandy D, Chiaromonte F, Li R. Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. J Am Stat Assoc 2022; 117:1516-1529. [PMID: 36172297 PMCID: PMC9512254 DOI: 10.1080/01621459.2020.1864380] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (n), each with a very large number of covariates (p >> n), only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of Sure Independence Screening (Fan and Lv, 2008) and other model- and correlation-based feature screening methods, we propose a model-free procedure called Covariate Information Number - Sure Independence Screening (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher Information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods.
Collapse
Affiliation(s)
- Debmalya Nandy
- Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA,Corresponding author Debmalya Nandy
| | - Francesca Chiaromonte
- Department of Statistics, Penn State University, University Park, PA 16802, USA,Institute of Economics and EMbeDS, Sant’Anna School of Advanced Studies, Piazza Martiri della Libertà 33, Pisa 56127, Italy
| | - Runze Li
- Department of Statistics, Penn State University, University Park, PA 16802, USA
| |
Collapse
|
15
|
Xu K, Zhou Y. Projection-averaging-based cumulative covariance and its use in goodness-of-fit testing for single-index models. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
16
|
Menvouta EJ, Serneels S, Verdonck T. Sparse dimension reduction based on energy and ball statistics. ADV DATA ANAL CLASSI 2021. [DOI: 10.1007/s11634-021-00470-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
17
|
Wang P, Yin X, Yuan Q, Kryscio R. Feature filter for estimating central mean subspace and its sparse solution. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Tian Y, Feng Y. RaSE: A Variable Screening Framework via Random Subspace Ensembles. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1938084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Ye Tian
- Department of Statistics, Columbia University, New York
| | - Yang Feng
- Department of Biostatistics, School of Global Public Health, New York University, New York
| |
Collapse
|
19
|
Lai T, Zhang Z, Wang Y. A kernel-based measure for conditional mean dependence. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
20
|
Gao L, Fan Y, Lv J, Shao QM. ASYMPTOTIC DISTRIBUTIONS OF HIGH-DIMENSIONAL DISTANCE CORRELATION INFERENCE. Ann Stat 2021; 49:1999-2020. [PMID: 34621096 PMCID: PMC8491772 DOI: 10.1214/20-aos2024] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Distance correlation has become an increasingly popular tool for detecting the nonlinear dependence between a pair of potentially high-dimensional random vectors. Most existing works have explored its asymptotic distributions under the null hypothesis of independence between the two random vectors when only the sample size or the dimensionality diverges. Yet its asymptotic null distribution for the more realistic setting when both sample size and dimensionality diverge in the full range remains largely underdeveloped. In this paper, we fill such a gap and develop central limit theorems and associated rates of convergence for a rescaled test statistic based on the bias-corrected distance correlation in high dimensions under some mild regularity conditions and the null hypothesis. Our new theoretical results reveal an interesting phenomenon of blessing of dimensionality for high-dimensional distance correlation inference in the sense that the accuracy of normal approximation can increase with dimensionality. Moreover, we provide a general theory on the power analysis under the alternative hypothesis of dependence, and further justify the capability of the rescaled distance correlation in capturing the pure nonlinear dependency under moderately high dimensionality for a certain type of alternative hypothesis. The theoretical results and finite-sample performance of the rescaled statistic are illustrated with several simulation examples and a blockchain application.
Collapse
Affiliation(s)
- Lan Gao
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California
| | - Qi-Man Shao
- Department of Statistics and Data Science, Southern University of Science and Technology
- Department of Statistics, The Chinese University of Hong Kong
| |
Collapse
|
21
|
Zhang J, Liu Y. Model-free slice screening for ultrahigh-dimensional survival data. J Appl Stat 2021; 48:1755-1774. [DOI: 10.1080/02664763.2020.1772734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Jing Zhang
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, Hubei, People's Republic of China
| | - Yanyan Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei, People's Republic of China
| |
Collapse
|
22
|
Zhong W, Wang J, Chen X. Censored mean variance sure independence screening for ultrahigh dimensional survival data. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
23
|
Affiliation(s)
- Jingke Zhou
- School of Mathematics and Statistics, Hubei University of Arts and Science, Xiangyang, People's Republic of China
| | - Lixing Zhu
- Department of Mathematics, Hong Kong Baptist University, Hong Kong, Hong Kong
- School of Statistics, Beijing Normal University, Beijing, People's Republic of China
| |
Collapse
|
24
|
Li C, Chen D, Xiong S. Linear screening for high‐dimensional computer experiments. Stat (Int Stat Inst) 2021. [DOI: 10.1002/sta4.320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Chunya Li
- School of Mathematical Sciences University of Chinese Academy of Sciences Beijing 100049 China
- NCMIS, KLSC, Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing 100190 China
| | - Daijun Chen
- Nuance Communications Inc. Chengdu 610094 China
| | - Shifeng Xiong
- NCMIS, KLSC, Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing 100190 China
| |
Collapse
|
25
|
Xu K, Zhou Y. Maximum-type tests for high-dimensional regression coefficients using Wilcoxon scores. J Stat Plan Inference 2021. [DOI: 10.1016/j.jspi.2020.06.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
26
|
Dong Y. A brief review of linear sufficient dimension reduction through optimization. J Stat Plan Inference 2021. [DOI: 10.1016/j.jspi.2020.06.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
27
|
Chakraborty S, Zhang X. A new framework for distance and kernel-based metrics in high dimensions. Electron J Stat 2021. [DOI: 10.1214/21-ejs1889] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
28
|
Zhu C, Zhang X, Yao S, Shao X. Distance-based and RKHS-based dependence metrics in high dimension. Ann Stat 2020. [DOI: 10.1214/19-aos1934] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
29
|
Xu K, Shen Z, Huang X, Cheng Q. Projection correlation between scalar and vector variables and its use in feature screening with multi-response data. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1753057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Kai Xu
- School of Mathematics and Statistics, Anhui Normal University, Wuhu, People's Republic of China
| | - Zhiling Shen
- School of Mathematics and Statistics, Anhui Normal University, Wuhu, People's Republic of China
| | - Xudong Huang
- School of Mathematics and Statistics, Anhui Normal University, Wuhu, People's Republic of China
| | - Qing Cheng
- Center for Quantitative Medicine, Duke-NUS Medical School, National University of Singapore, Singapore, Singapore
| |
Collapse
|
30
|
Martingale-difference-divergence-based tests for goodness-of-fit in quantile models. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2019.10.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
31
|
Zhang X, Lee CE, Shao X. Envelopes in multivariate regression models with nonlinearity and heteroscedasticity. Biometrika 2020. [DOI: 10.1093/biomet/asaa036] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Summary
Envelopes have been proposed in recent years as a nascent methodology for sufficient dimension reduction and efficient parameter estimation in multivariate linear models. We extend the classical definition of envelopes in Cook et al. (2010) to incorporate a nonlinear conditional mean function and a heteroscedastic error. Given any two random vectors ${X}\in\mathbb{R}^{p}$ and ${Y}\in\mathbb{R}^{r}$, we propose two new model-free envelopes, called the martingale difference divergence envelope and the central mean envelope, and study their relationships to the standard envelope in the context of response reduction in multivariate linear models. The martingale difference divergence envelope effectively captures the nonlinearity in the conditional mean without imposing any parametric structure or requiring any tuning in estimation. Heteroscedasticity, or nonconstant conditional covariance of ${Y}\mid{X}$, is further detected by the central mean envelope based on a slicing scheme for the data. We reveal the nested structure of different envelopes: (i) the central mean envelope contains the martingale difference divergence envelope, with equality when ${Y}\mid{X}$ has a constant conditional covariance; and (ii) the martingale difference divergence envelope contains the standard envelope, with equality when ${Y}\mid{X}$ has a linear conditional mean. We develop an estimation procedure that first obtains the martingale difference divergence envelope and then estimates the additional envelope components in the central mean envelope. We establish consistency in envelope estimation of the martingale difference divergence envelope and central mean envelope without stringent model assumptions. Simulations and real-data analysis demonstrate the advantages of the martingale difference divergence envelope and the central mean envelope over the standard envelope in dimension reduction.
Collapse
Affiliation(s)
- X Zhang
- Department of Statistics, Florida State University, 117 N.Woodward Ave., Tallahassee, Florida 32306, U.S.A
| | - C E Lee
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 916 Volunteer Blvd, Knoxville, Tennessee 37996, U.S.A
| | - X Shao
- Department of Statistics, University of Illinois at Urbana Champaign, 725 South Wright St, Champaign, Illinois 61820, U.S.A
| |
Collapse
|
32
|
Estimating the Growing Stem Volume of Chinese Pine and Larch Plantations based on Fused Optical Data Using an Improved Variable Screening Method and Stacking Algorithm. REMOTE SENSING 2020. [DOI: 10.3390/rs12050871] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Accurately estimating growing stem volume (GSV) is very important for forest resource management. The GSV estimation is affected by remote sensing images, variable selection methods, and estimation algorithms. Optical images have been widely used for modeling key attributes of forest stands, including GSV and aboveground biomass (AGB), because of their easy availability, large coverage and related mature data processing and analysis technologies. However, the low data saturation level and the difficulty of selecting feature variables from optical images often impede the improvement of estimation accuracy. In this research, two GaoFen-2 (GF-2) images, a Landsat 8 image, and fused images created by integrating GF-2 bands with the Landsat multispectral image using the Gram–Schmidt method were first used to derive various feature variables and obtain various datasets or data scenarios. A DC-FSCK approach that integrates feature variable screening and a combination optimization procedure based on the distance correlation coefficient and k-nearest neighbors (kNN) algorithm was proposed and compared with the stepwise regression analysis (SRA) and random forest (RF) for feature variable selection. The DC-FSCK considers the self-correlation and combination effect among feature variables so that the selected variables can improve the accuracy and saturation level of GSV estimation. To validate the proposed approach, six estimation algorithms were examined and compared, including Multiple Linear Regression (MLR), kNN, Support Vector Regression (SVR), RF, eXtreme Gradient Boosting (XGBoost) and Stacking. The results showed that compared with GF-2 and Landsat 8 images, overall, the fused image (Red_Landsat) of GF-2 red band with Landsat 8 multispectral image improved the GSV estimation accuracy of Chinese pine and larch plantations. The Red_Landsat image also performed better than other fused images (Pan_Landsat, Blue_Landsat, Green_Landsat and Nir_Landsat). For most of the combinations of the datasets and estimation models, the proposed variable selection method DC-FSCK led to more accurate GSV estimates compared with SRA and RF. In addition, in most of the combinations obtained by the datasets and variable selection methods, the Stacking algorithm performed better than other estimation models. More importantly, the combination of the fused image Red_Landsat with the DC-FSCK and Stacking algorithm led to the best performance of GSV estimation with the greatest adjusted coefficients of determination, 0.8127 and 0.6047, and the smallest relative root mean square errors of 17.1% and 20.7% for Chinese pine and larch, respectively. This study provided new insights on how to choose suitable optical images, variable selection methods and optimal modeling algorithms for the GSV estimation of Chinese pine and larch plantations.
Collapse
|
33
|
Abstract
Summary
We propose a new nonparametric conditional mean independence test for a response variable $Y$ and a predictor variable $X$ where either or both can be function-valued. Our test is built on a new metric, the so-called functional martingale difference divergence, which fully characterizes the conditional mean dependence of $Y$ given $X$ and extends the martingale difference divergence proposed by Shao & Zhang (2014). We define an unbiased estimator of functional martingale difference divergence by using a $\mathcal{U}$-centring approach, and we obtain its limiting null distribution under mild assumptions. Since the limiting null distribution is not pivotal, we use the wild bootstrap method to estimate the critical value and show the consistency of the bootstrap test. Our test can detect the local alternative which approaches the null at the rate of $n^{-1/2}$ with a nontrivial power, where $n$ is the sample size. Unlike the three tests developed by Kokoszka et al. (2008), Lei (2014) and Patilea et al. (2016), our test does not require a finite-dimensional projection or assume a linear model, and it does not involve any tuning parameters. Promising finite-sample performance is demonstrated via simulations, and a real-data illustration is used to compare our test with existing ones.
Collapse
Affiliation(s)
- C E Lee
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 916 Volunteer Blvd, Knoxville, Tennessee 37996, USA
| | - X Zhang
- Department of Statistics, Texas A&M University, 155 Ireland St, College Station, Texas 77843, USA
| | - X Shao
- Department of Statistics, University of Illinois at Urbana Champaign, 725 South Wright St, Champaign, Illinois 61820, USA
| |
Collapse
|
34
|
|
35
|
Abstract
Feature screening plays an important role in the analysis of ultrahigh dimensional data. Due to complicated model structure and high noise level, existing screening methods often suffer from model misspecification and the presence of outliers. To address these issues, we introduce a new metric named cumulative divergence (CD), and develop a CD-based forward screening procedure. This forward screening method is model-free and resistant to the presence of outliers in the response. It also incorporates the joint effects among covariates into the screening process. With a data-driven threshold, the new method can automatically determine the number of features that should be retained after screening. These merits make the CD-based screening very appealing in practice. Under certain regularity conditions, we show that the proposed method possesses sure screening property. The performance of our proposal is illustrated through simulations and a real data example.
Collapse
Affiliation(s)
- Tingyou Zhou
- School of Data Sciences, Zhejiang University of Finance and Economics, Hangzhou, P. R. China
| | - Liping Zhu
- Institute of Statistics and Big Data and Center for Applied Statistics, Renmin University of China, Beijing, P. R. China
| | - Chen Xu
- Department of Mathematics and Statistics University of Ottawa, Ottawa, Canada
| | - Runze Li
- Department of Statistics and The Methodology Center, The Pennsylvania State University at University Park, U.S.A
| |
Collapse
|
36
|
Kong E, Xia Y, Zhong W. Composite Coefficient of Determination and Its Application in Ultrahigh Dimensional Variable Screening. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2018.1514305] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Affiliation(s)
- Efang Kong
- School of Mathematical Sciences, University of Electronic Science and Technology of China, China
| | - Yingcun Xia
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| | - Wei Zhong
- Wang Yanan Institute for Studies in Economics (WISE), and Department of Statistics, School of Economics, Xiamen University, China
| |
Collapse
|
37
|
Pan J, Zhang S, Zhou Y. Variable screening for ultrahigh dimensional censored quantile regression. J STAT COMPUT SIM 2018. [DOI: 10.1080/00949655.2018.1554068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Jing Pan
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shucong Zhang
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China
| | - Yong Zhou
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science, MOE, and Institute of Statistics and Interdisciplinary Sciences and School of Statistics, East China Normal University, Shanghai, China
| |
Collapse
|
38
|
Zhu L, Zhang Y, Xu K. Measuring and testing for interval quantile dependence. Ann Stat 2018. [DOI: 10.1214/17-aos1635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
39
|
Li X, Ma X, Zhang J. Conditional quantile correlation screening procedure for ultrahigh-dimensional varying coefficient models. J Stat Plan Inference 2018. [DOI: 10.1016/j.jspi.2017.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
40
|
Generalizing distance covariance to measure and test multivariate mutual dependence via complete and incomplete V-statistics. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2018.08.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
41
|
Edelmann D, Fokianos K, Pitsillou M. An Updated Literature Review of Distance Correlation and Its Applications to Time Series. Int Stat Rev 2018. [DOI: 10.1111/insr.12294] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Dominic Edelmann
- Department of Biostatistics; German Cancer Research Center; Heidelberg Germany
| | | | - Maria Pitsillou
- Department of Mathematics & Statistics; University of Cyprus; Nicosia Cyprus
| |
Collapse
|
42
|
Pan W, Wang X, Xiao W, Zhu H. A Generic Sure Independence Screening Procedure. J Am Stat Assoc 2018; 114:928-937. [PMID: 31692981 PMCID: PMC6831100 DOI: 10.1080/01621459.2018.1462709] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2016] [Accepted: 03/01/2018] [Indexed: 02/07/2023]
Abstract
Extracting important features from ultra-high dimensional data is one of the primary tasks in statistical learning, information theory, precision medicine and biological discovery. Many of the sure independent screening methods developed to meet these needs are suitable for special models under some assumptions. With the availability of more data types and possible models, a model-free generic screening procedure with fewer and less restrictive assumptions is desirable. In this paper, we propose a generic nonparametric sure independence screening procedure, called BCor-SIS, on the basis of a recently developed universal dependence measure: Ball correlation. We show that the proposed procedure has strong screening consistency even when the dimensionality is an exponential order of the sample size without imposing sub-exponential moment assumptions on the data. We investigate the flexibility of this procedure by considering three commonly encountered challenging settings in biological discovery or precision medicine: iterative BCor-SIS, interaction pursuit, and survival outcomes. We use simulation studies and real data analyses to illustrate the versatility and practicability of our BCor-SIS method.
Collapse
Affiliation(s)
- Wenliang Pan
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, P. R. China
| | - Xueqin Wang
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, P. R. China; and Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, 510080, China; and Xinhua College, Sun Yat-Sen University, Guangzhou, 510520, China
| | - Weinan Xiao
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, P. R. China
| | - Hongtu Zhu
- Department of Biostatistics, The University of Texas, MD Anderson Cancer Center, Houston, Texas, 77030
| |
Collapse
|
43
|
Niu Y, Zhang R, Liu J, Li H. Nonparametric independence screening for ultra-high-dimensional longitudinal data under additive models. J Nonparametr Stat 2018. [DOI: 10.1080/10485252.2018.1497797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
- Yong Niu
- School of Finance and Statistics, East China Normal University, Shanghai, People's Republic of China
- Department of Mathematics and Physics, Hefei University, Hefei, People's Republic of China
| | - Riquan Zhang
- School of Finance and Statistics, East China Normal University, Shanghai, People's Republic of China
| | - Jicai Liu
- College of Mathematics and Sciences, Shanghai Normal University, Shanghai, People's Republic of China
| | - Huapeng Li
- School of Finance and Statistics, East China Normal University, Shanghai, People's Republic of China
- School of Mathematics and Computer Sciences, Shanxi Datong University, Datong, People's Republic of China
| |
Collapse
|
44
|
Zhang S, Zhou Y. Variable screening for ultrahigh dimensional heterogeneous data via conditional quantile correlations. J MULTIVARIATE ANAL 2018. [DOI: 10.1016/j.jmva.2017.11.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
45
|
Chen X, Chen X, Wang H. Robust feature screening for ultra-high dimensional right censored data via distance correlation. Comput Stat Data Anal 2018. [DOI: 10.1016/j.csda.2017.10.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
46
|
Zhang X, Yao S, Shao X. Conditional mean and quantile dependence testing in high dimension. Ann Stat 2018. [DOI: 10.1214/17-aos1548] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
47
|
|
48
|
Fan GL, Jiang ZQ, Wang JF. Empirical likelihood for high-dimensional partially linear model with martingale difference errors. COMMUN STAT-THEOR M 2017. [DOI: 10.1080/03610926.2016.1260739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Guo-Liang Fan
- Institute of Statistics and Big Data, Renmin University of China, Beijing, China
- School of Mathematics and Physics, Anhui Polytechnic University, Wuhu, China
| | - Zhi-Qiang Jiang
- School of Mathematics & Physics, Anhui Polytechnic University, Wuhu, China
| | - Jiang-Feng Wang
- School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China
| |
Collapse
|
49
|
Wang HJ, McKeague IW, Qian M. Testing for Marginal Linear Effects in Quantile Regression. J R Stat Soc Series B Stat Methodol 2017; 80:433-452. [PMID: 29576736 DOI: 10.1111/rssb.12258] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
This paper develops a new marginal testing procedure to detect the presence of significant predictors associated with the conditional quantiles of a scalar response. The idea is to fit the marginal quantile regression on each predictor one at a time, and then base the test on the t-statistics associated with the most predictive predictors. A resampling method is devised to calibrate this test statistic, which has non-regular limiting behavior due to the selection of the most predictive variables. Asymptotic validity of the procedure is established in a general quantile regression setting in which the marginal quantile regression models can be misspecified. Even though a fixed dimension is assumed to derive the asymptotic results, the proposed test is applicable and computationally feasible for large-dimensional predictors. The method is more flexible than existing marginal screening test methods based on mean regression, and has the added advantage of being robust against outliers in the response. The approach is illustrated using an application to an HIV drug resistance dataset.
Collapse
Affiliation(s)
- Huixia Judy Wang
- Associate Professor, Department of Statistics, George Washington University, Washington, District of Columbia 20052, USA
| | - Ian W McKeague
- Professor, Department of Biostatistics, Columbia University, New York, NY 20032, USA
| | - Min Qian
- Assistant Professor, Department of Biostatistics, Columbia University, New York, NY 20032, USA
| |
Collapse
|
50
|
|