1
|
Guo X, Chen Y, Tang CY. Information criteria for latent factor models: a study on factor pervasiveness and adaptivity. JOURNAL OF ECONOMETRICS 2023; 233:237-250. [PMID: 36938506 PMCID: PMC10022528 DOI: 10.1016/j.jeconom.2022.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
We study the information criteria extensively under general conditions for high-dimensional latent factor models. Upon carefully analyzing the estimation errors of the principal component analysis method, we establish theoretical results on the estimation accuracy of the latent factor scores, incorporating the impact from possibly weak factor pervasiveness; our analysis does not require the same factor strength of all the leading factors. To estimate the number of the latent factors, we propose a new penalty specification with a two-fold consideration: i) being adaptive to the strength of the factor pervasiveness, and ii) favoring more parsimonious models. Our theory establishes the validity of the proposed approach under general conditions. Additionally, we construct examples to demonstrate that when the factor strength is too weak, scenarios exist such that no information criterion can consistently identify the latent factors. We illustrate the performance of the proposed adaptive information criteria with extensive numerical examples, including simulations and a real data analysis.
Collapse
Affiliation(s)
- Xiao Guo
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui 230026, People’s Republic of China
| | - Yu Chen
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui 230026, People’s Republic of China
| | - Cheng Yong Tang
- Department of Statistical Science, Temple University, 1810 Liacouras Walk, Philadelphia, Pennsylvania 19122-6083, U.S.A
| |
Collapse
|
2
|
James-Stein for the leading eigenvector. Proc Natl Acad Sci U S A 2023; 120:e2207046120. [PMID: 36603029 PMCID: PMC9926287 DOI: 10.1073/pnas.2207046120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Recent research identifies and corrects bias, such as excess dispersion, in the leading sample eigenvector of a factor-based covariance matrix estimated from a high-dimension low sample size (HL) data set. We show that eigenvector bias can have a substantial impact on variance-minimizing optimization in the HL regime, while bias in estimated eigenvalues may have little effect. We describe a data-driven eigenvector shrinkage estimator in the HL regime called "James-Stein for eigenvectors" (JSE) and its close relationship with the James-Stein (JS) estimator for a collection of averages. We show, both theoretically and with numerical experiments, that, for certain variance-minimizing problems of practical importance, efforts to correct eigenvalues have little value in comparison to the JSE correction of the leading eigenvector. When certain extra information is present, JSE is a consistent estimator of the leading eigenvector.
Collapse
|
3
|
Guerra-Urzola R, Van Deun K, Vera JC, Sijtsma K. A Guide for Sparse PCA: Model Comparison and Applications. PSYCHOMETRIKA 2021; 86:893-919. [PMID: 34185214 PMCID: PMC8636462 DOI: 10.1007/s11336-021-09773-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 05/17/2021] [Indexed: 05/14/2023]
Abstract
PCA is a popular tool for exploring and summarizing multivariate data, especially those consisting of many variables. PCA, however, is often not simple to interpret, as the components are a linear combination of the variables. To address this issue, numerous methods have been proposed to sparsify the nonzero coefficients in the components, including rotation-thresholding methods and, more recently, PCA methods subject to sparsity inducing penalties or constraints. Here, we offer guidelines on how to choose among the different sparse PCA methods. Current literature misses clear guidance on the properties and performance of the different sparse PCA methods, often relying on the misconception that the equivalence of the formulations for ordinary PCA also holds for sparse PCA. To guide potential users of sparse PCA methods, we first discuss several popular sparse PCA methods in terms of where the sparseness is imposed on the loadings or on the weights, assumed model, and optimization criterion used to impose sparseness. Second, using an extensive simulation study, we assess each of these methods by means of performance measures such as squared relative error, misidentification rate, and percentage of explained variance for several data generating models and conditions for the population model. Finally, two examples using empirical data are considered.
Collapse
Affiliation(s)
- Rosember Guerra-Urzola
- Department of Methodology and Statistics, Tilburg University, Prof. Cobbenhagenlaan 225, Simon Building, Room S 820, 5037 DB Tilburg, The Netherlands
| | - Katrijn Van Deun
- Department of Methodology and Statistics, Tilburg University, Tilburg, The Netherlands
| | - Juan C. Vera
- Department of Econometrics and OR, Tilburg University, Tilburg, Netherlands
| | | |
Collapse
|
4
|
Abstract
This paper studies model selection consistency for high dimensional sparse regression when data exhibits both cross-sectional and serial dependency. Most commonly-used model selection methods fail to consistently recover the true model when the covariates are highly correlated. Motivated by econometric and financial studies, we consider the case where covariate dependence can be reduced through the factor model, and propose a consistency strategy named Factor-Adjusted Regularized Model Selection (FarmSelect). By learning the latent factors and idiosyncratic components and using both of them as predictors, FarmSelect transforms the problem from model selection with highly correlated covariates to that with weakly correlated ones via lifting. Model selection consistency, as well as optimal rates of convergence, are obtained under mild conditions. Numerical studies demonstrate the nice finite sample performance in terms of both model selection and out-of-sample prediction. Moreover, our method is flexible in the sense that it pays no price for weakly correlated and uncorrelated cases. Our method is applicable to a wide range of high dimensional sparse regression problems. An R-package FarmSelect is also provided for implementation.
Collapse
Affiliation(s)
| | - Yuan Ke
- Department of Statistics, University of Georgia, USA
| | | |
Collapse
|
5
|
Abstract
Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top K eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top K eigenvectors. In particular, we show that for distributions with symmetric innovation, the empirical top eigenspaces are unbiased and hence the distributed PCA is "unbiased". We derive the rate of convergence for distributed PCA estimators, which depends explicitly on the effective rank of covariance, eigen-gap, and the number of machines. We show that when the number of machines is not unreasonably large, the distributed PCA performs as well as the whole sample PCA, even without full access of whole data. The theoretical results are verified by an extensive simulation study. We also extend our analysis to the heterogeneous case where the population covariance matrices are different across local machines but share similar top eigen-structures.
Collapse
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering Princeton University
| | - Dong Wang
- Department of Operations Research and Financial Engineering Princeton University
| | - Kaizheng Wang
- Department of Operations Research and Financial Engineering Princeton University
| | - Ziwei Zhu
- Department of Operations Research and Financial Engineering Princeton University
| |
Collapse
|
6
|
Fan J, Ke Y, Sun Q, Zhou WX. FarmTest: Factor-adjusted robust multiple testing with approximate false discovery control. J Am Stat Assoc 2019; 114:1880-1893. [PMID: 33033420 PMCID: PMC7539891 DOI: 10.1080/01621459.2018.1527700] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Revised: 08/15/2018] [Accepted: 09/16/2018] [Indexed: 12/21/2022]
Abstract
Large-scale multiple testing with correlated and heavy-tailed data arises in a wide range of research areas from genomics, medical imaging to finance. Conventional methods for estimating the false discovery proportion (FDP) often ignore the effect of heavy-tailedness and the dependence structure among test statistics, and thus may lead to inefficient or even inconsistent estimation. Also, the commonly imposed joint normality assumption is arguably too stringent for many applications. To address these challenges, in this paper we propose a Factor-Adjusted Robust Multiple Testing (FarmTest) procedure for large-scale simultaneous inference with control of the false discovery proportion. We demonstrate that robust factor adjustments are extremely important in both controlling the FDP and improving the power. We identify general conditions under which the proposed method produces consistent estimate of the FDP. As a byproduct that is of independent interest, we establish an exponential-type deviation inequality for a robust U-type covariance estimator under the spectral norm. Extensive numerical experiments demonstrate the advantage of the proposed method over several state-of-the-art methods especially when the data are generated from heavy-tailed distributions. The proposed procedures are implemented in the R-package FarmTest.
Collapse
Affiliation(s)
- Jianqing Fan
- Honorary Professor, School of Data Science, Fudan University, Shanghai, China and Frederick L. Moore '18 Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, NJ 08544
| | - Yuan Ke
- Assistant Professor, Department of Statistics, University of Georgia, Athens, GA 30602
| | - Qiang Sun
- Assistant Professor, Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada
| | - Wen-Xin Zhou
- Wen-Xin Zhou is Assistant Professor, Department of Mathematics, University of California, San Diego, La Jolla, CA 92093
| |
Collapse
|
7
|
Johnstone IM, Paul D. PCA in High Dimensions: An orientation. PROCEEDINGS OF THE IEEE. INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS 2018; 106:1277-1292. [PMID: 30287970 PMCID: PMC6167023 DOI: 10.1109/jproc.2018.2846730] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
When the data are high dimensional, widely used multivariate statistical methods such as principal component analysis can behave in unexpected ways. In settings where the dimension of the observations is comparable to the sample size, upward bias in sample eigenvalues and inconsistency of sample eigenvectors are among the most notable phenomena that appear. These phenomena, and the limiting behavior of the rescaled extreme sample eigenvalues, have recently been investigated in detail under the spiked covariance model. The behavior of the bulk of the sample eigenvalues under weak distributional assumptions on the observations has been described. These results have been exploited to develop new estimation and hypothesis testing methods for the population covariance matrix. Furthermore, partly in response to these phenomena, alternative classes of estimation procedures have been developed by exploiting sparsity of the eigenvectors or the covariance matrix. This paper gives an orientation to these areas.
Collapse
Affiliation(s)
- Iain M Johnstone
- Department of Statistics, Stanford University, Stanford CA 94305
| | - Debashis Paul
- Department of Statistics, University of California, Davis
| |
Collapse
|
8
|
Aoshima M, Shen D, Shen H, Yata K, Zhou YH, Marron JS. A survey of high dimension low sample size asymptotics. AUST NZ J STAT 2018; 60:4-19. [PMID: 30197552 PMCID: PMC6124695 DOI: 10.1111/anzs.12212] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Peter Hall's work illuminated many aspects of statistical thought, some of which are very well known including the bootstrap and smoothing. However, he also explored many other lesser known aspects of mathematical statistics. This is a survey of one of those areas, initiated by a seminal paper in 2005, on high dimension low sample size asymptotics. An interesting characteristic of that first paper, and of many of the following papers, is that they contain deep and insightful concepts which are frequently surprising and counter-intuitive, yet have mathematical underpinnings which tend to be direct and not difficult to prove.
Collapse
Affiliation(s)
- Makoto Aoshima
- Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan
| | - Dan Shen
- Interdisciplinary Data Sciences Consortium, Department of Mathematics & Statistics, University of South Florida, FL 33620, USA
| | - Haipeng Shen
- Innovation and Information Management, Faculty of Business and Economics, University of Hong Kong, Hong Kong
| | - Kazuyoshi Yata
- Institute of Mathematics, University of Tsukuba, Ibaraki 305-8571, Japan
| | - Yi-Hui Zhou
- Bioinformatics Research Center, Departments of Biological Sciences, North Carolina State University, USA
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27514, USA
| |
Collapse
|