1
|
Guo X, Li R, Liu J, Zeng M. Statistical inference for linear mediation models with high-dimensional mediators and application to studying stock reaction to COVID-19 pandemic. JOURNAL OF ECONOMETRICS 2023; 235:166-179. [PMID: 36568314 PMCID: PMC9759674 DOI: 10.1016/j.jeconom.2022.03.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 01/31/2022] [Accepted: 03/04/2022] [Indexed: 06/17/2023]
Abstract
Mediation analysis draws increasing attention in many research areas such as economics, finance and social sciences. In this paper, we propose new statistical inference procedures for high dimensional mediation models, in which both the outcome model and the mediator model are linear with high dimensional mediators. Traditional procedures for mediation analysis cannot be used to make statistical inference for high dimensional linear mediation models due to high-dimensionality of the mediators. We propose an estimation procedure for the indirect effects of the models via a partially penalized least squares method, and further establish its theoretical properties. We further develop a partially penalized Wald test on the indirect effects, and prove that the proposed test has a χ 2 limiting null distribution. We also propose an F -type test for direct effects and show that the proposed test asymptotically follows a χ 2 -distribution under null hypothesis and a noncentral χ 2 -distribution under local alternatives. Monte Carlo simulations are conducted to examine the finite sample performance of the proposed tests and compare their performance with existing ones. We further apply the newly proposed statistical inference procedures to study stock reaction to COVID-19 pandemic via an empirical analysis of studying the mediation effects of financial metrics that bridge company's sector and stock return.
Collapse
Affiliation(s)
- Xu Guo
- School of Statistics, Beijing Normal University, Beijing, 100875, China
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
| | - Jingyuan Liu
- MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics and Fujian Key Lab of Statistics, Xiamen University, Xiamen, 361000, China
| | - Mudong Zeng
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
2
|
Chen J, Li Q, Chen HY. Testing generalized linear models with high-dimensional nuisance parameter. Biometrika 2023; 110:83-99. [PMID: 36816791 PMCID: PMC9933885 DOI: 10.1093/biomet/asac021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Generalized linear models often have a high-dimensional nuisance parameters, as seen in applications such as testing gene-environment interactions or gene-gene interactions. In these scenarios, it is essential to test the significance of a high-dimensional sub-vector of the model's coefficients. Although some existing methods can tackle this problem, they often rely on the bootstrap to approximate the asymptotic distribution of the test statistic, and thus are computationally expensive. Here, we propose a computationally efficient test with a closed-form limiting distribution, which allows the parameter being tested to be either sparse or dense. We show that under certain regularity conditions, the type I error of the proposed method is asymptotically correct, and we establish its power under high-dimensional alternatives. Extensive simulations demonstrate the good performance of the proposed test and its robustness when certain sparsity assumptions are violated. We also apply the proposed method to Chinese famine sample data in order to show its performance when testing the significance of gene-environment interactions.
Collapse
Affiliation(s)
- Jinsong Chen
- College of Applied Health Sciences, University of Illinois at Chicago, 1919 W Taylor St, Chicago, Illinois 60612, U.S.A
| | - Quefeng Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Hua Yun Chen
- School of Public Health, University of Illinois at Chicago, 2121 W Taylor St, Chicago, Illinois 60612, U.S.A
| |
Collapse
|
3
|
Jiang F, Zhou Y, Liu J, Ma Y. On high-dimensional Poisson models with measurement error: Hypothesis testing for nonlinear nonconvex optimization. Ann Stat 2023; 51:233-259. [PMID: 37602147 PMCID: PMC10438917 DOI: 10.1214/22-aos2248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/22/2023]
Abstract
We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.
Collapse
Affiliation(s)
- Fei Jiang
- Department of Epidemiology and Biostatistics, The University of California, San Francisco
| | - Yeqing Zhou
- School of Mathematical Sciences, Tongji University
| | | | - Yanyuan Ma
- Department of Statistics, Pennsylvania State University
| |
Collapse
|
4
|
Fan J, Lou Z, Yu M. Are Latent Factor Regression and Sparse Regression Adequate? J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2169700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Affiliation(s)
- Jianqing Fan
- Frederick L. Moore ’18 Professor of Finance, Professor of Statistics, and Professor of Operations Research and Financial Engineering at the Princeton University
| | - Zhipeng Lou
- Department of Operations Research and Financial Engineering, Princeton University
| | - Mengxin Yu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
5
|
Li C, Shen X, Pan W. Inference for a Large Directed Acyclic Graph with Unspecified Interventions. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2023; 24:73. [PMID: 37701522 PMCID: PMC10497226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires to identify the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks. The R implementation is available at https://github.com/chunlinli/intdag.
Collapse
Affiliation(s)
- Chunlin Li
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
6
|
Yu X, Li D, Xue L. Fisher’s combined probability test for high-dimensional covariance matrices *. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2126781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Affiliation(s)
- Xiufan Yu
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame
| | - Danning Li
- KLAS and School of Mathematics & Statistics, Northeast Normal University
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University
| |
Collapse
|
7
|
Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. MATHEMATICS 2022. [DOI: 10.3390/math10101700] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Recently, the high-dimensional negative binomial regression (NBR) for count data has been widely used in many scientific fields. However, most studies assumed the dispersion parameter as a constant, which may not be satisfied in practice. This paper studies the variable selection and dispersion estimation for the heterogeneous NBR models, which model the dispersion parameter as a function. Specifically, we proposed a double regression and applied a double ℓ1-penalty to both regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for the lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.
Collapse
|
8
|
Huang Y, Li C, Li R, Yang S. An overview of tests on high-dimensional means. J MULTIVARIATE ANAL 2022. [DOI: 10.1016/j.jmva.2021.104813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
9
|
Abstract
Multimodal data, where different types of data are collected from the same subjects, are fast emerging in a large variety of scientific applications. Factor analysis is commonly used in integrative analysis of multimodal data, and is particularly useful to overcome the curse of high dimensionality and high correlations. However, there is little work on statistical inference for factor analysis based supervised modeling of multimodal data. In this article, we consider an integrative linear regression model that is built upon the latent factors extracted from multimodal data. We address three important questions: how to infer the significance of one data modality given the other modalities in the model; how to infer the significance of a combination of variables from one modality or across different modalities; and how to quantify the contribution, measured by the goodness-of-fit, of one data modality given the others. When answering each question, we explicitly characterize both the benefit and the extra cost of factor analysis. Those questions, to our knowledge, have not yet been addressed despite wide use of factor analysis in integrative multimodal analysis, and our proposal bridges an important gap. We study the empirical performance of our methods through simulations, and further illustrate with a multimodal neuroimaging analysis.
Collapse
|
10
|
Liang M, Choi YG, Ning Y, Smith MA, Zhao YQ. Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:262. [PMID: 38098839 PMCID: PMC10720606] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/17/2023]
Abstract
With the increasing adoption of electronic health records, there is an increasing interest in developing individualized treatment rules, which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal individualized treatment rule from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal adopts the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method.
Collapse
Affiliation(s)
- Muxuan Liang
- Department of Biostatistics, University of Florida, Gainesville, Florida 32611, USA
| | - Young-Geun Choi
- Department of Statistics, Sookmyung Women's University, Seoul 04310, Korea
| | - Yang Ning
- Department of Statistics and Data Science, Cornell University, Ithaca, Newyork 14853, USA
| | - Maureen A Smith
- Departments of Population Health and Family Medicine, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA
| | - Ying-Qi Zhao
- Public Health Sciences Divisions, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA
| |
Collapse
|
11
|
Abstract
A central question in high-dimensional mediation analysis is to infer the significance of individual mediators. The main challenge is that the total number of potential paths that go through any mediator is super-exponential in the number of mediators. Most existing mediation inference solutions either explicitly impose that the mediators are conditionally independent given the exposure, or ignore any potential directed paths among the mediators. In this article, we propose a novel hypothesis testing procedure to evaluate individual mediation effects, while taking into account potential interactions among the mediators. Our proposal thus fills a crucial gap, and greatly extends the scope of existing mediation tests. Our key idea is to construct the test statistic using the logic of Boolean matrices, which enables us to establish the proper limiting distribution under the null hypothesis. We further employ screening, data splitting, and decorrelated estimation to reduce the bias and increase the power of the test. We show that our test can control both the size and false discovery rate asymptotically, and the power of the test approaches one, while allowing the number of mediators to diverge to infinity with the sample size. We demonstrate the efficacy of the method through simulations and a neuroimaging study of Alzheimer's disease. A Python implementation of the proposed procedure is available at https://github.com/callmespring/LOGAN.
Collapse
Affiliation(s)
- Chengchun Shi
- London School of Economics and Political Science and University of California at Berkeley
| | - Lexin Li
- London School of Economics and Political Science and University of California at Berkeley
| |
Collapse
|
12
|
Nordman A, Friberg M, Forsell Y. Is There a Dose-Response Relationship between Acute Physical Activity and Sleep Length? A Longitudinal Study with Children and Adolescents Living in Sweden. CHILDREN-BASEL 2021; 8:children8090808. [PMID: 34572240 PMCID: PMC8471754 DOI: 10.3390/children8090808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 09/13/2021] [Accepted: 09/14/2021] [Indexed: 11/16/2022]
Abstract
A declining physical activity (PA) and sleep in children and adolescents have been observed during the previous decades. PA could benefit sleep, but the findings are mixed. The aim of the present study was to examine if there is a dose-response relationship between time spent in acute moderate and vigorous physical activity (MVPA) and sleep length in children and adolescents. Additional aims were to examine if the sleep length is higher for children and adolescents who conduct at least an average of 60 min in MVPA/day and to study differences between sex and school years. The study population consists of 262 participants in school year 5 (aged 11 years), 7 (aged 13 years), and 9 (aged 15 years). Accelerometers measured MVPA while sleep diaries measured sleep length. A linear and longitudinal mixed effect linear regression was conducted to study the primary aim. The secondary aims were studied with linear regressions. Included confounders were sex, school year, school stress, screen time, menstruation onset, family household economy, and health status. A stratified regression for sex and school year was conducted. The linear regression showed no statistically significant findings in the crude or adjusted model. The stratified linear regression found a significant positive association for girls but a negative association for school year 5. No associations were found in the longitudinal regression or when comparing sleep length for participants that did and did not spend an average of at least 60 min in MVPA/day. A dose-response relationship was found in the stratified linear regression, implying a possible weak association. The statistically non-significant differences between participants that did and did not spend an average of at least 60 min in MVPA/day implies that spending an average of at least 60 min in MVPA/day may not be associated with a higher mean sleep length.
Collapse
Affiliation(s)
- Alexandra Nordman
- Department of Global Public Health, Karolinska Institutet, 17177 Solna, Sweden;
| | - Marita Friberg
- Department of Living Conditions and Lifestyles, Public Health Agency of Sweden, 17165 Solna, Sweden;
| | - Yvonne Forsell
- Department of Global Public Health, Karolinska Institutet, 17177 Solna, Sweden;
- Correspondence: ; Tel.: +46-709-460-991
| |
Collapse
|
13
|
Tan F, Zhu L. Integrated conditional moment test and beyond: when the number of covariates is divergent. Biometrika 2021. [DOI: 10.1093/biomet/asab009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary
The classical integrated conditional moment test is a promising method for model checking and its basic idea has been applied to develop several variants. However, in diverging-dimension scenarios, the integrated conditional moment test may break down and has completely different limiting properties from the fixed-dimension case. Furthermore, the related wild bootstrap approximation can also be invalid. To extend this classical test to diverging dimension settings, we propose a projected adaptive-to-model version of the integrated conditional moment test. We study the asymptotic properties of the new test under both the null and alternative hypotheses to examine if it maintains significance level, and its sensitivity to the global and local alternatives that are distinct from the null at the rate $n^{-1/2}$. The corresponding wild bootstrap approximation can still work for the new test in diverging-dimension scenarios. We also derive the consistency and asymptotically linear representation of the least squares estimator when the parameter diverges at the fastest possible known rate in the literature. Numerical studies show that the new test can greatly enhance the performance of the integrated conditional moment test in high-dimensional cases. We also apply the test to a real dataset for illustration.
Collapse
|
14
|
Wu C, Xu G, Shen X, Pan W. A Regularization-Based Adaptive Test for High-Dimensional Generalized Linear Models. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2020; 21:128. [PMID: 32802002 PMCID: PMC7425805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In spite of its urgent importance in the era of big data, testing high-dimensional parameters in generalized linear models (GLMs) in the presence of high-dimensional nuisance parameters has been largely under-studied, especially with regard to constructing powerful tests for general (and unknown) alternatives. Most existing tests are powerful only against certain alternatives and may yield incorrect Type I error rates under high-dimensional nuisance parameter situations. In this paper, we propose the adaptive interaction sum of powered score (aiSPU) test in the framework of penalized regression with a non-convex penalty, called truncated Lasso penalty (TLP), which can maintain correct Type I error rates while yielding high statistical power across a wide range of alternatives. To calculate its p-values analytically, we derive its asymptotic null distribution. Via simulations, its superior finite-sample performance is demonstrated over several representative existing methods. In addition, we apply it and other representative tests to an Alzheimer's Disease Neuroimaging Initiative (ADNI) data set, detecting possible gene-gender interactions for Alzheimer's disease. We also put R package "aispu" implementing the proposed test on GitHub.
Collapse
Affiliation(s)
- Chong Wu
- Department of Statistics, Florida State University, FL, USA
| | - Gongjun Xu
- Department of Statistics, University of Michigan, MI, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, MN, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, MN, USA
| |
Collapse
|