1
|
Gong T, Dong Y, Chen H, Dong B, Li C. Markov Subsampling Based on Huber Criterion. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2250-2262. [PMID: 35834451 DOI: 10.1109/tnnls.2022.3189069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Subsampling is an important technique to tackle the computational challenges brought by big data. Many subsampling procedures fall within the framework of importance sampling, which assigns high sampling probabilities to the samples appearing to have big impacts. When the noise level is high, those sampling procedures tend to pick many outliers and thus often do not perform satisfactorily in practice. To tackle this issue, we design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data; the constructed subset then serves as refined working data for efficient processing. HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined using the Huber criterion to prevent over scoring the outliers. Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound. The promising performance of HMS is demonstrated by extensive studies on large-scale simulations and real data examples.
Collapse
|
2
|
Liu S, Zhang Y, Golm GT, Liu G(F, Yang S. Robust analyzes for longitudinal clinical trials with missing and non-normal continuous outcomes. STATISTICAL THEORY AND RELATED FIELDS 2023; 8:1-14. [PMID: 38800501 PMCID: PMC11115336 DOI: 10.1080/24754269.2023.2261351] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 09/16/2023] [Indexed: 05/29/2024]
Abstract
Missing data is unavoidable in longitudinal clinical trials, and outcomes are not always normally distributed. In the presence of outliers or heavy-tailed distributions, the conventional multiple imputation with the mixed model with repeated measures analysis of the average treatment effect (ATE) based on the multivariate normal assumption may produce bias and power loss. Control-based imputation (CBI) is an approach for evaluating the treatment effect under the assumption that participants in both the test and control groups with missing outcome data have a similar outcome profile as those with an identical history in the control group. We develop a robust framework to handle non-normal outcomes under CBI without imposing any parametric modeling assumptions. Under the proposed framework, sequential weighted robust regressions are applied to protect the constructed imputation model against non-normality in the covariates and the response variables. Accompanied by the subsequent mean imputation and robust model analysis, the resulting ATE estimator has good theoretical properties in terms of consistency and asymptotic normality. Moreover, our proposed method guarantees the analysis model robust-ness of the ATE estimation in the sense that its asymptotic results remain intact even when the analysis model is misspecified. The superiority of the proposed robust method is demonstrated by comprehensive simulation studies and an AIDS clinical trial data application.
Collapse
Affiliation(s)
- Siyi Liu
- Department of Statistics, North Carolina State University, Raleigh, NC, USA
| | | | | | | | - Shu Yang
- Department of Statistics, North Carolina State University, Raleigh, NC, USA
| |
Collapse
|
3
|
Chen X, Meyer MC. Penalized unimodal spline density estimation with application to M-estimation. J Stat Plan Inference 2023. [DOI: 10.1016/j.jspi.2022.10.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
4
|
Robust variable selection and estimation via adaptive elastic net S-estimators for linear regression. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2023.107730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
|
5
|
Wang S, Xie C, Kang X. A novel robust estimation for high-dimensional precision matrices. Stat Med 2023; 42:656-675. [PMID: 36563324 DOI: 10.1002/sim.9636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 11/24/2022] [Accepted: 12/14/2022] [Indexed: 12/24/2022]
Abstract
In this paper we propose a new robust estimation of precision matrices for high-dimensional data when the number of variables is larger than the sample size. Different from the existing methods in literature, the proposed model combines the technique of modified Cholesky decomposition (MCD) with the robust generalized M-estimators. The MCD reparameterizes a precision matrix and transforms its estimation into solving a series of linear regressions, in which the commonly used robust techniques can be conveniently incorporated. Additionally, the proposed method adopts the model averaging idea to address the ordering issue in the MCD approach, resulting in an accurate estimation for precision matrices. Simulations and real data analysis are conducted to illustrate the merits of the proposed estimator.
Collapse
Affiliation(s)
- Shaoxin Wang
- School of Statistics and Data Science, Qufu Normal University, Qufu, China
| | - Chaoping Xie
- College of Economics and Management, Nanjing Agricultural University, Nanjing, China
| | - Xiaoning Kang
- Institute of Supply Chain Analytics and International Business College, Dongbei University of Finance and Economics, Dalian, China
| |
Collapse
|
6
|
Fan J, Lou Z, Yu M. Are Latent Factor Regression and Sparse Regression Adequate? J Am Stat Assoc 2023; 119:1076-1088. [PMID: 39268549 PMCID: PMC11390100 DOI: 10.1080/01621459.2023.2169700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 01/13/2023] [Indexed: 01/19/2023]
Abstract
We propose the Factor Augmented (sparse linear) Regression Model (FARM) that not only admits both the latent factor regression and sparse linear regression as special cases but also bridges dimension reduction and sparse regression together. We provide theoretical guarantees for the estimation of our model under the existence of sub-Gaussian and heavy-tailed noises (with bounded (1 + ϑ) -th moment, for all ϑ > 0) respectively. In addition, the existing works on supervised learning often assume the latent factor regression or sparse linear regression is the true underlying model without justifying its adequacy. To fill in such an important gap on high-dimensional inference, we also leverage our model as the alternative model to test the sufficiency of the latent factor regression and the sparse linear regression models. To accomplish these goals, we propose the Factor-Adjusted deBiased Test (FabTest) and a two-stage ANOVA type test respectively. We also conduct large-scale numerical experiments including both synthetic and FRED macroeconomics data to corroborate the theoretical properties of our methods. Numerical results illustrate the robustness and effectiveness of our model against latent factor regression and sparse linear regression models.
Collapse
Affiliation(s)
- Jianqing Fan
- Frederick L. Moore '18 Professor of Finance, Professor of Statistics, and Professor of Operations Research and Financial Engineering at the Princeton University
| | - Zhipeng Lou
- Department of Operations Research and Financial Engineering, Princeton University
| | - Mengxin Yu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
7
|
Zhang X, Wang Y, Zhu L, Chen H, Li H, Wu L. Robust variable structure discovery based on tilted empirical risk minimization. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04409-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
8
|
Yang S, Ling N. Robust projected principal component analysis for large-dimensional semiparametric factor modeling. J MULTIVARIATE ANAL 2023. [DOI: 10.1016/j.jmva.2023.105155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
9
|
Sun Y, Fang X. Sparse calibration based on adaptive lasso penalty for computer models. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2155311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Affiliation(s)
- Yang Sun
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Xiangzhong Fang
- School of Mathematical Sciences, Peking University, Beijing, China
| |
Collapse
|
10
|
Pandhare SC, Ramanathan TV. The robust desparsified lasso and the focused information criterion for high-dimensional generalized linear models. STATISTICS-ABINGDON 2022. [DOI: 10.1080/02331888.2022.2154769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- S. C. Pandhare
- Department of Statistics, Savitribai Phule Pune University, Pune, Maharashtra, India
| | - T. V. Ramanathan
- Department of Statistics, Savitribai Phule Pune University, Pune, Maharashtra, India
| |
Collapse
|
11
|
Wang Y, Karunamuni RJ. High-dimensional robust regression with L-loss functions. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
12
|
Zuo Y. Non-asymptotic analysis and inference for an outlyingness induced winsorized mean. Stat Pap (Berl) 2022. [DOI: 10.1007/s00362-022-01353-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
13
|
Speller J, Staerk C, Mayr A. Robust statistical boosting with quantile-based adaptive loss functions. Int J Biostat 2022:ijb-2021-0127. [PMID: 35950232 DOI: 10.1515/ijb-2021-0127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 06/20/2022] [Indexed: 11/15/2022]
Abstract
We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially high-dimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with base-learners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical M-regression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For non-corrupted data, our approach yielded a similar performance to boosting with the efficient L 2 loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI-60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.
Collapse
Affiliation(s)
- Jan Speller
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Christian Staerk
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Andreas Mayr
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| |
Collapse
|
14
|
|
15
|
Robust parameter estimation of regression models under weakened moment assumptions. Stat Probab Lett 2022. [DOI: 10.1016/j.spl.2022.109678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
16
|
Pan Y, Xu K, Wei S, Wang X, Liu Z. Efficient distributed optimization for large-scale high-dimensional sparse penalized Huber regression. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2098331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Yingli Pan
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, China
| | - Kaidong Xu
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, China
| | - Sha Wei
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, China
| | - Xiaojuan Wang
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, China
| | - Zhan Liu
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, China
| |
Collapse
|
17
|
Liu Y, Pi P, Luo S. A semi-parametric approach to feature selection in high-dimensional linear regression models. Comput Stat 2022. [DOI: 10.1007/s00180-022-01254-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
18
|
Zhou J, Claeskens G. Automatic bias correction for testing in high dimensional linear models. STAT NEERL 2022. [DOI: 10.1111/stan.12274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Jing Zhou
- ORStat and Leuven Statistics Research Center, KU Leuven, Naamsestraat 69 Leuven Belgium
| | - Gerda Claeskens
- ORStat and Leuven Statistics Research Center, KU Leuven, Naamsestraat 69 Leuven Belgium
| |
Collapse
|
19
|
Luo J, Sun Q, Zhou WX. Distributed adaptive Huber regression. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2021.107419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
20
|
Tan KM, Sun Q, Witten D. Sparse Reduced Rank Huber Regression in High Dimensions. J Am Stat Assoc 2022; 118:2383-2393. [PMID: 38283734 PMCID: PMC10812838 DOI: 10.1080/01621459.2022.2050243] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Accepted: 02/04/2022] [Indexed: 10/18/2022]
Abstract
We propose a sparse reduced rank Huber regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained nonconvex optimization problem, which is then solved using a block coordinate descent and an alternating direction method of multipliers algorithm. We establish nonasymptotic estimation error bounds under both Frobenius and nuclear norms in the high-dimensional setting. This is a major contribution over existing results in reduced rank regression, which mainly focus on rank selection and prediction consistency. Our theoretical results quantify the tradeoff between heavy-tailedness of the random noise and statistical bias. For random noise with bounded ( 1 + δ ) th moment with δ ∈ ( 0 , 1 ) , the rate of convergence is a function of δ , and is slower than the sub-Gaussian-type deviation bounds; for random noise with bounded second moment, we obtain a rate of convergence as if sub-Gaussian noise were assumed. We illustrate the performance of the proposed method via extensive numerical studies and a data application. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Kean Ming Tan
- Department of Statistics, University of Michigan, Ann Arbor, MI
| | - Qiang Sun
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington, Seattle, WA
| |
Collapse
|
21
|
Statistical Methods with Applications in Data Mining: A Review of the Most Recent Works. MATHEMATICS 2022. [DOI: 10.3390/math10060993] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The importance of statistical methods in finding patterns and trends in otherwise unstructured and complex large sets of data has grown over the past decade, as the amount of data produced keeps growing exponentially and knowledge obtained from understanding data allows to make quick and informed decisions that save time and provide a competitive advantage. For this reason, we have seen considerable advances over the past few years in statistical methods in data mining. This paper is a comprehensive and systematic review of these recent developments in the area of data mining.
Collapse
|
22
|
Liang W, Wu Y, Ma X. Robust sparse precision matrix estimation for high-dimensional compositional data. Stat Probab Lett 2022. [DOI: 10.1016/j.spl.2022.109379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
23
|
Maximum Correntropy Criterion with Distributed Method. MATHEMATICS 2022. [DOI: 10.3390/math10030304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The Maximum Correntropy Criterion (MCC) has recently triggered enormous research activities in engineering and machine learning communities since it is robust when faced with heavy-tailed noise or outliers in practice. This work is interested in distributed MCC algorithms, based on a divide-and-conquer strategy, which can deal with big data efficiently. By establishing minmax optimal error bounds, our results show that the averaging output function of this distributed algorithm can achieve comparable convergence rates to the algorithm processing the total data in one single machine.
Collapse
|
24
|
Mathieu T. Concentration study of M-estimators using the influence function. Electron J Stat 2022. [DOI: 10.1214/22-ejs2030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
25
|
Ren S, Mai Q. The robust nearest shrunken centroids classifier for high-dimensional heavy-tailed data. Electron J Stat 2022. [DOI: 10.1214/22-ejs2022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Shaokang Ren
- Department of Statistics, Florida State University, Tallahassee, Florida 32306, U.S.A
| | - Qing Mai
- Department of Statistics, Florida State University, Tallahassee, Florida 32306, U.S.A
| |
Collapse
|
26
|
Pezoulas VC, Exarchos TP, Tzioufas AG, Fotiadis DI. Multiple additive regression trees with hybrid loss for classification tasks across heterogeneous clinical data in distributed environments: a case study. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:1670-1673. [PMID: 34891606 DOI: 10.1109/embc46164.2021.9629912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Multiple additive regression trees (MART) have been widely used in the literature for various classification tasks. However, the overfitting effects of MART across heterogeneous and highly imbalanced big data structures within distributed environments has not yet been investigated. In this work, we utilize distributed MART with hybrid loss to resolve overfitting effects during the training of disease classification models in a case study with 10 heterogeneous and distributed clinical datasets. Lexical and semantic analysis methods were utilized to match heterogeneous terminologies with 80% overlap. Data augmentation was used to resolve class imbalance yielding virtual data with goodness of fit 0.01 and correlation difference 0.02. Our results highlight the favorable performance of the proposed distributed MART on the augmented data with an average increase by 7.3% in the accuracy, 6.8% in sensitivity, 10.4% in specificity, for a specific loss function topology.
Collapse
|
27
|
Madrid Padilla OH, Chatterjee S. Risk bounds for quantile trend filtering. Biometrika 2021. [DOI: 10.1093/biomet/asab045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
We study quantile trend filtering, a recently proposed method for nonparametric quantile regression, with the goal of generalizing existing risk bounds for the usual trend-filtering estimators that perform mean regression. We study both the penalized and the constrained versions, of order $r \geqslant 1$, of univariate quantile trend filtering. Our results show that both the constrained and the penalized versions of order $r \geqslant 1$ attain the minimax rate up to logarithmic factors, when the $(r-1)$th discrete derivative of the true vector of quantiles belongs to the class of bounded-variation signals. Moreover, we show that if the true vector of quantiles is a discrete spline with a few polynomial pieces, then both versions attain a near-parametric rate of convergence. Corresponding results for the usual trend-filtering estimators are known to hold only when the errors are sub-Gaussian. In contrast, our risk bounds are shown to hold under minimal assumptions on the error variables. In particular, no moment assumptions are needed and our results hold under heavy-tailed errors. Our proof techniques are general, and thus can potentially be used to study other nonparametric quantile regression methods. To illustrate this generality, we employ our proof techniques to obtain new results for multivariate quantile total-variation denoising and high-dimensional quantile linear regression.
Collapse
Affiliation(s)
- Oscar Hernan Madrid Padilla
- Department of Statistics, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, California 90095, U.S.A
| | - Sabyasachi Chatterjee
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S. Wright St. M/C 374, Champaign, Illinois 61820, U.S.A
| |
Collapse
|
28
|
Zhang Y, Bradic J. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika 2021. [DOI: 10.1093/biomet/asab042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
A fundamental challenge in semi-supervised learning lies in the observed data’s disproportional size when compared with the size of the data collected with missing outcomes. An implicit understanding is that the dataset with missing outcomes, being significantly larger, ought to improve estimation and inference. However, it is unclear to what extent this is correct. We illustrate one clear benefit: root-n inference of the outcome’s mean is possible while only requiring a consistent estimation of the outcome, possibly at a rate slower than root-n. This is achieved by a novel k-fold cross-fitted, double robust estimator. We discuss both linear and nonlinear outcomes. Such an estimator is particularly suited for models that naturally do not admit root-n consistency, such as high-dimensional, nonparametric, or semiparametric models. We apply our methods to the heterogeneous treatment effects.
Collapse
Affiliation(s)
- Yuqian Zhang
- Department of Mathematics, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093-0112, U.S.A
| | - Jelena Bradic
- Department of Mathematics, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093-0112, U.S.A
| |
Collapse
|
29
|
Hu Z, Zhou Y, Tong T. Meta-Analyzing Multiple Omics Data With Robust Variable Selection. Front Genet 2021; 12:656826. [PMID: 34290735 PMCID: PMC8288516 DOI: 10.3389/fgene.2021.656826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/24/2021] [Indexed: 12/03/2022] Open
Abstract
High-throughput omics data are becoming more and more popular in various areas of science. Given that many publicly available datasets address the same questions, researchers have applied meta-analysis to synthesize multiple datasets to achieve more reliable results for model estimation and prediction. Due to the high dimensionality of omics data, it is also desirable to incorporate variable selection into meta-analysis. Existing meta-analyzing variable selection methods are often sensitive to the presence of outliers, and may lead to missed detections of relevant covariates, especially for lasso-type penalties. In this paper, we develop a robust variable selection algorithm for meta-analyzing high-dimensional datasets based on logistic regression. We first search an outlier-free subset from each dataset by borrowing information across the datasets with repeatedly use of the least trimmed squared estimates for the logistic model and together with a hierarchical bi-level variable selection technique. We then refine a reweighting step to further improve the efficiency after obtaining a reliable non-outlier subset. Simulation studies and real data analysis show that our new method can provide more reliable results than the existing meta-analysis methods in the presence of outliers.
Collapse
Affiliation(s)
- Zongliang Hu
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Yan Zhou
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| |
Collapse
|
30
|
Fan J, Wang W, Zhu Z. A SHRINKAGE PRINCIPLE FOR HEAVY-TAILED DATA: HIGH-DIMENSIONAL ROBUST LOW-RANK MATRIX RECOVERY. Ann Stat 2021; 49:1239-1266. [PMID: 34556893 PMCID: PMC8457508 DOI: 10.1214/20-aos1980] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
This paper introduces a simple principle for robust statistical inference via appropriate shrinkage on the data. This widens the scope of high-dimensional techniques, reducing the distributional conditions from sub-exponential or sub-Gaussian to more relaxed bounded second or fourth moment. As an illustration of this principle, we focus on robust estimation of the low-rank matrix Θ* from the trace regression model Y = Tr(Θ*⊤ X) + ϵ. It encompasses four popular problems: sparse linear model, compressed sensing, matrix completion and multi-task learning. We propose to apply the penalized least-squares approach to the appropriately truncated or shrunk data. Under only bounded 2+δ moment condition on the response, the proposed robust methodology yields an estimator that possesses the same statistical error rates as previous literature with sub-Gaussian errors. For sparse linear model and multi-task regression, we further allow the design to have only bounded fourth moment and obtain the same statistical rates. As a byproduct, we give a robust covariance estimator with concentration inequality and optimal rate of convergence in terms of the spectral norm, when the samples only bear bounded fourth moment. This result is of its own interest and importance. We reveal that under high dimensions, the sample covariance matrix is not optimal whereas our proposed robust covariance can achieve optimality. Extensive simulations are carried out to support the theories.
Collapse
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| | - Weichen Wang
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| | - Ziwei Zhu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| |
Collapse
|
31
|
He Y, Liu P, Zhang X, Zhou W. Robust covariance estimation for high-dimensional compositional data with application to microbial communities analysis. Stat Med 2021; 40:3499-3515. [PMID: 33840134 DOI: 10.1002/sim.8979] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 03/13/2021] [Accepted: 03/25/2021] [Indexed: 11/08/2022]
Abstract
Microbial communities analysis is drawing growing attention due to the rapid development fire of high-throughput sequencing techniques nowadays. The observed data has the following typical characteristics: it is high-dimensional, compositional (lying in a simplex) and even would be leptokurtic and highly skewed due to the existence of overly abundant taxa, which makes the conventional correlation analysis infeasible to study the co-occurrence and co-exclusion relationship between microbial taxa. In this article, we address the challenges of covariance estimation for this kind of data. Assuming the basis covariance matrix lying in a well-recognized class of sparse covariance matrices, we adopt a proxy matrix known as centered log-ratio covariance matrix in the literature. We construct a Median-of-Means estimator for the centered log-ratio covariance matrix and propose a thresholding procedure that is adaptive to the variability of individual entries. By imposing a much weaker finite fourth moment condition compared with the sub-Gaussianity condition in the literature, we derive the optimal rate of convergence under the spectral norm. In addition, we also provide theoretical guarantee on support recovery. The adaptive thresholding procedure of the MOM estimator is easy to implement and gains robustness when outliers or heavy-tailedness exist. Thorough simulation studies are conducted to show the advantages of the proposed procedure over some state-of-the-arts methods. At last, we apply the proposed method to analyze a microbiome dataset in human gut.
Collapse
Affiliation(s)
- Yong He
- Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, Shandong, China
| | - Pengfei Liu
- School of Mathematics and Statistics and Research Institute of Mathematical Sciences, Jiangsu Normal University, Xuzhou, Jiangsu, China
| | | | - Wang Zhou
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| |
Collapse
|
32
|
Chen P, Jin X, Li X, Xu L. A generalized Catoni’s M-estimator under finite α-th moment assumption with α∈(1,2). Electron J Stat 2021. [DOI: 10.1214/21-ejs1911] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Peng Chen
- Department of Mathematics, College of Science, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China
| | - Xinghu Jin
- Department of Mathematics, Faculty of Science and Technology, University of Macau, Av. Padre Tomás Pereira, Taipa Macau, China
| | - Xiang Li
- Department of Mathematics, Faculty of Science and Technology, University of Macau, Av. Padre Tomás Pereira, Taipa Macau, China
| | - Lihu Xu
- Department of Mathematics, Faculty of Science and Technology, University of Macau, Av. Padre Tomás Pereira, Taipa Macau, China
| |
Collapse
|
33
|
Affiliation(s)
- Xiaoou Pan
- Department of Mathematics, University of California, San Diego, La Jolla, CA 92093, USA
| | - Qiang Sun
- Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada
| | - Wen-Xin Zhou
- Department of Mathematics, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
34
|
Fan J, Ma C, Wang K. Comment on “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1837138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University , Princeton , NJ
| | - Cong Ma
- Department of Electrical Engineering and Computer Sciences, UC Berkeley , Berkeley , CA
| | - Kaizheng Wang
- Department of Industrial Engineering and Operations Research, Columbia University , New York , NY
| |
Collapse
|
35
|
Ehsan MA, Shahirinia A, Zhang N, Oladunni T. Investigation of Data Size Variability in Wind Speed Prediction Using AI Algorithms. CYBERNETICS AND SYSTEMS 2020; 52:105-126. [PMID: 38500540 PMCID: PMC10947156 DOI: 10.1080/01969722.2020.1827796] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
Electricity generation from burning fossil fuel is one of the major contributors to global warming. Renewable energy sources are a viable alternative to produce electrical energy and to reduce the emission from power industry. They have unlocked opportunities for consumers to produce electricity locally and use it on-site that reduces dependency on centralized generation. Despite the widespread availability, one of the major challenges is to understand their characteristics in a more informative way. Wind energy is highly dependent on the intermittent wind speed profile. This paper proposes the prediction of wind speed that simplifies wind farm planning and feasibility study. Twelve artificial intelligence algorithms were used for wind speed prediction from collected meteorological parameters. The model performances were compared to determine the wind speed prediction accuracy and model comparison for different sizes of data set. The results show, the most effective algorithm varies based on the data size.
Collapse
Affiliation(s)
- M. A. Ehsan
- Department of Electrical and Computer Engineering, University of the District of Columbia, Washington, DC, USA
| | - Amir Shahirinia
- Department of Electrical and Computer Engineering, University of the District of Columbia, Washington, DC, USA
| | - Nian Zhang
- Department of Electrical and Computer Engineering, University of the District of Columbia, Washington, DC, USA
| | - Timothy Oladunni
- Department of Computer Science & Information Technology, University of the District of Columbia, Washington, DC, USA
| |
Collapse
|
36
|
Ke Y, Minsker S, Ren Z, Sun Q, Zhou WX. User-Friendly Covariance Estimation for Heavy-Tailed Distributions. Stat Sci 2019. [DOI: 10.1214/19-sts711] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|