1
|
Sherwood B, Price BS. On the Use of Minimum Penalties in Statistical Learning. J Comput Graph Stat 2023; 33:138-151. [PMID: 38706715 PMCID: PMC11065433 DOI: 10.1080/10618600.2023.2210174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 04/27/2023] [Indexed: 05/07/2024]
Abstract
Modern multivariate machine learning and statistical methodologies estimate parameters of interest while leveraging prior knowledge of the association between outcome variables. The methods that do allow for estimation of relationships do so typically through an error covariance matrix in multivariate regression which does not generalize to other types of models. In this article we proposed the MinPen framework to simultaneously estimate regression coefficients associated with the multivariate regression model and the relationships between outcome variables using common assumptions. The MinPen framework utilizes a novel penalty based on the minimum function to simultaneously detect and exploit relationships between responses. An iterative algorithm is proposed as a solution to the non-convex optimization. Theoretical results such as high dimensional convergence rates, model selection consistency, and a framework for post selection inference are provided. We extend the proposed MinPen framework to other exponential family loss functions, with a specific focus on multiple binomial responses. Tuning parameter selection is also addressed. Finally, simulations and two data examples are presented to show the finite sample properties of this framework. Supplemental material providing proofs, additional simulations, code, and data sets are available online.
Collapse
Affiliation(s)
| | - Bradley S. Price
- Management Information Systems Department, West Virginia University
| |
Collapse
|
2
|
Li N, Zhu W. A Bayesian approach for subgroup analysis. Biom J 2023; 65:e2200231. [PMID: 36908004 DOI: 10.1002/bimj.202200231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 01/06/2023] [Accepted: 01/09/2023] [Indexed: 03/14/2023]
Abstract
Several penalization approaches have been developed to identify homogeneous subgroups based on a regression model with subject-specific intercepts in subgroup analysis. These methods often apply concave penalty functions to pairwise comparisons of the intercepts, such that the subjects with similar intercept values are assigned to the same group, which is very similar to the procedure of the penalization approaches for variable selection. Since the Bayesian methods are commonly used in variable selection, it is worth considering the corresponding approaches to subgroup analysis in the Bayesian framework. In this paper, a Bayesian hierarchical model with appropriate prior structures is developed for the pairwise differences of intercepts based on a regression model with subject-specific intercepts, which can automatically detect and identify homogeneous subgroups. A Gibbs sampling algorithm is also provided to select the hyperparameter and estimate the intercepts and coefficients of the covariates simultaneously, which is computationally efficient for pairwise comparisons compared to the time-consuming procedures for parameter estimation of the penalization methods (e.g., alternating direction method of multiplier) in the case of large sample sizes. The effectiveness and usefulness of the proposed Bayesian method are evaluated through simulation studies and analysis of a Cleveland Heart Disease Dataset.
Collapse
Affiliation(s)
- Nan Li
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, China
| | - Wensheng Zhu
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, China
| |
Collapse
|
3
|
Huang X, Xu J, Zhou Y. Efficient algorithms for survival data with multiple outcomes using the frailty model. Stat Methods Med Res 2023; 32:118-132. [PMID: 36317365 DOI: 10.1177/09622802221133554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Survival data with multiple outcomes are frequently encountered in biomedical investigations. An illustrative example comes from Alzheimer's Disease Neuroimaging Initiative study where the cognitively normal subjects may clinically progress to mild cognitive impairment and/or Alzheimer's disease dementia. Transition time from normal cognition to mild cognitive impairment and that from mild cognitive impairment to Alzheimer's disease are expected to be correlated within subjects and the dependence is often accommodated by the frailty (random effects). Estimation in the frailty model unavoidably involves multiple integrations which may be intractable and hence leads to severe computational challenges, especially in the presence of high-dimensional covariates. In this paper, we propose efficient minorization-maximization algorithms in the frailty model for survival data with multiple outcomes. The alternating direction method of multipliers is further incorporated for simultaneous variable selection and homogeneity pursuit via regularization and fusion. Extensive simulation studies are conducted to assess the performance of the proposed algorithms. An application to the Alzheimer's Disease Neuroimaging Initiative data is also provided to illustrate their practical utilities.
Collapse
Affiliation(s)
- Xifen Huang
- School of Mathematics, 66343Yunnan Normal University, Kunming, China
| | - Jinfeng Xu
- Department of Biostatistics, 53025City University of Hong Kong, Hong Kong, Hong Kong
| | - Yunpeng Zhou
- Department of Statistics & Actuarial Science, 25809University of Hong Kong, Hong Kong
| |
Collapse
|
4
|
Pan Y, Zhao X, Wei S, Liu Z. High-dimensional expectile regression incorporating graphical structure among predictors. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2099861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Yingli Pan
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, People's Republic of China
| | - Xiaoluo Zhao
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, People's Republic of China
| | - Sha Wei
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, People's Republic of China
| | - Zhan Liu
- Hubei Key Laboratory of Applied Mathematics, Faculty of Mathematics and Statistics, Hubei University, Wuhan, People's Republic of China
| |
Collapse
|
5
|
Li X, Wang Y, Ruiz R. A Survey on Sparse Learning Models for Feature Selection. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1642-1660. [PMID: 32386172 DOI: 10.1109/tcyb.2020.2982445] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Feature selection is important in both machine learning and pattern recognition. Successfully selecting informative features can significantly increase learning accuracy and improve result comprehensibility. Various methods have been proposed to identify informative features from high-dimensional data by removing redundant and irrelevant features to improve classification accuracy. In this article, we systematically survey existing sparse learning models for feature selection from the perspectives of individual sparse feature selection and group sparse feature selection, and analyze the differences and connections among various sparse learning models. Promising research directions and topics on sparse learning models are analyzed.
Collapse
|
6
|
Sürer Ö, Apley DW, Malthouse EC. Coefficient tree regression: fast, accurate and interpretable predictive modeling. Mach Learn 2021. [DOI: 10.1007/s10994-021-06091-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
7
|
Weber M, Striaukas J, Schumacher M, Binder H. Regularized regression when covariates are linked on a network: the 3CoSE algorithm. J Appl Stat 2021; 50:535-554. [PMID: 36819080 PMCID: PMC9930759 DOI: 10.1080/02664763.2021.1982878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Covariates in regressions may be linked to each other on a network. Knowledge of the network structure can be incorporated into regularized regression settings via a network penalty term. However, when it is unknown whether the connection signs in the network are positive (connected covariates reinforce each other) or negative (connected covariates repress each other), the connection signs have to be estimated jointly with the covariate coefficients. This can be done with an algorithm iterating a connection sign estimation step and a covariate coefficient estimation step. We develop such an algorithm, called 3CoSE, and show detailed simulation results and an application forecasting event times. The algorithm performs well in a variety of settings. We also briefly describe the publicly available R-package developed for this purpose.
Collapse
Affiliation(s)
- Matthias Weber
- School of Finance, University of St. Gallen, St. Gallen, Switzerland
| | - Jonas Striaukas
- F.R.S.-FNRS, Université Catholique de Louvain, Louvain-la-Neuve, Belgium,Jonas Striaukas
| | - Martin Schumacher
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg im Breisgau, Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg im Breisgau, Germany
| |
Collapse
|
8
|
Sürer Ö, Apley DW, Malthouse EC. Coefficient tree regression for generalized linear models. Stat Anal Data Min 2021. [DOI: 10.1002/sam.11534] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Affiliation(s)
- Özge Sürer
- Industrial Engineering and Management Sciences Department Northwestern University Evanston Illinois USA
| | - Daniel W. Apley
- Industrial Engineering and Management Sciences Department Northwestern University Evanston Illinois USA
| | - Edward C. Malthouse
- Industrial Engineering and Management Sciences Department Northwestern University Evanston Illinois USA
| |
Collapse
|
9
|
Wang W, Zhu Z. Group structure detection for a high‐dimensional panel data model. CAN J STAT 2021. [DOI: 10.1002/cjs.11646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Wu Wang
- Center for Applied Statistics and School of Statistics Renmin University of China Beijing China
| | - Zhongyi Zhu
- Department of Statistics Fudan University Shanghai China
| |
Collapse
|
10
|
High-dimensional sign-constrained feature selection and grouping. ANN I STAT MATH 2020. [DOI: 10.1007/s10463-020-00766-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
11
|
Yue M, Huang L. A new approach of subgroup identification for high-dimensional longitudinal data. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1764555] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Affiliation(s)
- Mu Yue
- Engineering Systems and Design (ESD), Singapore University of Technology and Design, Singapore, Singapore
| | - Lei Huang
- School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu, People's Republic of China
| |
Collapse
|
12
|
Chi EC, Gaines BR, Sun WW, Zhou H, Yang J. Provable Convex Co-clustering of Tensors. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2020; 21:214. [PMID: 33312074 PMCID: PMC7731944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising "blessing of dimensionality" phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.
Collapse
Affiliation(s)
- Eric C Chi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Brian R Gaines
- Advanced Analytics R&D, SAS Institute Inc., Cary, NC 27513, USA
| | - Will Wei Sun
- Krannert School of Management, Purdue University, West Lafayette, IN 47907, USA
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, CA 90095, USA
| | - Jian Yang
- Advertising Sciences, Yahoo Research, Sunnyvale, CA 94089, USA
| |
Collapse
|
13
|
|
14
|
Li J, Yue M, Zhang W. Subgroup identification via homogeneity pursuit for dense longitudinal/spatial data. Stat Med 2019; 38:3256-3271. [PMID: 31066095 DOI: 10.1002/sim.8192] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 04/08/2019] [Accepted: 04/12/2019] [Indexed: 12/23/2022]
Abstract
In the clinical trial community, it is usually not easy to find a treatment that benefits all patients since the reaction to treatment may differ substantially across different patient subgroups. The heterogeneity of treatment effect plays an essential role in personalized medicine. To facilitate the development of tailored therapies and improve the treatment efficacy, it is important to identify subgroups that exhibit different treatment effects. We consider a very general framework for subgroup identification via the homogeneity pursuit methods usually employed in econometric time series analysis. The change point detection algorithm in our procedure is most suitable for analyzing dense longitudinal or spatial data which are quite common for biomedical studies these days. We demonstrate that our proposed method is fast and accurate through extensive numerical studies. In particular, our method is illustrated by analyzing a diffusion tensor imaging data set.
Collapse
Affiliation(s)
- Jialiang Li
- Department of Statistics and Applied Probability, National University of Singapore, Singapore.,Duke-NUS Medical School, Singapore.,Singapore Eye Research Institute, Singapore
| | - Mu Yue
- School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Wenyang Zhang
- Department of Mathematics, University of York, York, UK
| |
Collapse
|
15
|
Shi X, Xing F, Guo Z, Su H, Liu F, Yang L. Structured orthogonal matching pursuit for feature selection. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.12.030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
16
|
Tian S, Wang C, Wang B. Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2497509. [PMID: 31073522 PMCID: PMC6470448 DOI: 10.1155/2019/2497509] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/07/2019] [Indexed: 12/29/2022]
Abstract
To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71 Xinmin Street, Changchun, Jilin 130021, China
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St., Lexington, KY 40536, USA
| | - Bing Wang
- School of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin 130012, China
| |
Collapse
|
17
|
Liu J, Yu G, Liu Y. Graph-based sparse linear discriminant analysis for high-dimensional classification. J MULTIVARIATE ANAL 2018; 171:250-269. [PMID: 31983784 DOI: 10.1016/j.jmva.2018.12.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector β . The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of β and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of β estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.
Collapse
Affiliation(s)
- Jianyu Liu
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Guan Yu
- Department of Biostatistics, University at Buffalo, Buffalo, NY 14214, USA
| | - Yufeng Liu
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA.,Department of Genetics, Department of Biostatistics, and Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
18
|
|
19
|
Gui J, Sun Z, Ji S, Tao D, Tan T. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1490-1507. [PMID: 28287983 DOI: 10.1109/tnnls.2016.2551724] [Citation(s) in RCA: 122] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Feature selection (FS) is an important component of many pattern recognition tasks. In these tasks, one is often confronted with very high-dimensional data. FS algorithms are designed to identify the relevant feature subset from the original features, which can facilitate subsequent analysis, such as clustering and classification. Structured sparsity-inducing feature selection (SSFS) methods have been widely studied in the last few years, and a number of algorithms have been proposed. However, there is no comprehensive study concerning the connections between different SSFS methods, and how they have evolved. In this paper, we attempt to provide a survey on various SSFS methods, including their motivations and mathematical representations. We then explore the relationship among different formulations and propose a taxonomy to elucidate their evolution. We group the existing SSFS methods into two categories, i.e., vector-based feature selection (feature selection based on lasso) and matrix-based feature selection (feature selection based on lr,p-norm). Furthermore, FS has been combined with other machine learning algorithms for specific applications, such as multitask learning, multilabel learning, multiview learning, classification, and clustering. This paper not only compares the differences and commonalities of these methods based on regression and regularization strategies, but also provides useful guidelines to practitioners working in related fields to guide them how to do feature selection.
Collapse
|
20
|
Affiliation(s)
- Yunzhang Zhu
- Department of Statistics, The Ohio State University, Columbus, Ohio
| |
Collapse
|
21
|
Abstract
With the abundance of high dimensional data in various disciplines, sparse regularized techniques are very popular these days. In this paper, we make use of the structure information among predictors to improve sparse regression models. Typically, such structure information can be modeled by the connectivity of an undirected graph using all predictors as nodes of the graph. Most existing methods use this undirected graph edge-by-edge to encourage the regression coefficients of corresponding connected predictors to be similar. However, such methods do not directly utilize the neighborhood information of the graph. Furthermore, if there are more edges in the predictor graph, the corresponding regularization term will be more complicate. In this paper, we incorporate the graph information node-by-node, instead of edge-by-edge as used in most existing methods. Our proposed method is very general and it includes adaptive Lasso, group Lasso, and ridge regression as special cases. Both theoretical and numerical studies demonstrate the effectiveness of the proposed method for simultaneous estimation, prediction and model selection.
Collapse
Affiliation(s)
- Guan Yu
- Guan Yu is Ph.D. Candidate, Department of Statistics and Operations Research. Yufeng Liu is Professor, Department of Statistics and Operations Research, Carolina Center for Genome Science, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599
| | - Yufeng Liu
- Guan Yu is Ph.D. Candidate, Department of Statistics and Operations Research. Yufeng Liu is Professor, Department of Statistics and Operations Research, Carolina Center for Genome Science, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599
| |
Collapse
|
22
|
|
23
|
Abstract
This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates are expected to be approximately the same. Sparsity corresponds to a special case of homogeneity with a large cluster of known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional structures in high-dimensional regression: homogeneity and sparsity. Our results also shed lights on the properties of the fussed Lasso. The newly developed method is further illustrated by simulation studies and applications to real data. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Tracy Ke
- Department of Operations Research and Financial Engineering, Princeton University
| | - Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University
| | - Yichao Wu
- Department of Statistics, North Carolina State University
| |
Collapse
|
24
|
Abstract
Gaussian graphical models are useful to analyze and visualize conditional dependence relationships between interacting units. Motivated from network analysis under di erent experimental conditions, such as gene networks for disparate cancer subtypes, we model structural changes over multiple networks with possible heterogeneities. In particular, we estimate multiple precision matrices describing dependencies among interacting units through maximum penalized likelihood. Of particular interest are homogeneous groups of similar entries across and zero-entries of these matrices, referred to as clustering and sparseness structures, respectively. A non-convex method is proposed to seek a sparse representation for each matrix and identify clusters of the entries across the matrices. Computationally, we develop an e cient method on the basis of di erence convex programming, the augmented Lagrangian method and the block-wise coordinate descent method, which is scalable to hundreds of graphs of thousands nodes through a simple necessary and sufficient partition rule, which divides nodes into smaller disjoint subproblems excluding zero-coe cients nodes for arbitrary graphs with convex relaxation. Theoretically, a finite-sample error bound is derived for the proposed method to reconstruct the clustering and sparseness structures. This leads to consistent reconstruction of these two structures simultaneously, permitting the number of unknown parameters to be exponential in the sample size, and yielding the optimal performance of the oracle estimator as if the true structures were given a priori. Simulation studies suggest that the method enjoys the benefit of pursuing these two disparate kinds of structures, and compares favorably against its convex counterpart in the accuracy of structure pursuit and parameter estimation.
Collapse
Affiliation(s)
- Yunzhang Zhu
- School of Statistics, University of Minnesota, Minneapolis, MN 55455
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455
| |
Collapse
|
25
|
Maldonado S, Weber R, Famili F. Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.07.015] [Citation(s) in RCA: 197] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
26
|
Kim S, Pan W, Shen X. Penalized regression approaches to testing for quantitative trait-rare variant association. Front Genet 2014; 5:121. [PMID: 24860593 PMCID: PMC4026747 DOI: 10.3389/fgene.2014.00121] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2014] [Accepted: 04/18/2014] [Indexed: 11/13/2022] Open
Abstract
In statistical data analysis, penalized regression is considered an attractive approach for its ability of simultaneous variable selection and parameter estimation. Although penalized regression methods have shown many advantages in variable selection and outcome prediction over other approaches for high-dimensional data, there is a relative paucity of the literature on their applications to hypothesis testing, e.g., in genetic association analysis. In this study, we apply several new penalized regression methods with a novel penalty, called Truncated L1 -penalty (TLP) (Shen et al., 2012), for either variable selection, or both variable selection and parameter grouping, in a data-adaptive way to test for association between a quantitative trait and a group of rare variants. The performance of the new methods are compared with some existing tests, including some recently proposed global tests and penalized regression-based methods, via simulations and an application to the real sequence data of the Genetic Analysis Workshop 17 (GAW17). Although our proposed penalized methods can improve over some existing penalized methods, often they do not outperform some existing global association tests. Some possible problems with utilizing penalized regression methods in genetic hypothesis testing are discussed. Given the capability of penalized regression in selecting causal variants and its sometimes promising performance, further studies are warranted.
Collapse
Affiliation(s)
- Sunkyung Kim
- Division of Biostatistics, School of Public Health, University of Minnesota Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota Minneapolis, MN, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota Minneapolis, MN, USA
| |
Collapse
|
27
|
Ye J, Liu J. Sparse Methods for Biomedical Data. SIGKDD EXPLORATIONS : NEWSLETTER OF THE SPECIAL INTEREST GROUP (SIG) ON KNOWLEDGE DISCOVERY & DATA MINING 2012; 14:4-15. [PMID: 24076585 PMCID: PMC3783968 DOI: 10.1145/2408736.2408739] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Following recent technological revolutions, the investigation of massive biomedical data with growing scale, diversity, and complexity has taken a center stage in modern data analysis. Although complex, the underlying representations of many biomedical data are often sparse. For example, for a certain disease such as leukemia, even though humans have tens of thousands of genes, only a few genes are relevant to the disease; a gene network is sparse since a regulatory pathway involves only a small number of genes; many biomedical signals are sparse or compressible in the sense that they have concise representations when expressed in a proper basis. Therefore, finding sparse representations is fundamentally important for scientific discovery. Sparse methods based on the [Formula: see text] norm have attracted a great amount of research efforts in the past decade due to its sparsity-inducing property, convenient convexity, and strong theoretical guarantees. They have achieved great success in various applications such as biomarker selection, biological network construction, and magnetic resonance imaging. In this paper, we review state-of-the-art sparse methods and their applications to biomedical data.
Collapse
Affiliation(s)
- Jieping Ye
- Arizona State University Tempe, AZ 85287
| | - Jun Liu
- Siemens Corporate Research Princeton, NJ 08540
| |
Collapse
|
28
|
Yang S, Yuan L, Lai YC, Shen X, Wonka P, Ye J. Feature Grouping and Selection Over an Undirected Graph. KDD : PROCEEDINGS. INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING 2012:922-930. [PMID: 24014201 PMCID: PMC3763852 DOI: 10.1145/2339530.2339675] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l1 regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.
Collapse
Affiliation(s)
- Sen Yang
- Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA
| | - Lei Yuan
- Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA
- Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA
| | - Ying-Cheng Lai
- Electrical Engineering, Arizona State University, Tempe, AZ 85287, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Peter Wonka
- Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA
| | - Jieping Ye
- Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA
- Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA
| |
Collapse
|