1
|
Buch G, Schulz A, Schmidtmann I, Strauch K, Wild PS. Interpretability of bi-level variable selection methods. Biom J 2024; 66:e2300063. [PMID: 38519877 DOI: 10.1002/bimj.202300063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 01/31/2024] [Accepted: 02/07/2024] [Indexed: 03/25/2024]
Abstract
Variable selection is usually performed to increase interpretability, as sparser models are easier to understand than full models. However, a focus on sparsity is not always suitable, for example, when features are related due to contextual similarities or high correlations. Here, it may be more appropriate to identify groups and their predictive members, a task that can be accomplished with bi-level selection procedures. To investigate whether such techniques lead to increased interpretability, group exponential LASSO (GEL), sparse group LASSO (SGL), composite minimax concave penalty (cMCP), and least absolute shrinkage, and selection operator (LASSO) as reference methods were used to select predictors in time-to-event, regression, and classification tasks in bootstrap samples from a cohort of 1001 patients. Different groupings based on prior knowledge, correlation structure, and random assignment were compared in terms of selection relevance, group consistency, and collinearity tolerance. The results show that bi-level selection methods are superior to LASSO in all criteria. The cMCP demonstrated superiority in selection relevance, while SGL was convincing in group consistency. An all-round capacity was achieved by GEL: the approach jointly selected correlated and content-related predictors while maintaining high selection relevance. This method seems recommendable when variables are grouped, and interpretation is of primary interest.
Collapse
Affiliation(s)
- Gregor Buch
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- German Center for Cardiovascular Research (DZHK), Mainz, Germany
| | - Andreas Schulz
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Irene Schmidtmann
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Konstantin Strauch
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Philipp S Wild
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- German Center for Cardiovascular Research (DZHK), Mainz, Germany
- Clinical Epidemiology and Systems Medicine, Center for Thrombosis and Hemostasis, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- Institute of Molecular Biology (IMB), Mainz, Germany
| |
Collapse
|
2
|
Buch G, Schulz A, Schmidtmann I, Strauch K, Wild PS. A systematic review and evaluation of statistical methods for group variable selection. Stat Med 2023; 42:331-352. [PMID: 36546512 DOI: 10.1002/sim.9620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 10/27/2022] [Accepted: 11/22/2022] [Indexed: 12/24/2022]
Abstract
This review condenses the knowledge on variable selection methods implemented in R and appropriate for datasets with grouped features. The focus is on regularized regressions identified through a systematic review of the literature, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. A total of 14 methods are discussed, most of which use penalty terms to perform group variable selection. Depending on how the methods account for the group structure, they can be classified into knowledge and data-driven approaches. The first encompass group-level and bi-level selection methods, while two-step approaches and collinearity-tolerant methods constitute the second category. The identified methods are briefly explained and their performance compared in a simulation study. This comparison demonstrated that group-level selection methods, such as the group minimax concave penalty, are superior to other methods in selecting relevant variable groups but are inferior in identifying important individual variables in scenarios where not all variables in the groups are predictive. This can be better achieved by bi-level selection methods such as group bridge. Two-step and collinearity-tolerant approaches such as elastic net and ordered homogeneity pursuit least absolute shrinkage and selection operator are inferior to knowledge-driven methods but provide results without requiring prior knowledge. Possible applications in proteomics are considered, leading to suggestions on which method to use depending on existing prior knowledge and research question.
Collapse
Affiliation(s)
- Gregor Buch
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany.,German Center for Cardiovascular Research (DZHK), partner site Rhine-Main, Mainz, Germany.,Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Andreas Schulz
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Irene Schmidtmann
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Konstantin Strauch
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Philipp S Wild
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany.,German Center for Cardiovascular Research (DZHK), partner site Rhine-Main, Mainz, Germany.,Clinical Epidemiology and Systems Medicine, Center for Thrombosis and Hemostasis, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany.,Institute of Molecular Biology (IMB), Mainz, Germany
| |
Collapse
|
3
|
Ouhourane M, Yang Y, Benedet AL, Oualkacha K. Group penalized quantile regression. STAT METHOD APPL-GER 2022. [DOI: 10.1007/s10260-021-00580-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
4
|
Group variable selection via ℓp,0 regularization and application to optimal scoring. Neural Netw 2019; 118:220-234. [DOI: 10.1016/j.neunet.2019.05.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 05/12/2019] [Accepted: 05/19/2019] [Indexed: 11/22/2022]
|
5
|
Xiu Y, Shen W, Wang Z, Liu S, Wang J. Multiple graph regularized graph transduction via greedy gradient Max-Cut. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2017.09.054] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
6
|
Fu Z, Parikh CR, Zhou B. Penalized variable selection in competing risks regression. LIFETIME DATA ANALYSIS 2017; 23:353-376. [PMID: 27016934 DOI: 10.1007/s10985-016-9362-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 03/12/2016] [Indexed: 06/05/2023]
Abstract
Penalized variable selection methods have been extensively studied for standard time-to-event data. Such methods cannot be directly applied when subjects are at risk of multiple mutually exclusive events, known as competing risks. The proportional subdistribution hazard (PSH) model proposed by Fine and Gray (J Am Stat Assoc 94:496-509, 1999) has become a popular semi-parametric model for time-to-event data with competing risks. It allows for direct assessment of covariate effects on the cumulative incidence function. In this paper, we propose a general penalized variable selection strategy that simultaneously handles variable selection and parameter estimation in the PSH model. We rigorously establish the asymptotic properties of the proposed penalized estimators and modify the coordinate descent algorithm for implementation. Simulation studies are conducted to demonstrate the good performance of the proposed method. Data from deceased donor kidney transplants from the United Network of Organ Sharing illustrate the utility of the proposed method.
Collapse
Affiliation(s)
- Zhixuan Fu
- Biostatistics Department, Yale University, 60 College Street, New Haven, CT, 06510, USA
| | - Chirag R Parikh
- Section of Nephrology, Department of Internal Medicine, Yale University, 60 Temple Street, Suite 6C, New Haven, CT, 06510, USA
| | - Bingqing Zhou
- Biostatistics Department, Yale University, 60 College Street, New Haven, CT, 06510, USA.
- Novartis AG, 1 Health Plaza, East Hanover, NJ, USA.
| |
Collapse
|
7
|
Lee S, Pawitan Y, Lee Y. A random-effect model approach for group variable selection. Comput Stat Data Anal 2015. [DOI: 10.1016/j.csda.2015.02.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
8
|
|
9
|
|
10
|
Lin D, Zhang J, Li J, Calhoun VD, Deng HW, Wang YP. Group sparse canonical correlation analysis for genomic data integration. BMC Bioinformatics 2013; 14:245. [PMID: 23937249 PMCID: PMC3751310 DOI: 10.1186/1471-2105-14-245] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Accepted: 08/08/2013] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group). RESULTS We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features. CONCLUSIONS The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
Collapse
Affiliation(s)
- Dongdong Lin
- Biomedical Engineering Department, Tulane University, New Orleans, LA, USA
- Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, USA
| | - Jigang Zhang
- Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, USA
- Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, USA
| | - Jingyao Li
- Biomedical Engineering Department, Tulane University, New Orleans, LA, USA
- Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, USA
| | - Vince D Calhoun
- The Mind Research Network, Albuquerque, NM, 87131, USA
- Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM, 87131, USA
| | - Hong-Wen Deng
- Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, USA
- Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, USA
| | - Yu-Ping Wang
- Biomedical Engineering Department, Tulane University, New Orleans, LA, USA
- Center of Genomics and Bioinformatics, Tulane University, New Orleans, LA, USA
- Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, USA
| |
Collapse
|