1
|
Liu Y, Ren J, Ma S, Wu C. The spike-and-slab quantile LASSO for robust variable selection in cancer genomics studies. Stat Med 2024; 43:4928-4983. [PMID: 39260448 PMCID: PMC11585335 DOI: 10.1002/sim.10196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 05/28/2024] [Accepted: 07/31/2024] [Indexed: 09/13/2024]
Abstract
Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the nonrobust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Roc̆ková and George, J Am Stat Associat, 2018, 113(521): 431-444). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with nondifferentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA).
Collapse
Affiliation(s)
- Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
2
|
Fan K, Subedi S, Yang G, Lu X, Ren J, Wu C. Is Seeing Believing? A Practitioner's Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies. ENTROPY (BASEL, SWITZERLAND) 2024; 26:794. [PMID: 39330127 PMCID: PMC11430850 DOI: 10.3390/e26090794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 08/23/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the "unpenalized loss function + penalty term" formulation for regularization methods and the "likelihood function × shrinkage prior" framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.
Collapse
Affiliation(s)
- Kun Fan
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Srijana Subedi
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Gongshun Yang
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Xi Lu
- Department of Pharmaceutical Health Outcomes and Policy, College of Pharmacy, University of Houston, Houston, TX 77204, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| |
Collapse
|
3
|
Wu M, Wang F, Ge Y, Ma S, Li Y. Bi-level structured functional analysis for genome-wide association studies. Biometrics 2023; 79:3359-3373. [PMID: 37098961 DOI: 10.1111/biom.13871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 04/19/2023] [Indexed: 04/27/2023]
Abstract
Genome-wide association studies (GWAS) have led to great successes in identifying genotype-phenotype associations for complex human diseases. In such studies, the high dimensionality of single nucleotide polymorphisms (SNPs) often makes analysis difficult. Functional analysis, which interprets SNPs densely distributed in a chromosomal region as a continuous process rather than discrete observations, has emerged as a promising avenue for overcoming the high dimensionality challenges. However, the majority of the existing functional studies continue to be individual SNP based and are unable to sufficiently account for the intricate underpinning structures of SNP data. SNPs are often found in groups (e.g., genes or pathways) and have a natural group structure. Additionally, these SNP groups can be highly correlated with coordinated biological functions and interact in a network. Motivated by these unique characteristics of SNP data, we develop a novel bi-level structured functional analysis method and investigate disease-associated genetic variants at the SNP level and SNP group level simultaneously. The penalization technique is adopted for bi-level selection and also to accommodate the group-level network structure. Both the estimation and selection consistency properties are rigorously established. The superiority of the proposed method over alternatives is shown through extensive simulation studies. A type 2 diabetes SNP data application yields some biologically intriguing results.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Fan Wang
- Center for Applied Statistics, School of Statistics, and Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Yeheng Ge
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Yang Li
- Center for Applied Statistics, School of Statistics, and Statistical Consulting Center, Renmin University of China, Beijing, China
| |
Collapse
|
4
|
Zhou F, Liu Y, Ren J, Wang W, Wu C. Springer: An R package for bi-level variable selection of high-dimensional longitudinal data. Front Genet 2023; 14:1088223. [PMID: 37091810 PMCID: PMC10117642 DOI: 10.3389/fgene.2023.1088223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 02/28/2023] [Indexed: 04/09/2023] Open
Abstract
In high-dimensional data analysis, the bi-level (or the sparse group) variable selection can simultaneously conduct penalization on the group level and within groups, which has been developed for continuous, binary, and survival responses in the literature. Zhou et al. (2022) (PMID: 35766061) has further extended it under the longitudinal response by proposing a quadratic inference function-based penalization method in gene-environment interaction studies. This study introduces "springer," an R package implementing the bi-level variable selection within the QIF framework developed in Zhou et al. (2022). In addition, R package "springer" has also implemented the generalized estimating equation-based sparse group penalization method. Alternative methods focusing only on the group level or individual level have also been provided by the package. In this study, we have systematically introduced the longitudinal penalization methods implemented in the "springer" package. We demonstrate the usage of the core and supporting functions, which is followed by the numerical examples and discussions. R package "springer" is available at https://cran.r-project.org/package=springer.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Weiqun Wang
- Department of Food, Nutrition, Dietetics and Health, Kansas State University, Manhattan, KS, United States
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| |
Collapse
|
5
|
Zhou F, Lu X, Ren J, Fan K, Ma S, Wu C. Sparse group variable selection for gene-environment interactions in the longitudinal study. Genet Epidemiol 2022; 46:317-340. [PMID: 35766061 DOI: 10.1002/gepi.22461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 01/31/2022] [Accepted: 03/15/2022] [Indexed: 11/06/2022]
Abstract
Penalized variable selection for high-dimensional longitudinal data has received much attention as it can account for the correlation among repeated measurements while providing additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies, the potential of penalization methods is far from fully understood for accommodating structured sparsity. In this article, we develop a sparse group penalization method to conduct the bi-level gene-environment (G × $\times $ E) interaction study under the repeatedly measured phenotype. Within the quadratic inference function framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual levels. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program, we conduct G × $\times $ E study by using high-dimensional single nucleotide polymorphism data as genetic factors and the longitudinal trait, forced expiratory volume in 1 s, as the phenotype. Our method leads to improved prediction and identification of main and interaction effects with important implications.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, 06520, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| |
Collapse
|