1
|
Liu Y, Ren J, Ma S, Wu C. The spike-and-slab quantile LASSO for robust variable selection in cancer genomics studies. Stat Med 2024; 43:4928-4983. [PMID: 39260448 DOI: 10.1002/sim.10196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 05/28/2024] [Accepted: 07/31/2024] [Indexed: 09/13/2024]
Abstract
Data irregularity in cancer genomics studies has been widely observed in the form of outliers and heavy-tailed distributions in the complex traits. In the past decade, robust variable selection methods have emerged as powerful alternatives to the nonrobust ones to identify important genes associated with heterogeneous disease traits and build superior predictive models. In this study, to keep the remarkable features of the quantile LASSO and fully Bayesian regularized quantile regression while overcoming their disadvantage in the analysis of high-dimensional genomics data, we propose the spike-and-slab quantile LASSO through a fully Bayesian spike-and-slab formulation under the robust likelihood by adopting the asymmetric Laplace distribution (ALD). The proposed robust method has inherited the prominent properties of selective shrinkage and self-adaptivity to the sparsity pattern from the spike-and-slab LASSO (Roc̆ková and George, J Am Stat Associat, 2018, 113(521): 431-444). Furthermore, the spike-and-slab quantile LASSO has a computational advantage to locate the posterior modes via soft-thresholding rule guided Expectation-Maximization (EM) steps in the coordinate descent framework, a phenomenon rarely observed for robust regularization with nondifferentiable loss functions. We have conducted comprehensive simulation studies with a variety of heavy-tailed errors in both homogeneous and heterogeneous model settings to demonstrate the superiority of the spike-and-slab quantile LASSO over its competing methods. The advantage of the proposed method has been further demonstrated in case studies of the lung adenocarcinomas (LUAD) and skin cutaneous melanoma (SKCM) data from The Cancer Genome Atlas (TCGA).
Collapse
Affiliation(s)
- Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| |
Collapse
|
2
|
Fan K, Subedi S, Yang G, Lu X, Ren J, Wu C. Is Seeing Believing? A Practitioner's Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies. ENTROPY (BASEL, SWITZERLAND) 2024; 26:794. [PMID: 39330127 PMCID: PMC11430850 DOI: 10.3390/e26090794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 08/23/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
Variable selection methods have been extensively developed for and applied to cancer genomics data to identify important omics features associated with complex disease traits, including cancer outcomes. However, the reliability and reproducibility of the findings are in question if valid inferential procedures are not available to quantify the uncertainty of the findings. In this article, we provide a gentle but systematic review of high-dimensional frequentist and Bayesian inferential tools under sparse models which can yield uncertainty quantification measures, including confidence (or Bayesian credible) intervals, p values and false discovery rates (FDR). Connections in high-dimensional inferences between the two realms have been fully exploited under the "unpenalized loss function + penalty term" formulation for regularization methods and the "likelihood function × shrinkage prior" framework for regularized Bayesian analysis. In particular, we advocate for robust Bayesian variable selection in cancer genomics studies due to its ability to accommodate disease heterogeneity in the form of heavy-tailed errors and structured sparsity while providing valid statistical inference. The numerical results show that robust Bayesian analysis incorporating exact sparsity has yielded not only superior estimation and identification results but also valid Bayesian credible intervals under nominal coverage probabilities compared with alternative methods, especially in the presence of heavy-tailed model errors and outliers.
Collapse
Affiliation(s)
- Kun Fan
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Srijana Subedi
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Gongshun Yang
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Xi Lu
- Department of Pharmaceutical Health Outcomes and Policy, College of Pharmacy, University of Houston, Houston, TX 77204, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| |
Collapse
|
3
|
Wang F, Jia K, Li Y. Integrative deep learning with prior assisted feature selection. Stat Med 2024; 43:3792-3814. [PMID: 38923006 DOI: 10.1002/sim.10148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 04/23/2024] [Accepted: 06/07/2024] [Indexed: 06/28/2024]
Abstract
Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the "smalln $$ n $$ and largep $$ p $$ " challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify "prior information". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.
Collapse
Affiliation(s)
- Feifei Wang
- Center for Applied Statistics, Renmin University of China, Beijing, China
- School of Statistics, Renmin University of China, Beijing, China
| | - Ke Jia
- School of Statistics, Renmin University of China, Beijing, China
| | - Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China
- School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
4
|
Zhou F, Ren J, Ma S, Wu C. The Bayesian Regularized Quantile Varying Coefficient Model. Comput Stat Data Anal 2023; 187:107808. [PMID: 38746689 PMCID: PMC11090482 DOI: 10.1016/j.csda.2023.107808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024]
Abstract
The quantile varying coefficient (VC) model can flexibly capture dynamical patterns of regression coefficients. In addition, due to the quantile check loss function, it is robust against outliers and heavy-tailed distributions of the response variable, and can provide a more comprehensive picture of modeling via exploring the conditional quantiles of the response variable. Although extensive studies have been conducted to examine variable selection for the high-dimensional quantile varying coefficient models, the Bayesian analysis has been rarely developed. The Bayesian regularized quantile varying coefficient model has been proposed to incorporate robustness against data heterogeneity while accommodating the non-linear interactions between the effect modifier and predictors. Selecting important varying coefficients can be achieved through Bayesian variable selection. Incorporating the multivariate spike-and-slab priors further improves performance by inducing exact sparsity. The Gibbs sampler has been derived to conduct efficient posterior inference of the sparse Bayesian quantile VC model through Markov chain Monte Carlo (MCMC). The merit of the proposed model in selection and estimation accuracy over the alternatives has been systematically investigated in simulation under specific quantile levels and multiple heavy-tailed model errors. In the case study, the proposed model leads to identification of biologically sensible markers in a non-linear gene-environment interaction study using the NHS data.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
5
|
Ren J, Zhou F, Li X, Ma S, Jiang Y, Wu C. Robust Bayesian variable selection for gene-environment interactions. Biometrics 2023; 79:684-694. [PMID: 35394058 PMCID: PMC11086965 DOI: 10.1111/biom.13670] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 03/23/2022] [Accepted: 03/28/2022] [Indexed: 11/30/2022]
Abstract
Gene-environment (G× E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G× E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a fully Bayesian robust variable selection method for G× E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies, analysis of diabetes data with single-nucleotide polymorphism measurements from the Nurses' Health Study, and The Cancer Genome Atlas melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.
Collapse
Affiliation(s)
- Jie Ren
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| |
Collapse
|
6
|
Zhou F, Lu X, Ren J, Fan K, Ma S, Wu C. Sparse group variable selection for gene-environment interactions in the longitudinal study. Genet Epidemiol 2022; 46:317-340. [PMID: 35766061 DOI: 10.1002/gepi.22461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 01/31/2022] [Accepted: 03/15/2022] [Indexed: 11/06/2022]
Abstract
Penalized variable selection for high-dimensional longitudinal data has received much attention as it can account for the correlation among repeated measurements while providing additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies, the potential of penalization methods is far from fully understood for accommodating structured sparsity. In this article, we develop a sparse group penalization method to conduct the bi-level gene-environment (G × $\times $ E) interaction study under the repeatedly measured phenotype. Within the quadratic inference function framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual levels. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program, we conduct G × $\times $ E study by using high-dimensional single nucleotide polymorphism data as genetic factors and the longitudinal trait, forced expiratory volume in 1 s, as the phenotype. Our method leads to improved prediction and identification of main and interaction effects with important implications.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, 06520, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| |
Collapse
|
7
|
Wang H, Zhang J, Klump KL, Alexandra Burt S, Cui Y. Multivariate partial linear varying coefficients model for gene-environment interactions with multiple longitudinal traits. Stat Med 2022; 41:3643-3660. [PMID: 35582816 PMCID: PMC9308731 DOI: 10.1002/sim.9440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 04/26/2022] [Accepted: 05/05/2022] [Indexed: 11/13/2022]
Abstract
Correlated phenotypes often share common genetic determinants. Thus, a multi‐trait analysis can potentially increase association power and help in understanding pleiotropic effect. When multiple traits are jointly measured over time, the correlation information between multivariate longitudinal responses can help to gain power in association analysis, and the longitudinal traits can provide insights on the dynamic gene effect over time. In this work, we propose a multivariate partially linear varying coefficients model to identify genetic variants with their effects potentially modified by environmental factors. We derive a testing framework to jointly test the association of genetic factors and illustrated with a bivariate phenotypic trait, while taking the time varying genetic effects into account. We extend the quadratic inference functions to deal with the longitudinal correlations and used penalized splines for the approximation of nonparametric coefficient functions. Theoretical results such as consistency and asymptotic normality of the estimates are established. The performance of the testing procedure is evaluated through Monte Carlo simulation studies. The utility of the method is demonstrated with a real data set from the Twin Study of Hormones and Behavior across the menstrual cycle project, in which single nucleotide polymorphisms associated with emotional eating behavior are identified.
Collapse
Affiliation(s)
- Honglang Wang
- Department of Mathematical Sciences, Indiana University-Purdue University Indianapolis, Indianapolis, Indiana, USA
| | - Jingyi Zhang
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, USA.,Amazon Lab126, Sunnyvale, California, USA
| | - Kelly L Klump
- Department of Psychology, Michigan State University, East Lansing, Michigan, USA
| | - Sybil Alexandra Burt
- Department of Psychology, Michigan State University, East Lansing, Michigan, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, USA
| |
Collapse
|
8
|
Zhou F, Ren J, Liu Y, Li X, Wang W, Wu C. Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data. Genes (Basel) 2022; 13:544. [PMID: 35328097 PMCID: PMC8950762 DOI: 10.3390/genes13030544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/12/2022] [Accepted: 03/13/2022] [Indexed: 02/05/2023] Open
Abstract
We introduce interep, an R package for interaction analysis of repeated measurement data with high-dimensional main and interaction effects. In G × E interaction studies, the forms of environmental factors play a critical role in determining how structured sparsity should be imposed in the high-dimensional scenario to identify important effects. Zhou et al. (2019) (PMID: 31816972) proposed a longitudinal penalization method to select main and interaction effects corresponding to the individual and group structure, respectively, which requires a mixture of individual and group level penalties. The R package interep implements generalized estimating equation (GEE)-based penalization methods with this sparsity assumption. Moreover, alternative methods have also been implemented in the package. These alternative methods merely select effects on an individual level and ignore the group-level interaction structure. In this software article, we first introduce the statistical methodology corresponding to the penalized GEE methods implemented in the package. Next, we present the usage of the core and supporting functions, which is followed by a simulation example with R codes and annotations. The R package interep is available at The Comprehensive R Archive Network (CRAN).
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA;
| | - Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Weiqun Wang
- Department of Food, Nutrition, Dietetics and Health, Kansas State University, Manhattan, KS 66506, USA;
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| |
Collapse
|
9
|
Pośpiech E, Karłowska-Pik J, Kukla-Bartoszek M, Woźniak A, Boroń M, Zubańska M, Jarosz A, Bronikowska A, Grzybowski T, Płoski R, Spólnicka M, Branicki W. Overlapping association signals in the genetics of hair-related phenotypes in humans and their relevance to predictive DNA analysis. Forensic Sci Int Genet 2022; 59:102693. [DOI: 10.1016/j.fsigen.2022.102693] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 02/25/2022] [Accepted: 03/22/2022] [Indexed: 01/02/2023]
|
10
|
Lu X, Fan K, Ren J, Wu C. Identifying Gene-Environment Interactions With Robust Marginal Bayesian Variable Selection. Front Genet 2021; 12:667074. [PMID: 34956304 PMCID: PMC8693717 DOI: 10.3389/fgene.2021.667074] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 07/13/2021] [Indexed: 01/02/2023] Open
Abstract
In high-throughput genetics studies, an important aim is to identify gene–environment interactions associated with the clinical outcomes. Recently, multiple marginal penalization methods have been developed and shown to be effective in G×E studies. However, within the Bayesian framework, marginal variable selection has not received much attention. In this study, we propose a novel marginal Bayesian variable selection method for G×E studies. In particular, our marginal Bayesian method is robust to data contamination and outliers in the outcome variables. With the incorporation of spike-and-slab priors, we have implemented the Gibbs sampler based on Markov Chain Monte Carlo (MCMC). The proposed method outperforms a number of alternatives in extensive simulation studies. The utility of the marginal robust Bayesian variable selection method has been further demonstrated in the case studies using data from the Nurse Health Study (NHS). Some of the identified main and interaction effects from the real data analysis have important biological implications.
Collapse
Affiliation(s)
- Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| |
Collapse
|
11
|
Zhou F, Ren J, Lu X, Ma S, Wu C. Gene-Environment Interaction: A Variable Selection Perspective. Methods Mol Biol 2021; 2212:191-223. [PMID: 33733358 DOI: 10.1007/978-1-0716-0947-7_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Gene-environment interactions have important implications for elucidating the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G × E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G × E interactions, due to the complicated form of environmental effects and the presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G × E interactions, which has been ignored in the majority of published reviews on genetic interaction studies. In this article, we first survey existing studies on both gene-environment and gene-gene interactions. Then, after a brief introduction to the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms, respectively, under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G × E studies, have also been provided.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
12
|
Li Y, Wang F, Wu M, Ma S. Integrative functional linear model for genome-wide association studies with multiple traits. Biostatistics 2020; 23:574-590. [PMID: 33040145 DOI: 10.1093/biostatistics/kxaa043] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 06/30/2020] [Accepted: 09/12/2020] [Indexed: 11/14/2022] Open
Abstract
In recent biomedical research, genome-wide association studies (GWAS) have demonstrated great success in investigating the genetic architecture of human diseases. For many complex diseases, multiple correlated traits have been collected. However, most of the existing GWAS are still limited because they analyze each trait separately without considering their correlations and suffer from a lack of sufficient information. Moreover, the high dimensionality of single nucleotide polymorphism (SNP) data still poses tremendous challenges to statistical methods, in both theoretical and practical aspects. In this article, we innovatively propose an integrative functional linear model for GWAS with multiple traits. This study is the first to approximate SNPs as functional objects in a joint model of multiple traits with penalization techniques. It effectively accommodates the high dimensionality of SNPs and correlations among multiple traits to facilitate information borrowing. Our extensive simulation studies demonstrate the satisfactory performance of the proposed method in the identification and estimation of disease-associated genetic variants, compared to four alternatives. The analysis of type 2 diabetes data leads to biologically meaningful findings with good prediction accuracy and selection stability.
Collapse
Affiliation(s)
- Yang Li
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Fan Wang
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven 06520, USA
| |
Collapse
|
13
|
Lai P, Wang F, Zhu T, Zhang Q. Model identification and selection for single-index varying-coefficient models. ANN I STAT MATH 2020. [DOI: 10.1007/s10463-020-00757-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
14
|
Ren J, Zhou F, Li X, Chen Q, Zhang H, Ma S, Jiang Y, Wu C. Semiparametric Bayesian variable selection for gene-environment interactions. Stat Med 2020; 39:617-638. [PMID: 31863500 PMCID: PMC7467082 DOI: 10.1002/sim.8434] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 09/26/2019] [Accepted: 11/02/2019] [Indexed: 11/06/2022]
Abstract
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Study of gene-environment (G×E) interactions is important for elucidating the disease etiology. Existing Bayesian methods for G×E interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. Many studies have shown the advantages of penalization methods in detecting G×E interactions in "large p, small n" settings. However, Bayesian variable selection, which can provide fresh insight into G×E study, has not been widely examined. We propose a novel and powerful semiparametric Bayesian variable selection model that can investigate linear and nonlinear G×E interactions simultaneously. Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. Spike-and-slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. The proposed method conducts Bayesian variable selection more efficiently than existing methods. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. The proposed Bayesian method leads to the identification of main and interaction effects with important implications in a high-throughput profiling study with high-dimensional SNP data.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Qi Chen
- Department of Pharmacology, Toxicology and Therapeutics, University of Kansas Medical Center, Kansas City, Kansas
| | - Hongmei Zhang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas
| |
Collapse
|
15
|
Penalized Variable Selection for Lipid-Environment Interactions in a Longitudinal Lipidomics Study. Genes (Basel) 2019; 10:genes10121002. [PMID: 31816972 PMCID: PMC6947406 DOI: 10.3390/genes10121002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/26/2019] [Indexed: 12/20/2022] Open
Abstract
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid-environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid-environment interactions in a longitudinal lipidomics study. An efficient Newton-Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.
Collapse
|
16
|
Zhang S, Xue Y, Zhang Q, Ma C, Wu M, Ma S. Identification of gene-environment interactions with marginal penalization. Genet Epidemiol 2019; 44:159-196. [PMID: 31724772 DOI: 10.1002/gepi.22270] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Revised: 10/05/2019] [Accepted: 10/25/2019] [Indexed: 12/29/2022]
Abstract
Gene-environment (G-E) interaction analysis has been extensively conducted for complex diseases. In marginal analysis, the common practice is to conduct likelihood-based (and other "standard") estimation with each marginal model, and then select significant G-E interactions and main effects based on p values and multiple comparisons adjustment. One limitation of this approach is that the identification results often do not respect the "main effects, interactions" hierarchy, which has been stressed in recent G-E interaction analyses. There is some recent effort tackling this problem, however, with very complex formulations. Another limitation of the common practice is that it may not perform well when regularization is needed, for example, because of "non-normal" distributions. In this article, we propose a marginal penalization approach which adopts a novel penalty to directly tackle the aforementioned problems. The proposed approach has a framework more coherent with that of the recently developed joint analysis methods and an intuitive formulation, and can be effectively realized. In simulation, it outperforms the popular significance-based analysis and simple penalization-based alternatives. Promising findings are made in the analysis of a single-nucleotide polymorphism and a gene expression data.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yuan Xue
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Chenjin Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut.,School of Statistics, Renmin University, Beijing, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
17
|
Wu M, Zhang Q, Ma S. Structured gene-environment interaction analysis. Biometrics 2019; 76:23-35. [PMID: 31424088 DOI: 10.1111/biom.13139] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 08/06/2019] [Indexed: 01/03/2023]
Abstract
For the etiology, progression, and treatment of complex diseases, gene-environment (G-E) interactions have important implications beyond the main G and E effects. G-E interaction analysis can be more challenging with higher dimensionality and need for accommodating the "main effects, interactions" hierarchy. In recent literature, an array of novel methods, many of which are based on the penalization technique, have been developed. In most of these studies, however, the structures of G measurements, for example, the adjacency structure of single nucleotide polymorphisms (SNPs; attributable to their physical adjacency on the chromosomes) and the network structure of gene expressions (attributable to their coordinated biological functions and correlated measurements) have not been well accommodated. In this study, we develop structured G-E interaction analysis, where such structures are accommodated using penalization for both the main G effects and interactions. Penalization is also applied for regularized estimation and selection. The proposed structured interaction analysis can be effectively realized. It is shown to have consistency properties under high-dimensional settings. Simulations and analysis of GENEVA diabetes data with SNP measurements and TCGA melanoma data with gene expression measurements demonstrate its competitive practical performance.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
18
|
Li Y, Li R, Lin C, Qin Y, Ma S. Penalized integrative semiparametric interaction analysis for multiple genetic datasets. Stat Med 2019; 38:3221-3242. [PMID: 30993736 DOI: 10.1002/sim.8172] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 02/08/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022]
Abstract
In this article, we consider a semiparametric additive partially linear interaction model for the integrative analysis of multiple genetic datasets. The goals are to identify important genetic predictors and gene-gene interactions and to estimate the nonparametric functions that describe the environmental effects at the same time. To find the similarities and differences of the genetic effects across different datasets, we impose a group structure on the regression coefficients matrix under the homogeneity assumption, ie, models for different datasets share the same sparsity structure, but the coefficients may differ across datasets. We develop an iterative approach to estimate the parameters of main effects, interactions and nonparametric functions, where a reparametrization of interaction parameters is implemented to meet the strong hierarchy assumption. We demonstrate the advantages of the proposed method in identification, estimation, and prediction in a series of numerical studies. We also apply the proposed method to the Skin Cutaneous Melanoma data and the lung cancer data from the Cancer Genome Atlas.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Rong Li
- School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Cunjie Lin
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Yichen Qin
- Department of Operations, Business Analytics and Information Systems, University of Cincinnati, Cincinatti, Ohio
| | - Shuangge Ma
- School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
19
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
20
|
Ren J, He T, Li Y, Liu S, Du Y, Jiang Y, Wu C. Network-based regularization for high dimensional SNP data in the case-control study of Type 2 diabetes. BMC Genet 2017; 18:44. [PMID: 28511641 PMCID: PMC5434559 DOI: 10.1186/s12863-017-0495-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Accepted: 03/25/2017] [Indexed: 12/02/2022] Open
Abstract
Background Over the past decades, the prevalence of type 2 diabetes mellitus (T2D) has been steadily increasing around the world. Despite large efforts devoted to better understand the genetic basis of the disease, the identified susceptibility loci can only account for a small portion of the T2D heritability. Some of the existing approaches proposed for the high dimensional genetic data from the T2D case–control study are limited by analyzing a few number of SNPs at a time from a large pool of SNPs, by ignoring the correlations among SNPs and by adopting inefficient selection techniques. Methods We propose a network constrained regularization method to select important SNPs by taking the linkage disequilibrium into account. To accomodate the case control study, an iteratively reweighted least square algorithm has been developed within the coordinate descent framework where optimization of the regularized logistic loss function is performed with respect to one parameter at a time and iteratively cycle through all the parameters until convergence. Results In this article, a novel approach is developed to identify important SNPs more effectively through incorporating the interconnections among them in the regularized selection. A coordinate descent based iteratively reweighed least squares (IRLS) algorithm has been proposed. Conclusions Both the simulation study and the analysis of the Nurses’s Health Study, a case–control study of type 2 diabetes data with high dimensional SNP measurements, demonstrate the advantage of the network based approach over the competing alternatives.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, 1116 Mid-Campus Drive N., 66506, Manhattan, KS, USA
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, USA
| | - Ye Li
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Sai Liu
- Division of Nephrology, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Yinhao Du
- Department of Statistics, Kansas State University, 1116 Mid-Campus Drive N., 66506, Manhattan, KS, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis, Memphis, TN, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, 1116 Mid-Campus Drive N., 66506, Manhattan, KS, USA.
| |
Collapse
|
21
|
Profile forward regression screening for ultra-high dimensional semiparametric varying coefficient partially linear models. J MULTIVARIATE ANAL 2017. [DOI: 10.1016/j.jmva.2016.12.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|