1
|
Zhou F, Ren J, Ma S, Wu C. The Bayesian Regularized Quantile Varying Coefficient Model. Comput Stat Data Anal 2023; 187:107808. [PMID: 38746689 PMCID: PMC11090482 DOI: 10.1016/j.csda.2023.107808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024]
Abstract
The quantile varying coefficient (VC) model can flexibly capture dynamical patterns of regression coefficients. In addition, due to the quantile check loss function, it is robust against outliers and heavy-tailed distributions of the response variable, and can provide a more comprehensive picture of modeling via exploring the conditional quantiles of the response variable. Although extensive studies have been conducted to examine variable selection for the high-dimensional quantile varying coefficient models, the Bayesian analysis has been rarely developed. The Bayesian regularized quantile varying coefficient model has been proposed to incorporate robustness against data heterogeneity while accommodating the non-linear interactions between the effect modifier and predictors. Selecting important varying coefficients can be achieved through Bayesian variable selection. Incorporating the multivariate spike-and-slab priors further improves performance by inducing exact sparsity. The Gibbs sampler has been derived to conduct efficient posterior inference of the sparse Bayesian quantile VC model through Markov chain Monte Carlo (MCMC). The merit of the proposed model in selection and estimation accuracy over the alternatives has been systematically investigated in simulation under specific quantile levels and multiple heavy-tailed model errors. In the case study, the proposed model leads to identification of biologically sensible markers in a non-linear gene-environment interaction study using the NHS data.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
2
|
Ren J, Zhou F, Li X, Ma S, Jiang Y, Wu C. Robust Bayesian variable selection for gene-environment interactions. Biometrics 2023; 79:684-694. [PMID: 35394058 PMCID: PMC11086965 DOI: 10.1111/biom.13670] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 03/23/2022] [Accepted: 03/28/2022] [Indexed: 11/30/2022]
Abstract
Gene-environment (G× E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G× E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a fully Bayesian robust variable selection method for G× E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies, analysis of diabetes data with single-nucleotide polymorphism measurements from the Nurses' Health Study, and The Cancer Genome Atlas melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.
Collapse
Affiliation(s)
- Jie Ren
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| |
Collapse
|
3
|
Xiong W, Chen Y, Ma S. Unified model-free interaction screening via CV-entropy filter. Comput Stat Data Anal 2023; 180:107684. [PMID: 36910335 PMCID: PMC9997997 DOI: 10.1016/j.csda.2022.107684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
Collapse
Affiliation(s)
- Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing 100872, PR China
| | - Yaxian Chen
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| |
Collapse
|
4
|
Zhou F, Lu X, Ren J, Fan K, Ma S, Wu C. Sparse group variable selection for gene-environment interactions in the longitudinal study. Genet Epidemiol 2022; 46:317-340. [PMID: 35766061 DOI: 10.1002/gepi.22461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 01/31/2022] [Accepted: 03/15/2022] [Indexed: 11/06/2022]
Abstract
Penalized variable selection for high-dimensional longitudinal data has received much attention as it can account for the correlation among repeated measurements while providing additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies, the potential of penalization methods is far from fully understood for accommodating structured sparsity. In this article, we develop a sparse group penalization method to conduct the bi-level gene-environment (G × $\times $ E) interaction study under the repeatedly measured phenotype. Within the quadratic inference function framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual levels. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program, we conduct G × $\times $ E study by using high-dimensional single nucleotide polymorphism data as genetic factors and the longitudinal trait, forced expiratory volume in 1 s, as the phenotype. Our method leads to improved prediction and identification of main and interaction effects with important implications.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, 06520, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| |
Collapse
|
5
|
Luo Z, Zhang Y, Sun Y. A Penalization Method for Estimating Heterogeneous Covariate Effects in Cancer Genomic Data. Genes (Basel) 2022; 13:genes13040702. [PMID: 35456506 PMCID: PMC9025588 DOI: 10.3390/genes13040702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 04/08/2022] [Accepted: 04/09/2022] [Indexed: 11/16/2022] Open
Abstract
In high-throughput profiling studies, extensive efforts have been devoted to searching for the biomarkers associated with the development and progression of complex diseases. The heterogeneity of covariate effects associated with the outcomes across subjects has been noted in the literature. In this paper, we consider a scenario where the effects of covariates change smoothly across subjects, which are ordered by a known auxiliary variable. To this end, we develop a penalization-based approach, which applies a penalization technique to simultaneously select important covariates and estimate their unique effects on the outcome variables of each subject. We demonstrate that, under the appropriate conditions, our method shows selection and estimation consistency. Additional simulations demonstrate its superiority compared to several competing methods. Furthermore, applying the proposed approach to two The Cancer Genome Atlas datasets leads to better prediction performance and higher selection stability.
Collapse
Affiliation(s)
- Ziye Luo
- School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Beijing 100872, China; (Z.L.); (Y.Z.)
| | - Yuzhao Zhang
- School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Beijing 100872, China; (Z.L.); (Y.Z.)
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, No. 59 Zhongguancun Street, Beijing 100872, China
- Correspondence:
| |
Collapse
|
6
|
Zhou F, Ren J, Liu Y, Li X, Wang W, Wu C. Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data. Genes (Basel) 2022; 13:genes13030544. [PMID: 35328097 PMCID: PMC8950762 DOI: 10.3390/genes13030544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/12/2022] [Accepted: 03/13/2022] [Indexed: 02/05/2023] Open
Abstract
We introduce interep, an R package for interaction analysis of repeated measurement data with high-dimensional main and interaction effects. In G × E interaction studies, the forms of environmental factors play a critical role in determining how structured sparsity should be imposed in the high-dimensional scenario to identify important effects. Zhou et al. (2019) (PMID: 31816972) proposed a longitudinal penalization method to select main and interaction effects corresponding to the individual and group structure, respectively, which requires a mixture of individual and group level penalties. The R package interep implements generalized estimating equation (GEE)-based penalization methods with this sparsity assumption. Moreover, alternative methods have also been implemented in the package. These alternative methods merely select effects on an individual level and ignore the group-level interaction structure. In this software article, we first introduce the statistical methodology corresponding to the penalized GEE methods implemented in the package. Next, we present the usage of the core and supporting functions, which is followed by a simulation example with R codes and annotations. The R package interep is available at The Comprehensive R Archive Network (CRAN).
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA;
| | - Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Weiqun Wang
- Department of Food, Nutrition, Dietetics and Health, Kansas State University, Manhattan, KS 66506, USA;
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
- Correspondence: ; Tel.: +1-7855322231
| |
Collapse
|
7
|
Ren M, Zhang S, Ma S, Zhang Q. Gene-environment interaction identification via penalized robust divergence. Biom J 2022; 64:461-480. [PMID: 34725857 PMCID: PMC9386692 DOI: 10.1002/bimj.202000157] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Revised: 06/01/2021] [Accepted: 08/23/2021] [Indexed: 12/11/2022]
Abstract
In high-throughput cancer studies, gene-environment interactions associated with outcomes have important implications. Some commonly adopted identification methods do not respect the "main effect, interaction" hierarchical structure. In addition, they can be challenged by data contamination and/or long-tailed distributions, which are not uncommon. In this article, robust methods based on γ$\gamma$ -divergence and density power divergence are proposed to accommodate contaminated data/long-tailed distributions. A hierarchical sparse group penalty is adopted for regularized estimation and selection and can identify important gene-environment interactions and respect the "main effect, interaction" hierarchical structure. The proposed methods are implemented using an effective group coordinate descent algorithm. Simulation shows that when contamination occurs, the proposed methods can significantly outperform the existing alternatives with more accurate identification. The proposed approach is applied to the analysis of The Cancer Genome Atlas (TCGA) triple-negative breast cancer data and Gene Environment Association Studies (GENEVA) Type 2 Diabetes data.
Collapse
Affiliation(s)
- Mingyang Ren
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China,Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China,Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, Fujian Key Lab of Statistics, Xiamen University, Fujian, P. R. China
| |
Collapse
|
8
|
Lu X, Fan K, Ren J, Wu C. Identifying Gene-Environment Interactions With Robust Marginal Bayesian Variable Selection. Front Genet 2021; 12:667074. [PMID: 34956304 PMCID: PMC8693717 DOI: 10.3389/fgene.2021.667074] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 07/13/2021] [Indexed: 01/02/2023] Open
Abstract
In high-throughput genetics studies, an important aim is to identify gene–environment interactions associated with the clinical outcomes. Recently, multiple marginal penalization methods have been developed and shown to be effective in G×E studies. However, within the Bayesian framework, marginal variable selection has not received much attention. In this study, we propose a novel marginal Bayesian variable selection method for G×E studies. In particular, our marginal Bayesian method is robust to data contamination and outliers in the outcome variables. With the incorporation of spike-and-slab priors, we have implemented the Gibbs sampler based on Markov Chain Monte Carlo (MCMC). The proposed method outperforms a number of alternatives in extensive simulation studies. The utility of the marginal robust Bayesian variable selection method has been further demonstrated in the case studies using data from the Nurse Health Study (NHS). Some of the identified main and interaction effects from the real data analysis have important biological implications.
Collapse
Affiliation(s)
- Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| |
Collapse
|
9
|
Du Y, Fan K, Lu X, Wu C. Integrating Multi–Omics Data for Gene-Environment Interactions. BIOTECH 2021; 10:biotech10010003. [PMID: 35822775 PMCID: PMC9245467 DOI: 10.3390/biotech10010003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Revised: 01/22/2021] [Accepted: 01/22/2021] [Indexed: 01/05/2023] Open
Abstract
Gene-environment (G×E) interaction is critical for understanding the genetic basis of complex disease beyond genetic and environment main effects. In addition to existing tools for interaction studies, penalized variable selection emerges as a promising alternative for dissecting G×E interactions. Despite the success, variable selection is limited in terms of accounting for multidimensional measurements. Published variable selection methods cannot accommodate structured sparsity in the framework of integrating multiomics data for disease outcomes. In this paper, we have developed a novel variable selection method in order to integrate multi-omics measurements in G×E interaction studies. Extensive studies have already revealed that analyzing omics data across multi-platforms is not only sensible biologically, but also resulting in improved identification and prediction performance. Our integrative model can efficiently pinpoint important regulators of gene expressions through sparse dimensionality reduction, and link the disease outcomes to multiple effects in the integrative G×E studies through accommodating a sparse bi-level structure. The simulation studies show the integrative model leads to better identification of G×E interactions and regulators than alternative methods. In two G×E lung cancer studies with high dimensional multi-omics data, the integrative model leads to an improved prediction and findings with important biological implications.
Collapse
|
10
|
Zhou F, Ren J, Lu X, Ma S, Wu C. Gene-Environment Interaction: A Variable Selection Perspective. Methods Mol Biol 2021; 2212:191-223. [PMID: 33733358 DOI: 10.1007/978-1-0716-0947-7_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Gene-environment interactions have important implications for elucidating the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G × E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G × E interactions, due to the complicated form of environmental effects and the presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G × E interactions, which has been ignored in the majority of published reviews on genetic interaction studies. In this article, we first survey existing studies on both gene-environment and gene-gene interactions. Then, after a brief introduction to the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms, respectively, under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G × E studies, have also been provided.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
11
|
Yao Z, Zhang J, Zou X. A general index for linear and nonlinear correlations for high dimensional genomic data. BMC Genomics 2020; 21:846. [PMID: 33256599 PMCID: PMC7706065 DOI: 10.1186/s12864-020-07246-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 11/18/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND With the advance of high throughput sequencing, high-dimensional data are generated. Detecting dependence/correlation between these datasets is becoming one of most important issues in multi-dimensional data integration and co-expression network construction. RNA-sequencing data is widely used to construct gene regulatory networks. Such networks could be more accurate when methylation data, copy number aberration data and other types of data are introduced. Consequently, a general index for detecting relationships between high-dimensional data is indispensable. RESULTS We proposed a Kernel-Based RV-coefficient, named KBRV, for testing both linear and nonlinear correlation between two matrices by introducing kernel functions into RV2 (the modified RV-coefficient). Permutation test and other validation methods were used on simulated data to test the significance and rationality of KBRV. In order to demonstrate the advantages of KBRV in constructing gene regulatory networks, we applied this index on real datasets (ovarian cancer datasets and exon-level RNA-Seq data in human myeloid differentiation) to illustrate its superiority over vector correlation. CONCLUSIONS We concluded that KBRV is an efficient index for detecting both linear and nonlinear relationships in high dimensional data. The correlation method for high dimensional data has possible applications in the construction of gene regulatory network.
Collapse
Affiliation(s)
- Zhihao Yao
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
- Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, 430072 China
| | - Jing Zhang
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
- Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, 430072 China
| | - Xiufen Zou
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
- Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, 430072 China
| |
Collapse
|
12
|
Lai P, Wang F, Zhu T, Zhang Q. Model identification and selection for single-index varying-coefficient models. ANN I STAT MATH 2020. [DOI: 10.1007/s10463-020-00757-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
Ren J, Zhou F, Li X, Chen Q, Zhang H, Ma S, Jiang Y, Wu C. Semiparametric Bayesian variable selection for gene-environment interactions. Stat Med 2020; 39:617-638. [PMID: 31863500 PMCID: PMC7467082 DOI: 10.1002/sim.8434] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 09/26/2019] [Accepted: 11/02/2019] [Indexed: 11/06/2022]
Abstract
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Study of gene-environment (G×E) interactions is important for elucidating the disease etiology. Existing Bayesian methods for G×E interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. Many studies have shown the advantages of penalization methods in detecting G×E interactions in "large p, small n" settings. However, Bayesian variable selection, which can provide fresh insight into G×E study, has not been widely examined. We propose a novel and powerful semiparametric Bayesian variable selection model that can investigate linear and nonlinear G×E interactions simultaneously. Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. Spike-and-slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. The proposed method conducts Bayesian variable selection more efficiently than existing methods. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. The proposed Bayesian method leads to the identification of main and interaction effects with important implications in a high-throughput profiling study with high-dimensional SNP data.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Qi Chen
- Department of Pharmacology, Toxicology and Therapeutics, University of Kansas Medical Center, Kansas City, Kansas
| | - Hongmei Zhang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas
| |
Collapse
|
14
|
Penalized Variable Selection for Lipid-Environment Interactions in a Longitudinal Lipidomics Study. Genes (Basel) 2019; 10:genes10121002. [PMID: 31816972 PMCID: PMC6947406 DOI: 10.3390/genes10121002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/26/2019] [Indexed: 12/20/2022] Open
Abstract
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid-environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid-environment interactions in a longitudinal lipidomics study. An efficient Newton-Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.
Collapse
|
15
|
Wu M, Ma S. Robust semiparametric gene-environment interaction analysis using sparse boosting. Stat Med 2019; 38:4625-4641. [PMID: 31359454 PMCID: PMC6736719 DOI: 10.1002/sim.8322] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 04/02/2019] [Accepted: 06/19/2019] [Indexed: 12/25/2022]
Abstract
For the pathogenesis of complex diseases, gene-environment (G-E) interactions have been shown to have important implications. G-E interaction analysis can be challenging with the need to jointly analyze a large number of main effects and interactions and to respect the "main effects, interactions" hierarchical constraint. Extensive methodological developments on G-E interaction analysis have been conducted in recent literature. Despite considerable successes, most of the existing studies are still limited as they cannot accommodate long-tailed distributions/data contamination, make the restricted assumption of linear effects, and cannot effectively accommodate missingness in E variables. To directly tackle these problems, a semiparametric model is assumed to accommodate nonlinear effects, and the Huber loss function and Qn estimator are adopted to accommodate long-tailed distributions/data contamination. A regression-based multiple imputation approach is developed to accommodate missingness in E variables. For model estimation and selection of relevant variables, we adopt an effective sparse boosting approach. The proposed approach is practically well motivated, has intuitive formulations, and can be effectively realized. In extensive simulations, it significantly outperforms multiple direct competitors. The analysis of The Cancer Genome Atlas data on stomach adenocarcinoma and cutaneous melanoma shows that the proposed approach makes sensible discoveries with satisfactory prediction and stability.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| |
Collapse
|
16
|
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38:2074-2102. [PMID: 30652356 PMCID: PMC6492164 DOI: 10.1002/sim.8086] [Citation(s) in RCA: 477] [Impact Index Per Article: 95.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Revised: 08/23/2018] [Accepted: 11/02/2018] [Indexed: 12/11/2022]
Abstract
Simulation studies are computer experiments that involve creating data by pseudo-random sampling. A key strength of simulation studies is the ability to understand the behavior of statistical methods because some "truth" (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias. While widely used, simulation studies are often poorly designed, analyzed, and reported. This tutorial outlines the rationale for using simulation studies and offers guidance for design, execution, analysis, reporting, and presentation. In particular, this tutorial provides a structured approach for planning and reporting simulation studies, which involves defining aims, data-generating mechanisms, estimands, methods, and performance measures ("ADEMP"); coherent terminology for simulation studies; guidance on coding simulation studies; a critical discussion of key performance measures and their estimation; guidance on structuring tabular and graphical presentation of results; and new graphical presentations. With a view to describing recent practice, we review 100 articles taken from Volume 34 of Statistics in Medicine, which included at least one simulation study and identify areas for improvement.
Collapse
Affiliation(s)
- Tim P. Morris
- London Hub for Trials Methodology ResearchMRC Clinical Trials Unit at UCLLondonUnited Kingdom
| | - Ian R. White
- London Hub for Trials Methodology ResearchMRC Clinical Trials Unit at UCLLondonUnited Kingdom
| | - Michael J. Crowther
- Biostatistics Research Group, Department of Health SciencesUniversity of LeicesterLeicesterUnited Kingdom
| |
Collapse
|
17
|
Li Y, Li R, Lin C, Qin Y, Ma S. Penalized integrative semiparametric interaction analysis for multiple genetic datasets. Stat Med 2019; 38:3221-3242. [PMID: 30993736 DOI: 10.1002/sim.8172] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 02/08/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022]
Abstract
In this article, we consider a semiparametric additive partially linear interaction model for the integrative analysis of multiple genetic datasets. The goals are to identify important genetic predictors and gene-gene interactions and to estimate the nonparametric functions that describe the environmental effects at the same time. To find the similarities and differences of the genetic effects across different datasets, we impose a group structure on the regression coefficients matrix under the homogeneity assumption, ie, models for different datasets share the same sparsity structure, but the coefficients may differ across datasets. We develop an iterative approach to estimate the parameters of main effects, interactions and nonparametric functions, where a reparametrization of interaction parameters is implemented to meet the strong hierarchy assumption. We demonstrate the advantages of the proposed method in identification, estimation, and prediction in a series of numerical studies. We also apply the proposed method to the Skin Cutaneous Melanoma data and the lung cancer data from the Cancer Genome Atlas.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Rong Li
- School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Cunjie Lin
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Yichen Qin
- Department of Operations, Business Analytics and Information Systems, University of Cincinnati, Cincinatti, Ohio
| | - Shuangge Ma
- School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
18
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
19
|
Wu M, Ma S. Robust genetic interaction analysis. Brief Bioinform 2019; 20:624-637. [PMID: 29897421 PMCID: PMC6556899 DOI: 10.1093/bib/bby033] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Indexed: 01/17/2023] Open
Abstract
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Collapse
Affiliation(s)
- Mengyun Wu
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| | - Shuangge Ma
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
20
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
21
|
Xu Y, Wu M, Ma S, Ahmed SE. Robust gene-environment interaction analysis using penalized trimmed regression. J STAT COMPUT SIM 2018; 88:3502-3528. [PMID: 30718937 PMCID: PMC6358205 DOI: 10.1080/00949655.2018.1523411] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Accepted: 09/09/2018] [Indexed: 12/25/2022]
Abstract
In biomedical and epidemiological studies, gene-environment (G-E) interactions have been shown to importantly contribute to the etiology and progression of many complex diseases. Most existing approaches for identifying G-E interactions are limited by the lack of robustness against outliers/contaminations in response and predictor spaces. In this study, we develop a novel robust G-E identification approach using the trimmed regression technique under joint modeling. A robust data-driven criterion and stability selection are adopted to determine the trimmed subset which is free from both vertical outliers and leverage points. An effective penalization approach is developed to identify important G-E interactions, respecting the "main effects, interactions" hierarchical structure. Extensive simulations demonstrate the better performance of the proposed approach compared to multiple alternatives. Interesting findings with superior prediction accuracy and stability are observed in the analysis of TCGA data on cutaneous melanoma and breast invasive carcinoma.
Collapse
Affiliation(s)
- Yaqing Xu
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Mengyun Wu
- Department of Biostatistics, Yale University, New Haven, CT, USA
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Syed Ejaz Ahmed
- Department of Mathematics and Statistics, Brock University, Canada
| |
Collapse
|
22
|
Shim J, Hwang C, Jeong S, Sohn I. Semivarying coefficient least-squares support vector regression for analyzing high-dimensional gene-environmental data. J Appl Stat 2018. [DOI: 10.1080/02664763.2017.1371676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Jooyong Shim
- Institute of Statistical Information, Department of Statistics, Inje University, Kyungnam, South Korea
| | - Changha Hwang
- Department of Applied Statistics, Dankook University, Gyeonggido, South Korea
| | - Sunjoo Jeong
- Department of Bioconvergent Science and Technology, Dankook University, Gyeonggido, South Korea
| | - Insuk Sohn
- Statistics and Data Center, Samsung Medical Center, Seoul, South Korea
| |
Collapse
|
23
|
Wu C, Jiang Y, Ren J, Cui Y, Ma S. Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Stat Med 2017; 37:437-456. [PMID: 29034484 DOI: 10.1002/sim.7518] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Revised: 07/30/2017] [Accepted: 09/07/2017] [Indexed: 12/26/2022]
Abstract
Identification of gene-environment (G × E) interactions associated with disease phenotypes has posed a great challenge in high-throughput cancer studies. The existing marginal identification methods have suffered from not being able to accommodate the joint effects of a large number of genetic variants, while some of the joint-effect methods have been limited by failing to respect the "main effects, interactions" hierarchy, by ignoring data contamination, and by using inefficient selection techniques under complex structural sparsity. In this article, we develop an effective penalization approach to identify important G × E interactions and main effects, which can account for the hierarchical structures of the 2 types of effects. Possible data contamination is accommodated by adopting the least absolute deviation loss function. The advantage of the proposed approach over the alternatives is convincingly demonstrated in both simulation and a case study on lung cancer prognosis with gene expression measurements and clinical covariates under the accelerated failure time model.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics, and Environmental Health, University of Memphis, Memphis, TN 38111, USA
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd, East Lansing, MI 48824, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, 60 College Street, New Haven, CT 06520, USA
| |
Collapse
|
24
|
Wu C, Ma S. A selective review of robust variable selection with applications in bioinformatics. Brief Bioinform 2015; 16:873-83. [PMID: 25479793 PMCID: PMC4570200 DOI: 10.1093/bib/bbu046] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Revised: 10/20/2014] [Indexed: 11/13/2022] Open
Abstract
A drastic amount of data have been and are being generated in bioinformatics studies. In the analysis of such data, the standard modeling approaches can be challenged by the heavy-tailed errors and outliers in response variables, the contamination in predictors (which may be caused by, for instance, technical problems in microarray gene expression studies), model mis-specification and others. Robust methods are needed to tackle these challenges. When there are a large number of predictors, variable selection can be as important as estimation. As a generic variable selection and regularization tool, penalization has been extensively adopted. In this article, we provide a selective review of robust penalized variable selection approaches especially designed for high-dimensional data from bioinformatics and biomedical studies. We discuss the robust loss functions, penalty functions and computational algorithms. The theoretical properties and implementation are also briefly examined. Application examples of the robust penalization approaches in representative bioinformatics and biomedical studies are also illustrated.
Collapse
|