1
|
Sun N, Chu J, He Q, Wang Y, Han Q, Yi N, Zhang R, Shen Y. BHAFT: Bayesian heredity-constrained accelerated failure time models for detecting gene-environment interactions in survival analysis. Stat Med 2024. [PMID: 38963094 DOI: 10.1002/sim.10145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 05/06/2024] [Accepted: 06/06/2024] [Indexed: 07/05/2024]
Abstract
In addition to considering the main effects, understanding gene-environment (G × E) interactions is imperative for determining the etiology of diseases and the factors that affect their prognosis. In the existing statistical framework for censored survival outcomes, there are several challenges in detecting G × E interactions, such as handling high-dimensional omics data, diverse environmental factors, and algorithmic complications in survival analysis. The effect heredity principle has widely been used in studies involving interaction identification because it incorporates the dependence of the main and interaction effects. However, Bayesian survival models that incorporate the assumption of this principle have not been developed. Therefore, we propose Bayesian heredity-constrained accelerated failure time (BHAFT) models for identifying main and interaction (M-I) effects with novel spike-and-slab or regularized horseshoe priors to incorporate the assumption of effect heredity principle. The R package rstan was used to fit the proposed models. Extensive simulations demonstrated that BHAFT models had outperformed other existing models in terms of signal identification, coefficient estimation, and prognosis prediction. Biologically plausible G × E interactions associated with the prognosis of lung adenocarcinoma were identified using our proposed model. Notably, BHAFT models incorporating the effect heredity principle could identify both main and interaction effects, which are highly useful in exploring G × E interactions in high-dimensional survival analysis. The code and data used in our paper are available at https://github.com/SunNa-bayesian/BHAFT.
Collapse
Affiliation(s)
- Na Sun
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Jiadong Chu
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Qida He
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Yu Wang
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Qiang Han
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Ruyang Zhang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Yueping Shen
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
| |
Collapse
|
2
|
Liang W, Zhang Q, Ma S. Hierarchical False Discovery Rate Control for High-dimensional Survival Analysis with Interactions. Comput Stat Data Anal 2024; 192:107906. [PMID: 38098875 PMCID: PMC10718515 DOI: 10.1016/j.csda.2023.107906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2023]
Abstract
With the development of data collection techniques, analysis with a survival response and high-dimensional covariates has become routine. Here we consider an interaction model, which includes a set of low-dimensional covariates, a set of high-dimensional covariates, and their interactions. This model has been motivated by gene-environment (G-E) interaction analysis, where the E variables have a low dimension, and the G variables have a high dimension. For such a model, there has been extensive research on estimation and variable selection. Comparatively, inference studies with a valid false discovery rate (FDR) control have been very limited. The existing high-dimensional inference tools cannot be directly applied to interaction models, as interactions and main effects are not "equal". In this article, for high-dimensional survival analysis with interactions, we model survival using the Accelerated Failure Time (AFT) model and adopt a "weighted least squares + debiased Lasso" approach for estimation and selection. A hierarchical FDR control approach is developed for inference and respect of the "main effects, interactions" hierarchy. The asymptotic distribution properties of the debiased Lasso estimators are rigorously established. Simulation demonstrates the satisfactory performance of the proposed approach, and the analysis of a breast cancer dataset further establishes its practical utility.
Collapse
Affiliation(s)
- Weijuan Liang
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, The Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
3
|
Ren J, Zhou F, Li X, Ma S, Jiang Y, Wu C. Robust Bayesian variable selection for gene-environment interactions. Biometrics 2023; 79:684-694. [PMID: 35394058 PMCID: PMC11086965 DOI: 10.1111/biom.13670] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 03/23/2022] [Accepted: 03/28/2022] [Indexed: 11/30/2022]
Abstract
Gene-environment (G× E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G× E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a fully Bayesian robust variable selection method for G× E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies, analysis of diabetes data with single-nucleotide polymorphism measurements from the Nurses' Health Study, and The Cancer Genome Atlas melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.
Collapse
Affiliation(s)
- Jie Ren
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, USA
| |
Collapse
|
4
|
Zhou F, Liu Y, Ren J, Wang W, Wu C. Springer: An R package for bi-level variable selection of high-dimensional longitudinal data. Front Genet 2023; 14:1088223. [PMID: 37091810 PMCID: PMC10117642 DOI: 10.3389/fgene.2023.1088223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 02/28/2023] [Indexed: 04/09/2023] Open
Abstract
In high-dimensional data analysis, the bi-level (or the sparse group) variable selection can simultaneously conduct penalization on the group level and within groups, which has been developed for continuous, binary, and survival responses in the literature. Zhou et al. (2022) (PMID: 35766061) has further extended it under the longitudinal response by proposing a quadratic inference function-based penalization method in gene–environment interaction studies. This study introduces “springer,” an R package implementing the bi-level variable selection within the QIF framework developed in Zhou et al. (2022). In addition, R package “springer” has also implemented the generalized estimating equation-based sparse group penalization method. Alternative methods focusing only on the group level or individual level have also been provided by the package. In this study, we have systematically introduced the longitudinal penalization methods implemented in the “springer” package. We demonstrate the usage of the core and supporting functions, which is followed by the numerical examples and discussions. R package “springer” is available at https://cran.r-project.org/package=springer.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Weiqun Wang
- Department of Food, Nutrition, Dietetics and Health, Kansas State University, Manhattan, KS, United States
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
- *Correspondence: Cen Wu,
| |
Collapse
|
5
|
Wu S, Xu Y, Zhang Q, Ma S. Gene-environment interaction analysis via deep learning. Genet Epidemiol 2023; 47:261-286. [PMID: 36807383 PMCID: PMC10244912 DOI: 10.1002/gepi.22518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 12/17/2022] [Accepted: 02/01/2023] [Indexed: 02/21/2023]
Abstract
Gene-environment (G-E) interaction analysis plays an important role in studying complex diseases. Extensive methodological research has been conducted on G-E interaction analysis, and the existing methods are mostly based on regression techniques. In many fields including biomedicine and omics, it has been increasingly recognized that deep learning may outperform regression with its unique flexibility (e.g., in accommodating unspecified nonlinear effects) and superior prediction performance. However, there has been a lack of development in deep learning for G-E interaction analysis. In this article, we fill this important knowledge gap and develop a new analysis approach based on deep neural network in conjunction with penalization. The proposed approach can simultaneously conduct model estimation and selection (of important main G effects and G-E interactions), while uniquely respecting the "main effects, interactions" variable selection hierarchy. Simulation shows that it has superior prediction and feature selection performance. The analysis of data on lung adenocarcinoma and skin cutaneous melanoma overall survival further establishes its practical utility. Overall, this study can advance G-E interaction analysis by delivering a powerful new analysis approach based on modern deep learning.
Collapse
Affiliation(s)
- Shuni Wu
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Yaqing Xu
- Department of Epidemiology and Biostatistics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Qingzhao Zhang
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
- Department of Statistics and Data Science, School of Economics and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
6
|
Liu M, Zhang Q, Ma S. A tree-based gene-environment interaction analysis with rare features. Stat Anal Data Min 2022; 15:648-674. [PMID: 38046814 PMCID: PMC10691867 DOI: 10.1002/sam.11578] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 02/14/2022] [Indexed: 01/20/2023]
Abstract
Gene-environment (G-E) interaction analysis plays a critical role in understanding and modeling complex diseases. Compared to main-effect-only analysis, it is more seriously challenged by higher dimensionality, weaker signals, and the unique "main effects, interactions" variable selection hierarchy. In joint G-E interaction analysis under which a large number of G factors are analysed in a single model, effort tailored to rare features (e.g., SNPs with low minor allele frequencies) has been limited. Existing investigations on rare features have been mostly focused on marginal analysis, where various data aggregation techniques have been developed, and hypothesis testings have been conducted to identify significant aggregated features. However, such techniques cannot be extended to joint G-E interaction analysis. In this study, building on a very recent tree-based data aggregation technique, which has been developed for main-effect-only analysis, we develop a new G-E interaction analysis approach tailored to rare features. The adopted data aggregation technique allows for more efficient information borrowing from neighboring rare features. Similar to some existing state-of-the-art ones, the proposed approach adopts penalization for variable selection, regularized estimation, and respect of the variable selection hierarchy. Simulation shows that it has more accurate identification of important interactions and main effects than several competing alternatives. In the analysis of NFBC1966 study, the proposed approach leads to findings different from the alternatives and with satisfactory prediction and stability performance.
Collapse
Affiliation(s)
- Mengque Liu
- School of Journalism and New Media, Xi’an Jiaotong Universit0y, Shanxi Xi’an, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Fujian Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
7
|
Zhou F, Lu X, Ren J, Fan K, Ma S, Wu C. Sparse group variable selection for gene-environment interactions in the longitudinal study. Genet Epidemiol 2022; 46:317-340. [PMID: 35766061 DOI: 10.1002/gepi.22461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 01/31/2022] [Accepted: 03/15/2022] [Indexed: 11/06/2022]
Abstract
Penalized variable selection for high-dimensional longitudinal data has received much attention as it can account for the correlation among repeated measurements while providing additional and essential information for improved identification and prediction performance. Despite the success, in longitudinal studies, the potential of penalization methods is far from fully understood for accommodating structured sparsity. In this article, we develop a sparse group penalization method to conduct the bi-level gene-environment (G × $\times $ E) interaction study under the repeatedly measured phenotype. Within the quadratic inference function framework, the proposed method can achieve simultaneous identification of main and interaction effects on both the group and individual levels. Simulation studies have shown that the proposed method outperforms major competitors. In the case study of asthma data from the Childhood Asthma Management Program, we conduct G × $\times $ E study by using high-dimensional single nucleotide polymorphism data as genetic factors and the longitudinal trait, forced expiratory volume in 1 s, as the phenotype. Our method leads to improved prediction and identification of main and interaction effects with important implications.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, Indiana, 46202, USA
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, 06520, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas, 66506, USA
| |
Collapse
|
8
|
Zhou F, Ren J, Liu Y, Li X, Wang W, Wu C. Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data. Genes (Basel) 2022; 13:genes13030544. [PMID: 35328097 PMCID: PMC8950762 DOI: 10.3390/genes13030544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/12/2022] [Accepted: 03/13/2022] [Indexed: 02/05/2023] Open
Abstract
We introduce interep, an R package for interaction analysis of repeated measurement data with high-dimensional main and interaction effects. In G × E interaction studies, the forms of environmental factors play a critical role in determining how structured sparsity should be imposed in the high-dimensional scenario to identify important effects. Zhou et al. (2019) (PMID: 31816972) proposed a longitudinal penalization method to select main and interaction effects corresponding to the individual and group structure, respectively, which requires a mixture of individual and group level penalties. The R package interep implements generalized estimating equation (GEE)-based penalization methods with this sparsity assumption. Moreover, alternative methods have also been implemented in the package. These alternative methods merely select effects on an individual level and ignore the group-level interaction structure. In this software article, we first introduce the statistical methodology corresponding to the penalized GEE methods implemented in the package. Next, we present the usage of the core and supporting functions, which is followed by a simulation example with R codes and annotations. The R package interep is available at The Comprehensive R Archive Network (CRAN).
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Jie Ren
- Department of Biostatistics and Health Data Sciences, Indiana University School of Medicine, Indianapolis, IN 46202, USA;
| | - Yuwen Liu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
| | - Weiqun Wang
- Department of Food, Nutrition, Dietetics and Health, Kansas State University, Manhattan, KS 66506, USA;
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; (F.Z.); (Y.L.); (X.L.)
- Correspondence: ; Tel.: +1-7855322231
| |
Collapse
|
9
|
Ren M, Zhang S, Ma S, Zhang Q. Gene-environment interaction identification via penalized robust divergence. Biom J 2022; 64:461-480. [PMID: 34725857 PMCID: PMC9386692 DOI: 10.1002/bimj.202000157] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Revised: 06/01/2021] [Accepted: 08/23/2021] [Indexed: 12/11/2022]
Abstract
In high-throughput cancer studies, gene-environment interactions associated with outcomes have important implications. Some commonly adopted identification methods do not respect the "main effect, interaction" hierarchical structure. In addition, they can be challenged by data contamination and/or long-tailed distributions, which are not uncommon. In this article, robust methods based on γ$\gamma$ -divergence and density power divergence are proposed to accommodate contaminated data/long-tailed distributions. A hierarchical sparse group penalty is adopted for regularized estimation and selection and can identify important gene-environment interactions and respect the "main effect, interaction" hierarchical structure. The proposed methods are implemented using an effective group coordinate descent algorithm. Simulation shows that when contamination occurs, the proposed methods can significantly outperform the existing alternatives with more accurate identification. The proposed approach is applied to the analysis of The Cancer Genome Atlas (TCGA) triple-negative breast cancer data and Gene Environment Association Studies (GENEVA) Type 2 Diabetes data.
Collapse
Affiliation(s)
- Mingyang Ren
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China,Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China,Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, Fujian Key Lab of Statistics, Xiamen University, Fujian, P. R. China
| |
Collapse
|
10
|
Zemlianskaia N, Gauderman WJ, Lewinger JP. A scalable hierarchical lasso for gene-environment interactions. J Comput Graph Stat 2022; 31:1091-1103. [PMID: 36793591 PMCID: PMC9928188 DOI: 10.1080/10618600.2022.2039161] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
We describe a regularized regression model for the selection of gene-environment (G×E) interactions. The model focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure. We propose an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy. We present simulation results showing that the model outperforms existing joint selection methods for (G×E) interactions in terms of selection performance, scalability and speed, and provide a real data application. Our implementation is available in the gesso R package.
Collapse
Affiliation(s)
- Natalia Zemlianskaia
- Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA
| | - W. James Gauderman
- Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA
| | - Juan Pablo Lewinger
- Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA
| |
Collapse
|
11
|
Spirko-Burns L, Devarajan K. Supervised Dimension Reduction for Large-Scale "Omics" Data With Censored Survival Outcomes Under Possible Non-Proportional Hazards. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2032-2044. [PMID: 31940547 DOI: 10.1109/tcbb.2020.2965934] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The past two decades have witnessed significant advances in high-throughput "omics" technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes, such as overall and recurrence-free survival, with the goal of developing a predictive "omics" profile. Large-scale studies often suffer from the presence of a large fraction of censored observations and potential time-varying effects of features, and methods for handling them have been lacking. In this paper, we propose supervised methods for feature selection and survival prediction that simultaneously deal with both issues. Our approach utilizes continuum power regression (CPR) - a framework that includes a variety of regression methods - in conjunction with the parametric or semi-parametric accelerated failure time (AFT) model. Both CPR and AFT fall within the linear models framework and, unlike black-box models, the proposed prognostic index has a simple yet useful interpretation. We demonstrate the utility of our methods using simulated and publicly available cancer genomics data.
Collapse
|
12
|
Uncovering Evidence for Endocrine-Disrupting Chemicals That Elicit Differential Susceptibility through Gene-Environment Interactions. TOXICS 2021; 9:toxics9040077. [PMID: 33917455 PMCID: PMC8067468 DOI: 10.3390/toxics9040077] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 03/27/2021] [Accepted: 04/02/2021] [Indexed: 12/17/2022]
Abstract
Exposure to endocrine-disrupting chemicals (EDCs) is linked to myriad disorders, characterized by the disruption of the complex endocrine signaling pathways that govern development, physiology, and even behavior across the entire body. The mechanisms of endocrine disruption involve a complex system of pathways that communicate across the body to stimulate specific receptors that bind DNA and regulate the expression of a suite of genes. These mechanisms, including gene regulation, DNA binding, and protein binding, can be tied to differences in individual susceptibility across a genetically diverse population. In this review, we posit that EDCs causing such differential responses may be identified by looking for a signal of population variability after exposure. We begin by summarizing how the biology of EDCs has implications for genetically diverse populations. We then describe how gene-environment interactions (GxE) across the complex pathways of endocrine signaling could lead to differences in susceptibility. We survey examples in the literature of individual susceptibility differences to EDCs, pointing to a need for research in this area, especially regarding the exceedingly complex thyroid pathway. Following a discussion of experimental designs to better identify and study GxE across EDCs, we present a case study of a high-throughput screening signal of putative GxE within known endocrine disruptors. We conclude with a call for further, deeper analysis of the EDCs, particularly the thyroid disruptors, to identify if these chemicals participate in GxE leading to differences in susceptibility.
Collapse
|
13
|
Du Y, Fan K, Lu X, Wu C. Integrating Multi–Omics Data for Gene-Environment Interactions. BIOTECH 2021; 10:biotech10010003. [PMID: 35822775 PMCID: PMC9245467 DOI: 10.3390/biotech10010003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Revised: 01/22/2021] [Accepted: 01/22/2021] [Indexed: 01/05/2023] Open
Abstract
Gene-environment (G×E) interaction is critical for understanding the genetic basis of complex disease beyond genetic and environment main effects. In addition to existing tools for interaction studies, penalized variable selection emerges as a promising alternative for dissecting G×E interactions. Despite the success, variable selection is limited in terms of accounting for multidimensional measurements. Published variable selection methods cannot accommodate structured sparsity in the framework of integrating multiomics data for disease outcomes. In this paper, we have developed a novel variable selection method in order to integrate multi-omics measurements in G×E interaction studies. Extensive studies have already revealed that analyzing omics data across multi-platforms is not only sensible biologically, but also resulting in improved identification and prediction performance. Our integrative model can efficiently pinpoint important regulators of gene expressions through sparse dimensionality reduction, and link the disease outcomes to multiple effects in the integrative G×E studies through accommodating a sparse bi-level structure. The simulation studies show the integrative model leads to better identification of G×E interactions and regulators than alternative methods. In two G×E lung cancer studies with high dimensional multi-omics data, the integrative model leads to an improved prediction and findings with important biological implications.
Collapse
|
14
|
Zhou F, Ren J, Lu X, Ma S, Wu C. Gene-Environment Interaction: A Variable Selection Perspective. Methods Mol Biol 2021; 2212:191-223. [PMID: 33733358 DOI: 10.1007/978-1-0716-0947-7_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Gene-environment interactions have important implications for elucidating the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G × E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G × E interactions, due to the complicated form of environmental effects and the presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G × E interactions, which has been ignored in the majority of published reviews on genetic interaction studies. In this article, we first survey existing studies on both gene-environment and gene-gene interactions. Then, after a brief introduction to the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms, respectively, under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G × E studies, have also been provided.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
15
|
Yao Z, Zhang J, Zou X. A general index for linear and nonlinear correlations for high dimensional genomic data. BMC Genomics 2020; 21:846. [PMID: 33256599 PMCID: PMC7706065 DOI: 10.1186/s12864-020-07246-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 11/18/2020] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND With the advance of high throughput sequencing, high-dimensional data are generated. Detecting dependence/correlation between these datasets is becoming one of most important issues in multi-dimensional data integration and co-expression network construction. RNA-sequencing data is widely used to construct gene regulatory networks. Such networks could be more accurate when methylation data, copy number aberration data and other types of data are introduced. Consequently, a general index for detecting relationships between high-dimensional data is indispensable. RESULTS We proposed a Kernel-Based RV-coefficient, named KBRV, for testing both linear and nonlinear correlation between two matrices by introducing kernel functions into RV2 (the modified RV-coefficient). Permutation test and other validation methods were used on simulated data to test the significance and rationality of KBRV. In order to demonstrate the advantages of KBRV in constructing gene regulatory networks, we applied this index on real datasets (ovarian cancer datasets and exon-level RNA-Seq data in human myeloid differentiation) to illustrate its superiority over vector correlation. CONCLUSIONS We concluded that KBRV is an efficient index for detecting both linear and nonlinear relationships in high dimensional data. The correlation method for high dimensional data has possible applications in the construction of gene regulatory network.
Collapse
Affiliation(s)
- Zhihao Yao
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
- Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, 430072 China
| | - Jing Zhang
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
- Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, 430072 China
| | - Xiufen Zou
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072 China
- Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, 430072 China
| |
Collapse
|
16
|
Mbemi A, Khanna S, Njiki S, Yedjou CG, Tchounwou PB. Impact of Gene-Environment Interactions on Cancer Development. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:E8089. [PMID: 33153024 PMCID: PMC7662361 DOI: 10.3390/ijerph17218089] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 10/26/2020] [Accepted: 10/30/2020] [Indexed: 12/24/2022]
Abstract
Several epidemiological and experimental studies have demonstrated that many human diseases are not only caused by specific genetic and environmental factors but also by gene-environment interactions. Although it has been widely reported that genetic polymorphisms play a critical role in human susceptibility to cancer and other chronic disease conditions, many single nucleotide polymorphisms (SNPs) are caused by somatic mutations resulting from human exposure to environmental stressors. Scientific evidence suggests that the etiology of many chronic illnesses is caused by the joint effect between genetics and the environment. Research has also pointed out that the interactions of environmental factors with specific allelic variants highly modulate the susceptibility to diseases. Hence, many scientific discoveries on gene-environment interactions have elucidated the impact of their combined effect on the incidence and/or prevalence rate of human diseases. In this review, we provide an overview of the nature of gene-environment interactions, and discuss their role in human cancers, with special emphases on lung, colorectal, bladder, breast, ovarian, and prostate cancers.
Collapse
Affiliation(s)
- Ariane Mbemi
- NIH/NIMHD RCMI-Center for Health Disparities Research, Jackson State University, 1400 Lynch Street, Box 18750, Jackson, MS 39217, USA; (A.M.); (S.N.)
- Department of Biology, College of Science, Engineering and Technology, Jackson State University, 1400 Lynch Street, Box 18540, Jackson, MS 39217, USA
| | - Sunali Khanna
- Department of Oral Medicine and Radiology, Nair Hospital Dental College, Municipal Corporation of Greater Mumbai, Mumbai 400 008, India;
| | - Sylvianne Njiki
- NIH/NIMHD RCMI-Center for Health Disparities Research, Jackson State University, 1400 Lynch Street, Box 18750, Jackson, MS 39217, USA; (A.M.); (S.N.)
- Department of Biology, College of Science, Engineering and Technology, Jackson State University, 1400 Lynch Street, Box 18540, Jackson, MS 39217, USA
| | - Clement G. Yedjou
- Department of Biological Sciences, College of Science and Technology, Florida Agricultural and Mechanical University, 1610 S. Martin Luther King Blvd., Tallahassee, FL 32307, USA;
| | - Paul B. Tchounwou
- NIH/NIMHD RCMI-Center for Health Disparities Research, Jackson State University, 1400 Lynch Street, Box 18750, Jackson, MS 39217, USA; (A.M.); (S.N.)
- Department of Biology, College of Science, Engineering and Technology, Jackson State University, 1400 Lynch Street, Box 18540, Jackson, MS 39217, USA
| |
Collapse
|
17
|
Ren J, Zhou F, Li X, Chen Q, Zhang H, Ma S, Jiang Y, Wu C. Semiparametric Bayesian variable selection for gene-environment interactions. Stat Med 2020; 39:617-638. [PMID: 31863500 PMCID: PMC7467082 DOI: 10.1002/sim.8434] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 09/26/2019] [Accepted: 11/02/2019] [Indexed: 11/06/2022]
Abstract
Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Study of gene-environment (G×E) interactions is important for elucidating the disease etiology. Existing Bayesian methods for G×E interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. Many studies have shown the advantages of penalization methods in detecting G×E interactions in "large p, small n" settings. However, Bayesian variable selection, which can provide fresh insight into G×E study, has not been widely examined. We propose a novel and powerful semiparametric Bayesian variable selection model that can investigate linear and nonlinear G×E interactions simultaneously. Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. Spike-and-slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. The proposed method conducts Bayesian variable selection more efficiently than existing methods. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. The proposed Bayesian method leads to the identification of main and interaction effects with important implications in a high-throughput profiling study with high-dimensional SNP data.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, Kansas
| | - Qi Chen
- Department of Pharmacology, Toxicology and Therapeutics, University of Kansas Medical Center, Kansas City, Kansas
| | - Hongmei Zhang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, Tennessee
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, Kansas
| |
Collapse
|
18
|
Penalized Variable Selection for Lipid-Environment Interactions in a Longitudinal Lipidomics Study. Genes (Basel) 2019; 10:genes10121002. [PMID: 31816972 PMCID: PMC6947406 DOI: 10.3390/genes10121002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/26/2019] [Indexed: 12/20/2022] Open
Abstract
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid-environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid-environment interactions in a longitudinal lipidomics study. An efficient Newton-Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.
Collapse
|
19
|
Zhang S, Xue Y, Zhang Q, Ma C, Wu M, Ma S. Identification of gene-environment interactions with marginal penalization. Genet Epidemiol 2019; 44:159-196. [PMID: 31724772 DOI: 10.1002/gepi.22270] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Revised: 10/05/2019] [Accepted: 10/25/2019] [Indexed: 12/29/2022]
Abstract
Gene-environment (G-E) interaction analysis has been extensively conducted for complex diseases. In marginal analysis, the common practice is to conduct likelihood-based (and other "standard") estimation with each marginal model, and then select significant G-E interactions and main effects based on p values and multiple comparisons adjustment. One limitation of this approach is that the identification results often do not respect the "main effects, interactions" hierarchy, which has been stressed in recent G-E interaction analyses. There is some recent effort tackling this problem, however, with very complex formulations. Another limitation of the common practice is that it may not perform well when regularization is needed, for example, because of "non-normal" distributions. In this article, we propose a marginal penalization approach which adopts a novel penalty to directly tackle the aforementioned problems. The proposed approach has a framework more coherent with that of the recently developed joint analysis methods and an intuitive formulation, and can be effectively realized. In simulation, it outperforms the popular significance-based analysis and simple penalization-based alternatives. Promising findings are made in the analysis of a single-nucleotide polymorphism and a gene expression data.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yuan Xue
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Chenjin Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut.,School of Statistics, Renmin University, Beijing, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
20
|
Lobach I, Kim I, Alekseyenko A, Lobach S, Zhang L. A Simple Approximation to Bias in Gene-Environment Interaction Estimates When a Case Might Not Be the Case. Front Genet 2019; 10:886. [PMID: 31681402 PMCID: PMC6812609 DOI: 10.3389/fgene.2019.00886] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Accepted: 08/22/2019] [Indexed: 11/25/2022] Open
Abstract
Case–control genetic association studies are often used to examine the role of the genetic basis in complex diseases, such as cancer and neurodegenerative diseases. The role of the genetic basis might vary by nongenetic (environmental) measures, what is traditionally defined as gene–environment interactions (G×E). A commonly overlooked complication is that the set of clinically diagnosed cases might be contaminated by a subset with a nuisance pathologic state that presents with the same symptoms as the pathologic state of interest. The genetic basis of the pathologic state of interest might differ from that of the nuisance pathologic state. Often, frequencies of the pathologically defined states within the clinically diagnosed set of cases vary by the environment. We derive a simple and general approximation to bias in G×E parameter estimates when the presence of the nuisance pathologic state is ignored. We then perform extensive simulation studies to show that ignoring the presence of the nuisance pathologic state can result in substantial bias in G×E estimates and that the approximation we derived is reasonably accurate in finite samples. We demonstrate the applicability of the proposed approximation in a study of Alzheimer’s disease.
Collapse
Affiliation(s)
- Iryna Lobach
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, United States
| | - Inyoung Kim
- Department of Statistics, Virginia Tech University, Blacksburg, VA, United States
| | - Alexander Alekseyenko
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, United States
| | - Siarhei Lobach
- Applied Mathematics and Computer Science Department, Belarusian State University, Minsk, Belarus
| | - Li Zhang
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA, United States.,Department of Medicine, University of California, San Francisco, San Francisco, CA, United States.,Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, United States
| |
Collapse
|
21
|
Wu M, Ma S. Robust semiparametric gene-environment interaction analysis using sparse boosting. Stat Med 2019; 38:4625-4641. [PMID: 31359454 PMCID: PMC6736719 DOI: 10.1002/sim.8322] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 04/02/2019] [Accepted: 06/19/2019] [Indexed: 12/25/2022]
Abstract
For the pathogenesis of complex diseases, gene-environment (G-E) interactions have been shown to have important implications. G-E interaction analysis can be challenging with the need to jointly analyze a large number of main effects and interactions and to respect the "main effects, interactions" hierarchical constraint. Extensive methodological developments on G-E interaction analysis have been conducted in recent literature. Despite considerable successes, most of the existing studies are still limited as they cannot accommodate long-tailed distributions/data contamination, make the restricted assumption of linear effects, and cannot effectively accommodate missingness in E variables. To directly tackle these problems, a semiparametric model is assumed to accommodate nonlinear effects, and the Huber loss function and Qn estimator are adopted to accommodate long-tailed distributions/data contamination. A regression-based multiple imputation approach is developed to accommodate missingness in E variables. For model estimation and selection of relevant variables, we adopt an effective sparse boosting approach. The proposed approach is practically well motivated, has intuitive formulations, and can be effectively realized. In extensive simulations, it significantly outperforms multiple direct competitors. The analysis of The Cancer Genome Atlas data on stomach adenocarcinoma and cutaneous melanoma shows that the proposed approach makes sensible discoveries with satisfactory prediction and stability.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| |
Collapse
|
22
|
Wu M, Zhang Q, Ma S. Structured gene-environment interaction analysis. Biometrics 2019; 76:23-35. [PMID: 31424088 DOI: 10.1111/biom.13139] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 08/06/2019] [Indexed: 01/03/2023]
Abstract
For the etiology, progression, and treatment of complex diseases, gene-environment (G-E) interactions have important implications beyond the main G and E effects. G-E interaction analysis can be more challenging with higher dimensionality and need for accommodating the "main effects, interactions" hierarchy. In recent literature, an array of novel methods, many of which are based on the penalization technique, have been developed. In most of these studies, however, the structures of G measurements, for example, the adjacency structure of single nucleotide polymorphisms (SNPs; attributable to their physical adjacency on the chromosomes) and the network structure of gene expressions (attributable to their coordinated biological functions and correlated measurements) have not been well accommodated. In this study, we develop structured G-E interaction analysis, where such structures are accommodated using penalization for both the main G effects and interactions. Penalization is also applied for regularized estimation and selection. The proposed structured interaction analysis can be effectively realized. It is shown to have consistency properties under high-dimensional settings. Simulations and analysis of GENEVA diabetes data with SNP measurements and TCGA melanoma data with gene expression measurements demonstrate its competitive practical performance.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
23
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
24
|
Wu M, Ma S. Robust genetic interaction analysis. Brief Bioinform 2019; 20:624-637. [PMID: 29897421 PMCID: PMC6556899 DOI: 10.1093/bib/bby033] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Indexed: 01/17/2023] Open
Abstract
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Collapse
Affiliation(s)
- Mengyun Wu
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| | - Shuangge Ma
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
25
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
26
|
Xu Y, Wu M, Ma S, Ahmed SE. Robust gene-environment interaction analysis using penalized trimmed regression. J STAT COMPUT SIM 2018; 88:3502-3528. [PMID: 30718937 PMCID: PMC6358205 DOI: 10.1080/00949655.2018.1523411] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Accepted: 09/09/2018] [Indexed: 12/25/2022]
Abstract
In biomedical and epidemiological studies, gene-environment (G-E) interactions have been shown to importantly contribute to the etiology and progression of many complex diseases. Most existing approaches for identifying G-E interactions are limited by the lack of robustness against outliers/contaminations in response and predictor spaces. In this study, we develop a novel robust G-E identification approach using the trimmed regression technique under joint modeling. A robust data-driven criterion and stability selection are adopted to determine the trimmed subset which is free from both vertical outliers and leverage points. An effective penalization approach is developed to identify important G-E interactions, respecting the "main effects, interactions" hierarchical structure. Extensive simulations demonstrate the better performance of the proposed approach compared to multiple alternatives. Interesting findings with superior prediction accuracy and stability are observed in the analysis of TCGA data on cutaneous melanoma and breast invasive carcinoma.
Collapse
Affiliation(s)
- Yaqing Xu
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Mengyun Wu
- Department of Biostatistics, Yale University, New Haven, CT, USA
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Syed Ejaz Ahmed
- Department of Mathematics and Statistics, Brock University, Canada
| |
Collapse
|
27
|
Xu Y, Wu M, Zhang Q, Ma S. Robust identification of gene-environment interactions for prognosis using a quantile partial correlation approach. Genomics 2018; 111:1115-1123. [PMID: 30009922 DOI: 10.1016/j.ygeno.2018.07.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Revised: 06/23/2018] [Accepted: 07/05/2018] [Indexed: 10/28/2022]
Abstract
Gene-environment (G-E) interactions have important implications for the etiology and progression of many complex diseases. Compared to continuous markers and categorical disease status, prognosis has been less investigated, with the additional challenges brought by the unique characteristics of survival outcomes. Most of the existing G-E interaction approaches for prognosis data share the limitation that they cannot accommodate long-tailed or contaminated outcomes. In this study, for prognosis data, we develop a robust G-E interaction identification approach using the censored quantile partial correlation (CQPCorr) technique. The proposed approach is built on the quantile regression technique (and hence has a solid statistical basis), uses weights to easily accommodate censoring, and adopts partial correlation to identify important interactions while properly controlling for the main genetic and environmental effects. In simulation, it outperforms multiple competitors with more accurate identification. In the analysis of TCGA data on lung cancer and melanoma, biologically sensible findings different from using the alternatives are made.
Collapse
Affiliation(s)
- Yaqing Xu
- Department of Biostatistics, Yale University, United States
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, China; Department of Biostatistics, Yale University, United States
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, United States.
| |
Collapse
|