1
|
Zhang Y, Muller S. Robust variable selection methods with Cox model-a selective practical benchmark study. Brief Bioinform 2024; 25:bbae508. [PMID: 39400113 PMCID: PMC11472364 DOI: 10.1093/bib/bbae508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 09/01/2024] [Accepted: 09/30/2024] [Indexed: 10/15/2024] Open
Abstract
With the advancement of biological and medical techniques, we can now obtain large amounts of high-dimensional omics data with censored survival information. This presents challenges in method development across various domains, particularly in variable selection. Given the inherently skewed distribution of the survival time outcome variable, robust variable selection methods offer potential solutions. Recently, there has been a focus on extending robust variable selection methods from linear regression models to survival models. However, despite these developments, robust methods are currently rarely used in practical applications, possibly due to a limited appreciation of their overall good performance. To address this gap, we conduct a selective review comparing the variable selection performance of twelve robust and non-robust penalised Cox models. Our study reveals the intricate relationship among covariates, survival outcomes, and modeling approaches, demonstrating how subtle variations can significantly impact the performance of methods considered. Based on our empirical research, we recommend the use of robust Cox models for variable selection in practice based on their superior performance in presence of outliers while maintaining good efficiency and accuracy when there are no outliers. This study provides valuable insights for method development and application, contributing to a better understanding of the relationship between correlated covariates and censored outcomes.
Collapse
Affiliation(s)
- Yunwei Zhang
- School of Mathematics, Statistics, Chemistry and Physics, Murdoch University, 90 South St, Murdoch WA 6150, Australia
- School of Mathematical and Physical Sciences, Macquarie University, 12 Wally's Walk, Macquarie Park NSW 2109, Australia
- School of Mathematics and Statistics, The University of Sydney, F07 Eastern Ave, Camperdown NSW 2050, Australia
| | - Samuel Muller
- School of Mathematical and Physical Sciences, Macquarie University, 12 Wally's Walk, Macquarie Park NSW 2109, Australia
- School of Mathematics and Statistics, The University of Sydney, F07 Eastern Ave, Camperdown NSW 2050, Australia
| |
Collapse
|
2
|
Xiong W, Chen Y, Ma S. Unified model-free interaction screening via CV-entropy filter. Comput Stat Data Anal 2023; 180:107684. [PMID: 36910335 PMCID: PMC9997997 DOI: 10.1016/j.csda.2022.107684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
Collapse
Affiliation(s)
- Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing 100872, PR China
| | - Yaxian Chen
- Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| |
Collapse
|
3
|
Zhang X, Liu Y. Sparse Laplacian shrinkage for nonparametric transformation survival model. COMMUN STAT-THEOR M 2022. [DOI: 10.1080/03610926.2022.2042025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Xiao Zhang
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yiming Liu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
4
|
Ren M, Zhang S, Ma S, Zhang Q. Gene-environment interaction identification via penalized robust divergence. Biom J 2022; 64:461-480. [PMID: 34725857 PMCID: PMC9386692 DOI: 10.1002/bimj.202000157] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Revised: 06/01/2021] [Accepted: 08/23/2021] [Indexed: 12/11/2022]
Abstract
In high-throughput cancer studies, gene-environment interactions associated with outcomes have important implications. Some commonly adopted identification methods do not respect the "main effect, interaction" hierarchical structure. In addition, they can be challenged by data contamination and/or long-tailed distributions, which are not uncommon. In this article, robust methods based on γ $\gamma$ -divergence and density power divergence are proposed to accommodate contaminated data/long-tailed distributions. A hierarchical sparse group penalty is adopted for regularized estimation and selection and can identify important gene-environment interactions and respect the "main effect, interaction" hierarchical structure. The proposed methods are implemented using an effective group coordinate descent algorithm. Simulation shows that when contamination occurs, the proposed methods can significantly outperform the existing alternatives with more accurate identification. The proposed approach is applied to the analysis of The Cancer Genome Atlas (TCGA) triple-negative breast cancer data and Gene Environment Association Studies (GENEVA) Type 2 Diabetes data.
Collapse
Affiliation(s)
- Mingyang Ren
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, P. R. China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, P. R. China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, Fujian Key Lab of Statistics, Xiamen University, Fujian, P. R. China
| |
Collapse
|
5
|
Lu X, Fan K, Ren J, Wu C. Identifying Gene-Environment Interactions With Robust Marginal Bayesian Variable Selection. Front Genet 2021; 12:667074. [PMID: 34956304 PMCID: PMC8693717 DOI: 10.3389/fgene.2021.667074] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 07/13/2021] [Indexed: 01/02/2023] Open
Abstract
In high-throughput genetics studies, an important aim is to identify gene–environment interactions associated with the clinical outcomes. Recently, multiple marginal penalization methods have been developed and shown to be effective in G×E studies. However, within the Bayesian framework, marginal variable selection has not received much attention. In this study, we propose a novel marginal Bayesian variable selection method for G×E studies. In particular, our marginal Bayesian method is robust to data contamination and outliers in the outcome variables. With the incorporation of spike-and-slab priors, we have implemented the Gibbs sampler based on Markov Chain Monte Carlo (MCMC). The proposed method outperforms a number of alternatives in extensive simulation studies. The utility of the marginal robust Bayesian variable selection method has been further demonstrated in the case studies using data from the Nurse Health Study (NHS). Some of the identified main and interaction effects from the real data analysis have important biological implications.
Collapse
Affiliation(s)
- Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Kun Fan
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| |
Collapse
|
6
|
Zhou F, Ren J, Lu X, Ma S, Wu C. Gene-Environment Interaction: A Variable Selection Perspective. Methods Mol Biol 2021; 2212:191-223. [PMID: 33733358 DOI: 10.1007/978-1-0716-0947-7_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Gene-environment interactions have important implications for elucidating the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G × E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G × E interactions, due to the complicated form of environmental effects and the presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G × E interactions, which has been ignored in the majority of published reviews on genetic interaction studies. In this article, we first survey existing studies on both gene-environment and gene-gene interactions. Then, after a brief introduction to the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms, respectively, under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G × E studies, have also been provided.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
7
|
Li Y, Wang F, Wu M, Ma S. Integrative functional linear model for genome-wide association studies with multiple traits. Biostatistics 2020; 23:574-590. [PMID: 33040145 DOI: 10.1093/biostatistics/kxaa043] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 06/30/2020] [Accepted: 09/12/2020] [Indexed: 11/14/2022] Open
Abstract
In recent biomedical research, genome-wide association studies (GWAS) have demonstrated great success in investigating the genetic architecture of human diseases. For many complex diseases, multiple correlated traits have been collected. However, most of the existing GWAS are still limited because they analyze each trait separately without considering their correlations and suffer from a lack of sufficient information. Moreover, the high dimensionality of single nucleotide polymorphism (SNP) data still poses tremendous challenges to statistical methods, in both theoretical and practical aspects. In this article, we innovatively propose an integrative functional linear model for GWAS with multiple traits. This study is the first to approximate SNPs as functional objects in a joint model of multiple traits with penalization techniques. It effectively accommodates the high dimensionality of SNPs and correlations among multiple traits to facilitate information borrowing. Our extensive simulation studies demonstrate the satisfactory performance of the proposed method in the identification and estimation of disease-associated genetic variants, compared to four alternatives. The analysis of type 2 diabetes data leads to biologically meaningful findings with good prediction accuracy and selection stability.
Collapse
Affiliation(s)
- Yang Li
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Fan Wang
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven 06520, USA
| |
Collapse
|
8
|
Zhang S, Xue Y, Zhang Q, Ma C, Wu M, Ma S. Identification of gene-environment interactions with marginal penalization. Genet Epidemiol 2019; 44:159-196. [PMID: 31724772 DOI: 10.1002/gepi.22270] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2019] [Revised: 10/05/2019] [Accepted: 10/25/2019] [Indexed: 12/29/2022]
Abstract
Gene-environment (G-E) interaction analysis has been extensively conducted for complex diseases. In marginal analysis, the common practice is to conduct likelihood-based (and other "standard") estimation with each marginal model, and then select significant G-E interactions and main effects based on p values and multiple comparisons adjustment. One limitation of this approach is that the identification results often do not respect the "main effects, interactions" hierarchy, which has been stressed in recent G-E interaction analyses. There is some recent effort tackling this problem, however, with very complex formulations. Another limitation of the common practice is that it may not perform well when regularization is needed, for example, because of "non-normal" distributions. In this article, we propose a marginal penalization approach which adopts a novel penalty to directly tackle the aforementioned problems. The proposed approach has a framework more coherent with that of the recently developed joint analysis methods and an intuitive formulation, and can be effectively realized. In simulation, it outperforms the popular significance-based analysis and simple penalization-based alternatives. Promising findings are made in the analysis of a single-nucleotide polymorphism and a gene expression data.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yuan Xue
- School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Chenjin Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut.,School of Statistics, Renmin University, Beijing, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
9
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
10
|
Wu M, Ma S. Robust genetic interaction analysis. Brief Bioinform 2019; 20:624-637. [PMID: 29897421 PMCID: PMC6556899 DOI: 10.1093/bib/bby033] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Indexed: 01/17/2023] Open
Abstract
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Collapse
Affiliation(s)
- Mengyun Wu
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| | - Shuangge Ma
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
11
|
Wang X, Xu Y, Ma S. Identifying gene-environment interactions incorporating prior information. Stat Med 2019; 38:1620-1633. [PMID: 30637789 DOI: 10.1002/sim.8064] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 10/20/2018] [Accepted: 11/26/2018] [Indexed: 12/28/2022]
Abstract
For many complex diseases, gene-environment (G-E) interactions have independent contributions beyond the main G and E effects. Despite extensive effort, it still remains challenging to identify G-E interactions. With the long accumulation of experiments and data, for many biomedical problems of common interest, there are existing studies that can be relevant and informative for the identification of G-E interactions and/or main effects. In this study, our goal is to identify G-E interactions (as well as their corresponding main G effects) under a joint statistical modeling framework. Significantly advancing from the existing studies, a quasi-likelihood-based approach is developed to incorporate information mined from the existing literature. A penalization approach is adopted for identification and selection and respects the "main effects, interactions" hierarchical structure. Simulation shows that, when the existing information is of high quality, significant improvement can be observed. On the other hand, when the existing information is less informative, the proposed method still performs reasonably (and hence demonstrates a certain degree of "robustness"). The analysis of The Cancer Genome Atlas (TCGA) data on cutaneous melanoma and glioblastoma multiforme demonstrates the practical applicability of the proposed approach and also leads to sensible findings.
Collapse
Affiliation(s)
- Xiaoyan Wang
- College of Finance and Statistics, Hunan University, Changsha, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Yonghong Xu
- School of Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
12
|
Xu Y, Wu M, Ma S, Ahmed SE. Robust gene-environment interaction analysis using penalized trimmed regression. J STAT COMPUT SIM 2018; 88:3502-3528. [PMID: 30718937 PMCID: PMC6358205 DOI: 10.1080/00949655.2018.1523411] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Accepted: 09/09/2018] [Indexed: 12/25/2022]
Abstract
In biomedical and epidemiological studies, gene-environment (G-E) interactions have been shown to importantly contribute to the etiology and progression of many complex diseases. Most existing approaches for identifying G-E interactions are limited by the lack of robustness against outliers/contaminations in response and predictor spaces. In this study, we develop a novel robust G-E identification approach using the trimmed regression technique under joint modeling. A robust data-driven criterion and stability selection are adopted to determine the trimmed subset which is free from both vertical outliers and leverage points. An effective penalization approach is developed to identify important G-E interactions, respecting the "main effects, interactions" hierarchical structure. Extensive simulations demonstrate the better performance of the proposed approach compared to multiple alternatives. Interesting findings with superior prediction accuracy and stability are observed in the analysis of TCGA data on cutaneous melanoma and breast invasive carcinoma.
Collapse
Affiliation(s)
- Yaqing Xu
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Mengyun Wu
- Department of Biostatistics, Yale University, New Haven, CT, USA
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Syed Ejaz Ahmed
- Department of Mathematics and Statistics, Brock University, Canada
| |
Collapse
|
13
|
Xu Y, Wu M, Zhang Q, Ma S. Robust identification of gene-environment interactions for prognosis using a quantile partial correlation approach. Genomics 2018; 111:1115-1123. [PMID: 30009922 DOI: 10.1016/j.ygeno.2018.07.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Revised: 06/23/2018] [Accepted: 07/05/2018] [Indexed: 10/28/2022]
Abstract
Gene-environment (G-E) interactions have important implications for the etiology and progression of many complex diseases. Compared to continuous markers and categorical disease status, prognosis has been less investigated, with the additional challenges brought by the unique characteristics of survival outcomes. Most of the existing G-E interaction approaches for prognosis data share the limitation that they cannot accommodate long-tailed or contaminated outcomes. In this study, for prognosis data, we develop a robust G-E interaction identification approach using the censored quantile partial correlation (CQPCorr) technique. The proposed approach is built on the quantile regression technique (and hence has a solid statistical basis), uses weights to easily accommodate censoring, and adopts partial correlation to identify important interactions while properly controlling for the main genetic and environmental effects. In simulation, it outperforms multiple competitors with more accurate identification. In the analysis of TCGA data on lung cancer and melanoma, biologically sensible findings different from using the alternatives are made.
Collapse
Affiliation(s)
- Yaqing Xu
- Department of Biostatistics, Yale University, United States
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, China; Department of Biostatistics, Yale University, United States
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, United States.
| |
Collapse
|
14
|
Wu C, Jiang Y, Ren J, Cui Y, Ma S. Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Stat Med 2017; 37:437-456. [PMID: 29034484 DOI: 10.1002/sim.7518] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Revised: 07/30/2017] [Accepted: 09/07/2017] [Indexed: 12/26/2022]
Abstract
Identification of gene-environment (G × E) interactions associated with disease phenotypes has posed a great challenge in high-throughput cancer studies. The existing marginal identification methods have suffered from not being able to accommodate the joint effects of a large number of genetic variants, while some of the joint-effect methods have been limited by failing to respect the "main effects, interactions" hierarchy, by ignoring data contamination, and by using inefficient selection techniques under complex structural sparsity. In this article, we develop an effective penalization approach to identify important G × E interactions and main effects, which can account for the hierarchical structures of the 2 types of effects. Possible data contamination is accommodated by adopting the least absolute deviation loss function. The advantage of the proposed approach over the alternatives is convincingly demonstrated in both simulation and a case study on lung cancer prognosis with gene expression measurements and clinical covariates under the accelerated failure time model.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Yu Jiang
- Division of Epidemiology, Biostatistics, and Environmental Health, University of Memphis, Memphis, TN 38111, USA
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd, East Lansing, MI 48824, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, 60 College Street, New Haven, CT 06520, USA
| |
Collapse
|
15
|
Chai H, Zhang Q, Jiang Y, Wang G, Zhang S, Ahmed SE, Ma S. Identifying gene-environment interactions for prognosis using a robust approach. ECONOMETRICS AND STATISTICS 2017; 4:105-120. [PMID: 31157309 PMCID: PMC6541416 DOI: 10.1016/j.ecosta.2016.10.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
For many complex diseases, prognosis is of essential importance. It has been shown that, beyond the main effects of genetic (G) and environmental (E) risk factors, gene-environment (G × E) interactions also play a critical role. In practical data analysis, part of the prognosis outcome data can have a distribution different from that of the rest of the data because of contamination or a mixture of subtypes. Literature has shown that data contamination as well as a mixture of distributions, if not properly accounted for, can lead to severely biased model estimation. In this study, we describe prognosis using an accelerated failure time (AFT) model. An exponential squared loss is proposed to accommodate data contamination or a mixture of distributions. A penalization approach is adopted for regularized estimation and marker selection. The proposed method is realized using an effective coordinate descent (CD) and minorization maximization (MM) algorithm. The estimation and identification consistency properties are rigorously established. Simulation shows that without contamination or mixture, the proposed method has performance comparable to or better than the nonrobust alternative. However, with contamination or mixture, it outperforms the nonrobust alternative and, under certain scenarios, is superior to the robust method based on quantile regression. The proposed method is applied to the analysis of TCGA (The Cancer Genome Atlas) lung cancer data. It identifies interactions different from those using the alternatives. The identified markers have important implications and satisfactory stability.
Collapse
Affiliation(s)
- Hao Chai
- Department of Biostatistics, Yale University, United States
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, China
| | - Yu Jiang
- School of Public Health, University of Memphis, United States
| | - Guohua Wang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, China
| | - Syed Ejaz Ahmed
- Department of Mathematics and Statistics, Brock University, Canada
| | - Shuangge Ma
- Department of Biostatistics, Yale University, United States
| |
Collapse
|
16
|
Han F, Ji H, Ji Z, Wang H. A provable smoothing approach for high dimensional generalized regression with applications in genomics. Electron J Stat 2017. [DOI: 10.1214/17-ejs1352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
Wu C, Ma S. A selective review of robust variable selection with applications in bioinformatics. Brief Bioinform 2015; 16:873-83. [PMID: 25479793 PMCID: PMC4570200 DOI: 10.1093/bib/bbu046] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Revised: 10/20/2014] [Indexed: 11/13/2022] Open
Abstract
A drastic amount of data have been and are being generated in bioinformatics studies. In the analysis of such data, the standard modeling approaches can be challenged by the heavy-tailed errors and outliers in response variables, the contamination in predictors (which may be caused by, for instance, technical problems in microarray gene expression studies), model mis-specification and others. Robust methods are needed to tackle these challenges. When there are a large number of predictors, variable selection can be as important as estimation. As a generic variable selection and regularization tool, penalization has been extensively adopted. In this article, we provide a selective review of robust penalized variable selection approaches especially designed for high-dimensional data from bioinformatics and biomedical studies. We discuss the robust loss functions, penalty functions and computational algorithms. The theoretical properties and implementation are also briefly examined. Application examples of the robust penalization approaches in representative bioinformatics and biomedical studies are also illustrated.
Collapse
|
18
|
Wu C, Shi X, Cui Y, Ma S. A penalized robust semiparametric approach for gene-environment interactions. Stat Med 2015; 34:4016-30. [PMID: 26239060 DOI: 10.1002/sim.6609] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Revised: 06/28/2015] [Accepted: 07/06/2015] [Indexed: 11/09/2022]
Abstract
In genetic and genomic studies, gene-environment (G×E) interactions have important implications. Some of the existing G×E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection techniques. In this study, we propose a new approach for identifying important G×E interactions. It jointly models the effects of all E and G factors and their interactions. A partially linear varying coefficient model is adopted to accommodate possible nonlinear effects of E factors. A rank-based loss function is used to accommodate possible data contamination. Penalization, which has been extensively used with high-dimensional data, is adopted for selection. The proposed penalized estimation approach can automatically determine if a G factor has an interaction with an E factor, main effect but not interaction, or no effect at all. The proposed approach can be effectively realized using a coordinate descent algorithm. Simulation shows that it has satisfactory performance and outperforms several competing alternatives. The proposed approach is used to analyze a lung cancer study with gene expression measurements and clinical variables. Copyright © 2015 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Cen Wu
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, CT, 06520, U.S.A.,Department of Statistics, Kansas State University, 1116 Mid-Campus Drive N., Manhattan, KS, 66506, U.S.A
| | - Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd, East Lansing, MI, 48824, U.S.A
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, CT, 06520, U.S.A.,VA Cooperative Studies Program Coordinating Center, West Haven, CT, 06516, U.S.A
| |
Collapse
|
19
|
Zhao Q, Shi X, Huang J, Liu J, Li Y, Ma S. Integrative Analysis of "-Omics" Data Using Penalty Functions. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2015; 7:99-108. [PMID: 25691921 PMCID: PMC4327914 DOI: 10.1002/wics.1322] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In the analysis of omics data, integrative analysis provides an effective way of pooling information across multiple datasets or multiple correlated responses, and can be more effective than single-dataset (response) analysis. Multiple families of integrative analysis methods have been proposed in the literature. The current review focuses on the penalization methods. Special attention is paid to sparse meta-analysis methods that pool summary statistics across datasets, and integrative analysis methods that pool raw data across datasets. We discuss their formulation and rationale. Beyond "standard" penalized selection, we also review contrasted penalization and Laplacian penalization which accommodate finer data structures. The computational aspects, including computational algorithms and tuning parameter selection, are examined. This review concludes with possible limitations and extensions.
Collapse
Affiliation(s)
- Qing Zhao
- Department of Biostatistics, School of Public Health, Yale University
| | - Xingjie Shi
- Department of Biostatistics, School of Public Health, Yale University
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa
| | - Jin Liu
- Division of Epidemiology and Biostatistics, UIC School of Public Health
| | - Yang Li
- School of Statistics, Center for Applied Statistics, Renmin University of China
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University
- School of Statistics, Capital University of Economics and Business
| |
Collapse
|
20
|
Wu C, Cui Y, Ma S. Integrative analysis of gene-environment interactions under a multi-response partially linear varying coefficient model. Stat Med 2014; 33:4988-98. [PMID: 25146388 PMCID: PMC4225006 DOI: 10.1002/sim.6287] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2014] [Revised: 06/10/2014] [Accepted: 07/28/2014] [Indexed: 12/29/2022]
Abstract
Consider the integrative analysis of genetic data with multiple correlated response variables. The goal is to identify important gene-environment (G × E) interactions along with main gene and environment effects that are associated with the responses. The homogeneity and heterogeneity models can be adopted to describe the genetic basis of multiple responses. To accommodate possible nonlinear effects of some environment effects, a multi-response partially linear varying coefficient model is assumed. Penalization is adopted for marker selection. The proposed penalization method can select genetic variants with G × E interactions, no G × E interactions, and no main effects simultaneously. It adopts different penalties to accommodate the homogeneity and heterogeneity models. The proposed method can be effectively computed using a coordinate descent algorithm. Simulation study and the analysis of Health Professionals Follow-up Study, which has two correlated continuous traits, SNP measurements and multiple environment effects, show superior performance of the proposed method over its competitors.
Collapse
Affiliation(s)
- Cen Wu
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, CT, 06520, U.S.A
| | | | | |
Collapse
|