1
|
Shen J, Wang S, Sun H, Huang J, Bai L, Wang X, Dong Y, Tang Z. A novel non-negative Bayesian stacking modeling method for Cancer survival prediction using high-dimensional omics data. BMC Med Res Methodol 2024; 24:105. [PMID: 38702624 PMCID: PMC11067084 DOI: 10.1186/s12874-024-02232-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 04/23/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Survival prediction using high-dimensional molecular data is a hot topic in the field of genomics and precision medicine, especially for cancer studies. Considering that carcinogenesis has a pathway-based pathogenesis, developing models using such group structures is a closer mimic of disease progression and prognosis. Many approaches can be used to integrate group information; however, most of them are single-model methods, which may account for unstable prediction. METHODS We introduced a novel survival stacking method that modeled using group structure information to improve the robustness of cancer survival prediction in the context of high-dimensional omics data. With a super learner, survival stacking combines the prediction from multiple sub-models that are independently trained using the features in pre-grouped biological pathways. In addition to a non-negative linear combination of sub-models, we extended the super learner to non-negative Bayesian hierarchical generalized linear model and artificial neural network. We compared the proposed modeling strategy with the widely used survival penalized method Lasso Cox and several group penalized methods, e.g., group Lasso Cox, via simulation study and real-world data application. RESULTS The proposed survival stacking method showed superior and robust performance in terms of discrimination compared with single-model methods in case of high-noise simulated data and real-world data. The non-negative Bayesian stacking method can identify important biological signal pathways and genes that are associated with the prognosis of cancer. CONCLUSIONS This study proposed a novel survival stacking strategy incorporating biological group information into the cancer prognosis models. Additionally, this study extended the super learner to non-negative Bayesian model and ANN, enriching the combination of sub-models. The proposed Bayesian stacking strategy exhibited favorable properties in the prediction and interpretation of complex survival data, which may aid in discovering cancer targets.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Shuo Wang
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, 79085, Freiburg, Germany
| | - Hao Sun
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Jie Huang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Lu Bai
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Xichao Wang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Major Chronic Non-communicable Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, Suzhou, Jiangsu, 215123, People's Republic of China.
| |
Collapse
|
2
|
Shen J, Wang S, Dong Y, Sun H, Wang X, Tang Z. A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data. BMC Bioinformatics 2024; 25:119. [PMID: 38509499 PMCID: PMC10953151 DOI: 10.1186/s12859-024-05741-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 03/11/2024] [Indexed: 03/22/2024] Open
Abstract
BACKGROUND High-dimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the model-building process. However, these methods are still based on a single model, offen leading to overconfident inferences and inferior generalization. RESULTS We proposed a novel stacking strategy based on a non-negative spike-and-slab Lasso (nsslasso) generalized linear model (GLM) for disease risk prediction in the context of high-dimensional omics data. Briefly, we used prior biological knowledge to segment omics data into a set of sub-data. Each sub-model was trained separately using the features from the group via a proper base learner. Then, the predictions of sub-models were ensembled by a super learner using nsslasso GLM. The proposed method was compared to several competitors, such as the Lasso, grlasso, and gsslasso, using simulated data and two open-access breast cancer data. As a result, the proposed method showed robustly superior prediction performance to the optimal single-model method in high-noise simulated data and real-world data. Furthermore, compared to the traditional stacking method, the proposed nsslasso stacking method can efficiently handle redundant sub-models and identify important sub-models. CONCLUSIONS The proposed nsslasso method demonstrated favorable predictive accuracy, stability, and biological interpretability. Additionally, the proposed method can also be used to detect new biomarkers and key group structures.
Collapse
Affiliation(s)
- Junjie Shen
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Shuo Wang
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, 79085, Freiburg, Germany
| | - Yongfei Dong
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Hao Sun
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Xichao Wang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow University, No. 199 Renai Road, Suzhou, 215123, Jiangsu, People's Republic of China.
| |
Collapse
|
3
|
Li W, Chang C, Kundu S, Long Q. Accounting for network noise in graph-guided Bayesian modeling of structured high-dimensional data. Biometrics 2024; 80:ujae012. [PMID: 38483282 PMCID: PMC10938547 DOI: 10.1093/biomtc/ujae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 12/31/2023] [Accepted: 02/14/2024] [Indexed: 03/17/2024]
Abstract
There is a growing body of literature on knowledge-guided statistical learning methods for analysis of structured high-dimensional data (such as genomic and transcriptomic data) that can incorporate knowledge of underlying networks derived from functional genomics and functional proteomics. These methods have been shown to improve variable selection and prediction accuracy and yield more interpretable results. However, these methods typically use graphs extracted from existing databases or rely on subject matter expertise, which are known to be incomplete and may contain false edges. To address this gap, we propose a graph-guided Bayesian modeling framework to account for network noise in regression models involving structured high-dimensional predictors. Specifically, we use 2 sources of network information, including the noisy graph extracted from existing databases and the estimated graph from observed predictors in the dataset at hand, to inform the model for the true underlying network via a latent scale modeling framework. This model is coupled with the Bayesian regression model with structured high-dimensional predictors involving an adaptive structured shrinkage prior. We develop an efficient Markov chain Monte Carlo algorithm for posterior sampling. We demonstrate the advantages of our method over existing methods in simulations, and through analyses of a genomics dataset and another proteomics dataset for Alzheimer's disease.
Collapse
Affiliation(s)
- Wenrui Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, PA 19104, United States
| | - Changgee Chang
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Suprateek Kundu
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, PA 19104, United States
| |
Collapse
|
4
|
Wang JH, Wang KH, Chen YH. Overlapping group screening for detection of gene-environment interactions with application to TCGA high-dimensional survival genomic data. BMC Bioinformatics 2022; 23:202. [PMID: 35637439 PMCID: PMC9150322 DOI: 10.1186/s12859-022-04750-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 05/25/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the context of biomedical and epidemiological research, gene-environment (G-E) interaction is of great significance to the etiology and progression of many complex diseases. In high-dimensional genetic data, two general models, marginal and joint models, are proposed to identify important interaction factors. Most existing approaches for identifying G-E interactions are limited owing to the lack of robustness to outliers/contamination in response and predictor data. In particular, right-censored survival outcomes make the associated feature screening even challenging. In this article, we utilize the overlapping group screening (OGS) approach to select important G-E interactions related to clinical survival outcomes by incorporating the gene pathway information under a joint modeling framework. RESULTS Simulation studies under various scenarios are carried out to compare the performances of our proposed method with some commonly used methods. In the real data applications, we use our proposed method to identify G-E interactions related to the clinical survival outcomes of patients with head and neck squamous cell carcinoma, and esophageal carcinoma in The Cancer Genome Atlas clinical survival genetic data, and further establish corresponding survival prediction models. Both simulation and real data studies show that our method performs well and outperforms existing methods in the G-E interaction selection, effect estimation, and survival prediction accuracy. CONCLUSIONS The OGS approach is useful for selecting important environmental factors, genes and G-E interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The same idea of the OGS approach can apply to other outcome models, such as the proportional odds survival time model, the logistic regression model for binary outcomes, and the multinomial logistic regression model for multi-class outcomes.
Collapse
Affiliation(s)
- Jie-Huei Wang
- Department of Statistics, Feng Chia University, Seatwen, Taichung, 40724, Taiwan.
| | - Kang-Hsin Wang
- Department of Statistics, Feng Chia University, Seatwen, Taichung, 40724, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei, 11529, Taiwan
| |
Collapse
|
5
|
Ko S, Li GX, Choi H, Won JH. Computationally scalable regression modeling for ultrahigh-dimensional omics data with ParProx. Brief Bioinform 2021; 22:bbab256. [PMID: 34254998 PMCID: PMC8575036 DOI: 10.1093/bib/bbab256] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Revised: 06/15/2021] [Accepted: 06/17/2021] [Indexed: 12/20/2022] Open
Abstract
Statistical analysis of ultrahigh-dimensional omics scale data has long depended on univariate hypothesis testing. With growing data features and samples, the obvious next step is to establish multivariable association analysis as a routine method to describe genotype-phenotype association. Here we present ParProx, a state-of-the-art implementation to optimize overlapping and non-overlapping group lasso regression models for time-to-event and classification analysis, with selection of variables grouped by biological priors. ParProx enables multivariable model fitting for ultrahigh-dimensional data within an architecture for parallel or distributed computing via latent variable group representation. It thereby aims to produce interpretable regression models consistent with known biological relationships among independent variables, a property often explored post hoc, not during model estimation. Simulation studies clearly demonstrate the scalability of ParProx with graphics processing units in comparison to existing implementations. We illustrate the tool using three different omics data sets featuring moderate to large numbers of variables, where we use genomic regions and biological pathways as variable groups, rendering the selected independent variables directly interpretable with respect to those groups. ParProx is applicable to a wide range of studies using ultrahigh-dimensional omics data, from genome-wide association analysis to multi-omics studies where model estimation is computationally intractable with existing implementation.
Collapse
Affiliation(s)
- Seyoon Ko
- Department of Statistics, Seoul National University, Republic of Korea
| | - Ginny X Li
- Department of Medicine, National University of Singapore, Singapore
| | - Hyungwon Choi
- Department of Medicine, National University of Singapore, Singapore
| | - Joong-Ho Won
- Department of Statistics, Seoul National University, Republic of Korea
| |
Collapse
|
6
|
Münch MM, Peeters CFW, Van Der Vaart AW, Van De Wiel MA. Adaptive group-regularized logistic elastic net regression. Biostatistics 2021; 22:723-737. [PMID: 31886488 PMCID: PMC8596493 DOI: 10.1093/biostatistics/kxz062] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 12/04/2019] [Accepted: 12/05/2019] [Indexed: 12/27/2022] Open
Abstract
In high-dimensional data settings, additional information on the features is often
available. Examples of such external information in omics research are: (i)
\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$p$\end{document}-values from a previous study and (ii) omics
annotation. The inclusion of this information in the analysis may enhance classification
performance and feature selection but is not straightforward. We propose a
group-regularized (logistic) elastic net regression method, where each penalty parameter
corresponds to a group of features based on the external information. The method, termed
gren, makes use of the Bayesian formulation of logistic elastic
net regression to estimate both the model and penalty parameters in an approximate
empirical–variational Bayes framework. Simulations and applications to three cancer
genomics studies and one Alzheimer metabolomics study show that, if the partitioning of
the features is informative, classification performance, and feature selection are indeed
enhanced.
Collapse
Affiliation(s)
- Magnus M Münch
- Department of Epidemiology & Biostatistics, Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, PO Box 7057, 1007 MB Amsterdam, The Netherlands and Mathematical Institute, Leiden University, PO Box 9512, 2300 RA Leiden, The Netherlands
| | - Carel F W Peeters
- Department of Epidemiology & Biostatistics, Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, PO Box 7057, 1007 MB Amsterdam, The Netherlands
| | - Aad W Van Der Vaart
- Mathematical Institute, Leiden University, PO Box 9512, 2300 RA Leiden, The Netherlands
| | - Mark A Van De Wiel
- Department of Epidemiology & Biostatistics, Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, PO Box 7057, 1007 MB Amsterdam, The Netherlands and MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK
| |
Collapse
|
7
|
Affiliation(s)
- Jingyi K. Tay
- Department of Statistics Stanford University Stanford CA USA
| | - Jerome Friedman
- Department of Statistics Stanford University Stanford CA USA
| | - Robert Tibshirani
- Department of Statistics Stanford University Stanford CA USA
- Department of Biomedical Data Science Stanford University Stanford CA USA
| |
Collapse
|
8
|
Joshi N, Nguyen C, Ivanova A. Multi-stage adaptive enrichment trial design with subgroup estimation. J Biopharm Stat 2020; 30:1038-1049. [PMID: 33073685 DOI: 10.1080/10543406.2020.1832109] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
We consider the problem of estimating the best subgroup and testing for treatment effect in a clinical trial. We define the best subgroup as the subgroup that maximizes a utility function that reflects the trade-off between the subgroup size and the treatment effect. For moderate effect sizes and sample sizes, simpler methods for subgroup estimation worked better than more complex tree-based regression approaches. We propose a three-stage design with a weighted inverse normal combination test to test the hypothesis of no treatment effect across the three stages.
Collapse
Affiliation(s)
- Neha Joshi
- Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Crystal Nguyen
- Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Anastasia Ivanova
- Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| |
Collapse
|
9
|
Zhou S, Zhou J, Zhang B. Overlapping group lasso for high-dimensional generalized linear models. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1500604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Shengbin Zhou
- Department of Statistics, Harbin Normal University, Harbin, China
| | - Jingke Zhou
- Department of Statistics, Harbin Normal University, Harbin, China
| | - Bo Zhang
- Department of Statistics, Harbin Normal University, Harbin, China
| |
Collapse
|
10
|
Tang Z, Lei S, Zhang X, Yi Z, Guo B, Chen JY, Shen Y, Yi N. Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information. BMC Bioinformatics 2019; 20:94. [PMID: 30813883 PMCID: PMC6391807 DOI: 10.1186/s12859-019-2656-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Accepted: 01/28/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Group structures among genes encoded in functional relationships or biological pathways are valuable and unique features in large-scale molecular data for survival analysis. However, most of previous approaches for molecular data analysis ignore such group structures. It is desirable to develop powerful analytic methods for incorporating valuable pathway information for predicting disease survival outcomes and detecting associated genes. RESULTS We here propose a Bayesian hierarchical Cox survival model, called the group spike-and-slab lasso Cox (gsslasso Cox), for predicting disease survival outcomes and detecting associated genes by incorporating group structures of biological pathways. Our hierarchical model employs a novel prior on the coefficients of genes, i.e., the group spike-and-slab double-exponential distribution, to integrate group structures and to adaptively shrink the effects of genes. We have developed a fast and stable deterministic algorithm to fit the proposed models. We performed extensive simulation studies to assess the model fitting properties and the prognostic performance of the proposed method, and also applied our method to analyze three cancer data sets. CONCLUSIONS Both the theoretical and empirical studies show that the proposed method can induce weaker shrinkage on predictors in an active pathway, thereby incorporating the biological similarity of genes within a same pathway into the hierarchical modeling. Compared with several existing methods, the proposed method can more accurately estimate gene effects and can better predict survival outcomes. For the three cancer data sets, the results show that the proposed method generates more powerful models for survival prediction and detecting associated genes. The method has been implemented in a freely available R package BhGLM at https://github.com/nyiuab/BhGLM .
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| | - Shufeng Lei
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
| | - Xinyan Zhang
- Department of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, GA 30458 USA
| | - Zixuan Yi
- Eastern Virginia Medical School, Norfork, VA 23507 USA
| | - Boyi Guo
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| | - Jake Y. Chen
- Informatics Institute, School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294 USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, University of Alabama at Birmingham, Suzhou, 215123 China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, 215123 China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294-0022 USA
| |
Collapse
|
11
|
Zhou S, Zhou J, Zhang B. High-dimensional generalized linear models incorporating graphical structure among predictors. Electron J Stat 2019. [DOI: 10.1214/19-ejs1601] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
12
|
Chowdhury S, Chatterjee S, Mallick H, Banerjee P, Garai B. Group regularization for zero-inflated poisson regression models with an application to insurance ratemaking. J Appl Stat 2018. [DOI: 10.1080/02664763.2018.1555232] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Shrabanti Chowdhury
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Himel Mallick
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | | |
Collapse
|
13
|
Chang C, Kundu S, Long Q. Scalable Bayesian variable selection for structured high-dimensional data. Biometrics 2018; 74:1372-1382. [PMID: 29738602 PMCID: PMC6222001 DOI: 10.1111/biom.12882] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 02/01/2018] [Accepted: 02/01/2018] [Indexed: 12/30/2022]
Abstract
Variable selection for structured covariates lying on an underlying known graph is a problem motivated by practical applications, and has been a topic of increasing interest. However, most of the existing methods may not be scalable to high-dimensional settings involving tens of thousands of variables lying on known pathways such as the case in genomics studies. We propose an adaptive Bayesian shrinkage approach which incorporates prior network information by smoothing the shrinkage parameters for connected variables in the graph, so that the corresponding coefficients have a similar degree of shrinkage. We fit our model via a computationally efficient expectation maximization algorithm which scalable to high-dimensional settings ( p ∼ 100 , 000 ). Theoretical properties for fixed as well as increasing dimensions are established, even when the number of variables increases faster than the sample size. We demonstrate the advantages of our approach in terms of variable selection, prediction, and computational scalability via a simulation study, and apply the method to a cancer genomics study.
Collapse
Affiliation(s)
- Changgee Chang
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A
| | - Suprateek Kundu
- Department of Biostatistics, Emory University, Atlanta, Georgia, U.S.A
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A
| |
Collapse
|
14
|
Lee S, Lee Y, Pawitan Y. Sparse pathway-based prediction models for high-throughput molecular data. Comput Stat Data Anal 2018. [DOI: 10.1016/j.csda.2018.04.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
15
|
Wang JH, Chen YH. Overlapping group screening for detection of gene-gene interactions: application to gene expression profiles with survival trait. BMC Bioinformatics 2018; 19:335. [PMID: 30241463 PMCID: PMC6150983 DOI: 10.1186/s12859-018-2372-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Accepted: 09/12/2018] [Indexed: 01/29/2023] Open
Abstract
Background The development of a disease is a complex process that may result from joint effects of multiple genes. In this article, we propose the overlapping group screening (OGS) approach to determining active genes and gene-gene interactions incorporating prior pathway information. The OGS method is developed to overcome the challenges in genome-wide data analysis that the number of the genes and gene-gene interactions is far greater than the sample size, and the pathways generally overlap with one another. The OGS method is further proposed for patients’ survival prediction based on gene expression data. Results Simulation studies demonstrate that the performance of the OGS approach in identifying the true main and interaction effects is good and the survival prediction accuracy of OGS with the Lasso penalty is better than the ordinary Lasso method. In real data analysis, we identify several significant genes and/or epistasis interactions that are associated with clinical survival outcomes of diffuse large B-cell lymphoma (DLBCL) and non-small-cell lung cancer (NSCLC) by utilizing prior pathway information from the KEGG pathway and the GO biological process databases, respectively. Conclusions The OGS approach is useful for selecting important genes and epistasis interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The OGS approach is generally applicable to various types of outcome data (quantitative, qualitative, censored event time data) and regression models (e.g. linear, logistic, and Cox’s regression models). Electronic supplementary material The online version of this article (10.1186/s12859-018-2372-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jie-Huei Wang
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei, Taiwan.
| |
Collapse
|
16
|
Svoboda M, Mungenast F, Gleiss A, Vergote I, Vanderstichele A, Sehouli J, Braicu E, Mahner S, Jäger W, Mechtcheriakova D, Cacsire-Tong D, Zeillinger R, Thalhammer T, Pils D. Clinical Significance of Organic Anion Transporting Polypeptide Gene Expression in High-Grade Serous Ovarian Cancer. Front Pharmacol 2018; 9:842. [PMID: 30131693 PMCID: PMC6090214 DOI: 10.3389/fphar.2018.00842] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2018] [Accepted: 07/13/2018] [Indexed: 12/31/2022] Open
Abstract
High-grade serous ovarian cancer (HGSOC) is considered the most deadly and frequently occurring type of ovarian cancer and is associated with various molecular compositions and growth patterns. Evaluating the mRNA expression pattern of the organic anion transporters (OATPs) encoded by SLCO genes may allow for improved stratification of HGSOC patients for targeted invention. The expression of SLCO mRNA and genes coding for putative functionally related ABC-efflux pumps, enzymes, pregnane-X-receptor, ESR1 and ESR2 (coding for estrogen receptors ERα and ERß) and HER-2 were assessed using RT-qPCR. The expression levels were assessed in a cohort of 135 HGSOC patients to elucidate the independent impact of the expression pattern on the overall survival (OS). For identification of putative regulatory networks, Graphical Gaussian Models were constructed from the expression data with a tuning parameter K varying between meaningful borders (Pils et al., 2012; Auer et al., 2015, 2017; Kurman and Shih Ie, 2016; Karam et al., 2017; Labidi-Galy et al., 2017; Salomon-Perzynski et al., 2017; Sukhbaatar et al., 2017). The final value used (K = 4) was determined by maximizing the proportion of explained variation of the corresponding LASSO Cox regression model for OS. The following two networks of directly correlated genes were identified: (i) SLCO2B1 with ABCC3 implicated in estrogen homeostasis; and (ii) two ABC-efflux pumps in the immune regulation (ABCB2/ABCB3) with ABCC3 and HER-2. Combining LASSO Cox regression and univariate Cox regression analyses, SLCO5A1 coding for OATP5A1, an estrogen metabolite transporter located in the cytoplasm and plasma membranes of ovarian cancer cells, was identified as significant and independent prognostic factor for OS (HR = 0.68, CI 0.49-0.93; p = 0.031). Furthermore, results indicated the benefits of patients with high expression by adding 5.1% to the 12.8% of the proportion of explained variation (PEV) for clinicopathological parameters known for prognostic significance (FIGO stage, age and residual tumor after debulking). Additionally, overlap with previously described signatures that indicated a more favorable prognosis for ovarian cancer patients was shown for SLCO5A1, the network ABCB2/ABCB3/ABCC4/HER2 as well as ESR1. Furthermore, expression of SLCO2A1 and PGDH, which are important for PGE2 degradation, was associated with the non-miliary peritoneal tumor spreading. In conclusion, the present findings suggested that SLCOs and the related molecules identified as potential biomarkers in HGSOC may be useful for the development of novel therapeutic strategies.
Collapse
Affiliation(s)
- Martin Svoboda
- Department of Pathophysiology and Allergy Research, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna, Austria
| | - Felicitas Mungenast
- Department of Pathophysiology and Allergy Research, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna, Austria
| | - Andreas Gleiss
- Institute of Clinical Biometrics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria
| | - Ignace Vergote
- Division of Gynaecological Oncology, Department of Gynaecology and Obstetrics, Leuven Cancer Institute, University Hospital Leuven, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Adriaan Vanderstichele
- Division of Gynaecological Oncology, Department of Gynaecology and Obstetrics, Leuven Cancer Institute, University Hospital Leuven, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Jalid Sehouli
- Department of Gynecology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Berlin Institute of Health, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Elena Braicu
- Department of Gynecology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Berlin Institute of Health, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Sven Mahner
- Department of Gynecology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Walter Jäger
- Department of Clinical Pharmacy and Diagnostics, University of Vienna, Vienna, Austria
| | - Diana Mechtcheriakova
- Department of Pathophysiology and Allergy Research, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna, Austria
| | - Dan Cacsire-Tong
- Translational Gynecology Group, Department of Obstetrics and Gynaecology, Comprehensive Cancer Center, Medical University of Vienna, Vienna, Austria
| | - Robert Zeillinger
- Molecular Oncology Group, Department of Obstetrics and Gynaecology, Comprehensive Cancer Center, Medical University of Vienna, Vienna, Austria
| | - Theresia Thalhammer
- Department of Pathophysiology and Allergy Research, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna, Austria
| | - Dietmar Pils
- Institute of Clinical Biometrics, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Vienna, Austria.,Department of Surgery, Medical University of Vienna, Vienna, Austria
| |
Collapse
|
17
|
Chatterjee S, Chowdhury S, Mallick H, Banerjee P, Garai B. Group regularization for zero-inflated negative binomial regression models with an application to health care demand in Germany. Stat Med 2018; 37:3012-3026. [PMID: 29900575 DOI: 10.1002/sim.7804] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Revised: 03/21/2018] [Accepted: 04/12/2018] [Indexed: 11/10/2022]
Abstract
In many biomedical applications, covariates are naturally grouped, with variables in the same group being systematically related or statistically correlated. Under such settings, variable selection must be conducted at both group and individual variable levels. Motivated by the widespread availability of zero-inflated count outcomes and grouped covariates in many practical applications, we consider group regularization for zero-inflated negative binomial regression models. Using a least squares approximation of the mixture likelihood and a variety of group-wise penalties on the coefficients, we propose a unified algorithm (Gooogle: Group Regularization for Zero-inflated Count Regression Models) to efficiently compute the entire regularization path of the estimators. We investigate the finite sample performance of these methods through extensive simulation experiments and the analysis of a German health care demand dataset. Finally, we derive theoretical properties of these methods under reasonable assumptions, which further provides deeper insight into the asymptotic behavior of these approaches. The open source software implementation of this method is publicly available at: https://github.com/himelmallick/Gooogle.
Collapse
Affiliation(s)
- Saptarshi Chatterjee
- Division of Statistics, Department of Mathematical Sciences, Northern Illinois University, DeKalb, IL, 60115, USA
| | - Shrabanti Chowdhury
- Center for Molecular Medicine and Genetics, School of Medicine, Wayne State University, Detroit, MI, 48202, USA
| | - Himel Mallick
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA.,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | | | - Broti Garai
- Monsanto Company, Chesterfield, MO, 63017, USA
| |
Collapse
|
18
|
Tang Z, Shen Y, Li Y, Zhang X, Wen J, Qian C, Zhuang W, Shi X, Yi N. Group spike-and-slab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information. Bioinformatics 2018; 34:901-910. [PMID: 29077795 PMCID: PMC5860634 DOI: 10.1093/bioinformatics/btx684] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2017] [Revised: 10/05/2017] [Accepted: 10/24/2017] [Indexed: 01/10/2023] Open
Abstract
Motivation Large-scale molecular data have been increasingly used as an important resource for prognostic prediction of diseases and detection of associated genes. However, standard approaches for omics data analysis ignore the group structure among genes encoded in functional relationships or pathway information. Results We propose new Bayesian hierarchical generalized linear models, called group spike-and-slab lasso GLMs, for predicting disease outcomes and detecting associated genes by incorporating large-scale molecular data and group structures. The proposed model employs a mixture double-exponential prior for coefficients that induces self-adaptive shrinkage amount on different coefficients. The group information is incorporated into the model by setting group-specific parameters. We have developed a fast and stable deterministic algorithm to fit the proposed hierarchal GLMs, which can perform variable selection within groups. We assess the performance of the proposed method on several simulated scenarios, by varying the overlap among groups, group size, number of non-null groups, and the correlation within group. Compared with existing methods, the proposed method provides not only more accurate estimates of the parameters but also better prediction. We further demonstrate the application of the proposed procedure on three cancer datasets by utilizing pathway structures of genes. Our results show that the proposed method generates powerful models for predicting disease outcomes and detecting associated genes. Availability and implementation The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/). Contact nyi@uab.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
- Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou, China
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow University, Suzhou, China
| | - Yan Li
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Xinyan Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Jia Wen
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA
| | - Chen’ao Qian
- Department of Bioinformatics, School of Biology & Basic Medical Science, Soochow University, Suzhou, China
| | - Wenzhuo Zhuang
- Department of Cell Biology, School of Biology & Basic Medical Science, Soochow University, Suzhou, China
| | - Xinghua Shi
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|