1
|
Nizeyimana P, Lee KE, Kim I. Bayesian pathway selection. J Korean Stat Soc 2023. [DOI: 10.1007/s42952-022-00201-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
2
|
Carpenter CM, Zhang W, Gillenwater L, Severn C, Ghosh T, Bowler R, Kechris K, Ghosh D. PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes. PLoS Comput Biol 2021; 17:e1008986. [PMID: 34679079 PMCID: PMC8565741 DOI: 10.1371/journal.pcbi.1008986] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 11/03/2021] [Accepted: 10/13/2021] [Indexed: 02/02/2023] Open
Abstract
High-throughput data such as metabolomics, genomics, transcriptomics, and proteomics have become familiar data types within the "-omics" family. For this work, we focus on subsets that interact with one another and represent these "pathways" as graphs. Observed pathways often have disjoint components, i.e., nodes or sets of nodes (metabolites, etc.) not connected to any other within the pathway, which notably lessens testing power. In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. This work extends previous kernel machine approaches. This paper also contributes an application of a graph kernel regularization method for overcoming disconnected pathways. By incorporating a regularized or "smoothed" graph into a score test, PaIRKAT can provide more powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research. We evaluate this method through several simulation studies and an application to real metabolomics data from the COPDGene study. Our simulation studies illustrate the robustness of this method to incorrect and incomplete pathway knowledge, and the real data analysis shows meaningful improvements of testing power in pathways. PaIRKAT was developed for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure.
Collapse
Affiliation(s)
- Charlie M. Carpenter
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Weiming Zhang
- Syneos Health, Morrisville, North Carolina, United States of America
| | - Lucas Gillenwater
- Computational Bioscience Program, University of Colorado Denver, Anschutz medical campus, Denver, Colorado, United States of America
| | - Cameron Severn
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Tusharkanti Ghosh
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Russell Bowler
- Department of Medicine, National Jewish Health, Denver; University of Colorado Denver, Anschutz Medical Campus, Denver, Colorado, United States of America
| | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical campus, Denver, Colorado, United States of America
| |
Collapse
|
3
|
Zhang L, Kim I. Finite mixtures of semiparametric Bayesian survival kernel machine regressions: Application to breast cancer gene pathway subgroup analysis. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Lin Zhang
- Department of Statistics Virginia Tech Blacksburg VAUSA
| | - Inyoung Kim
- Department of Statistics Virginia Tech Blacksburg VAUSA
| |
Collapse
|
4
|
Xu Y, Kim I, Carroll RJ. A hybrid omnibus test for generalized semiparametric single-index models with high-dimensional covariate sets. Biometrics 2019; 75:757-767. [PMID: 30859553 DOI: 10.1111/biom.13054] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 02/26/2019] [Indexed: 11/27/2022]
Abstract
Numerous statistical methods have been developed for analyzing high-dimensional data. These methods often focus on variable selection approaches but are limited for the purpose of testing with high-dimensional data. They are often required to have explicit-likelihood functions. In this article, we propose a "hybrid omnibus test" for high-dicmensional data testing purpose with much weaker requirements. Our hybrid omnibus test is developed under a semiparametric framework where a likelihood function is no longer necessary. Our test is a version of a frequentist-Bayesian hybrid score-type test for a generalized partially linear single-index model, which has a link function being a function of a set of variables through a generalized partially linear single index. We propose an efficient score based on estimating equations, define local tests, and then construct our hybrid omnibus test using local tests. We compare our approach with an empirical-likelihood ratio test and Bayesian inference based on Bayes factors, using simulation studies. Our simulation results suggest that our approach outperforms the others, in terms of type I error, power, and computational cost in both the low- and high-dimensional cases. The advantage of our approach is demonstrated by applying it to genetic pathway data for type II diabetes mellitus.
Collapse
Affiliation(s)
- Yangyi Xu
- Department of Statistics, Virginia Tech., Blacksburg, Virginia
| | - Inyoung Kim
- Department of Statistics, Virginia Tech., Blacksburg, Virginia
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, Texas.,School of Mathematical and Physical Sciences, University of Technology, Sydney, Sydney, Broadway, NSW, Australia
| |
Collapse
|
5
|
Zhang L, Kim I. Semiparametric Bayesian kernel survival model for evaluating pathway effects. Stat Methods Med Res 2018; 28:3301-3317. [PMID: 30289021 DOI: 10.1177/0962280218797360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Massive amounts of high-dimensional data have been accumulated over the past two decades, which has cultured increasing interests in identifying gene pathways related to certain biological processes. In particular, since pathway-based analysis has the ability to detect subtle changes of differentially expressed genes that could be missed when using gene-based analysis, detecting the gene pathways that regulate certain diseases can provide new strategies for medical procedures and new targets for drug discovery. Limited work has been carried out, primarily in regression settings, to study the effects of pathways on survival outcomes. Motivated by a breast cancer gene-pathway data set, which exhibits the "small n, large p" characteristics, we propose a semiparametric Bayesian kernel survival model (s-BKSurv) to study the effects of both clinical covariates and gene expression levels within a pathway on survival time. We model the unknown high-dimensional functions of pathways via Gaussian kernel machine to consider the possibility that genes within the same pathway interact with each other. To address the multiple comparisons problem under a full Bayesian setting, we propose a similarity-dependent procedure based on Bayes factor to control the family-wise error rate. We demonstrate the outperformance of our approach under various simulation settings and pathways data.
Collapse
Affiliation(s)
- Lin Zhang
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| | - Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| |
Collapse
|
6
|
Sun P, Kim I, Lee KA. Dual-semiparametric regression using weighted Dirichlet process mixture. Comput Stat Data Anal 2018. [DOI: 10.1016/j.csda.2017.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
7
|
Bayesian Semiparametric Model for Pathway-Based Analysis with Zero-Inflated Clinical Outcomes. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2016. [DOI: 10.1007/s13253-016-0264-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
8
|
Pang H, Kim I, Zhao H. Random Effects Model for Multiple Pathway Analysis with Applications to Type II Diabetes Microarray Data. STATISTICS IN BIOSCIENCES 2015; 7:167-186. [PMID: 26640601 PMCID: PMC4666561 DOI: 10.1007/s12561-014-9109-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Close to three percent of the world's population suffer from diabetes. Despite the range of treatment options available for diabetes patients, not all patients benefit from them. Investigating how different pathways correlate with phenotype of interest may help unravel novel drug targets and discover a possible cure. Many pathway-based methods have been developed to incorporate biological knowledge into the study of microarray data. Most of these methods can only analyze individual pathways but cannot deal with two or more pathways in a model based framework. This represents a serious limitation because, like genes, individual pathways do not work in isolation, and joint modeling may enable researchers to uncover patterns not seen in individual pathway-based analysis. In this paper, we propose a random effects model to analyze two or more pathways. We also derive score test statistics for significance of pathway effects. We apply our method to a microarray study of Type II diabetes. Our method may eludicate how pathways crosstalk with each other and facilitate the investigation of pathway crosstalks. Further hypothesis on the biological mechanisms underlying the disease and traits of interest may be generated and tested based on this method.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina 27705, U.S.A. Tel.: +919-681-5011, Fax: +919-668-5888
| | - Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, Virginia 24061, U.S.A. Tel.: +540-231-5366, Fax: +540-231-3863
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, and Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520, U.S.A. Tel.: +203-785-6271, Fax: +203-785-6912
| |
Collapse
|
9
|
Pang H, Zhao H. Stratified pathway analysis to identify gene sets associated with oral contraceptive use and breast cancer. Cancer Inform 2014; 13:73-8. [PMID: 25574128 PMCID: PMC4263464 DOI: 10.4137/cin.s13973] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 08/15/2014] [Accepted: 08/19/2014] [Indexed: 01/02/2023] Open
Abstract
Cancer biomarker discovery can facilitate drug development, improve staging of patients, and predict patient prognosis. Because cancer is the result of many interacting genes, analysis based on a set of genes with related biological functions or pathways may be more informative than single gene-based analysis for cancer biomarker discovery. The relevant pathways thus identified may help characterize different aspects of molecular phenotypes related to the tumor. Although it is well known that cancer patients may respond to the same treatment differently because of clinical variables and variation of molecular phenotypes, this patient heterogeneity has not been explicitly considered in pathway analysis in the literature. We hypothesize that combining pathway and patient clinical information can more effectively identify relevant pathways pertinent to specific patient subgroups, leading to better diagnosis and treatment. In this article, we propose to perform stratified pathway analysis based on clinical information from patients. In contrast to analysis using all the patients, this more focused analysis has the potential to reveal subgroup-specific pathways that may lead to more biological insights into disease etiology and treatment response. As an illustration, the power of our approach is demonstrated through its application to a breast cancer dataset in which the patients are stratified according to their oral contraceptive use.
Collapse
Affiliation(s)
- Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA. ; School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
10
|
An asymptotically minimax kernel machine. Stat Probab Lett 2014. [DOI: 10.1016/j.spl.2014.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
11
|
Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, Armistead PM. Kernel machine SNP-set testing under multiple candidate kernels. Genet Epidemiol 2013; 37:267-75. [PMID: 23471868 PMCID: PMC3769109 DOI: 10.1002/gepi.21715] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2012] [Revised: 01/15/2013] [Accepted: 02/05/2013] [Indexed: 11/10/2022]
Abstract
Joint testing for the cumulative effect of multiple single-nucleotide polymorphisms grouped on the basis of prior biological knowledge has become a popular and powerful strategy for the analysis of large-scale genetic association studies. The kernel machine (KM)-testing framework is a useful approach that has been proposed for testing associations between multiple genetic variants and many different types of complex traits by comparing pairwise similarity in phenotype between subjects to pairwise similarity in genotype, with similarity in genotype defined via a kernel function. An advantage of the KM framework is its flexibility: choosing different kernel functions allows for different assumptions concerning the underlying model and can allow for improved power. In practice, it is difficult to know which kernel to use a priori because this depends on the unknown underlying trait architecture and selecting the kernel which gives the lowest P-value can lead to inflated type I error. Therefore, we propose practical strategies for KM testing when multiple candidate kernels are present based on constructing composite kernels and based on efficient perturbation procedures. We demonstrate through simulations and real data applications that the procedures protect the type I error rate and can lead to substantially improved power over poor choices of kernels and only modest differences in power vs. using the best candidate kernel.
Collapse
Affiliation(s)
- Michael C Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7420, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Kim I, Pang H, Zhao H. Statistical properties on semiparametric regression for evaluating pathway effects. J Stat Plan Inference 2013; 143:745-763. [PMID: 24014933 PMCID: PMC3763850 DOI: 10.1016/j.jspi.2012.09.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.'s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have [Formula: see text] convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.'s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).
Collapse
Affiliation(s)
- Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA
| | - Herbert Pang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27705, USA
| | - Hongyu Zhao
- Division of Biostatistics, Yale School of Public Health, New Haven, CT 06520, USA
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06520, USA
| |
Collapse
|