1
|
Rong Y, Zhao SD, Zheng X, Li Y. Kernel Cox partially linear regression: Building predictive models for cancer patients' survival. Stat Med 2024; 43:1-15. [PMID: 37875428 DOI: 10.1002/sim.9938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 09/30/2023] [Accepted: 10/03/2023] [Indexed: 10/26/2023]
Abstract
Wide heterogeneity exists in cancer patients' survival, ranging from a few months to several decades. To accurately predict clinical outcomes, it is vital to build an accurate predictive model that relates the patients' molecular profiles with the patients' survival. With complex relationships between survival and high-dimensional molecular predictors, it is challenging to conduct nonparametric modeling and irrelevant predictors removing simultaneously. In this article, we build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model. We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and nonparametric predictors through a LASSO penalty. An efficient high-dimensional algorithm is developed for the proposed method. Comparison with other competing methods in simulation shows that the proposed method always has better predictive accuracy. We apply this method to analyze a multiple myeloma dataset and predict the patients' death burden based on their gene expressions. Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.
Collapse
Affiliation(s)
- Yaohua Rong
- Faculty of Science, Beijing University of Technology, Beijing, China
| | - Sihai Dave Zhao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA
| | - Xia Zheng
- Faculty of Science, Beijing University of Technology, Beijing, China
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
2
|
Blumhagen RZ, Schwartz DA, Langefeld CD, Fingerlin TE. Identification of Influential Variants in Significant Aggregate Rare Variant Tests. Hum Hered 2021; 85:1-13. [PMID: 33567433 PMCID: PMC8353006 DOI: 10.1159/000513290] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 11/19/2020] [Indexed: 12/17/2022] Open
Abstract
INTRODUCTION Studies that examine the role of rare variants in both simple and complex disease are increasingly common. Though the usual approach of testing rare variants in aggregate sets is more powerful than testing individual variants, it is of interest to identify the variants that are plausible drivers of the association. We present a novel method for prioritization of rare variants after a significant aggregate test by quantifying the influence of the variant on the aggregate test of association. METHODS In addition to providing a measure used to rank variants, we use outlier detection methods to present the computationally efficient Rare Variant Influential Filtering Tool (RIFT) to identify a subset of variants that influence the disease association. We evaluated several outlier detection methods that vary based on the underlying variance measure: interquartile range (Tukey fences), median absolute deviation, and SD. We performed 1,000 simulations for 50 regions of size 3 kb and compared the true and false positive rates. We compared RIFT using the Inner Tukey to 2 existing methods: adaptive combination of p values (ADA) and a Bayesian hierarchical model (BeviMed). Finally, we applied this method to data from our targeted resequencing study in idiopathic pulmonary fibrosis (IPF). RESULTS All outlier detection methods observed higher sensitivity to detect uncommon variants (0.001 < minor allele frequency, MAF > 0.03) compared to very rare variants (MAF <0.001). For uncommon variants, RIFT had a lower median false positive rate compared to the ADA. ADA and RIFT had significantly higher true positive rates than that observed for BeviMed. When applied to 2 regions found previously associated with IPF including 100 rare variants, we identified 6 polymorphisms with the greatest evidence for influencing the association with IPF. DISCUSSION In summary, RIFT has a high true positive rate while maintaining a low false positive rate for identifying polymorphisms influencing rare variant association tests. This work provides an approach to obtain greater resolution of the rare variant signals within significant aggregate sets; this information can provide an objective measure to prioritize variants for follow-up experimental studies and insight into the biological pathways involved.
Collapse
Affiliation(s)
- Rachel Z Blumhagen
- Center for Genes, Environment and Health, National Jewish Health, Denver, Colorado, USA,
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, Colorado, USA,
| | - David A Schwartz
- School of Medicine, University of Colorado, Aurora, Colorado, USA
| | - Carl D Langefeld
- Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, North Carolina, USA
- Comprehensive Cancer Center, Wake Forest Baptist Medical Center, Winston-Salem, North Carolina, USA
- Center for Precision Medicine, Wake Forest School of Medicine, Winston-Salem, North Carolina, USA
| | - Tasha E Fingerlin
- Center for Genes, Environment and Health, National Jewish Health, Denver, Colorado, USA
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, Colorado, USA
- School of Medicine, University of Colorado, Aurora, Colorado, USA
| |
Collapse
|
3
|
Yang S, Wen J, Eckert ST, Wang Y, Liu DJ, Wu R, Li R, Zhan X. Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning. Bioinformatics 2020; 36:3811-3817. [PMID: 32246825 DOI: 10.1093/bioinformatics/btaa229] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 02/19/2020] [Accepted: 03/31/2020] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION Large scale genome-wide association studies (GWAS) have resulted in the identification of a wide range of genetic variants related to a host of complex traits and disorders. Despite their success, the individual single-nucleotide polymorphism (SNP) analysis approach adopted in most current GWAS can be limited in that it is usually biologically simple to elucidate a comprehensive genetic architecture of phenotypes and statistically underpowered due to heavy multiple-testing correction burden. On the other hand, multiple-SNP analyses (e.g. gene-based or region-based SNP-set analysis) are usually more powerful to examine the joint effects of a set of SNPs on the phenotype of interest. However, current multiple-SNP approaches can only draw an overall conclusion at the SNP-set level and does not directly inform which SNPs in the SNP-set are driving the overall genotype-phenotype association. RESULTS In this article, we propose a new permutation-assisted tuning procedure in lasso (plasso) to identify phenotype-associated SNPs in a joint multiple-SNP regression model in GWAS. The tuning parameter of lasso determines the amount of shrinkage and is essential to the performance of variable selection. In the proposed plasso procedure, we first generate permutations as pseudo-SNPs that are not associated with the phenotype. Then, the lasso tuning parameter is delicately chosen to separate true signal SNPs and non-informative pseudo-SNPs. We illustrate plasso using simulations to demonstrate its superior performance over existing methods, and application of plasso to a real GWAS dataset gains new additional insights into the genetic control of complex traits. AVAILABILITY AND IMPLEMENTATION R codes to implement the proposed methodology is available at https://github.com/xyz5074/plasso. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Songshan Yang
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Jiawei Wen
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Scott T Eckert
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Yaqun Wang
- Department of Biostatistics, Rutgers University, New Brunswick, NJ 08901, USA
| | - Dajiang J Liu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Rongling Wu
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| | - Runze Li
- Department of Statistics, Pennsylvania State University, University Park, PA 16802
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033
| |
Collapse
|
4
|
Posner DC, Lin H, Meigs JB, Kolaczyk ED, Dupuis J. Convex combination sequence kernel association test for rare-variant studies. Genet Epidemiol 2020; 44:352-367. [PMID: 32100372 DOI: 10.1002/gepi.22287] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2019] [Revised: 12/17/2019] [Accepted: 01/27/2020] [Indexed: 02/06/2023]
Abstract
We propose a novel variant set test for rare-variant association studies, which leverages multiple single-nucleotide variant (SNV) annotations. Our approach optimizes a convex combination of different sequence kernel association test (SKAT) statistics, where each statistic is constructed from a different annotation and combination weights are optimized through a multiple kernel learning algorithm. The combination test statistic is evaluated empirically through data splitting. In simulations, we find our method preserves type I error at α = 2.5 × 1 0 - 6 and has greater power than SKAT(-O) when SNV weights are not misspecified and sample sizes are large ( N ≥ 5 , 000 ). We utilize our method in the Framingham Heart Study (FHS) to identify SNV sets associated with fasting glucose. While we are unable to detect any genome-wide significant associations between fasting glucose and 4-kb windows of rare variants ( p < 1 0 - 7 ) in 6,419 FHS participants, our method identifies suggestive associations between fasting glucose and rare variants near ROCK2 ( p = 2.1 × 1 0 - 5 ) and within CPLX1 ( p = 5.3 × 1 0 - 5 ). These two genes were previously reported to be involved in obesity-mediated insulin resistance and glucose-induced insulin secretion by pancreatic beta-cells, respectively. These findings will need to be replicated in other cohorts and validated by functional genomic studies.
Collapse
Affiliation(s)
- Daniel C Posner
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts
| | - Honghuang Lin
- National Heart Lung and Blood Institute's, Boston University's Framingham Heart Study, Framingham, Massachusetts.,Section of Computational Biomedicine, Department of Medicine, Boston University School of Medicine, Boston, Massachusetts
| | - James B Meigs
- Division of General Internal Medicine, Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Eric D Kolaczyk
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts
| | - Josée Dupuis
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts.,National Heart Lung and Blood Institute's, Boston University's Framingham Heart Study, Framingham, Massachusetts
| |
Collapse
|
5
|
Rong Y, Zhao SD, Zhu J, Yuan W, Cheng W, Li Y. More accurate semiparametric regression in pharmacogenomics. STATISTICS AND ITS INTERFACE 2019; 11:573-580. [PMID: 30815051 DOI: 10.4310/sii.2018.v11.n4.a2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
A key step in pharmacogenomic studies is the development of accurate prediction models for drug response based on individuals' genomic information. Recent interest has centered on semiparametric models based on kernel machine regression, which can flexibly model the complex relationships between gene expression and drug response. However, performance suffers if irrelevant covariates are unknowingly included when training the model. We propose a new semiparametric regression procedure, based on a novel penalized garrotized kernel machine (PGKM), which can better adapt to the presence of irrelevant covariates while still allowing for a complex nonlinear model and gene-gene interactions. We study the performance of our approach in simulations and in a pharmacogenomic study of the renal carcinoma drug temsirolimus. Our method predicts plasma concentration of temsirolimus as well as standard kernel machine regression when no irrelevant covariates are included in training, but has much higher prediction accuracy when the truly important covariates are not known in advance. Supplemental materials, including R code used in this manuscript, are available online.
Collapse
Affiliation(s)
- Yaohua Rong
- College of Applied Sciences, Beijing University of Technology, #100 Pingleyuan, Beijing, China
| | - Sihai Dave Zhao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, U.S.A
| | - Ji Zhu
- Department of Statistics, University of Michigan, Ann Arbor, U.S.A
| | - Wei Yuan
- School of Statistics, Renmin University of China, Beijing, China
| | - Weihu Cheng
- College of Applied Sciences, Beijing University of Technology, #100 Pingleyuan, Beijing, China
| | - Yi Li
- West China Hospital at Chengdu, China, Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| |
Collapse
|
6
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
7
|
Cologne J, Loo L, Shvetsov YB, Misumi M, Lin P, Haiman CA, Wilkens LR, Le Marchand L. Stepwise approach to SNP-set analysis illustrated with the Metabochip and colorectal cancer in Japanese Americans of the Multiethnic Cohort. BMC Genomics 2018; 19:524. [PMID: 29986644 PMCID: PMC6038257 DOI: 10.1186/s12864-018-4910-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 06/29/2018] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Common variants have explained less than the amount of heritability expected for complex diseases, which has led to interest in less-common variants and more powerful approaches to the analysis of whole-genome scans. Because of low frequency (low statistical power), less-common variants are best analyzed using SNP-set methods such as gene-set or pathway-based analyses. However, there is as yet no clear consensus regarding how to focus in on potential risk variants following set-based analyses. We used a stepwise, telescoping approach to analyze common- and rare-variant data from the Illumina Metabochip array to assess genomic association with colorectal cancer (CRC) in the Japanese sub-population of the Multiethnic Cohort (676 cases, 7180 controls). We started with pathway analysis of SNPs that are in genes and pathways having known mechanistic roles in colorectal cancer, then focused on genes within the pathways that evidenced association with CRC, and finally assessed individual SNPs within the genes that evidenced association. Pathway SNPs downloaded from the dbSNP database were cross-matched with Metabochip SNPs and analyzed using the logistic kernel machine regression approach (logistic SNP-set kernel-machine association test, or sequence kernel association test; SKAT) and related methods. RESULTS The TGF-β and WNT pathways were associated with all CRC, and the WNT pathway was associated with colon cancer. Individual genes demonstrating the strongest associations were TGFBR2 in the TGF-β pathway and SMAD7 (which is involved in both the TGF-β and WNT pathways). As partial validation of our approach, a known CRC risk variant in SMAD7 (in both the TGF-β and WNT pathways: rs11874392) was associated with CRC risk in our data. We also detected two novel candidate CRC risk variants (rs13075948 and rs17025857) in TGFBR2, a gene known to be associated with CRC risk. CONCLUSIONS A stepwise, telescoping approach identified some potentially novel risk variants associated with colorectal cancer, so it may be a useful method for following up on results of set-based SNP analyses. Further work is required to assess the statistical characteristics of the approach, and additional applications should aid in better clarifying its utility.
Collapse
Affiliation(s)
- John Cologne
- Department of Statistics, Radiation Effects Research Foundation, Hiroshima, 732-0815, Japan.
| | - Lenora Loo
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Yurii B Shvetsov
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Munechika Misumi
- Department of Statistics, Radiation Effects Research Foundation, Hiroshima, 732-0815, Japan
| | - Philip Lin
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Christopher A Haiman
- Department of Preventive Medicine and Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Lynne R Wilkens
- Biostatistics and Informatics Shared Resource, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Loïc Le Marchand
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| |
Collapse
|
8
|
Leveraging human genetic and adverse outcome pathway (AOP) data to inform susceptibility in human health risk assessment. Mamm Genome 2018; 29:190-204. [DOI: 10.1007/s00335-018-9738-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 01/31/2018] [Indexed: 12/19/2022]
|
9
|
Sun J, Oualkacha K, Forgetta V, Zheng HF, Richards JB, Evans DS, Orwoll E, Greenwood CMT. Exome-wide rare variant analyses of two bone mineral density phenotypes: the challenges of analyzing rare genetic variation. Sci Rep 2018; 8:220. [PMID: 29317680 PMCID: PMC5760616 DOI: 10.1038/s41598-017-18385-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 12/11/2017] [Indexed: 11/08/2022] Open
Abstract
Performance of a recently developed test for association between multivariate phenotypes and sets of genetic variants (MURAT) is demonstrated using measures of bone mineral density (BMD). By combining individual-level whole genome sequenced data from the UK10K study, and imputed genome-wide genetic data on individuals from the Study of Osteoporotic Fractures (SOF) and the Osteoporotic Fractures in Men Study (MrOS), a data set of 8810 individuals was assembled; tests of association were performed between autosomal gene-sets of genetic variants and BMD measured at lumbar spine and femoral neck. Distributions of p-values obtained from analyses of a single BMD phenotype are compared to those from the multivariate tests, across several region definitions and variant weightings. There is evidence of increased power with the multivariate test, although no new loci for BMD were identified. Among 17 genes highlighted either because there were significant p-values in region-based association tests or because they were in well-known BMD genes, 4 windows in 2 genes as well as 6 single SNPs in one of these genes showed association at genome-wide significant thresholds with the multivariate phenotype test but not with the single-phenotype test, Sequence Kernel Association Test (SKAT).
Collapse
Affiliation(s)
- Jianping Sun
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Karim Oualkacha
- Département de mathématiques, Université du Québec à Montréal, Montreal, QC, Canada
| | - Vincenzo Forgetta
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Hou-Feng Zheng
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Westlake University, Hangzhou, Zhejiang, China
- Institute of Aging Research and the Affiliated Hospital, School of Medicine, Hangzhou Normal University, Hangzhou, Zhejiang, China
| | - J Brent Richards
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
- Department of Human Genetics, McGill University, Montreal, QC, Canada
| | - Daniel S Evans
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Eric Orwoll
- Department of Medicine, Bone and Mineral Unit, Oregon Health and Science University, Portland, OR, USA
| | - Celia M T Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada.
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada.
- Department of Human Genetics, McGill University, Montreal, QC, Canada.
- Department of Oncology, McGill University, Montreal, QC, Canada.
| |
Collapse
|
10
|
Powerful Genetic Association Analysis for Common or Rare Variants with High-Dimensional Structured Traits. Genetics 2017. [PMID: 28642271 DOI: 10.1534/genetics.116.199646] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Many genetic association studies collect a wide range of complex traits. As these traits may be correlated and share a common genetic mechanism, joint analysis can be statistically more powerful and biologically more meaningful. However, most existing tests for multiple traits cannot be used for high-dimensional and possibly structured traits, such as network-structured transcriptomic pathway expressions. To overcome potential limitations, in this article we propose the dual kernel-based association test (DKAT) for testing the association between multiple traits and multiple genetic variants, both common and rare. In DKAT, two individual kernels are used to describe the phenotypic and genotypic similarity, respectively, between pairwise subjects. Using kernels allows for capturing structure while accommodating dimensionality. Then, the association between traits and genetic variants is summarized by a coefficient which measures the association between two kernel matrices. Finally, DKAT evaluates the hypothesis of nonassociation with an analytical P-value calculation without any computationally expensive resampling procedures. By collapsing information in both traits and genetic variants using kernels, the proposed DKAT is shown to have a correct type-I error rate and higher power than other existing methods in both simulation studies and application to a study of genetic regulation of pathway gene expressions.
Collapse
|