1
|
Wendel B, Heidenreich M, Budde M, Heilbronner M, Oraki Kohshour M, Papiol S, Falkai P, Schulze TG, Heilbronner U, Bickeböller H. Kalpra: A kernel approach for longitudinal pathway regression analysis integrating network information with an application to the longitudinal PsyCourse Study. Front Genet 2022; 13:1015885. [PMID: 36561312 PMCID: PMC9767414 DOI: 10.3389/fgene.2022.1015885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 11/24/2022] [Indexed: 12/12/2022] Open
Abstract
A popular approach to reduce the high dimensionality resulting from genome-wide association studies is to analyze a whole pathway in a single test for association with a phenotype. Kernel machine regression (KMR) is a highly flexible pathway analysis approach. Initially, KMR was developed to analyze a simple phenotype with just one measurement per individual. Recently, however, the investigation into the influence of genomic factors in the development of disease-related phenotypes across time (trajectories) has gained in importance. Thus, novel statistical approaches for KMR analyzing longitudinal data, i.e. several measurements at specific time points per individual are required. For longitudinal pathway analysis, we extend KMR to long-KMR using the estimation equivalence of KMR and linear mixed models. We include additional random effects to correct for the dependence structure. Moreover, within long-KMR we created a topology-based pathway analysis by combining this approach with a kernel including network information of the pathway. Most importantly, long-KMR not only allows for the investigation of the main genetic effect adjusting for time dependencies within an individual, but it also allows to test for the association of the pathway with the longitudinal course of the phenotype in the form of testing the genetic time-interaction effect. The approach is implemented as an R package, kalpra. Our simulation study demonstrates that the power of long-KMR exceeded that of another KMR method previously developed to analyze longitudinal data, while maintaining (slightly conservatively) the type I error. The network kernel improved the performance of long-KMR compared to the linear kernel. Considering different pathway densities, the power of the network kernel decreased with increasing pathway density. We applied long-KMR to cognitive data on executive function (Trail Making Test, part B) from the PsyCourse Study and 17 candidate pathways selected from Reactome. We identified seven nominally significant pathways.
Collapse
Affiliation(s)
- Bernadette Wendel
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany,*Correspondence: Bernadette Wendel,
| | - Markus Heidenreich
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| | - Monika Budde
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Maria Heilbronner
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Mojtaba Oraki Kohshour
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Sergi Papiol
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany,Department of Psychiatry and Psychotherapy, University Hospital, LMU Munich, Munich, Germany
| | - Peter Falkai
- Department of Psychiatry and Psychotherapy, University Hospital, LMU Munich, Munich, Germany
| | - Thomas G. Schulze
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany,Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Syracuse, NY, United States,Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, United States
| | - Urs Heilbronner
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| |
Collapse
|
2
|
Sugasawa S, Noma H. Efficient testing and effect size estimation for set-based genetic association inference via semiparametric multilevel mixture modeling. Biom J 2022; 64:1142-1152. [PMID: 35543501 DOI: 10.1002/bimj.202100234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 01/25/2022] [Accepted: 04/03/2022] [Indexed: 11/10/2022]
Abstract
In genetic association studies, rare variants with extremely low allele frequencies play a crucial role in complex traits. Therefore, set-based testing methods that jointly assess the effects of groups of single nucleotide polymorphisms (SNPs) were developed to increase the powers of the association tests. However, these powers are still insufficient, and precise estimations of the effect sizes of individual SNPs are largely impossible. In this article, we provide an efficient set-based statistical inference framework that addresses both of these important issues simultaneously using an empirical Bayes method with semiparametric multilevel mixture modeling. We propose to utilize the hierarchical model that incorporates variations in set-specific effects and to apply the optimal discovery procedure (ODP) that achieves the largest overall power in multiple significance testing. In addition, we provide an optimal "set-based" estimator of the empirical distribution of effect sizes. The efficiency of the proposed methods is demonstrated through application to a genome-wide association study of coronary artery disease and through simulation studies. The results demonstrated numerous rare variants with large effect sizes for coronary artery disease, and the number of significant sets detected by the ODP was much greater than those identified by existing methods.
Collapse
Affiliation(s)
- Shonosuke Sugasawa
- Center for Spatial Information Science, The University of Tokyo, Chiba, Japan.,Research Center for Medical and Health Data Science, The Institute of Statistical Mathematics, Tokyo, Japan
| | - Hisashi Noma
- Research Center for Medical and Health Data Science, The Institute of Statistical Mathematics, Tokyo, Japan.,Department of Data Science, The Institute of Statistical Mathematics, Tokyo, Japan
| |
Collapse
|
3
|
Zhao Y, Sun L. On set‐based association tests: Insights from a regression using summary statistics. CAN J STAT 2020. [DOI: 10.1002/cjs.11584] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yanyan Zhao
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
| | - Lei Sun
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
- Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto M5T 3M7 Ontario Canada
| |
Collapse
|
4
|
Fore R, Boehme J, Li K, Westra J, Tintle N. Multi-Set Testing Strategies Show Good Behavior When Applied to Very Large Sets of Rare Variants. Front Genet 2020; 11:591606. [PMID: 33240333 PMCID: PMC7680887 DOI: 10.3389/fgene.2020.591606] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 10/05/2020] [Indexed: 12/22/2022] Open
Abstract
Gene-based tests of association (e.g., variance components and burden tests) are now common practice for analyses attempting to elucidate the contribution of rare genetic variants on common disease. As sequencing datasets continue to grow in size, the number of variants within each set (e.g., gene) being tested is also continuing to grow. Pathway-based methods have been used to allow for the initial aggregation of gene-based statistical evidence and then the subsequent aggregation of evidence across the pathway. This “multi-set” approach (first gene-based test, followed by pathway-based) lacks thorough exploration in regard to evaluating genotype–phenotype associations in the age of large, sequenced datasets. In particular, we wonder whether there are statistical and biological characteristics that make the multi-set approach optimal vs. simply doing all gene-based tests? In this paper, we provide an intuitive framework for evaluating these questions and use simulated data to affirm us this intuition. A real data application is provided demonstrating how our insights manifest themselves in practice. Ultimately, we find that when initial subsets are biologically informative (e.g., tending to aggregate causal genetic variants within one or more subsets, often genes), multi-set strategies can improve statistical power, with particular gains in cases where causal variants are aggregated in subsets with less variants overall (high proportion of causal variants in the subset). However, we find that there is little advantage when the sets are non-informative (similar proportion of causal variants in the subsets). Our application to real data further demonstrates this intuition. In practice, we recommend wider use of pathway-based methods and further exploration of optimal ways of aggregating variants into subsets based on emerging biological evidence of the genetic architecture of complex disease.
Collapse
Affiliation(s)
- Ruby Fore
- Department of Biostatistics, Brown University, Providence, RI, United States
| | - Jaden Boehme
- Department of Mathematics, Oregon State University, Corvallis, OR, United States
| | - Kevin Li
- Department of Mathematics, School of Arts and Sciences, Columbia University, New York, NY, United States
| | - Jason Westra
- Department of Mathematics and Statistics, Dordt University, Sioux Center, IA, United States
| | - Nathan Tintle
- Department of Mathematics and Statistics, Dordt University, Sioux Center, IA, United States
| |
Collapse
|
5
|
Wen Y, Lu Q. An optimal kernel-based multivariate U-statistic to test for associations with multiple phenotypes. Biostatistics 2020; 23:705-720. [PMID: 33108446 DOI: 10.1093/biostatistics/kxaa049] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 09/24/2020] [Accepted: 10/03/2020] [Indexed: 11/13/2022] Open
Abstract
Set-based analysis that jointly considers multiple predictors in a group has been broadly conducted for association tests. However, their power can be sensitive to the distribution of phenotypes, and the underlying relationships between predictors and outcomes. Moreover, most of the set-based methods are designed for single-trait analysis, making it hard to explore the pleiotropic effect and borrow information when multiple phenotypes are available. Here, we propose a kernel-based multivariate U-statistics (KMU) that is robust and powerful in testing the association between a set of predictors and multiple outcomes. We employed a rank-based kernel function for the outcomes, which makes our method robust to various outcome distributions. Rather than selecting a single kernel, our test statistics is built based on multiple kernels selected in a data-driven manner, and thus is capable of capturing various complex relationships between predictors and outcomes. The asymptotic properties of our test statistics have been developed. Through simulations, we have demonstrated that KMU has controlled type I error and higher power than its counterparts. We further showed its practical utility by analyzing a whole genome sequencing data from Alzheimer's Disease Neuroimaging Initiative study, where novel genes have been detected to be associated with imaging phenotypes.
Collapse
Affiliation(s)
- Y Wen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Qing Lu
- Department of Biostatistics, College of Public Health, University of Florida, Gainesville, FL, USA
| |
Collapse
|
6
|
He T, Li S, Zhong PS, Cui Y. An optimal kernel-based U
-statistic method for quantitative gene-set association analysis. Genet Epidemiol 2018; 43:137-149. [DOI: 10.1002/gepi.22170] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 08/19/2018] [Accepted: 09/26/2018] [Indexed: 11/09/2022]
Affiliation(s)
- Tao He
- Department of Mathematics; San Francisco State University; San Francisco California
| | - Shaoyu Li
- Department of Mathematics and Statistics; University of North Carolina at Charlotte; Charlotte North Carolina
| | - Ping-Shou Zhong
- Department of Mathematics, Statistics, and Computer Science; University of Illinois at Chicago; Chicago Illinois
| | - Yuehua Cui
- Department of Statistics & Probability; Michigan State University; East Lansing Michigan
- School of Public Health, Zhengzhou University; Zhengzhou China
| |
Collapse
|
7
|
Cologne J, Loo L, Shvetsov YB, Misumi M, Lin P, Haiman CA, Wilkens LR, Le Marchand L. Stepwise approach to SNP-set analysis illustrated with the Metabochip and colorectal cancer in Japanese Americans of the Multiethnic Cohort. BMC Genomics 2018; 19:524. [PMID: 29986644 PMCID: PMC6038257 DOI: 10.1186/s12864-018-4910-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 06/29/2018] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Common variants have explained less than the amount of heritability expected for complex diseases, which has led to interest in less-common variants and more powerful approaches to the analysis of whole-genome scans. Because of low frequency (low statistical power), less-common variants are best analyzed using SNP-set methods such as gene-set or pathway-based analyses. However, there is as yet no clear consensus regarding how to focus in on potential risk variants following set-based analyses. We used a stepwise, telescoping approach to analyze common- and rare-variant data from the Illumina Metabochip array to assess genomic association with colorectal cancer (CRC) in the Japanese sub-population of the Multiethnic Cohort (676 cases, 7180 controls). We started with pathway analysis of SNPs that are in genes and pathways having known mechanistic roles in colorectal cancer, then focused on genes within the pathways that evidenced association with CRC, and finally assessed individual SNPs within the genes that evidenced association. Pathway SNPs downloaded from the dbSNP database were cross-matched with Metabochip SNPs and analyzed using the logistic kernel machine regression approach (logistic SNP-set kernel-machine association test, or sequence kernel association test; SKAT) and related methods. RESULTS The TGF-β and WNT pathways were associated with all CRC, and the WNT pathway was associated with colon cancer. Individual genes demonstrating the strongest associations were TGFBR2 in the TGF-β pathway and SMAD7 (which is involved in both the TGF-β and WNT pathways). As partial validation of our approach, a known CRC risk variant in SMAD7 (in both the TGF-β and WNT pathways: rs11874392) was associated with CRC risk in our data. We also detected two novel candidate CRC risk variants (rs13075948 and rs17025857) in TGFBR2, a gene known to be associated with CRC risk. CONCLUSIONS A stepwise, telescoping approach identified some potentially novel risk variants associated with colorectal cancer, so it may be a useful method for following up on results of set-based SNP analyses. Further work is required to assess the statistical characteristics of the approach, and additional applications should aid in better clarifying its utility.
Collapse
Affiliation(s)
- John Cologne
- Department of Statistics, Radiation Effects Research Foundation, Hiroshima, 732-0815, Japan.
| | - Lenora Loo
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Yurii B Shvetsov
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Munechika Misumi
- Department of Statistics, Radiation Effects Research Foundation, Hiroshima, 732-0815, Japan
| | - Philip Lin
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Christopher A Haiman
- Department of Preventive Medicine and Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Lynne R Wilkens
- Biostatistics and Informatics Shared Resource, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| | - Loïc Le Marchand
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, 96813, USA
| |
Collapse
|