1
|
Bass AJ, Cutler DJ, Epstein MP. A powerful framework for differential co-expression analysis of general risk factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.29.626006. [PMID: 39677786 PMCID: PMC11642831 DOI: 10.1101/2024.11.29.626006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Differential co-expression analysis (DCA) aims to identify genes in a pathway whose shared expression depends on a risk factor. While DCA provides insights into the biological activity of diseases, existing methods are limited to categorical risk factors and/or suffer from bias due to batch and variance-specific effects. We propose a new framework, Kernel-based Differential Co-expression Analysis (KDCA), that harnesses correlation patterns between genes in a pathway to detect differential co-expression arising from general (i.e., continuous, discrete, or categorical) risk factors. Using various simulated pathway architectures, we find that KDCA accounts for common sources of bias to control the type I error rate while substantially increasing the power compared to the standard eigengene approach. We then applied KDCA to The Cancer Genome Atlas thyroid data set and found several differentially co-expressed pathways by age of diagnosis and BRAF mutation status that were undetected by the eigengene method. Collectively, our results demonstrate that KDCA is a powerful testing framework that expands DCA applications in expression studies.
Collapse
Affiliation(s)
- Andrew J. Bass
- Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
| | - David J. Cutler
- Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
| | | |
Collapse
|
2
|
He M, Zhao N. A Mixed Effect Similarity Matrix Regression Model (SMRmix) for Integrating Multiple Microbiome Datasets at Community Level. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.10.584315. [PMID: 38559012 PMCID: PMC10979838 DOI: 10.1101/2024.03.10.584315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
BACKGROUND Recent studies have highlighted the importance of human microbiota in our health and diseases. However, in many areas of research, individual microbiome studies often offer inconsistent results due to the limited sample sizes and the heterogeneity in study populations and experimental procedures. This inconsistency underscores the necessity for integrative analysis of multiple microbiome datasets. Despite the critical need, statistical methods that incorporate multiple microbiome datasets and account for the study heterogeneity are not available in the literature. METHODS In this paper, we develop a mixed effect similarity matrix regression (SMRmix) approach for identifying community level microbiome shifts between outcomes. SMRmix has a close connection with the microbiome kernel association test, one of the most popular approaches for such a task but is only applicable when we have a single study. SMRmix enables researchers to consolidate findings from diverse microbiome studies. RESULTS Via extensive simulations, we show that SMRmix has well-controlled type I error and higher power than some potential competitors. We applied the SMRmix to two real-world datasets. The first, from the HIV-reanalysis consortium, integrated data from 17 studies on gut dysbiosis in HIV. Our analysis confirmed consistent associations between the gut microbiome and HIV infection as well as MSM (men who have sex with men) status, demonstrating greater power than competing methods. The second dataset involved 11 studies on the gut microbiome in colorectal cancer; analysis with SMRmix confirmed significant dysbiosis in affected individuals compared to healthy controls. CONCLUSION The development of SMRmix enables the integration of multiple studies and effectively managing study heterogeneity, and provides a powerful tool for uncovering consistent associations between diseases and community-level microbiome data.
Collapse
|
3
|
Xu P, Liang S, Hahn A, Zhao V, Lo WT‘J, Haller BC, Sobkowiak B, Chitwood MH, Colijn C, Cohen T, Rhee KY, Messer PW, Wells MT, Clark AG, Kim J. e3SIM: epidemiological-ecological-evolutionary simulation framework for genomic epidemiology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.29.601123. [PMID: 39005464 PMCID: PMC11244936 DOI: 10.1101/2024.06.29.601123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Infectious disease dynamics are driven by the complex interplay of epidemiological, ecological, and evolutionary processes. Accurately modeling these interactions is crucial for understanding pathogen spread and informing public health strategies. However, existing simulators often fail to capture the dynamic interplay between these processes, resulting in oversimplified models that do not fully reflect real-world complexities in which the pathogen's genetic evolution dynamically influences disease transmission. We introduce the epidemiological-ecological-evolutionary simulator (e3SIM), an open-source framework that concurrently models the transmission dynamics and molecular evolution of pathogens within a host population while integrating environmental factors. Using an agent-based, discrete-generation, forward-in-time approach, e3SIM incorporates compartmental models, host-population contact networks, and quantitative-trait models for pathogens. This integration allows for realistic simulations of disease spread and pathogen evolution. Key features include a modular and scalable design, flexibility in modeling various epidemiological and population-genetic complexities, incorporation of time-varying environmental factors, and a user-friendly graphical interface. We demonstrate e3SIM's capabilities through simulations of realistic outbreak scenarios with SARS-CoV-2 and Mycobacterium tuberculosis, illustrating its flexibility for studying the genomic epidemiology of diverse pathogen types.
Collapse
Affiliation(s)
- Peiyu Xu
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
| | - Shenni Liang
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Andrew Hahn
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Vivian Zhao
- Department of Computational Science, Cornell University, Ithaca, NY, USA
| | - Wai Tung ‘Jack’ Lo
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin C. Haller
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Benjamin Sobkowiak
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Melanie H. Chitwood
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| | - Ted Cohen
- Department of Epidemiology of Microbial Disease, Yale School of Public Health, New Haven, CT, USA
| | - Kyu Y. Rhee
- Department of Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Philipp W. Messer
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Martin T. Wells
- Department of Statistics and Data Science, Cornell University, Ithaca, NY, USA
| | - Andrew G. Clark
- Department of Molecular Biology & Genetics, Cornell University, Ithaca, NY, USA
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| | - Jaehee Kim
- Department of Computational Biology, Cornell University, Ithaca, NY, USA
| |
Collapse
|
4
|
Cao R, Schladt DP, Dorr C, Matas AJ, Oetting WS, Jacobson PA, Israni A, Chen J, Guan W. Polygenic risk score for acute rejection based on donor-recipient non-HLA genotype mismatch. PLoS One 2024; 19:e0303446. [PMID: 38820342 PMCID: PMC11142483 DOI: 10.1371/journal.pone.0303446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 04/24/2024] [Indexed: 06/02/2024] Open
Abstract
BACKGROUND Acute rejection (AR) after kidney transplantation is an important allograft complication. To reduce the risk of post-transplant AR, determination of kidney transplant donor-recipient mismatching focuses on blood type and human leukocyte antigens (HLA), while it remains unclear whether non-HLA genetic mismatching is related to post-transplant complications. METHODS We carried out a genome-wide scan (HLA and non-HLA regions) on AR with a large kidney transplant cohort of 784 living donor-recipient pairs of European ancestry. An AR polygenic risk score (PRS) was constructed with the non-HLA single nucleotide polymorphisms (SNPs) filtered by independence (r2 < 0.2) and P-value (< 1×10-3) criteria. The PRS was validated in an independent cohort of 352 living donor-recipient pairs. RESULTS By the genome-wide scan, we identified one significant SNP rs6749137 with HR = 2.49 and P-value = 2.15×10-8. 1,307 non-HLA PRS SNPs passed the clumping plus thresholding and the PRS exhibited significant association with the AR in the validation cohort (HR = 1.54, 95% CI = (1.07, 2.22), p = 0.019). Further pathway analysis attributed the PRS genes into 13 categories, and the over-representation test identified 42 significant biological processes, the most significant of which is the cell morphogenesis (GO:0000902), with 4.08 fold of the percentage from homo species reference and FDR-adjusted P-value = 8.6×10-4. CONCLUSIONS Our results show the importance of donor-recipient mismatching in non-HLA regions. Additional work will be needed to understand the role of SNPs included in the PRS and to further improve donor-recipient genetic matching algorithms. Trial registry: Deterioration of Kidney Allograft Function Genomics (NCT00270712) and Genomics of Kidney Transplantation (NCT01714440) are registered on ClinicalTrials.gov.
Collapse
Affiliation(s)
- Rui Cao
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - David P. Schladt
- Hennepin Healthcare Research Institute, Minneapolis, Minnesota, United States of America
| | - Casey Dorr
- Hennepin Healthcare Research Institute, Minneapolis, Minnesota, United States of America
- Department of Medicine, University of Minnesota Medical School, Minneapolis, Minnesota, United States of America
| | - Arthur J. Matas
- Department of Surgery, University of Minnesota Medical School, Minneapolis, Minnesota, United States of America
| | - William S. Oetting
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Pamala A. Jacobson
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Ajay Israni
- Hennepin Healthcare Research Institute, Minneapolis, Minnesota, United States of America
- Department of Medicine, University of Minnesota Medical School, Minneapolis, Minnesota, United States of America
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Weihua Guan
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| |
Collapse
|
5
|
Bass AJ, Bian S, Wingo AP, Wingo TS, Cutler DJ, Epstein MP. Identifying latent genetic interactions in genome-wide association studies using multiple traits. Genome Med 2024; 16:62. [PMID: 38664839 PMCID: PMC11044415 DOI: 10.1186/s13073-024-01329-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 04/02/2024] [Indexed: 04/28/2024] Open
Abstract
The "missing" heritability of complex traits may be partly explained by genetic variants interacting with other genes or environments that are difficult to specify, observe, and detect. We propose a new kernel-based method called Latent Interaction Testing (LIT) to screen for genetic interactions that leverages pleiotropy from multiple related traits without requiring the interacting variable to be specified or observed. Using simulated data, we demonstrate that LIT increases power to detect latent genetic interactions compared to univariate methods. We then apply LIT to obesity-related traits in the UK Biobank and detect variants with interactive effects near known obesity-related genes (URL: https://CRAN.R-project.org/package=lit ).
Collapse
Affiliation(s)
- Andrew J Bass
- Department of Human Genetics, Emory University, Atlanta, GA, 30322, USA.
| | - Shijia Bian
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, 30322, USA
| | - Aliza P Wingo
- Department of Psychiatry, Emory University, Atlanta, GA, 30322, USA
| | - Thomas S Wingo
- Department of Human Genetics, Emory University, Atlanta, GA, 30322, USA
- Department of Neurology, Emory University, Atlanta, GA, 30322, USA
| | - David J Cutler
- Department of Human Genetics, Emory University, Atlanta, GA, 30322, USA
| | - Michael P Epstein
- Department of Human Genetics, Emory University, Atlanta, GA, 30322, USA.
| |
Collapse
|
6
|
Das Adhikari S, Cui Y, Wang J. BayesKAT: bayesian optimal kernel-based test for genetic association studies reveals joint genetic effects in complex diseases. Brief Bioinform 2024; 25:bbae182. [PMID: 38653490 PMCID: PMC11036342 DOI: 10.1093/bib/bbae182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 03/10/2024] [Accepted: 04/05/2024] [Indexed: 04/25/2024] Open
Abstract
Genome-wide Association Studies (GWAS) methods have identified individual single-nucleotide polymorphisms (SNPs) significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT (https://github.com/wangjr03/BayesKAT), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.
Collapse
Affiliation(s)
- Sikta Das Adhikari
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Jianrong Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
7
|
Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts. PLoS Genet 2023; 19:e1011057. [PMID: 38039339 PMCID: PMC10718418 DOI: 10.1371/journal.pgen.1011057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 12/13/2023] [Accepted: 11/07/2023] [Indexed: 12/03/2023] Open
Abstract
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
Collapse
Affiliation(s)
- Han Chen
- Human Genetics Center, Department of Epidemiology, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Ardalan Naseri
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Degui Zhi
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
8
|
Das Adhikari S, Cui Y, Wang J. BayesKAT: Bayesian Optimal Kernel-based Test for genetic association studies reveals joint genetic effects in complex diseases. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.18.562824. [PMID: 37905124 PMCID: PMC10614916 DOI: 10.1101/2023.10.18.562824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
GWAS methods have identified individual SNPs significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power, or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT( https://github.com/wangjr03/BayesKAT ), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules, and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.
Collapse
|
9
|
Bass AJ, Bian S, Wingo AP, Wingo TS, Cutler DJ, Epstein MP. Identifying latent genetic interactions in genome-wide association studies using multiple traits. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.11.557155. [PMID: 37745553 PMCID: PMC10515795 DOI: 10.1101/2023.09.11.557155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Genome-wide association studies of complex traits frequently find that SNP-based estimates of heritability are considerably smaller than estimates from classic family-based studies. This 'missing' heritability may be partly explained by genetic variants interacting with other genes or environments that are difficult to specify, observe, and detect. To circumvent these challenges, we propose a new method to detect genetic interactions that leverages pleiotropy from multiple related traits without requiring the interacting variable to be specified or observed. Our approach, Latent Interaction Testing (LIT), uses the observation that correlated traits with shared latent genetic interactions have trait variance and covariance patterns that differ by genotype. LIT examines the relationship between trait variance/covariance patterns and genotype using a flexible kernel-based framework that is computationally scalable for biobank-sized datasets with a large number of traits. We first use simulated data to demonstrate that LIT substantially increases power to detect latent genetic interactions compared to a trait-by-trait univariate method. We then apply LIT to four obesity-related traits in the UK Biobank and detect genetic variants with interactive effects near known obesity-related genes. Overall, we show that LIT, implemented in the R package lit, uses shared information across traits to improve detection of latent genetic interactions compared to standard approaches.
Collapse
Affiliation(s)
- Andrew J. Bass
- Department of Human Genetics, Emory University, Atlanta, GA 30322, USA
| | - Shijia Bian
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| | - Aliza P. Wingo
- Department of Psychiatry, Emory University, Atlanta, GA 30322, USA
| | - Thomas S. Wingo
- Department of Human Genetics, Emory University, Atlanta, GA 30322, USA
- Department of Neurology, Emory University, Atlanta, GA 30322, USA
| | - David J. Cutler
- Department of Human Genetics, Emory University, Atlanta, GA 30322, USA
| | | |
Collapse
|
10
|
Wang J, Zhou F, Li C, Yin N, Liu H, Zhuang B, Huang Q, Wen Y. Gene Association Analysis of Quantitative Trait Based on Functional Linear Regression Model with Local Sparse Estimator. Genes (Basel) 2023; 14:genes14040834. [PMID: 37107592 PMCID: PMC10137544 DOI: 10.3390/genes14040834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 03/27/2023] [Accepted: 03/28/2023] [Indexed: 04/03/2023] Open
Abstract
Functional linear regression models have been widely used in the gene association analysis of complex traits. These models retain all the genetic information in the data and take full advantage of spatial information in genetic variation data, which leads to brilliant detection power. However, the significant association signals identified by the high-power methods are not all the real causal SNPs, because it is easy to regard noise information as significant association signals, leading to a false association. In this paper, a method based on the sparse functional data association test (SFDAT) of gene region association analysis is developed based on a functional linear regression model with local sparse estimation. The evaluation indicators CSR and DL are defined to evaluate the feasibility and performance of the proposed method with other indicators. Simulation studies show that: (1) SFDAT performs well under both linkage equilibrium and linkage disequilibrium simulation; (2) SFDAT performs successfully for gene regions (including common variants, low-frequency variants, rare variants and mix variants); (3) With power and type I error rates comparable to OLS and Smooth, SFDAT has a better ability to handle the zero regions. The Oryza sativa data set is analyzed by SFDAT. It is shown that SFDAT can better perform gene association analysis and eliminate the false positive of gene localization. This study showed that SFDAT can lower the interference caused by noise while maintaining high power. SFDAT provides a new method for the association analysis between gene regions and phenotypic quantitative traits.
Collapse
Affiliation(s)
- Jingyu Wang
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Fujie Zhou
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Cheng Li
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ning Yin
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Huiming Liu
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Binxian Zhuang
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Qingyu Huang
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Yongxian Wen
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou 350002, China
- Correspondence:
| |
Collapse
|
11
|
Misawa K. Genotype Value Decomposition: Simple Methods for the Computation of Kernel Statistics. ADVANCED GENETICS (HOBOKEN, N.J.) 2022; 3:2100066. [PMID: 36620199 PMCID: PMC9744480 DOI: 10.1002/ggn2.202100066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Indexed: 01/11/2023]
Abstract
Recent advances in sequencing technologies enable genome-wide analyses for thousands of individuals. The sequential kernel association test (SKAT) is a widely used method to test for associations between a phenotype and a set of rare variants. As the sample size of human genetics studies increases, the computational time required to calculate a kernel is becoming more and more problematic. In this study, a new method to obtain kernel statistics without calculating a kernel matrix is proposed. A simple method for the computation of two kernel statistics, namely, a kernel statistic based on a genetic relationship matrix (GRM) and one based on an identity by state (IBS) matrix, are proposed. By using this method, calculation of the kernel statistics can be conducted using vector calculation without matrix calculation. The proposed method enables one to conduct SKAT for large samples of human genetics.
Collapse
Affiliation(s)
- Kazuharu Misawa
- Department of Human GeneticsYokohama City University Graduate School of Medicine3‐9 Fukuura, Kanazawa‐kuYokohama236‐0004Japan
| |
Collapse
|
12
|
Shao Z, Wang T, Qiao J, Zhang Y, Huang S, Zeng P. A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies. BMC Bioinformatics 2022; 23:359. [PMID: 36042399 PMCID: PMC9429742 DOI: 10.1186/s12859-022-04897-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 08/22/2022] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Multilocus analysis on a set of single nucleotide polymorphisms (SNPs) pre-assigned within a gene constitutes a valuable complement to single-marker analysis by aggregating data on complex traits in a biologically meaningful way. However, despite the existence of a wide variety of SNP-set methods, few comprehensive comparison studies have been previously performed to evaluate the effectiveness of these methods. RESULTS We herein sought to fill this knowledge gap by conducting a comprehensive empirical comparison for 22 commonly-used summary-statistics based SNP-set methods. We showed that only seven methods could effectively control the type I error, and that these well-calibrated approaches had varying power performance under the simulation scenarios. Overall, we confirmed that the burden test was generally underpowered and score-based variance component tests (e.g., sequence kernel association test) were much powerful under the polygenic genetic architecture in both common and rare variant association analyses. We further revealed that two linkage-disequilibrium-free P value combination methods (e.g., harmonic mean P value method and aggregated Cauchy association test) behaved very well under the sparse genetic architecture in simulations and real-data applications to common and rare variant association analyses as well as in expression quantitative trait loci weighted integrative analysis. We also assessed the scalability of these approaches by recording computational time and found that all these methods can be scalable to biobank-scale data although some might be relatively slow. CONCLUSION In conclusion, we hope that our findings can offer an important guidance on how to choose appropriate multilocus association analysis methods in post-GWAS era. All the SNP-set methods are implemented in the R package called MCA, which is freely available at https://github.com/biostatpzeng/ .
Collapse
Affiliation(s)
- Zhonghe Shao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ting Wang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Jiahao Qiao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Yuchen Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Shuiping Huang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
| |
Collapse
|
13
|
Li S, Li S, Su S, Zhang H, Shen J, Wen Y. Gene Region Association Analysis of Longitudinal Quantitative Traits Based on a Function-On-Function Regression Model. Front Genet 2022; 13:781740. [PMID: 35265102 PMCID: PMC8899465 DOI: 10.3389/fgene.2022.781740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 01/04/2022] [Indexed: 11/13/2022] Open
Abstract
In the process of growth and development in life, gene expressions that control quantitative traits will turn on or off with time. Studies of longitudinal traits are of great significance in revealing the genetic mechanism of biological development. With the development of ultra-high-density sequencing technology, the associated analysis has tremendous challenges to statistical methods. In this paper, a longitudinal functional data association test (LFDAT) method is proposed based on the function-on-function regression model. LFDAT can simultaneously treat phenotypic traits and marker information as continuum variables and analyze the association of longitudinal quantitative traits and gene regions. Simulation studies showed that: 1) LFDAT performs well for both linkage equilibrium simulation and linkage disequilibrium simulation, 2) LFDAT has better performance for gene regions (include common variants, low-frequency variants, rare variants and mixture), and 3) LFDAT can accurately identify gene switching in the growth and development stage. The longitudinal data of the Oryza sativa projected shoot area is analyzed by LFDAT. It showed that there is the advantage of quick calculations. Further, an association analysis was conducted between longitudinal traits and gene regions by integrating the micro effects of multiple related variants and using the information of the entire gene region. LFDAT provides a feasible method for studying the formation and expression of longitudinal traits.
Collapse
Affiliation(s)
- Shijing Li
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou, China.,> Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Shiqin Li
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Shaoqiang Su
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Hui Zhang
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou, China.,> Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Jiayu Shen
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou, China.,> Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Yongxian Wen
- College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou, China.,> Institute of Statistics and Application, Fujian Agriculture and Forestry University, Fuzhou, China
| |
Collapse
|
14
|
Rudra P, Baxter R, Hsieh EWY, Ghosh D. Compositional Data Analysis using Kernels in mass cytometry data. BIOINFORMATICS ADVANCES 2022; 2:vbac003. [PMID: 35224501 PMCID: PMC8867823 DOI: 10.1093/bioadv/vbac003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 12/06/2021] [Accepted: 01/12/2022] [Indexed: 01/27/2023]
Abstract
MOTIVATION Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. RESULTS We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. AVAILABILITY AND IMPLEMENTATION CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. CONTACT prudra@okstate.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Pratyaydipta Rudra
- Department of Statistics, Oklahoms State University, Stillwater, OK 74078, USA
| | - Ryan Baxter
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Elena W Y Hsieh
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Department of Pediatrics, Section of Allergy and Immunology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
15
|
Qu J, Cui Y. Gene set analysis with graph-embedded kernel association test. Bioinformatics 2021; 38:1560-1567. [PMID: 34935928 PMCID: PMC8896609 DOI: 10.1093/bioinformatics/btab851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 11/20/2021] [Accepted: 12/16/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Kernel-based association test (KAT) has been a popular approach to evaluate the association of expressions of a gene set (e.g. pathway) with a phenotypic trait. KATs rely on kernel functions which capture the sample similarity across multiple features, to capture potential linear or non-linear relationship among features in a gene set. When calculating the kernel functions, no network graphical information about the features is considered. While genes in a functional group (e.g. a pathway) are not independent in general due to regulatory interactions, incorporating regulatory network (or graph) information can potentially increase the power of KAT. In this work, we propose a graph-embedded kernel association test, termed gKAT. gKAT incorporates prior pathway knowledge when constructing a kernel function into hypothesis testing. RESULTS We apply a diffusion kernel to capture any graph structures in a gene set, then incorporate such information to build a kernel function for further association test. We illustrate the geometric meaning of the approach. Through extensive simulation studies, we show that the proposed gKAT algorithm can improve testing power compared to the one without considering graph structures. Application to a real dataset further demonstrate the utility of the method. AVAILABILITY AND IMPLEMENTATION The R code used for the analysis can be accessed at https://github.com/JialinQu/gKAT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jialin Qu
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Yuehua Cui
- To whom correspondence should be addressed.
| |
Collapse
|
16
|
Arthur VL, Li Z, Cao R, Oetting WS, Israni AK, Jacobson PA, Ritchie MD, Guan W, Chen J. A Multi-Marker Test for Analyzing Paired Genetic Data in Transplantation. Front Genet 2021; 12:745773. [PMID: 34721531 PMCID: PMC8548646 DOI: 10.3389/fgene.2021.745773] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 09/23/2021] [Indexed: 12/02/2022] Open
Abstract
Emerging evidence suggests that donor/recipient matching in non-HLA (human leukocyte antigen) regions of the genome may impact transplant outcomes and recognizing these matching effects may increase the power of transplant genetics studies. Most available matching scores account for either single-nucleotide polymorphism (SNP) matching only or sum these SNP matching scores across multiple gene-coding regions, which makes it challenging to interpret the association findings. We propose a multi-marker Joint Score Test (JST) to jointly test for association between recipient genotype SNP effects and a gene-based matching score with transplant outcomes. This method utilizes Eigen decomposition as a dimension reduction technique to potentially increase statistical power by decreasing the degrees of freedom for the test. In addition, JST allows for the matching effect and the recipient genotype effect to follow different biological mechanisms, which is not the case for other multi-marker methods. Extensive simulation studies show that JST is competitive when compared with existing methods, such as the sequence kernel association test (SKAT), especially under scenarios where associated SNPs are in low linkage disequilibrium with non-associated SNPs or in gene regions containing a large number of SNPs. Applying the method to paired donor/recipient genetic data from kidney transplant studies yields various gene regions that are potentially associated with incidence of acute rejection after transplant.
Collapse
Affiliation(s)
- Victoria L. Arthur
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
| | - Zhengbang Li
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
- Departments of Statistics, Central China Normal University, Wuhan, China
| | - Rui Cao
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - William S. Oetting
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, University of Minnesota, Minneapolis, MN, United States
| | - Ajay K. Israni
- Minneapolis Medical Research Foundation, Minneapolis, MN, United States
- Department of Medicine, Hennepin County Medical Center, Minneapolis, MN, United States
- Department of Epidemiology and Community Health, University of Minnesota, Minneapolis, MN, United States
| | - Pamala A. Jacobson
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, University of Minnesota, Minneapolis, MN, United States
| | - Marylyn D. Ritchie
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
| |
Collapse
|
17
|
Shao Z, Wang T, Zhang M, Jiang Z, Huang S, Zeng P. IUSMMT: Survival mediation analysis of gene expression with multiple DNA methylation exposures and its application to cancers of TCGA. PLoS Comput Biol 2021; 17:e1009250. [PMID: 34464378 PMCID: PMC8437300 DOI: 10.1371/journal.pcbi.1009250] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 09/13/2021] [Accepted: 07/06/2021] [Indexed: 02/07/2023] Open
Abstract
Effective and powerful survival mediation models are currently lacking. To partly fill such knowledge gap, we particularly focus on the mediation analysis that includes multiple DNA methylations acting as exposures, one gene expression as the mediator and one survival time as the outcome. We proposed IUSMMT (intersection-union survival mixture-adjusted mediation test) to effectively examine the existence of mediation effect by fitting an empirical three-component mixture null distribution. With extensive simulation studies, we demonstrated the advantage of IUSMMT over existing methods. We applied IUSMMT to ten TCGA cancers and identified multiple genes that exhibited mediating effects. We further revealed that most of the identified regions, in which genes behaved as active mediators, were cancer type-specific and exhibited a full mediation from DNA methylation CpG sites to the survival risk of various types of cancers. Overall, IUSMMT represents an effective and powerful alternative for survival mediation analysis; our results also provide new insights into the functional role of DNA methylation and gene expression in cancer progression/prognosis and demonstrate potential therapeutic targets for future clinical practice.
Collapse
Affiliation(s)
- Zhonghe Shao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Ting Wang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Meng Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Zhou Jiang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Shuiping Huang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, Jiangsu, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, Jiangsu, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, Jiangsu, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, Jiangsu, China
| |
Collapse
|
18
|
Gong M, Liu P, Sciurba FC, Stojanov P, Tao D, Tseng GC, Zhang K, Batmanghelich K. Unpaired data empowers association tests. Bioinformatics 2021; 37:785-792. [PMID: 33070196 PMCID: PMC8098021 DOI: 10.1093/bioinformatics/btaa886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 09/07/2020] [Accepted: 10/05/2020] [Indexed: 11/25/2022] Open
Abstract
Motivation There is growing interest in the biomedical research community to incorporate retrospective data, available in healthcare systems, to shed light on associations between different biomarkers. Understanding the association between various types of biomedical data, such as genetic, blood biomarkers, imaging, etc. can provide a holistic understanding of human diseases. To formally test a hypothesized association between two types of data in Electronic Health Records (EHRs), one requires a substantial sample size with both data modalities to achieve a reasonable power. Current association test methods only allow using data from individuals who have both data modalities. Hence, researchers cannot take advantage of much larger EHR samples that includes individuals with at least one of the data types, which limits the power of the association test. Results We present a new method called the Semi-paired Association Test (SAT) that makes use of both paired and unpaired data. In contrast to classical approaches, incorporating unpaired data allows SAT to produce better control of false discovery and to improve the power of the association test. We study the properties of the new test theoretically and empirically, through a series of simulations and by applying our method on real studies in the context of Chronic Obstructive Pulmonary Disease. We are able to identify an association between the high-dimensional characterization of Computed Tomography chest images and several blood biomarkers as well as the expression of dozens of genes involved in the immune system. Availability and implementation Code is available on https://github.com/batmanlab/Semi-paired-Association-Test. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingming Gong
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA.,Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,School of Mathematics and Statistics, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Peng Liu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Frank C Sciurba
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Petar Stojanov
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Dacheng Tao
- Australia School of Computer Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - George C Tseng
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Kun Zhang
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Kayhan Batmanghelich
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| |
Collapse
|
19
|
Zeng P, Shao Z, Zhou X. Statistical methods for mediation analysis in the era of high-throughput genomics: Current successes and future challenges. Comput Struct Biotechnol J 2021; 19:3209-3224. [PMID: 34141140 PMCID: PMC8187160 DOI: 10.1016/j.csbj.2021.05.042] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 05/21/2021] [Accepted: 05/21/2021] [Indexed: 12/12/2022] Open
Abstract
Mediation analysis investigates the intermediate mechanism through which an exposure exerts its influence on the outcome of interest. Mediation analysis is becoming increasingly popular in high-throughput genomics studies where a common goal is to identify molecular-level traits, such as gene expression or methylation, which actively mediate the genetic or environmental effects on the outcome. Mediation analysis in genomics studies is particularly challenging, however, thanks to the large number of potential mediators measured in these studies as well as the composite null nature of the mediation effect hypothesis. Indeed, while the standard univariate and multivariate mediation methods have been well-established for analyzing one or multiple mediators, they are not well-suited for genomics studies with a large number of mediators and often yield conservative p-values and limited power. Consequently, over the past few years many new high-dimensional mediation methods have been developed for analyzing the large number of potential mediators collected in high-throughput genomics studies. In this work, we present a thorough review of these important recent methodological advances in high-dimensional mediation analysis. Specifically, we describe in detail more than ten high-dimensional mediation methods, focusing on their motivations, basic modeling ideas, specific modeling assumptions, practical successes, methodological limitations, as well as future directions. We hope our review will serve as a useful guidance for statisticians and computational biologists who develop methods of high-dimensional mediation analysis as well as for analysts who apply mediation methods to high-throughput genomics studies.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
- Center for Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Zhonghe Shao
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor 48109, MI, USA
- Center for Statistical Genetics, University of Michigan, Ann Arbor 48109, MI, USA
| |
Collapse
|
20
|
Family-based gene-environment interaction using sequence kernel association test (FGE-SKAT) for complex quantitative traits. Sci Rep 2021; 11:7431. [PMID: 33795796 PMCID: PMC8016937 DOI: 10.1038/s41598-021-86871-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Accepted: 03/22/2021] [Indexed: 11/30/2022] Open
Abstract
After the genome-wide association studies (GWAS) era, whole-genome sequencing is highly engaged in identifying the association of complex traits with rare variations. A score-based variance-component test has been proposed to identify common and rare genetic variants associated with complex traits while quickly adjusting for covariates. Such kernel score statistic allows for familial dependencies and adjusts for random confounding effects. However, the etiology of complex traits may involve the effects of genetic and environmental factors and the complex interactions between genes and the environment. Therefore, in this research, a novel method is proposed to detect gene and gene-environment interactions in a complex family-based association study with various correlated structures. We also developed an R function for the Fast Gene-Environment Sequence Kernel Association Test (FGE-SKAT), which is freely available as supplementary material for easy GWAS implementation to unveil such family-based joint effects. Simulation studies confirmed the validity of the new strategy and the superior statistical power. The FGE-SKAT was applied to the whole genome sequence data provided by Genetic Analysis Workshop 18 (GAW18) and discovered concordant and discordant regions compared to the methods without considering gene by environment interactions.
Collapse
|
21
|
Tang S, Buchman AS, De Jager PL, Bennett DA, Epstein MP, Yang J. Novel Variance-Component TWAS method for studying complex human diseases with applications to Alzheimer's dementia. PLoS Genet 2021; 17:e1009482. [PMID: 33798195 PMCID: PMC8046351 DOI: 10.1371/journal.pgen.1009482] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 04/14/2021] [Accepted: 03/15/2021] [Indexed: 02/07/2023] Open
Abstract
Transcriptome-wide association studies (TWAS) have been widely used to integrate transcriptomic and genetic data to study complex human diseases. Within a test dataset lacking transcriptomic data, traditional two-stage TWAS methods first impute gene expression by creating a weighted sum that aggregates SNPs with their corresponding cis-eQTL effects on reference transcriptome. Traditional TWAS methods then employ a linear regression model to assess the association between imputed gene expression and test phenotype, thereby assuming the effect of a cis-eQTL SNP on test phenotype is a linear function of the eQTL's estimated effect on reference transcriptome. To increase TWAS robustness to this assumption, we propose a novel Variance-Component TWAS procedure (VC-TWAS) that assumes the effects of cis-eQTL SNPs on phenotype are random (with variance proportional to corresponding reference cis-eQTL effects) rather than fixed. VC-TWAS is applicable to both continuous and dichotomous phenotypes, as well as individual-level and summary-level GWAS data. Using simulated data, we show VC-TWAS is more powerful than traditional TWAS methods based on a two-stage Burden test, especially when eQTL genetic effects on test phenotype are no longer a linear function of their eQTL genetic effects on reference transcriptome. We further applied VC-TWAS to both individual-level (N = ~3.4K) and summary-level (N = ~54K) GWAS data to study Alzheimer's dementia (AD). With the individual-level data, we detected 13 significant risk genes including 6 known GWAS risk genes such as TOMM40 that were missed by traditional TWAS methods. With the summary-level data, we detected 57 significant risk genes considering only cis-SNPs and 71 significant genes considering both cis- and trans- SNPs, which also validated our findings with the individual-level GWAS data. Our VC-TWAS method is implemented in the TIGAR tool for public use.
Collapse
Affiliation(s)
- Shizhen Tang
- Center for Computational and Quantitative Genetics, Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, United States of America
- Department of Biostatistics and Bioinformatics, Emory University School of Public Health, Atlanta, Georgia, United States of America
| | - Aron S. Buchman
- Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago, Illinois, United States of America
| | - Philip L. De Jager
- Center for Translational and Computational Neuroimmunology, Department of Neurology and Taub Institute for Research on Alzheimer’s Disease and the Aging Brain, Columbia University Irving Medical Center, New York, New York, United States of America
| | - David A. Bennett
- Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago, Illinois, United States of America
| | - Michael P. Epstein
- Center for Computational and Quantitative Genetics, Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, United States of America
| | - Jingjing Yang
- Center for Computational and Quantitative Genetics, Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, United States of America
| |
Collapse
|
22
|
Lu H, Zhang J, Jiang Z, Zhang M, Wang T, Zhao H, Zeng P. Detection of Genetic Overlap Between Rheumatoid Arthritis and Systemic Lupus Erythematosus Using GWAS Summary Statistics. Front Genet 2021; 12:656545. [PMID: 33815486 PMCID: PMC8012913 DOI: 10.3389/fgene.2021.656545] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 03/01/2021] [Indexed: 01/04/2023] Open
Abstract
Background Clinical and epidemiological studies have suggested systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) are comorbidities and common genetic etiologies can partly explain such coexistence. However, shared genetic determinations underlying the two diseases remain largely unknown. Methods Our analysis relied on summary statistics available from genome-wide association studies of SLE (N = 23,210) and RA (N = 58,284). We first evaluated the genetic correlation between RA and SLE through the linkage disequilibrium score regression (LDSC). Then, we performed a multiple-tissue eQTL (expression quantitative trait loci) weighted integrative analysis for each of the two diseases and aggregated association evidence across these tissues via the recently proposed harmonic mean P-value (HMP) combination strategy, which can produce a single well-calibrated P-value for correlated test statistics. Afterwards, we conducted the pleiotropy-informed association using conjunction conditional FDR (ccFDR) to identify potential pleiotropic genes associated with both RA and SLE. Results We found there existed a significant positive genetic correlation (rg = 0.404, P = 6.01E-10) via LDSC between RA and SLE. Based on the multiple-tissue eQTL weighted integrative analysis and the HMP combination across various tissues, we discovered 14 potential pleiotropic genes by ccFDR, among which four were likely newly novel genes (i.e., INPP5B, OR5K2, RP11-2C24.5, and CTD-3105H18.4). The SNP effect sizes of these pleiotropic genes were typically positively dependent, with an average correlation of 0.579. Functionally, these genes were implicated in multiple auto-immune relevant pathways such as inositol phosphate metabolic process, membrane and glucagon signaling pathway. Conclusion This study reveals common genetic components between RA and SLE and provides candidate associated loci for understanding of molecular mechanism underlying the comorbidity of the two diseases.
Collapse
Affiliation(s)
- Haojie Lu
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China
| | - Jinhui Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China
| | - Zhou Jiang
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China
| | - Meng Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China
| | - Ting Wang
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China.,Center for Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical University, Xuzhou, China
| | - Huashuo Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China.,Center for Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical University, Xuzhou, China
| | - Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China.,Center for Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical University, Xuzhou, China
| |
Collapse
|
23
|
Deng Y, Wu S, Fan H. Genome-wide pathway-based quantitative multiple phenotypes analysis. PLoS One 2020; 15:e0240910. [PMID: 33175855 PMCID: PMC7657528 DOI: 10.1371/journal.pone.0240910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 10/06/2020] [Indexed: 11/18/2022] Open
Abstract
For complex diseases, genome-wide pathway association studies have become increasingly promising. Currently, however, pathway-based association analysis mainly focus on a single phenotype, which may insufficient to describe the complex diseases and physiological processes. This work proposes a combination model to evaluate the association between a pathway and multiple phenotypes and to reduce the run time based on asymptotic results. For a single phenotype, we propose a semi-supervised maximum kernel-based U-statistics (mSKU) method to assess the pathway-based association analysis. For multiple phenotypes, we propose the fisher combination function with dependent phenotypes (FC) to transform the p-values between the pathway and each marginal phenotype individually to achieve pathway-based multiple phenotypes analysis. With real data from the Alzheimer Disease Neuroimaging Initiative (ADNI) study and Human Liver Cohort (HLC) study, the FC-mSKU method allows us to specify which pathways are specific to a single phenotype or contribute to common genetic constructions of multiple phenotypes. If we only focus on single-phenotype tests, we may miss some findings for etiology studies. Through extensive simulation studies, the FC-mSKU method demonstrates its advantages compared with its counterparts.
Collapse
Affiliation(s)
- Yamin Deng
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China.,Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shiman Wu
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Huifang Fan
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| |
Collapse
|
24
|
Arthur VL, Guan W, Loza BL, Keating B, Chen J. Joint testing of donor and recipient genetic matching scores and recipient genotype has robust power for finding genes associated with transplant outcomes. Genet Epidemiol 2020; 44:893-907. [PMID: 32783273 PMCID: PMC7658035 DOI: 10.1002/gepi.22349] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 07/09/2020] [Accepted: 07/31/2020] [Indexed: 01/05/2023]
Abstract
Genetic matching between transplant donor and recipient pairs has traditionally focused on the human leukocyte antigen (HLA) regions of the genome, but recent studies suggest that matching for non-HLA regions may be important as well. We assess four genetic matching scores for use in association analyses of transplant outcomes. These scores describe genetic ancestry distance using identity-by-state, or genetic incompatibility or mismatch of the two genomes and therefore may reflect different underlying biological mechanisms for donor and recipient genes to influence transplant outcomes. Our simulation studies show that jointly testing these scores with the recipient genotype is a powerful method for preliminary screening and discovery of transplant outcome related single nucleotide polymorphisms (SNPs) and gene regions. Following these joint tests with marginal testing of the recipient genotype and matching score separately can lead to further understanding of the biological mechanisms behind transplant outcomes. In addition, we present results of a liver transplant data analysis that shows joint testing can detect SNPs significantly associated with acute rejection in liver transplant.
Collapse
Affiliation(s)
- Victoria L Arthur
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, USA
| | - Bao-li Loza
- Department of Surgery, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Brendan Keating
- Department of Surgery, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
25
|
Statistical Method Based on Bayes-Type Empirical Score Test for Assessing Genetic Association with Multilocus Genotype Data. Int J Genomics 2020; 2020:4708152. [PMID: 32455126 PMCID: PMC7229558 DOI: 10.1155/2020/4708152] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 04/21/2020] [Indexed: 12/20/2022] Open
Abstract
Simultaneous testing of multiple genetic variants for association is widely recognized as a valuable complementary approach to single-marker tests. As such, principal component regression (PCR) has been found to have competitive power. We focus on exploring a robust test for an unknown genetic mode of all SNPs, an unknown Hardy-Weinberg equilibrium (HWE) in a population, and a large number of all SNPs. First, we propose a new global test by means of the use of codominant codes for all markers and PCR. The new global test is built on an empirical Bayes-type score statistic for testing marginal associations with each single marker. The new global test gains power by robustly exploiting the Hardy-Weinberg equilibrium in the control population and effectively using linkage disequilibrium among test markers. The new global test reduces to PCR when the genotype for each marker is coded as the number of minor alleles. This connection lends insight into the power of the new global test relative to PCR and some other popular multimarker test methods. Second, we propose a robust test method based on the new global test and the ordinary PCR test built on a prospective score statistic for testing marginal associations with each single marker when the genotype for each marker is coded as the number of minor alleles by taking the minimum p value of these two tests. Finally, through extensive simulation studies and analysis of the association between pancreatic cancer and some genes of interest, we show that the proposed robust test method has desirable power and can often identify association signals that may be missed by existing methods.
Collapse
|
26
|
Deng Y, He T, Fang R, Li S, Cao H, Cui Y. Genome-Wide Gene-Based Multi-Trait Analysis. Front Genet 2020; 11:437. [PMID: 32508874 PMCID: PMC7248273 DOI: 10.3389/fgene.2020.00437] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/08/2020] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association studies focusing on a single phenotype have been broadly conducted to identify genetic variants associated with a complex disease. The commonly applied single variant analysis is limited by failing to consider the complex interactions between variants, which motivated the development of association analyses focusing on genes or gene sets. Moreover, when multiple correlated phenotypes are available, methods based on a multi-trait analysis can improve the association power. However, most currently available multi-trait analyses are single variant-based analyses; thus have limited power when disease variants function as a group in a gene or a gene set. In this work, we propose a genome-wide gene-based multi-trait analysis method by considering genes as testing units. For a given phenotype, we adopt a rapid and powerful kernel-based testing method which can evaluate the joint effect of multiple variants within a gene. The joint effect, either linear or nonlinear, is captured through kernel functions. Given a series of candidate kernel functions, we propose an omnibus test strategy to integrate the test results based on different candidate kernels. A p-value combination method is then applied to integrate dependent p-values to assess the association between a gene and multiple correlated phenotypes. Simulation studies show a reasonable type I error control and an excellent power of the proposed method compared to its counterparts. We further show the utility of the method by applying it to two data sets: the Human Liver Cohort and the Alzheimer Disease Neuroimaging Initiative data set, and novel genes are identified. Our method has broad applications in other fields in which the interest is to evaluate the joint effect (linear or nonlinear) of a set of variants.
Collapse
Affiliation(s)
- Yamin Deng
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, United States
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, United States
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
27
|
Shinohara RT, Shou H, Carone M, Schultz R, Tunc B, Parker D, Martin ML, Verma R. Distance-based analysis of variance for brain connectivity. Biometrics 2020; 76:257-269. [PMID: 31350904 PMCID: PMC7653688 DOI: 10.1111/biom.13123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Accepted: 07/12/2019] [Indexed: 01/07/2023]
Abstract
The field of neuroimaging dedicated to mapping connections in the brain is increasingly being recognized as key for understanding neurodevelopment and pathology. Networks of these connections are quantitatively represented using complex structures, including matrices, functions, and graphs, which require specialized statistical techniques for estimation and inference about developmental and disorder-related changes. Unfortunately, classical statistical testing procedures are not well suited to high-dimensional testing problems. In the context of global or regional tests for differences in neuroimaging data, traditional analysis of variance (ANOVA) is not directly applicable without first summarizing the data into univariate or low-dimensional features, a process that might mask the salient features of high-dimensional distributions. In this work, we consider a general framework for two-sample testing of complex structures by studying generalized within-group and between-group variances based on distances between complex and potentially high-dimensional observations. We derive an asymptotic approximation to the null distribution of the ANOVA test statistic, and conduct simulation studies with scalar and graph outcomes to study finite sample properties of the test. Finally, we apply our test to our motivating study of structural connectivity in autism spectrum disorder.
Collapse
Affiliation(s)
- Russell T. Shinohara
- Department of Biostatistics, Epidemiology, and Informatics, Penn Statistics in Imaging and Visualization Center, University of Pennsylvania, Philadelphia, Pennsylvania
- Department of Radiology, Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Haochang Shou
- Department of Biostatistics, Epidemiology, and Informatics, Penn Statistics in Imaging and Visualization Center, University of Pennsylvania, Philadelphia, Pennsylvania
- Department of Radiology, Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Marco Carone
- Department of Biostatistics, University of Washington, Seattle, Washington
| | - Robert Schultz
- Center for Autism Research, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania
| | - Birkan Tunc
- Department of Radiology, Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Drew Parker
- Department of Radiology, Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Melissa Lynne Martin
- Department of Biostatistics, Epidemiology, and Informatics, Penn Statistics in Imaging and Visualization Center, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Ragini Verma
- Department of Radiology, Center for Biomedical Image Computing and Analytics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
28
|
Solis-Lemus CR, Fischer ST, Todor A, Liu C, Leslie EJ, Cutler DJ, Ghosh D, Epstein MP. Leveraging Family History in Case-Control Analyses of Rare Variation. Genetics 2020; 214:295-303. [PMID: 31843756 PMCID: PMC7017020 DOI: 10.1534/genetics.119.302846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2019] [Accepted: 12/10/2019] [Indexed: 11/18/2022] Open
Abstract
Standard methods for case-control association studies of rare variation often treat disease outcome as a dichotomous phenotype. However, both theoretical and experimental studies have demonstrated that subjects with a family history of disease can be enriched for risk variation relative to subjects without such history. Assuming family history information is available, this observation motivates the idea of replacing the standard dichotomous outcome variable used in case-control studies with a more informative ordinal outcome variable that distinguishes controls (0), sporadic cases (1), and cases with a family history (2), with the expectation that we should observe increasing number of risk variants with increasing category of the ordinal variable. To leverage this expectation, we propose a novel rare-variant association test that incorporates family history information based on our previous GAMuT framework for rare-variant association testing of multivariate phenotypes. We use simulated data to show that, when family history information is available, our new method outperforms standard rare-variant association methods, like burden and SKAT tests, that ignore family history. We further illustrate our method using a rare-variant study of cleft lip and palate.
Collapse
Affiliation(s)
| | - S Taylor Fischer
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, 30329 Georgia
| | - Andrei Todor
- Department of Human Genetics, Emory University, Atlanta, 30030 Georgia
| | - Cuining Liu
- Department of Biostatistics and Informatics, University of Colorado, Aurora, 80045 Colorado
| | | | - David J Cutler
- Department of Human Genetics, Emory University, Atlanta, 30030 Georgia
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado, Aurora, 80045 Colorado
| | - Michael P Epstein
- Department of Human Genetics, Emory University, Atlanta, 30030 Georgia
| |
Collapse
|
29
|
Martinez K, Maity A, Yolken RH, Sullivan PF, Tzeng JY. Robust kernel association testing (RobKAT). Genet Epidemiol 2020; 44:272-282. [PMID: 31943371 DOI: 10.1002/gepi.22280] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Revised: 12/18/2019] [Accepted: 12/23/2019] [Indexed: 12/25/2022]
Abstract
Testing the association between single-nucleotide polymorphism (SNP) effects and a response is often carried out through kernel machine methods based on least squares, such as the sequence kernel association test (SKAT). However, these least-squares procedures are designed for a normally distributed conditional response, which may not apply. Other robust procedures such as the quantile regression kernel machine (QRKM) restrict the choice of the loss function and only allow inference on conditional quantiles. We propose a general and robust kernel association test with a flexible choice of the loss function, no distributional assumptions, and has SKAT and QRKM as special cases. We evaluate our proposed robust association test (RobKAT) across various data distributions through a simulation study. When errors are normally distributed, RobKAT controls type I error and shows comparable power with SKAT. In all other distributional settings investigated, our robust test has similar or greater power than SKAT. Finally, we apply our robust testing method to data from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) clinical trial to detect associations between selected genes including the major histocompatibility complex (MHC) region on chromosome six and neurotropic herpesvirus antibody levels in schizophrenia patients. RobKAT detected significant association with four SNP sets (HST1H2BJ, MHC, POM12L2, and SLC17A1), three of which were undetected by SKAT.
Collapse
Affiliation(s)
- Kara Martinez
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Robert H Yolken
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Patrick F Sullivan
- Stanley Neurovirology Laboratory, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina.,Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| |
Collapse
|
30
|
Aksman LM, Scelsi MA, Marquand AF, Alexander DC, Ourselin S, Altmann A. Modeling longitudinal imaging biomarkers with parametric Bayesian multi-task learning. Hum Brain Mapp 2019; 40:3982-4000. [PMID: 31168892 PMCID: PMC6679792 DOI: 10.1002/hbm.24682] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 05/03/2019] [Accepted: 05/19/2019] [Indexed: 01/09/2023] Open
Abstract
Longitudinal imaging biomarkers are invaluable for understanding the course of neurodegeneration, promising the ability to track disease progression and to detect disease earlier than cross-sectional biomarkers. To properly realize their potential, biomarker trajectory models must be robust to both under-sampling and measurement errors and should be able to integrate multi-modal information to improve trajectory inference and prediction. Here we present a parametric Bayesian multi-task learning based approach to modeling univariate trajectories across subjects that addresses these criteria. Our approach learns multiple subjects' trajectories within a single model that allows for different types of information sharing, that is, coupling, across subjects. It optimizes a combination of uncoupled, fully coupled and kernel coupled models. Kernel-based coupling allows linking subjects' trajectories based on one or more biomarker measures. We demonstrate this using Alzheimer's Disease Neuroimaging Initiative (ADNI) data, where we model longitudinal trajectories of MRI-derived cortical volumes in neurodegeneration, with coupling based on APOE genotype, cerebrospinal fluid (CSF) and amyloid PET-based biomarkers. In addition to detecting established disease effects, we detect disease related changes within the insula that have not received much attention within the literature. Due to its sensitivity in detecting disease effects, its competitive predictive performance and its ability to learn the optimal parameter covariance from data rather than choosing a specific set of random and fixed effects a priori, we propose that our model can be used in place of or in addition to linear mixed effects models when modeling biomarker trajectories. A software implementation of the method is publicly available.
Collapse
Affiliation(s)
- Leon M. Aksman
- Centre for Medical Image ComputingUniversity College LondonLondonUK
| | - Marzia A. Scelsi
- Centre for Medical Image ComputingUniversity College LondonLondonUK
| | - Andre F. Marquand
- Donders Centre for Cognitive Neuroimaging, Donders Institute for Brain, Cognition and BehaviourRadboud UniversityNijmegenThe Netherlands
| | | | - Sebastien Ourselin
- Centre for Medical Image ComputingUniversity College LondonLondonUK
- School of Biomedical Engineering and Imaging SciencesSt Thomas' Hospital, King's College LondonLondonUK
| | - Andre Altmann
- Centre for Medical Image ComputingUniversity College LondonLondonUK
| | | |
Collapse
|
31
|
Zhao Y, Zhu H, Lu Z, Knickmeyer RC, Zou F. Structured Genome-Wide Association Studies with Bayesian Hierarchical Variable Selection. Genetics 2019; 212:397-415. [PMID: 31010934 PMCID: PMC6553832 DOI: 10.1534/genetics.119.301906] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 04/08/2019] [Indexed: 02/04/2023] Open
Abstract
It becomes increasingly important in using genome-wide association studies (GWAS) to select important genetic information associated with qualitative or quantitative traits. Currently, the discovery of biological association among SNPs motivates various strategies to construct SNP-sets along the genome and to incorporate such set information into selection procedure for a higher selection power, while facilitating more biologically meaningful results. The aim of this paper is to propose a novel Bayesian framework for hierarchical variable selection at both SNP-set (group) level and SNP (within group) level. We overcome a key limitation of existing posterior updating scheme in most Bayesian variable selection methods by proposing a novel sampling scheme to explicitly accommodate the ultrahigh-dimensionality of genetic data. Specifically, by constructing an auxiliary variable selection model under SNP-set level, the new procedure utilizes the posterior samples of the auxiliary model to subsequently guide the posterior inference for the targeted hierarchical selection model. We apply the proposed method to a variety of simulation studies and show that our method is computationally efficient and achieves substantially better performance than competing approaches in both SNP-set and SNP selection. Applying the method to the Alzheimers Disease Neuroimaging Initiative (ADNI) data, we identify biologically meaningful genetic factors under several neuroimaging volumetric phenotypes. Our method is general and readily to be applied to a wide range of biomedical studies.
Collapse
Affiliation(s)
- Yize Zhao
- Department of Healthcare Policy and Research, Cornell University Weill Cornell, New York, New York 10065
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Zhaohua Lu
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Rebecca C Knickmeyer
- Department of Pediatrics and Human Development, Michigan State University, East Lansing, Michigan 48824
| | - Fei Zou
- Department of Biostatistics, University of Florida, Gainesville, Florida 32611
| |
Collapse
|
32
|
Holleman AM, Broadaway KA, Duncan R, Todor A, Almli LM, Bradley B, Ressler KJ, Ghosh D, Mulle JG, Epstein MP. Powerful and Efficient Strategies for Genetic Association Testing of Symptom and Questionnaire Data in Psychiatric Genetic Studies. Sci Rep 2019; 9:7523. [PMID: 31101869 PMCID: PMC6525248 DOI: 10.1038/s41598-019-44046-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Accepted: 05/01/2019] [Indexed: 11/09/2022] Open
Abstract
Genetic studies of psychiatric disorders often deal with phenotypes that are not directly measurable. Instead, researchers rely on multivariate symptom data from questionnaires and surveys like the PTSD Symptom Scale (PSS) and Beck Depression Inventory (BDI) to indirectly assess a latent phenotype of interest. Researchers subsequently collapse such multivariate questionnaire data into a univariate outcome to represent a surrogate for the latent phenotype. However, when a causal variant is only associated with a subset of collapsed symptoms, the effect will be challenging to detect using the univariate outcome. We describe a more powerful strategy for genetic association testing in this situation that jointly analyzes the original multivariate symptom data collectively using a statistical framework that compares similarity in multivariate symptom-scale data from questionnaires to similarity in common genetic variants across a gene. We use simulated data to demonstrate this strategy provides substantially increased power over standard approaches that collapse questionnaire data into a single surrogate outcome. We also illustrate our approach using GWAS data from the Grady Trauma Project and identify genes associated with BDI not identified using standard univariate techniques. The approach is computationally efficient, scales to genome-wide studies, and is applicable to correlated symptom data of arbitrary dimension.
Collapse
Affiliation(s)
- Aaron M Holleman
- Department of Epidemiology, Emory University, Atlanta, GA, USA
- Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA, USA
| | | | - Richard Duncan
- Department of Human Genetics, Emory University, Atlanta, GA, USA
| | - Andrei Todor
- Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA, USA
- Department of Human Genetics, Emory University, Atlanta, GA, USA
| | - Lynn M Almli
- Department of Psychiatry and Behavioral Sciences, Emory University, Atlanta, GA, USA
| | - Bekh Bradley
- Department of Psychiatry and Behavioral Sciences, Emory University, Atlanta, GA, USA
- Clinical Psychologist, Mental Health Service Line, Department of Veterans Affairs Medical Center, Atlanta, GA, USA
| | - Kerry J Ressler
- Department of Psychiatry, McLean Hospital, Harvard Medical School, Belmont, MA, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA
| | - Jennifer G Mulle
- Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA, USA
- Department of Human Genetics, Emory University, Atlanta, GA, USA
| | - Michael P Epstein
- Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA, USA.
- Department of Human Genetics, Emory University, Atlanta, GA, USA.
| |
Collapse
|
33
|
Yan Q, Liu N, Forno E, Canino G, Celedón JC, Chen W. An integrative association method for omics data based on a modified Fisher's method with application to childhood asthma. PLoS Genet 2019; 15:e1008142. [PMID: 31063461 PMCID: PMC6524814 DOI: 10.1371/journal.pgen.1008142] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 05/17/2019] [Accepted: 04/16/2019] [Indexed: 02/07/2023] Open
Abstract
The development of high-throughput biotechnologies allows the collection of omics data to study the biological mechanisms underlying complex diseases at different levels, such as genomics, epigenomics, and transcriptomics. However, each technology is designed to collect a specific type of omics data. Thus, the association between a disease and one type of omics data is usually tested individually, but this strategy is suboptimal. To better articulate biological processes and increase the consistency of variant identification, omics data from various platforms need to be integrated. In this report, we introduce an approach that uses a modified Fisher's method (denoted as Omnibus-Fisher) to combine separate p-values of association testing for a trait and SNPs, DNA methylation markers, and RNA sequencing, calculated by kernel machine regression into an overall gene-level p-value to account for correlation between omics data. To consider all possible disease models, we extend Omnibus-Fisher to an optimal test by using perturbations. In our simulations, a usual Fisher's method has inflated type I error rates when directly applied to correlated omics data. In contrast, Omnibus-Fisher preserves the expected type I error rates. Moreover, Omnibus-Fisher has increased power compared to its optimal version when the true disease model involves all types of omics data. On the other hand, the optimal Omnibus-Fisher is more powerful than its regular version when only one type of data is causal. Finally, we illustrate our proposed method by analyzing whole-genome genotyping, DNA methylation data, and RNA sequencing data from a study of childhood asthma in Puerto Ricans.
Collapse
Affiliation(s)
- Qi Yan
- Division of Pediatric Pulmonary Medicine, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA
- * E-mail: (QY); (WC)
| | - Nianjun Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Indiana University Bloomington, Bloomington, IN
| | - Erick Forno
- Division of Pediatric Pulmonary Medicine, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA
| | - Glorisa Canino
- Behavioral Sciences Research Institute, University of Puerto Rico, San Juan, PR
| | - Juan C. Celedón
- Division of Pediatric Pulmonary Medicine, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA
| | - Wei Chen
- Division of Pediatric Pulmonary Medicine, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, PA
- * E-mail: (QY); (WC)
| |
Collapse
|
34
|
Svishcheva GR. A generalized model for combining dependent SNP-level summary statistics and its extensions to statistics of other levels. Sci Rep 2019; 9:5461. [PMID: 30940856 PMCID: PMC6445108 DOI: 10.1038/s41598-019-41827-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Accepted: 03/06/2019] [Indexed: 11/12/2022] Open
Abstract
Here I propose a fundamentally new flexible model to reveal the association between a trait and a set of genetic variants in a genomic region/gene. This model was developed for the situation when original individual-level phenotype and genotype data are not available, but the researcher possesses the results of statistical analyses conducted on these data (namely, SNP-level summary Z score statistics and SNP-by-SNP correlations). The new model was analytically derived from the classical multiple linear regression model applied for the region-based association analysis of individual-level phenotype and genotype data by using the linear compression of data, where the SNP-by-SNP correlations are among the explanatory variables, and the summary Z score statistics are categorized as the response variables. I analytically show that the regional association analysis methods developed within the framework of the classical multiple linear regression model with additive effects of genetic variants can be reformulated in terms of the new model without the loss of information. The results obtained from the regional association analysis utilizing the classical model and those derived using the proposed model are identical when SNP-by-SNP correlations and SNP-level statistics are estimated from the same genetic data.
Collapse
Affiliation(s)
- Gulnara R Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, 630090, Russia. .,Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russia.
| |
Collapse
|
35
|
Zhao N, Zhang H, Clark JJ, Maity A, Wu MC. Composite kernel machine regression based on likelihood ratio test for joint testing of genetic and gene–environment interaction effect. Biometrics 2019; 75:625-637. [DOI: 10.1111/biom.13003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Accepted: 10/09/2018] [Indexed: 12/17/2022]
Affiliation(s)
- Ni Zhao
- Department of BiostatisticsJohns Hopkins UniversityBaltimore, Maryland
| | - Haoyu Zhang
- Department of BiostatisticsJohns Hopkins UniversityBaltimore, Maryland
| | - Jennifer J. Clark
- Department of BiostatisticsUniversity of North Carolina at Chapel HillChapel Hill, North Carolina
| | - Arnab Maity
- Department of StatisticsNorth Carolina State UniversityRaleigh, North Carolina
| | - Michael C. Wu
- Public Health Sciences Division,Fred Hutchinson Cancer Research CenterSeattle, Washington
| |
Collapse
|
36
|
Shao F, Wang Y, Zhao Y, Yang S. Identifying and exploiting gene-pathway interactions from RNA-seq data for binary phenotype. BMC Genet 2019; 20:36. [PMID: 30890140 PMCID: PMC6423879 DOI: 10.1186/s12863-019-0739-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 03/12/2019] [Indexed: 11/29/2022] Open
Abstract
Background RNA sequencing (RNA-seq) technology has identified multiple differentially expressed (DE) genes associated to complex disease, however, these genes only explain a modest part of variance. Omnigenic model assumes that disease may be driven by genes with indirect relevance to disease and be propagated by functional pathways. Here, we focus on identifying the interactions between the external genes and functional pathways, referring to gene-pathway interactions (GPIs). Specifically, relying on the relationship between the garrote kernel machine (GKM) and variance component test and permutations for the empirical distributions of score statistics, we propose an efficient analysis procedure as Permutation based gEne-pAthway interaction identification in binary phenotype (PEA). Results Various simulations show that PEA has well-calibrated type I error rates and higher power than the traditional likelihood ratio test (LRT). In addition, we perform the gene set enrichment algorithms and PEA to identifying the GPIs from a pan-cancer data (GES68086). These GPIs and genes possibly further illustrate the potential etiology of cancers, most of which are identified and some external genes and significant pathways are consistent with previous studies. Conclusions PEA is an efficient tool for identifying the GPIs from RNA-seq data. It can be further extended to identify the interactions between one variable and one functional set of other omics data for binary phenotypes. Electronic supplementary material The online version of this article (10.1186/s12863-019-0739-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Fang Shao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China
| | - Yaqi Wang
- Department of Pharmacy Informatics, School of Science, China Pharmaceutical University, 24 Tongjia Xiang, Nanjing , Jiangsu, People's Republic of China
| | - Yang Zhao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China
| | - Sheng Yang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China.
| |
Collapse
|
37
|
Marceau West R, Lu W, Rotroff DM, Kuenemann MA, Chang SM, Wu MC, Wagner MJ, Buse JB, Motsinger-Reif AA, Fourches D, Tzeng JY. Identifying individual risk rare variants using protein structure guided local tests (POINT). PLoS Comput Biol 2019; 15:e1006722. [PMID: 30779729 PMCID: PMC6396946 DOI: 10.1371/journal.pcbi.1006722] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 03/01/2019] [Accepted: 12/17/2018] [Indexed: 01/08/2023] Open
Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data.
Collapse
Affiliation(s)
- Rachel Marceau West
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Daniel M. Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Melaine A. Kuenemann
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Sheng-Mao Chang
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| | - Michael C. Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Michael J. Wagner
- Center for Pharmacogenomics and Individualized Therapy, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - John B. Buse
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, United States of America
| | - Alison A. Motsinger-Reif
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Denis Fourches
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
38
|
Saad M, Wijsman EM. Association score testing for rare variants and binary traits in family data with shared controls. Brief Bioinform 2019; 20:245-253. [PMID: 28968627 DOI: 10.1093/bib/bbx107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2017] [Indexed: 11/12/2022] Open
Abstract
Genome-wide association studies have been an important approach used to localize trait loci, with primary focus on common variants. The multiple rare variant-common disease hypothesis may explain the missing heritability remaining after accounting for identified common variants. Advances of sequencing technologies with their decreasing costs, coupled with methodological advances in the context of association studies in large samples, now make the study of rare variants at a genome-wide scale feasible. The resurgence of family-based association designs because of their advantage in studying rare variants has also stimulated more methods development, mainly based on linear mixed models (LMMs). Other tests such as score tests can have advantages over the LMMs, but to date have mainly been proposed for single-marker association tests. In this article, we extend several score tests (χcorrected2, WQLS, and SKAT) to the multiple variant association framework. We evaluate and compare their statistical performances relative with the LMM. Moreover, we show that three tests can be cast as the difference between marker allele frequencies (AFs) estimated in each of the group of affected and unaffected subjects. We show that these tests are flexible, as they can be based on related, unrelated or both related and unrelated subjects. They also make feasible an increasingly common design that only sequences a subset of affected subjects (related or unrelated) and uses for comparison publicly available AFs estimated in a group of healthy subjects. Finally, we show the great impact of linkage disequilibrium on the performance of all these tests.
Collapse
Affiliation(s)
- Mohamad Saad
- Department of Biostatistics, University of Washington, Seattle, USA.,Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, USA.,Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Ellen M Wijsman
- Department of Biostatistics, University of Washington, Seattle, USA.,Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, USA
| |
Collapse
|
39
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
40
|
Mariette J, Villa-Vialaneix N. Unsupervised multiple kernel learning for heterogeneous data integration. Bioinformatics 2019; 34:1009-1015. [PMID: 29077792 DOI: 10.1093/bioinformatics/btx682] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 10/24/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Recent high-throughput sequencing advances have expanded the breadth of available omics datasets and the integrated analysis of multiple datasets obtained on the same samples has allowed to gain important insights in a wide range of applications. However, the integration of various sources of information remains a challenge for systems biology since produced datasets are often of heterogeneous types, with the need of developing generic methods to take their different specificities into account. Results We propose a multiple kernel framework that allows to integrate multiple datasets of various types into a single exploratory analysis. Several solutions are provided to learn either a consensus meta-kernel or a meta-kernel that preserves the original topology of the datasets. We applied our framework to analyse two public multi-omics datasets. First, the multiple metagenomic datasets, collected during the TARA Oceans expedition, was explored to demonstrate that our method is able to retrieve previous findings in a single kernel PCA as well as to provide a new image of the sample structures when a larger number of datasets are included in the analysis. To perform this analysis, a generic procedure is also proposed to improve the interpretability of the kernel PCA in regards with the original data. Second, the multi-omics breast cancer datasets, provided by The Cancer Genome Atlas, is analysed using a kernel Self-Organizing Maps with both single and multi-omics strategies. The comparison of these two approaches demonstrates the benefit of our integration method to improve the representation of the studied biological system. Availability and implementation Proposed methods are available in the R package mixKernel, released on CRAN. It is fully compatible with the mixOmics package and a tutorial describing the approach can be found on mixOmics web site http://mixomics.org/mixkernel/. Contact jerome.mariette@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jérôme Mariette
- MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan, France
| | | |
Collapse
|
41
|
Javanrouh N, Soltanian AR, Tapak L, Azizi F, Ott J, Daneshpour MS. A novel association of rs13334070 in the RPGRIP1L gene with adiposity factors discovered by joint linkage and linkage disequilibrium analysis in Iranian pedigrees: Tehran Cardiometabolic Genetic Study (TCGS). Genet Epidemiol 2018; 43:342-351. [PMID: 30597647 DOI: 10.1002/gepi.22179] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2018] [Revised: 10/15/2018] [Accepted: 11/26/2018] [Indexed: 02/01/2023]
Abstract
Understanding the genetic and metabolic bases of obesity is helpful in planning and developing health strategies. Therefore, the first family-based joint linkage and linkage disequilibrium study was conducted in Iranian pedigrees to assess the relationship between obesity and single-nucleotide polymorphisms (SNPs) located in the 16q12.2 region. In the present study, a total of 13,344 individuals were included, of whom 12,502 individuals were within 3,109 pedigrees and 842 were unrelated singletons. To investigate the relationship between obesity and genetic variants, a joint model of linkage and linkage disequilibrium was applied. Moreover, a sequence kernel association test (SKAT) was used to evaluate the association of the SNP set with body size and lipid profile measurements. The joint model showed that rs13334070, in the intron 4 of the RPGRIP1L gene, has a significant association with obesity. According to the 4-gamete rule, which is a procedure for constructing SNP sets by considering recombination occurrence between SNPs, this polymorphism has a high correlation with six nearby SNPs that make an SNP set. SKAT showed that this SNP set has a significant association with body size factors, but almost no association with most of the lipid profile measurements. In conclusion, from the result of this study, it might be reasonable to consider RPGRIP1L as an important gene whose variations could be associated with obesity risk factors.
Collapse
Affiliation(s)
- Niloufar Javanrouh
- Department of Biostatistics and Epidemiology, Modeling of Non-Communicable Diseases Research Center, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.,Department of Cellular and Molecular, Cellular and Molecular Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Ali R Soltanian
- Department of Biostatistics and Epidemiology, Modeling of Non-Communicable Diseases Research Center, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Leili Tapak
- Department of Biostatistics and Epidemiology, Modeling of Non-Communicable Diseases Research Center, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Fereidoun Azizi
- Department of Thyroid, Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Jurg Ott
- Department of Statistical Genomics Methodology, Laboratory of Statistical Genetics, Rockefeller University, New York, New York
| | - Maryam S Daneshpour
- Department of Cellular and Molecular, Cellular and Molecular Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
42
|
Robust Rare-Variant Association Tests for Quantitative Traits in General Pedigrees. STATISTICS IN BIOSCIENCES 2018; 10:491-505. [DOI: 10.1007/s12561-017-9197-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
43
|
Schweiger R, Fisher E, Weissbrod O, Rahmani E, Müller-Nurasyid M, Kunze S, Gieger C, Waldenberger M, Rosset S, Halperin E. Detecting heritable phenotypes without a model using fast permutation testing for heritability and set-tests. Nat Commun 2018; 9:4919. [PMID: 30464216 PMCID: PMC6249264 DOI: 10.1038/s41467-018-07276-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 10/26/2018] [Indexed: 01/08/2023] Open
Abstract
Testing for association between a set of genetic markers and a phenotype is a fundamental task in genetic studies. Standard approaches for heritability and set testing strongly rely on parametric models that make specific assumptions regarding phenotypic variability. Here, we show that resulting p-values may be inflated by up to 15 orders of magnitude, in a heritability study of methylation measurements, and in a heritability and expression quantitative trait loci analysis of gene expression profiles. We propose FEATHER, a method for fast permutation-based testing of marker sets and of heritability, which properly controls for false-positive results. FEATHER eliminated 47% of methylation sites found to be heritable by the parametric test, suggesting a substantial inflation of false-positive findings by alternative methods. Our approach can rapidly identify heritable phenotypes out of millions of phenotypes acquired via high-throughput technologies, does not suffer from model misspecification and is highly efficient. Standard approaches for heritability and set testing in statistical genetics rely on parametric models that might not hold in reality and give inflated p-values. Here, the authors develop a fast method for permutation-based testing of marker sets and of heritability that does not suffer from model misspecification.
Collapse
Affiliation(s)
- Regev Schweiger
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801, Israel.
| | - Eyal Fisher
- School of Mathematical Sciences, Department of Statistics, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Omer Weissbrod
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, 02115, MA, USA
| | - Elior Rahmani
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801, Israel
| | - Martina Müller-Nurasyid
- Institute of Genetic Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, Neuherberg, 85764, Germany.,Department of Medicine I, Ludwig-Maximilians-Universität, Munich, 80539, Germany.,DZHK (German Centre for Cardiovascular Research), partner site Munich Heart Alliance, Munich, 80636, Germany
| | - Sonja Kunze
- Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, 85764, Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, 85764, Neuherberg, Germany
| | - Christian Gieger
- Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, 85764, Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, 85764, Neuherberg, Germany
| | - Melanie Waldenberger
- DZHK (German Centre for Cardiovascular Research), partner site Munich Heart Alliance, Munich, 80636, Germany.,Institute of Epidemiology II, Helmholtz Zentrum München - German Research Center for Environmental Health, 85764, Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München-German Research Center for Environmental Health, 85764, Neuherberg, Germany
| | - Saharon Rosset
- School of Mathematical Sciences, Department of Statistics, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Eran Halperin
- Los Angeles, University of California Los Angeles, Los Angeles, 90095, CA, USA.,Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, 90095, CA, USA
| |
Collapse
|
44
|
He T, Li S, Zhong PS, Cui Y. An optimal kernel-based U
-statistic method for quantitative gene-set association analysis. Genet Epidemiol 2018; 43:137-149. [DOI: 10.1002/gepi.22170] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 08/19/2018] [Accepted: 09/26/2018] [Indexed: 11/09/2022]
Affiliation(s)
- Tao He
- Department of Mathematics; San Francisco State University; San Francisco California
| | - Shaoyu Li
- Department of Mathematics and Statistics; University of North Carolina at Charlotte; Charlotte North Carolina
| | - Ping-Shou Zhong
- Department of Mathematics, Statistics, and Computer Science; University of Illinois at Chicago; Chicago Illinois
| | - Yuehua Cui
- Department of Statistics & Probability; Michigan State University; East Lansing Michigan
- School of Public Health, Zhengzhou University; Zhengzhou China
| |
Collapse
|
45
|
Zhang W, Chen Z, Liu A, Buck Louis GM. A weighted kernel machine regression approach to environmental pollutants and infertility. Stat Med 2018; 38:809-827. [PMID: 30328128 DOI: 10.1002/sim.8003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 08/17/2018] [Accepted: 09/19/2018] [Indexed: 11/09/2022]
Abstract
In epidemiological studies of environmental pollutants in relation to human infertility, it is common that concentrations of a large number of exposures are collected in both male and female partners. Such a couple-based study poses some new challenges in statistical analysis, especially when the effect of the totality of these chemical mixtures is of interest, because these exposures may have complex nonlinear and nonadditive relationships with the infertility outcome. Kernel machine regression, as a nonparametric regression method, can be applied to model such effects, while accounting for the highly correlated structure within and across exposures. However, it does not consider the partner-specific structure in these study data, which may lead to suboptimal estimation for the effects of environmental exposures. To overcome this limitation, we developed a weighted kernel machine regression method (wKRM) to model the joint effect of partner-specific exposures, in which a linear weight procedure is used to combine the female and male partners' exposure concentrations. The proposed wKRM is not only able to reduce the number of analyzed exposures but also provide an overall importance index of female and male partners' exposures in the risk of infertility. Simulation studies demonstrate good performance of the wKRM in both estimating the joint effects of exposures and fitting the infertility outcome. Application of the proposed method to a prospective infertility study suggests that the male partner's exposure to polychlorinated biphenyls might contribute more toward infertility than the female partner's.
Collapse
Affiliation(s)
- Wei Zhang
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| | - Zhen Chen
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| | - Aiyi Liu
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| | - Germaine M Buck Louis
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
46
|
Wu X, Guan T, Liu DJ, León Novelo LG, Bandyopadhyay D. ADAPTIVE-WEIGHT BURDEN TEST FOR ASSOCIATIONS BETWEEN QUANTITATIVE TRAITS AND GENOTYPE DATA WITH COMPLEX CORRELATIONS. Ann Appl Stat 2018; 12:1558-1582. [PMID: 30214655 DOI: 10.1214/17-aoas1121] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
High-throughput sequencing has often been used to screen samples from pedigrees or with population structure, producing genotype data with complex correlations rendered from both familial relation and linkage disequilibrium. With such data, it is critical to account for these genotypic correlations when assessing the contribution of variants by gene or pathway. Recognizing the limitations of existing association testing methods, we propose Adaptive-weight Burden Test (ABT), a retrospective, mixed-model test for genetic association of quantitative traits on genotype data with complex correlations. This method makes full use of genotypic correlations across both samples and variants, and adopts "data-driven" weights to improve power. We derive the ABT statistic and its explicit distribution under the null hypothesis, and demonstrate through simulation studies that it is generally more powerful than the fixed-weight burden test and family-based SKAT in various scenarios, controlling for the type I error rate. Further investigation reveals the connection of ABT with kernel tests, as well as the adaptability of its weights to the direction of genetic effects. The application of ABT is illustrated by a whole genome analysis of genes with common and rare variants associated with fasting glucose from the NHLBI "Grand Opportunity" Exome Sequencing Project.
Collapse
Affiliation(s)
- Xiaowei Wu
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, MC0439, Blacksburg, VA 24061, USA
| | - Ting Guan
- Department of Statistics, Virginia Tech, 250 Drillfield Drive, MC0439, Blacksburg, VA 24061, USA
| | - Dajiang J Liu
- Department of Public Health Sciences, Hershey Institute of Personalized Medicine, Pennsylvania State University College of Medicine, Hershey, PA 17033, USA
| | - Luis G León Novelo
- Department of Biostatistics, School of Public Health, University of Texas Health Science Center, Houston, TX 77030, USA
| | | |
Collapse
|
47
|
Alam MA, Lin HY, Deng HW, Calhoun VD, Wang YP. A kernel machine method for detecting higher order interactions in multimodal datasets: Application to schizophrenia. J Neurosci Methods 2018; 309:161-174. [PMID: 30184473 DOI: 10.1016/j.jneumeth.2018.08.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 08/12/2018] [Accepted: 08/30/2018] [Indexed: 12/20/2022]
Abstract
BACKGROUND Technological advances are enabling us to collect multimodal datasets at an increasing depth and resolution while with decreasing labors. Understanding complex interactions among multimodal datasets, however, is challenging. NEW METHOD In this study, we tested the interaction effect of multimodal datasets using a novel method called the kernel machine for detecting higher order interactions among biologically relevant multimodal data. Using a semiparametric method on a reproducing kernel Hilbert space, we formulated the proposed method as a standard mixed-effects linear model and derived a score-based variance component statistic to test higher order interactions between multimodal datasets. RESULTS The method was evaluated using extensive numerical simulation and real data from the Mind Clinical Imaging Consortium with both schizophrenia patients and healthy controls. Our method identified 13-triplets that included 6 gene-derived SNPs, 10 ROIs, and 6 gene-specific DNA methylations that are correlated with the changes in hippocampal volume, suggesting that these triplets may be important for explaining schizophrenia-related neurodegeneration. COMPARISON WITH EXISTING METHOD(S) The performance of the proposed method is compared with the following methods: test based on only first and first few principal components followed by multiple regression, and full principal component analysis regression, and the sequence kernel association test. CONCLUSIONS With strong evidence (p-value ≤0.000001), the triplet (MAGI2, CRBLCrus1.L, FBXO28) is a significant biomarker for schizophrenia patients. This novel method can be applicable to the study of other disease processes, where multimodal data analysis is a common task.
Collapse
Affiliation(s)
- Md Ashad Alam
- Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA.
| | - Hui-Yi Lin
- Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA 70112, USA
| | - Vince D Calhoun
- Department of Electrical and Computer Engineering, The University of New Mexico, Albuquerque, NM 87131, USA
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA
| |
Collapse
|
48
|
Fischer ST, Jiang Y, Broadaway KA, Conneely KN, Epstein MP. Powerful and robust cross-phenotype association test for case-parent trios. Genet Epidemiol 2018; 42:447-458. [PMID: 29460449 PMCID: PMC6013339 DOI: 10.1002/gepi.22116] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 01/05/2018] [Accepted: 01/08/2018] [Indexed: 12/17/2022]
Abstract
There has been increasing interest in identifying genes within the human genome that influence multiple diverse phenotypes. In the presence of pleiotropy, joint testing of these phenotypes is not only biologically meaningful but also statistically more powerful than univariate analysis of each separate phenotype accounting for multiple testing. Although many cross-phenotype association tests exist, the majority of such methods assume samples composed of unrelated subjects and therefore are not applicable to family-based designs, including the valuable case-parent trio design. In this paper, we describe a robust gene-based association test of multiple phenotypes collected in a case-parent trio study. Our method is based on the kernel distance covariance (KDC) method, where we first construct a similarity matrix for multiple phenotypes and a similarity matrix for genetic variants in a gene; we then test the dependency between the two similarity matrices. The method is applicable to either common variants or rare variants in a gene, and resulting tests from the method are by design robust to confounding due to population stratification. We evaluated our method through simulation studies and observed that the method is substantially more powerful than standard univariate testing of each separate phenotype. We also applied our method to phenotypic and genotypic data collected in case-parent trios as part of the Genetics of Kidneys in Diabetes (GoKinD) study and identified a genome-wide significant gene demonstrating cross-phenotype effects that was not identified using standard univariate approaches.
Collapse
Affiliation(s)
- S. Taylor Fischer
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Yunxuan Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA
| | - K. Alaine Broadaway
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Karen N. Conneely
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Michael P. Epstein
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| |
Collapse
|
49
|
Rudra P, Broadaway KA, Ware EB, Jhun MA, Bielak LF, Zhao W, Smith JA, Peyser PA, Kardia SL, Epstein MP, Ghosh D. Testing cross-phenotype effects of rare variants in longitudinal studies of complex traits. Genet Epidemiol 2018; 42:320-332. [PMID: 29601641 PMCID: PMC5980726 DOI: 10.1002/gepi.22121] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2017] [Revised: 01/19/2018] [Accepted: 02/19/2018] [Indexed: 01/09/2023]
Abstract
Many gene mapping studies of complex traits have identified genes or variants that influence multiple phenotypes. With the advent of next-generation sequencing technology, there has been substantial interest in identifying rare variants in genes that possess cross-phenotype effects. In the presence of such effects, modeling both the phenotypes and rare variants collectively using multivariate models can achieve higher statistical power compared to univariate methods that either model each phenotype separately or perform separate tests for each variant. Several studies collect phenotypic data over time and using such longitudinal data can further increase the power to detect genetic associations. Although rare-variant approaches exist for testing cross-phenotype effects at a single time point, there is no analogous method for performing such analyses using longitudinal outcomes. In order to fill this important gap, we propose an extension of Gene Association with Multiple Traits (GAMuT) test, a method for cross-phenotype analysis of rare variants using a framework based on the distance covariance. The approach allows for both binary and continuous phenotypes and can also adjust for covariates. Our simple adjustment to the GAMuT test allows it to handle longitudinal data and to gain power by exploiting temporal correlation. The approach is computationally efficient and applicable on a genome-wide scale due to the use of a closed-form test whose significance can be evaluated analytically. We use simulated data to demonstrate that our method has favorable power over competing approaches and also apply our approach to exome chip data from the Genetic Epidemiology Network of Arteriopathy.
Collapse
Affiliation(s)
- Pratyaydipta Rudra
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| | | | - Erin B. Ware
- Department of Epidemiology, University of Michigan, Ann Arbor, MI
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI
| | - Min A. Jhun
- Department of Epidemiology, University of Michigan, Ann Arbor, MI
| | | | - Wei Zhao
- Department of Epidemiology, University of Michigan, Ann Arbor, MI
| | | | | | | | | | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO
| |
Collapse
|
50
|
Abstract
Background Glioma accounts for 80% of malignant brain tumors, but its etiologic determinants remain elusive. Despite genetic susceptibility loci identified by genome-wide association study (GWAS), the agnostic approach leaves open the possibility that other susceptibility genes remain to be discovered. Here we conduct a gene-centric integrative GWAS (iGWAS) of glioma risk that combines transcriptomics and genetics. Methods We synthesized a brain transcriptomics dataset (n = 354), a GWAS dataset (n = 4203), and an advanced glioma tumor transcriptomic dataset (n = 483) to conduct an iGWAS. Using the expression quantitative trait loci (eQTL) dataset, we built models to predict gene expression for the GWAS data, based on eQTL genotypes. With the predicted gene expression, iGWAS analyses were performed using a novel statistical method. Gene signature risk score was constructed using a penalized logistic regression model. Results A total of 30527 transcripts were analyzed using the iGWAS approach. Four novel glioma susceptibility genes were identified with internal and external validation, including DRD5 (P = 3.0 × 10-79), WDR1 (P = 8.4 × 10-77), NOMO1 (P = 1.3 × 10-25), and PDXDC1 (P = 8.3 × 10-24). The genotype-predicted transcription pattern between cases and controls is consistent with that between tumor and its matched normal tissue. The genotype-based 4-gene signature improved the classification between glioma cases and controls based on age, gender, and population stratification, with area under the receiver operating characteristic curve increasing from 0.77 to 0.85 (P = 8.1 × 10-23). Conclusion A new genotype-based gene signature of glioma was identified using a novel iGWAS approach, which integrates multiplatform genomic data as well as different genetic association studies.
Collapse
Affiliation(s)
- Yen-Tsung Huang
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan; Department of Epidemiology; Department of Biostatistics, Brown University, Providence, Rhode Island; Department of Public Health and Community Medicine, Tufts University, Boston, Massachusetts
| | - Yi Zhang
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan; Department of Epidemiology; Department of Biostatistics, Brown University, Providence, Rhode Island; Department of Public Health and Community Medicine, Tufts University, Boston, Massachusetts
| | - Zhijin Wu
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan; Department of Epidemiology; Department of Biostatistics, Brown University, Providence, Rhode Island; Department of Public Health and Community Medicine, Tufts University, Boston, Massachusetts
| | - Dominique S Michaud
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan; Department of Epidemiology; Department of Biostatistics, Brown University, Providence, Rhode Island; Department of Public Health and Community Medicine, Tufts University, Boston, Massachusetts
| |
Collapse
|