1
|
Sun W, Jon K, Zhu W. Multiple phenotype association tests based on sliced inverse regression. BMC Bioinformatics 2024; 25:144. [PMID: 38575890 PMCID: PMC10996256 DOI: 10.1186/s12859-024-05731-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 03/05/2024] [Indexed: 04/06/2024] Open
Abstract
BACKGROUND Joint analysis of multiple phenotypes in studies of biological systems such as Genome-Wide Association Studies is critical to revealing the functional interactions between various traits and genetic variants, but growth of data in dimensionality has become a very challenging problem in the widespread use of joint analysis. To handle the excessiveness of variables, we consider the sliced inverse regression (SIR) method. Specifically, we propose a novel SIR-based association test that is robust and powerful in testing the association between multiple predictors and multiple outcomes. RESULTS We conduct simulation studies in both low- and high-dimensional settings with various numbers of Single-Nucleotide Polymorphisms and consider the correlation structure of traits. Simulation results show that the proposed method outperforms the existing methods. We also successfully apply our method to the genetic association study of ADNI dataset. Both the simulation studies and real data analysis show that the SIR-based association test is valid and achieves a higher efficiency compared with its competitors. CONCLUSION Several scenarios with low- and high-dimensional responses and genotypes are considered in this paper. Our SIR-based method controls the estimated type I error at the pre-specified level α .
Collapse
Affiliation(s)
- Wenyuan Sun
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin, China
- Department of Mathematics, College of Science, Yanbian University, Yanji, 133002, Jilin, China
| | - Kyongson Jon
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin, China
- Faculty of Mathematics, Kim Il Sung University, Pyongyan , 999093, Democratic People's Republic of Korea
| | - Wensheng Zhu
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin, China.
- School of Mathematical Sciences, Harbin Normal University, Harbin, 150025, Heilongjiang, China.
| |
Collapse
|
2
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Rare variant association on unrelated individuals in case-control studies using aggregation tests: existing methods and current limitations. Brief Bioinform 2023; 24:bbad412. [PMID: 37974506 DOI: 10.1093/bib/bbad412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 10/14/2023] [Accepted: 10/28/2023] [Indexed: 11/19/2023] Open
Abstract
Over the past years, progress made in next-generation sequencing technologies and bioinformatics have sparked a surge in association studies. Especially, genome-wide association studies (GWASs) have demonstrated their effectiveness in identifying disease associations with common genetic variants. Yet, rare variants can contribute to additional disease risk or trait heterogeneity. Because GWASs are underpowered for detecting association with such variants, numerous statistical methods have been recently proposed. Aggregation tests collapse multiple rare variants within a genetic region (e.g. gene, gene set, genomic loci) to test for association. An increasing number of studies using such methods successfully identified trait-associated rare variants and led to a better understanding of the underlying disease mechanism. In this review, we compare existing aggregation tests, their statistical features and scope of application, splitting them into the five classical classes: burden, adaptive burden, variance-component, omnibus and other. Finally, we describe some limitations of current aggregation tests, highlighting potential direction for further investigations.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- WELBIO department, WEL Research Institute, avenue Pasteur, 6, 1300 Wavre, Belgium
| |
Collapse
|
3
|
Sun H, Wang Y, Xiao Z, Huang X, Wang H, He T, Jiang X. multiMiAT: an optimal microbiome-based association test for multicategory phenotypes. Brief Bioinform 2023; 24:7005163. [PMID: 36702753 DOI: 10.1093/bib/bbad012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 12/31/2022] [Accepted: 01/03/2023] [Indexed: 01/28/2023] Open
Abstract
Microbes can affect the metabolism and immunity of human body incessantly, and the dysbiosis of human microbiome drives not only the occurrence but also the progression of disease (i.e. multiple statuses of disease). Recently, microbiome-based association tests have been widely developed to detect the association between the microbiome and host phenotype. However, the existing methods have not achieved satisfactory performance in testing the association between the microbiome and ordinal/nominal multicategory phenotypes (e.g. disease severity and tumor subtype). In this paper, we propose an optimal microbiome-based association test for multicategory phenotypes, namely, multiMiAT. Specifically, under the multinomial logit model framework, we first introduce a microbiome regression-based kernel association test for multicategory phenotypes (multiMiRKAT). As a data-driven optimal test, multiMiAT then integrates multiMiRKAT, score test and MiRKAT-MC to maintain excellent performance in diverse association patterns. Massive simulation experiments prove the success of our method. Furthermore, multiMiAT is also applied to real microbiome data experiments to detect the association between the gut microbiome and clinical statuses of colorectal cancer as well as for diverse statuses of Clostridium difficile infections.
Collapse
Affiliation(s)
- Han Sun
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
- School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
| | - Yue Wang
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
| | - Zhen Xiao
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
- School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
| | - Xiaoyun Huang
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
- Collaborative & Innovative Center for Educational Technology, Central China Normal University, Wuhan 430079, China
| | - Haodong Wang
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
| | - Tingting He
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
- National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan 430079, China
| | - Xingpeng Jiang
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, China
- School of Computer Science, Central China Normal University, Wuhan 430079, China
- National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan 430079, China
| |
Collapse
|
4
|
Xie H, Cao X, Zhang S, Sha Q. Joint analysis of multiple phenotypes for extremely unbalanced case-control association studies. Genet Epidemiol 2023; 47:185-197. [PMID: 36691904 DOI: 10.1002/gepi.22513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 11/16/2022] [Accepted: 01/11/2023] [Indexed: 01/25/2023]
Abstract
In genome-wide association studies (GWAS) for thousands of phenotypes in biobanks, most binary phenotypes have substantially fewer cases than controls. Many widely used approaches for joint analysis of multiple phenotypes produce inflated type I error rates for such extremely unbalanced case-control phenotypes. In this research, we develop a method to jointly analyze multiple unbalanced case-control phenotypes to circumvent this issue. We first group multiple phenotypes into different clusters based on a hierarchical clustering method, then we merge phenotypes in each cluster into a single phenotype. In each cluster, we use the saddlepoint approximation to estimate the p value of an association test between the merged phenotype and a single nucleotide polymorphism (SNP) which eliminates the issue of inflated type I error rate of the test for extremely unbalanced case-control phenotypes. Finally, we use the Cauchy combination method to obtain an integrated p value for all clusters to test the association between multiple phenotypes and a SNP. We use extensive simulation studies to evaluate the performance of the proposed approach. The results show that the proposed approach can control type I error rate very well and is more powerful than other available methods. We also apply the proposed approach to phenotypes in category IX (diseases of the circulatory system) in the UK Biobank. We find that the proposed approach can identify more significant SNPs than the other viable methods we compared with.
Collapse
Affiliation(s)
- Hongjing Xie
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| | - Xuewei Cao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| |
Collapse
|
5
|
Liu Y, Sun W, Hsu L, He Q. Statistical inference for high-dimensional pathway analysis with multiple responses. Comput Stat Data Anal 2022; 169. [PMID: 35125572 PMCID: PMC8813039 DOI: 10.1016/j.csda.2021.107418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Pathway analysis, i.e., grouping analysis, has important applications in genomic studies. Existing pathway analysis approaches are mostly focused on a single response and are not suitable for analyzing complex diseases that are often related with multiple response variables. Although a handful of approaches have been developed for multiple responses, these methods are mainly designed for pathways with a moderate number of features. A multi-response pathway analysis approach that is able to conduct statistical inference when the dimension is potentially higher than sample size is introduced. Asymptotical properties of the test statistic are established and theoretical investigation of the statistical power is conducted. Simulation studies and real data analysis show that the proposed approach performs well in identifying important pathways that influence multiple expression quantitative trait loci (eQTL).
Collapse
|
6
|
Rudra P, Baxter R, Hsieh EWY, Ghosh D. Compositional Data Analysis using Kernels in mass cytometry data. BIOINFORMATICS ADVANCES 2022; 2:vbac003. [PMID: 35224501 PMCID: PMC8867823 DOI: 10.1093/bioadv/vbac003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 12/06/2021] [Accepted: 01/12/2022] [Indexed: 01/27/2023]
Abstract
MOTIVATION Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. RESULTS We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. AVAILABILITY AND IMPLEMENTATION CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. CONTACT prudra@okstate.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Pratyaydipta Rudra
- Department of Statistics, Oklahoms State University, Stillwater, OK 74078, USA
| | - Ryan Baxter
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Elena W Y Hsieh
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Department of Pediatrics, Section of Allergy and Immunology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
7
|
Liu X, Cong X, Li G, Maas K, Chen K. Multivariate log-contrast regression with sub-compositional predictors: Testing the association between preterm infants' gut microbiome and neurobehavioral outcomes. Stat Med 2021; 41:580-594. [PMID: 34897772 DOI: 10.1002/sim.9273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 09/25/2021] [Accepted: 11/15/2021] [Indexed: 11/10/2022]
Abstract
To link a clinical outcome with compositional predictors in microbiome analysis, the linear log-contrast model is a popular choice, and the inference procedure for assessing the significance of each covariate is also available. However, with the existence of multiple potentially interrelated outcomes and the information of the taxonomic hierarchy of bacteria, a multivariate analysis method that considers the group structure of compositional covariates and an accompanying group inference method are still lacking. Motivated by a study for identifying the microbes in the gut microbiome of preterm infants that impact their later neurobehavioral outcomes, we formulate a constrained integrative multi-view regression. The neurobehavioral scores form multivariate responses, the log-transformed sub-compositional microbiome data form multi-view feature matrices, and a set of linear constraints on their corresponding sub-coefficient matrices ensures the sub-compositional nature. We assume all the sub-coefficient matrices are possible of low-rank to enable joint selection and inference of sub-compositions/views. We propose a scaled composite nuclear norm penalization approach for model estimation and develop a hypothesis testing procedure through de-biasing to assess the significance of different views. Simulation studies confirm the effectiveness of the proposed procedure. We apply the method to the preterm infant study, and the identified microbes are mostly consistent with existing studies and biological understandings.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Xiaomei Cong
- School of Nursing, University of Connecticut, Storrs, Connecticut, USA
| | - Gen Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Kendra Maas
- Microbial Analysis, Resources, and Services, University of Connecticut, Storrs, Connecticut, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| |
Collapse
|
8
|
Kim J, Shen J, Wang A, Mehrotra DV, Ko S, Zhou JJ, Zhou H. VCSEL: Prioritizing SNP-set by penalized variance component selection. Ann Appl Stat 2021; 15:1652-1672. [DOI: 10.1214/21-aoas1491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Juhyun Kim
- Department of Biostatistics, University of California, Los Angeles
| | - Judong Shen
- Biostatistics and Research Decision Sciences, Merck & Co., Inc
| | - Anran Wang
- Biostatistics and Research Decision Sciences, Merck & Co., Inc
| | | | - Seyoon Ko
- Department of Biostatistics, University of California, Los Angeles
| | - Jin J. Zhou
- Department of Medicine, University of California, Los Angeles
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles
| |
Collapse
|
9
|
Tong L, Zhou Y, Guo Y, Ding H, Ji D. Quantitative trait locus mapping analysis of multiple traits when using genotype data with potential errors. PeerJ 2021; 9:e12187. [PMID: 34631317 PMCID: PMC8475548 DOI: 10.7717/peerj.12187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 08/30/2021] [Indexed: 12/03/2022] Open
Abstract
Background Quantitative trait locus (QTL) analysis aims to locate and estimate the effects of the genes influencing quantitative traits and infer the relationship between gene variants and changes in phenotypic characteristics using statistical methods. Some methods have been developed to map QTLs of multiple traits in the case of no genotype error in a given dataset. However, practical genetic data that people use may contain some potential errors because of the limitations of biotechnology. Common genetic data correction methods can only reduce errors, but cannot calculate the degree of error. In this paper, we propose a QTL mapping strategy for multiple traits in the presence of genotype errors. Methods The additive effect, dominant effect, recombination rate, error rate, and other parameters of QTLs can be simultaneously obtained using this new method in the framework of multiple-interval mapping. Results Our simulation results show that the accuracy of parameter estimation can be improved by considering the errors of marker genotypes during the analysis of genetic data. Real data analysis also shows that the new method proposed in this paper can map the QTLs of multiple traits more accurately.
Collapse
Affiliation(s)
- Liang Tong
- School of Science, Harbin University of Science and Technology, Harbin, P. R. China.,School of Information Engineering, Suihua University, Suihua, P. R. China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University and Heilongjiang Provincial Key Laboratory of the Theory and Computation of Complex Systems, Harbin, P. R. China
| | - Yixing Guo
- Dalian University of Science and Technology, Dalian, P. R. China
| | - Hui Ding
- School of Information Engineering, Suihua University, Suihua, P. R. China
| | - Donghai Ji
- School of Science, Harbin University of Science and Technology, Harbin, P. R. China
| |
Collapse
|
10
|
He Q, Liu Y, Liu M, Wu MC, Hsu L. Random effect based tests for multinomial logistic regression in genetic association studies. Genet Epidemiol 2021; 45:736-740. [PMID: 34403161 DOI: 10.1002/gepi.22427] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 07/31/2021] [Accepted: 08/01/2021] [Indexed: 11/11/2022]
Affiliation(s)
- Qianchuan He
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Yang Liu
- Department of Mathematics and Statistics, Wright State University, Dayton, Ohio, USA
| | - Meiling Liu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Li Hsu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| |
Collapse
|
11
|
Pluta D, Shen T, Xue G, Chen C, Ombao H, Yu Z. Ridge-penalized adaptive Mantel test and its application in imaging genetics. Stat Med 2021; 40:5313-5332. [PMID: 34216035 DOI: 10.1002/sim.9127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 06/01/2021] [Accepted: 06/16/2021] [Indexed: 01/23/2023]
Abstract
We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement and testing. This result is not only theoretically interesting but also has important implications in penalized hypothesis testing, especially in high-dimensional settings such as imaging genetics. Applying the proposed method to an imaging genetic study of visual working memory in healthy adults, we identified interesting associations of brain connectivity (measured by electroencephalogram coherence) with selected genetic features.
Collapse
Affiliation(s)
- Dustin Pluta
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Tong Shen
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Gui Xue
- Center for Brain and Learning Science, Beijing Normal University, Beijing, China
| | - Chuansheng Chen
- Department of Psychology and Social Behavior, University of California, Irvine, Irvine, California, USA
| | - Hernando Ombao
- Statistics Program, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| |
Collapse
|
12
|
Gong M, Liu P, Sciurba FC, Stojanov P, Tao D, Tseng GC, Zhang K, Batmanghelich K. Unpaired data empowers association tests. Bioinformatics 2021; 37:785-792. [PMID: 33070196 PMCID: PMC8098021 DOI: 10.1093/bioinformatics/btaa886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 09/07/2020] [Accepted: 10/05/2020] [Indexed: 11/25/2022] Open
Abstract
Motivation There is growing interest in the biomedical research community to incorporate retrospective data, available in healthcare systems, to shed light on associations between different biomarkers. Understanding the association between various types of biomedical data, such as genetic, blood biomarkers, imaging, etc. can provide a holistic understanding of human diseases. To formally test a hypothesized association between two types of data in Electronic Health Records (EHRs), one requires a substantial sample size with both data modalities to achieve a reasonable power. Current association test methods only allow using data from individuals who have both data modalities. Hence, researchers cannot take advantage of much larger EHR samples that includes individuals with at least one of the data types, which limits the power of the association test. Results We present a new method called the Semi-paired Association Test (SAT) that makes use of both paired and unpaired data. In contrast to classical approaches, incorporating unpaired data allows SAT to produce better control of false discovery and to improve the power of the association test. We study the properties of the new test theoretically and empirically, through a series of simulations and by applying our method on real studies in the context of Chronic Obstructive Pulmonary Disease. We are able to identify an association between the high-dimensional characterization of Computed Tomography chest images and several blood biomarkers as well as the expression of dozens of genes involved in the immune system. Availability and implementation Code is available on https://github.com/batmanlab/Semi-paired-Association-Test. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingming Gong
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA.,Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,School of Mathematics and Statistics, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Peng Liu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Frank C Sciurba
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Petar Stojanov
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Dacheng Tao
- Australia School of Computer Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - George C Tseng
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Kun Zhang
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Kayhan Batmanghelich
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| |
Collapse
|
13
|
Zhang B, Chiu CY, Yuan F, Sang T, Cook RJ, Wilson AF, Bailey-Wilson JE, Chew EY, Xiong M, Fan R. Gene-based analysis of bi-variate survival traits via functional regressions with applications to eye diseases. Genet Epidemiol 2021; 45:455-470. [PMID: 33645812 DOI: 10.1002/gepi.22381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 01/15/2021] [Accepted: 02/08/2021] [Indexed: 11/12/2022]
Abstract
Genetic studies of two related survival outcomes of a pleiotropic gene are commonly encountered but statistical models to analyze them are rarely developed. To analyze sequencing data, we propose mixed effect Cox proportional hazard models by functional regressions to perform gene-based joint association analysis of two survival traits motivated by our ongoing real studies. These models extend fixed effect Cox models of univariate survival traits by incorporating variations and correlation of multivariate survival traits into the models. The associations between genetic variants and two survival traits are tested by likelihood ratio test statistics. Extensive simulation studies suggest that type I error rates are well controlled and power performances are stable. The proposed models are applied to analyze bivariate survival traits of left and right eyes in the age-related macular degeneration progression.
Collapse
Affiliation(s)
- Bingsong Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA
| | - Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee, USA.,Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Fang Yuan
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, People's Republic of China
| | - Tian Sang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.,School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China
| | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, Ontario, Canada
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Emily Y Chew
- Division of Epidemiology and Clinical Applications, National Eye Institute, NIH, Bethesda, Maryland, USA
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas, USA
| | - Ruzong Fan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.,Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| |
Collapse
|
14
|
Associating Multivariate Traits with Genetic Variants Using Collapsing and Kernel Methods with Pedigree- or Population-Based Studies. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:8812282. [PMID: 33628328 PMCID: PMC7889379 DOI: 10.1155/2021/8812282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 01/02/2021] [Accepted: 01/08/2021] [Indexed: 11/18/2022]
Abstract
In genetic association analysis, several relevant phenotypes or multivariate traits with different types of components are usually collected to study complex or multifactorial diseases. Over the past few years, jointly testing for association between multivariate traits and multiple genetic variants has become more popular because it can increase statistical power to identify causal genes in pedigree- or population-based studies. However, most of the existing methods mainly focus on testing genetic variants associated with multiple continuous phenotypes. In this investigation, we develop a framework for identifying the pleiotropic effects of genetic variants on multivariate traits by using collapsing and kernel methods with pedigree- or population-structured data. The proposed framework is applicable to the burden test, the kernel test, and the omnibus test for autosomes and the X chromosome. The proposed multivariate trait association methods can accommodate continuous phenotypes or binary phenotypes and further can adjust for covariates. Simulation studies show that the performance of our methods is satisfactory with respect to the empirical type I error rates and power rates in comparison with the existing methods.
Collapse
|
15
|
Deng Y, Wu S, Fan H. Genome-wide pathway-based quantitative multiple phenotypes analysis. PLoS One 2020; 15:e0240910. [PMID: 33175855 PMCID: PMC7657528 DOI: 10.1371/journal.pone.0240910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 10/06/2020] [Indexed: 11/18/2022] Open
Abstract
For complex diseases, genome-wide pathway association studies have become increasingly promising. Currently, however, pathway-based association analysis mainly focus on a single phenotype, which may insufficient to describe the complex diseases and physiological processes. This work proposes a combination model to evaluate the association between a pathway and multiple phenotypes and to reduce the run time based on asymptotic results. For a single phenotype, we propose a semi-supervised maximum kernel-based U-statistics (mSKU) method to assess the pathway-based association analysis. For multiple phenotypes, we propose the fisher combination function with dependent phenotypes (FC) to transform the p-values between the pathway and each marginal phenotype individually to achieve pathway-based multiple phenotypes analysis. With real data from the Alzheimer Disease Neuroimaging Initiative (ADNI) study and Human Liver Cohort (HLC) study, the FC-mSKU method allows us to specify which pathways are specific to a single phenotype or contribute to common genetic constructions of multiple phenotypes. If we only focus on single-phenotype tests, we may miss some findings for etiology studies. Through extensive simulation studies, the FC-mSKU method demonstrates its advantages compared with its counterparts.
Collapse
Affiliation(s)
- Yamin Deng
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China.,Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shiman Wu
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Huifang Fan
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| |
Collapse
|
16
|
Gao C, Sha Q, Zhang S, Zhang K. MF-TOWmuT: Testing an optimally weighted combination of common and rare variants with multiple traits using family data. Genet Epidemiol 2020; 45:64-81. [PMID: 33047835 DOI: 10.1002/gepi.22355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Revised: 08/03/2020] [Accepted: 08/18/2020] [Indexed: 11/11/2022]
Abstract
With rapid advancements of sequencing technologies and accumulations of electronic health records, a large number of genetic variants and multiple correlated human complex traits have become available in many genetic association studies. Thus, it becomes necessary and important to develop new methods that can jointly analyze the association between multiple genetic variants and multiple traits. Compared with methods that only use a single marker or trait, the joint analysis of multiple genetic variants and multiple traits is more powerful since such an analysis can fully incorporate the correlation structure of genetic variants and/or traits and their mutual dependence patterns. However, most of existing methods that simultaneously analyze multiple genetic variants and multiple traits are only applicable to unrelated samples. We develop a new method called MF-TOWmuT to detect association of multiple phenotypes and multiple genetic variants in a genomic region with family samples. MF-TOWmuT is based on an optimally weighted combination of variants. Our method can be applied to both rare and common variants and both qualitative and quantitative traits. Our simulation results show that (1) the type I error of MF-TOWmuT is preserved; (2) MF-TOWmuT outperforms two existing methods such as Multiple Family-based Quasi-Likelihood Score Test and Multivariate Family-based Rare Variant Association Test in terms of power. We also illustrate the usefulness of MF-TOWmuT by analyzing genotypic and phenotipic data from the Genetics of Kidneys in Diabetes study. R program is available at https://github.com/gaochengPRC/MF-TOWmuT.
Collapse
Affiliation(s)
- Cheng Gao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| | - Kui Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, USA
| |
Collapse
|
17
|
Guo B, Wu B. Integrate multiple traits to detect novel trait-gene association using GWAS summary data with an adaptive test approach. Bioinformatics 2020; 35:2251-2257. [PMID: 30476000 DOI: 10.1093/bioinformatics/bty961] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 10/30/2018] [Accepted: 11/22/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Genetics hold great promise to precision medicine by tailoring treatment to the individual patient based on their genetic profiles. Toward this goal, many large-scale genome-wide association studies (GWAS) have been performed in the last decade to identify genetic variants associated with various traits and diseases. They have successfully identified tens of thousands of disease-related variants. However they have explained only a small proportion of the overall trait heritability for most traits and are of very limited clinical use. This is partly owing to the small effect sizes of most genetic variants, and the common practice of testing association between one trait and one genetic variant at a time in most GWAS, even when multiple related traits are often measured for each individual. Increasing evidence suggests that many genetic variants can influence multiple traits simultaneously, and we can gain more power by testing association of multiple traits simultaneously. It is appealing to develop novel multi-trait association test methods that need only GWAS summary data, since it is generally very hard to access the individual-level GWAS phenotype and genotype data. RESULTS Many existing GWAS summary data-based association test methods have relied on ad hoc approach or crude Monte Carlo approximation. In this article, we develop rigorous statistical methods for efficient and powerful multi-trait association test. We develop robust and efficient methods to accurately estimate the marginal trait correlation matrix using only GWAS summary data. We construct the principal component (PC)-based association test from the summary statistics. PC-based test has optimal power when the underlying multi-trait signal can be captured by the first PC, and otherwise it will have suboptimal performance. We develop an adaptive test by optimally weighting the PC-based test and the omnibus chi-square test to achieve robust performance under various scenarios. We develop efficient numerical algorithms to compute the analytical P-values for all the proposed tests without the need of Monte Carlo sampling. We illustrate the utility of proposed methods through application to the GWAS meta-analysis summary data for multiple lipids and glycemic traits. We identify multiple novel loci that were missed by individual trait-based association test. AVAILABILITY AND IMPLEMENTATION All the proposed methods are implemented in an R package available at http://www.github.com/baolinwu/MTAR. The developed R programs are extremely efficient: it takes less than 2 min to compute the list of genome-wide significant single nucleotide polymorphisms (SNPs) for all proposed multi-trait tests for the lipids GWAS summary data with 2.5 million SNPs on a single Linux desktop. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
18
|
Deng Y, He T, Fang R, Li S, Cao H, Cui Y. Genome-Wide Gene-Based Multi-Trait Analysis. Front Genet 2020; 11:437. [PMID: 32508874 PMCID: PMC7248273 DOI: 10.3389/fgene.2020.00437] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/08/2020] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association studies focusing on a single phenotype have been broadly conducted to identify genetic variants associated with a complex disease. The commonly applied single variant analysis is limited by failing to consider the complex interactions between variants, which motivated the development of association analyses focusing on genes or gene sets. Moreover, when multiple correlated phenotypes are available, methods based on a multi-trait analysis can improve the association power. However, most currently available multi-trait analyses are single variant-based analyses; thus have limited power when disease variants function as a group in a gene or a gene set. In this work, we propose a genome-wide gene-based multi-trait analysis method by considering genes as testing units. For a given phenotype, we adopt a rapid and powerful kernel-based testing method which can evaluate the joint effect of multiple variants within a gene. The joint effect, either linear or nonlinear, is captured through kernel functions. Given a series of candidate kernel functions, we propose an omnibus test strategy to integrate the test results based on different candidate kernels. A p-value combination method is then applied to integrate dependent p-values to assess the association between a gene and multiple correlated phenotypes. Simulation studies show a reasonable type I error control and an excellent power of the proposed method compared to its counterparts. We further show the utility of the method by applying it to two data sets: the Human Liver Cohort and the Alzheimer Disease Neuroimaging Initiative data set, and novel genes are identified. Our method has broad applications in other fields in which the interest is to evaluate the joint effect (linear or nonlinear) of a set of variants.
Collapse
Affiliation(s)
- Yamin Deng
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, United States
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, United States
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
19
|
Guo B, Wu B. Powerful and efficient SNP-set association tests across multiple phenotypes using GWAS summary data. Bioinformatics 2020; 35:1366-1372. [PMID: 30239606 DOI: 10.1093/bioinformatics/bty811] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Revised: 08/29/2018] [Accepted: 09/18/2018] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Many GWAS conducted in the past decade have identified tens of thousands of disease related variants, which in total explained only part of the heritability for most traits. There remain many more genetics variants with small effect sizes to be discovered. This has motivated the development of sequencing studies with larger sample sizes and increased resolution of genotyped variants, e.g., the ongoing NHLBI Trans-Omics for Precision Medicine (TOPMed) whole genome sequencing project. An alternative approach is the development of novel and more powerful statistical methods. The current dominating approach in the field of GWAS analysis is the "single trait single variant" association test, despite the fact that most GWAS are conducted in deeply-phenotyped cohorts with many correlated traits measured. In this paper, we aim to develop rigorous methods that integrate multiple correlated traits and multiple variants to improve the power to detect novel variants. In recognition of the difficulty of accessing raw genotype and phenotype data due to privacy and logistic concerns, we develop methods that are applicable to publicly available GWAS summary data. RESULTS We build rigorous statistical models for GWAS summary statistics to motivate novel multi-trait SNP-set association tests, including variance component test, burden test and their adaptive test, and develop efficient numerical algorithms to quickly compute their analytical P-values. We implement the proposed methods in an open source R package. We conduct thorough simulation studies to verify the proposed methods rigorously control type I errors at the genome-wide significance level, and further demonstrate their utility via comprehensive analysis of GWAS summary data for multiple lipids traits and glycemic traits. We identified many novel loci that were not detected by the individual trait based GWAS analysis. AVAILABILITY AND IMPLEMENTATION We have implemented the proposed methods in an R package freely available at http://www.github.com/baolinwu/MSKAT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bin Guo
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Baolin Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
20
|
Martinez K, Maity A, Yolken RH, Sullivan PF, Tzeng JY. Robust kernel association testing (RobKAT). Genet Epidemiol 2020; 44:272-282. [PMID: 31943371 DOI: 10.1002/gepi.22280] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Revised: 12/18/2019] [Accepted: 12/23/2019] [Indexed: 12/25/2022]
Abstract
Testing the association between single-nucleotide polymorphism (SNP) effects and a response is often carried out through kernel machine methods based on least squares, such as the sequence kernel association test (SKAT). However, these least-squares procedures are designed for a normally distributed conditional response, which may not apply. Other robust procedures such as the quantile regression kernel machine (QRKM) restrict the choice of the loss function and only allow inference on conditional quantiles. We propose a general and robust kernel association test with a flexible choice of the loss function, no distributional assumptions, and has SKAT and QRKM as special cases. We evaluate our proposed robust association test (RobKAT) across various data distributions through a simulation study. When errors are normally distributed, RobKAT controls type I error and shows comparable power with SKAT. In all other distributional settings investigated, our robust test has similar or greater power than SKAT. Finally, we apply our robust testing method to data from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) clinical trial to detect associations between selected genes including the major histocompatibility complex (MHC) region on chromosome six and neurotropic herpesvirus antibody levels in schizophrenia patients. RobKAT detected significant association with four SNP sets (HST1H2BJ, MHC, POM12L2, and SLC17A1), three of which were undetected by SKAT.
Collapse
Affiliation(s)
- Kara Martinez
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Robert H Yolken
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Patrick F Sullivan
- Stanley Neurovirology Laboratory, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina.,Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| |
Collapse
|
21
|
Konigorski S, Yilmaz YE, Janke J, Bergmann MM, Boeing H, Pischon T. Powerful rare variant association testing in a copula-based joint analysis of multiple phenotypes. Genet Epidemiol 2019; 44:26-40. [PMID: 31732979 DOI: 10.1002/gepi.22265] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Revised: 08/13/2019] [Accepted: 09/16/2019] [Indexed: 12/16/2022]
Abstract
In genetic association studies of rare variants, the low power of association tests is one of the main challenges. In this study, we propose a new single-marker association test called C-JAMP (Copula-based Joint Analysis of Multiple Phenotypes), which is based on a joint model of multiple phenotypes given genetic markers and other covariates. We evaluated its performance and compared its empirical type I error and power with existing univariate and multivariate single-marker and multi-marker rare-variant tests in extensive simulation studies. C-JAMP yielded unbiased genetic effect estimates and valid type I errors with an adjusted test statistic. When strongly dependent traits were jointly analyzed, C-JAMP had the highest power in all scenarios except when a high percentage of variants were causal with moderate/small effect sizes. When traits with weak or moderate dependence were analyzed, whether C-JAMP or competing approaches had higher power depended on the effect size. When C-JAMP was applied with a misspecified copula function, it still achieved high power in some of the scenarios considered. In a real-data application, we analyzed sequencing data using C-JAMP and performed the first genome-wide association studies of high-molecular-weight and medium-molecular-weight adiponectin plasma concentrations. C-JAMP identified 20 rare variants with p-values smaller than 10-5 , while all other tests resulted in the identification of fewer variants with higher p-values. In summary, the results indicate that C-JAMP is a powerful, flexible, and robust method for association studies, and we identified novel candidate markers for adiponectin. C-JAMP is implemented as an R package and freely available from https://cran.r-project.org/package=CJAMP.
Collapse
Affiliation(s)
- Stefan Konigorski
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany.,Digital Health and Machine Learning Research Group, Hasso Plattner Institute for Digital Engineering, Potsdam, Germany
| | - Yildiz E Yilmaz
- Department of Mathematics and Statistics, Memorial University of Newfoundland, St. John's, NL, Canada.,Discipline of Genetics, Faculty of Medicine, Memorial University of Newfoundland, St. John's, NL, Canada.,Discipline of Medicine, Faculty of Medicine, Memorial University of Newfoundland, St. John's, NL, Canada
| | - Jürgen Janke
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Manuela M Bergmann
- Department of Epidemiology, German Institute of Human Nutrition Potsdam-Rehbrücke (DIfE), Nuthetal, Germany
| | - Heiner Boeing
- Department of Epidemiology, German Institute of Human Nutrition Potsdam-Rehbrücke (DIfE), Nuthetal, Germany
| | - Tobias Pischon
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany.,Charité-Universitätsmedizin Berlin, Berlin, Germany.,DZHK (German Center for Cardiovascular Research), partner site Berlin, Berlin, Germany
| |
Collapse
|
22
|
Schaid DJ, Tong X, Batzler A, Sinnwell JP, Qing J, Biernacka JM. Multivariate generalized linear model for genetic pleiotropy. Biostatistics 2019; 20:111-128. [PMID: 29267957 DOI: 10.1093/biostatistics/kxx067] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Accepted: 11/05/2017] [Indexed: 02/07/2023] Open
Abstract
When a single gene influences more than one trait, known as pleiotropy, it is important to detect pleiotropy to improve the biological understanding of a gene. This can lead to improved screening, diagnosis, and treatment of diseases. Yet, most current multivariate methods to evaluate pleiotropy test the null hypothesis that none of the traits are associated with a variant; departures from the null could be driven by just one associated trait. A formal test of pleiotropy should assume a null hypothesis that one or fewer traits are associated with a genetic variant. We recently developed statistical methods to analyze pleiotropy for quantitative traits having a multivariate normal distribution. We now extend this approach to traits that can be modeled by generalized linear models, such as analysis of binary, ordinal, or quantitative traits, or a mixture of these types of traits. Based on methods from estimating equations, we developed a new test for pleiotropy. We then extended the testing framework to a sequential approach to test the null hypothesis that $k+1$ traits are associated, given that the null of $k$ associated traits was rejected. This provides a testing framework to determine the number of traits associated with a genetic variant, as well as which traits, while accounting for correlations among the traits. By simulations, we illustrate the Type-I error rate and power of our new methods, describe how they are influenced by sample size, the number of traits, and the trait correlations, and apply the new methods to a genome-wide association study of multivariate traits measuring symptoms of major depression. Our new approach provides a quantitative assessment of pleiotropy, enhancing current analytic practice.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Mayo Clinic, Harwick 775, 200 First ST SW, Rochester, MN, USA
| | - Xingwei Tong
- School of Statistics, Beijing Normal University, Beijing, China
| | - Anthony Batzler
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jason P Sinnwell
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jiang Qing
- School of Statistics, Beijing Normal University, Beijing, China
| | | |
Collapse
|
23
|
Masotti M, Guo B, Wu B. Pleiotropy informed adaptive association test of multiple traits using genome-wide association study summary data. Biometrics 2019; 75:1076-1085. [PMID: 31021400 DOI: 10.1111/biom.13076] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 04/16/2019] [Indexed: 12/17/2022]
Abstract
Genetic variants associated with disease outcomes can be used to develop personalized treatment. To reach this precision medicine goal, hundreds of large-scale genome-wide association studies (GWAS) have been conducted in the past decade to search for promising genetic variants associated with various traits. They have successfully identified tens of thousands of disease-related variants. However, in total these identified variants explain only part of the variation for most complex traits. There remain many genetic variants with small effect sizes to be discovered, which calls for the development of (a) GWAS with more samples and more comprehensively genotyped variants, for example, the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program is planning to conduct whole genome sequencing on over 100 000 individuals; and (b) novel and more powerful statistical analysis methods. The current dominating GWAS analysis approach is the "single trait" association test, despite the fact that many GWAS are conducted in deeply phenotyped cohorts including many correlated and well-characterized outcomes, which can help improve the power to detect novel variants if properly analyzed, as suggested by increasing evidence that pleiotropy, where a genetic variant affects multiple traits, is the norm in genome-phenome associations. We aim to develop pleiotropy informed powerful association test methods across multiple traits for GWAS. Since it is generally very hard to access individual-level GWAS phenotype and genotype data for those existing GWAS, due to privacy concerns and various logistical considerations, we develop rigorous statistical methods for pleiotropy informed adaptive multitrait association test methods that need only summary association statistics publicly available from most GWAS. We first develop a pleiotropy test, which has powerful performance for truly pleiotropic variants but is sensitive to the pleiotropy assumption. We then develop a pleiotropy informed adaptive test that has robust and powerful performance under various genetic models. We develop accurate and efficient numerical algorithms to compute the analytical P-value for the proposed adaptive test without the need of resampling or permutation. We illustrate the performance of proposed methods through application to joint association test of GWAS meta-analysis summary data for several glycemic traits. Our proposed adaptive test identified several novel loci missed by individual trait based GWAS meta-analysis. All the proposed methods are implemented in a publicly available R package.
Collapse
Affiliation(s)
- Maria Masotti
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Bin Guo
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Baolin Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
24
|
Yan Q, Fang Z, Chen W. KMgene: a unified R package for gene-based association analysis for complex traits. Bioinformatics 2019; 34:2144-2146. [PMID: 29438558 PMCID: PMC6246171 DOI: 10.1093/bioinformatics/bty066] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Accepted: 02/08/2018] [Indexed: 11/29/2022] Open
Abstract
Summary In this report, we introduce an R package KMgene for performing gene-based association
tests for familial, multivariate or longitudinal traits using kernel machine (KM)
regression under a generalized linear mixed model framework. Extensive simulations were
performed to evaluate the validity of the approaches implemented in KMgene. Availability and implementation http://cran.r-project.org/web/packages/KMgene. Supplementary information Supplementary data are
available at Bioinformatics online.
Collapse
Affiliation(s)
- Qi Yan
- Division of Pulmonary Medicine, Allergy and Immunology, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, USA
| | - Zhou Fang
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| | - Wei Chen
- Division of Pulmonary Medicine, Allergy and Immunology, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA, USA.,Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
25
|
Multivariate association test for rare variants controlling for cryptic and family relatedness. CAN J STAT 2019. [DOI: 10.1002/cjs.11475] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
26
|
Dutta D, Scott L, Boehnke M, Lee S. Multi-SKAT: General framework to test for rare-variant association with multiple phenotypes. Genet Epidemiol 2019; 43:4-23. [PMID: 30298564 PMCID: PMC6330125 DOI: 10.1002/gepi.22156] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 07/12/2018] [Accepted: 07/15/2018] [Indexed: 12/13/2022]
Abstract
In genetic association analysis, a joint test of multiple distinct phenotypes can increase power to identify sets of trait-associated variants within genes or regions of interest. Existing multiphenotype tests for rare variants make specific assumptions about the patterns of association with underlying causal variants, and the violation of these assumptions can reduce power to detect association. Here, we develop a general framework for testing pleiotropic effects of rare variants on multiple continuous phenotypes using multivariate kernel regression (Multi-SKAT). Multi-SKAT models affect sizes of variants on the phenotypes through a kernel matrix and perform a variance component test of association. We show that many existing tests are equivalent to specific choices of kernel matrices with the Multi-SKAT framework. To increase power of detecting association across tests with different kernel matrices, we developed a fast and accurate approximation of the significance of the minimum observed P value across tests. To account for related individuals, our framework uses random effects for the kinship matrix. Using simulated data and amino acid and exome-array data from the METabolic Syndrome In Men (METSIM) study, we show that Multi-SKAT can improve power over single-phenotype SKAT-O test and existing multiple-phenotype tests, while maintaining Type I error rate.
Collapse
Affiliation(s)
- Diptavo Dutta
- Department of Biostatistics, University of Michigan Ann Arbor, Michigan, USA
- Center for Statistical Genetics, University of Michigan Ann Arbor, Michigan, USA
| | - Laura Scott
- Department of Biostatistics, University of Michigan Ann Arbor, Michigan, USA
- Center for Statistical Genetics, University of Michigan Ann Arbor, Michigan, USA
| | - Michael Boehnke
- Department of Biostatistics, University of Michigan Ann Arbor, Michigan, USA
- Center for Statistical Genetics, University of Michigan Ann Arbor, Michigan, USA
| | - Seunggeun Lee
- Department of Biostatistics, University of Michigan Ann Arbor, Michigan, USA
- Center for Statistical Genetics, University of Michigan Ann Arbor, Michigan, USA
| |
Collapse
|
27
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
28
|
Fischer ST, Jiang Y, Broadaway KA, Conneely KN, Epstein MP. Powerful and robust cross-phenotype association test for case-parent trios. Genet Epidemiol 2018; 42:447-458. [PMID: 29460449 PMCID: PMC6013339 DOI: 10.1002/gepi.22116] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 01/05/2018] [Accepted: 01/08/2018] [Indexed: 12/17/2022]
Abstract
There has been increasing interest in identifying genes within the human genome that influence multiple diverse phenotypes. In the presence of pleiotropy, joint testing of these phenotypes is not only biologically meaningful but also statistically more powerful than univariate analysis of each separate phenotype accounting for multiple testing. Although many cross-phenotype association tests exist, the majority of such methods assume samples composed of unrelated subjects and therefore are not applicable to family-based designs, including the valuable case-parent trio design. In this paper, we describe a robust gene-based association test of multiple phenotypes collected in a case-parent trio study. Our method is based on the kernel distance covariance (KDC) method, where we first construct a similarity matrix for multiple phenotypes and a similarity matrix for genetic variants in a gene; we then test the dependency between the two similarity matrices. The method is applicable to either common variants or rare variants in a gene, and resulting tests from the method are by design robust to confounding due to population stratification. We evaluated our method through simulation studies and observed that the method is substantially more powerful than standard univariate testing of each separate phenotype. We also applied our method to phenotypic and genotypic data collected in case-parent trios as part of the Genetics of Kidneys in Diabetes (GoKinD) study and identified a genome-wide significant gene demonstrating cross-phenotype effects that was not identified using standard univariate approaches.
Collapse
Affiliation(s)
- S. Taylor Fischer
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Yunxuan Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA
| | - K. Alaine Broadaway
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Karen N. Conneely
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Michael P. Epstein
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| |
Collapse
|
29
|
Wei C, Lu Q. A generalized association test based on U statistics. Bioinformatics 2018; 33:1963-1971. [PMID: 28334117 DOI: 10.1093/bioinformatics/btx103] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/15/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Second generation sequencing technologies are being increasingly used for genetic association studies, where the main research interest is to identify sets of genetic variants that contribute to various phenotypes. The phenotype can be univariate disease status, multivariate responses and even high-dimensional outcomes. Considering the genotype and phenotype as two complex objects, this also poses a general statistical problem of testing association between complex objects. Results We here proposed a similarity-based test, generalized similarity U (GSU), that can test the association between complex objects. We first studied the theoretical properties of the test in a general setting and then focused on the application of the test to sequencing association studies. Based on theoretical analysis, we proposed to use Laplacian Kernel-based similarity for GSU to boost power and enhance robustness. Through simulation, we found that GSU did have advantages over existing methods in terms of power and robustness. We further performed a whole genome sequencing (WGS) scan for Alzherimer's disease neuroimaging initiative data, identifying three genes, APOE , APOC1 and TOMM40 , associated with imaging phenotype. Availability and Implementation We developed a C ++ package for analysis of WGS data using GSU. The source codes can be downloaded at https://github.com/changshuaiwei/gsu . Contact weichangshuai@gmail.com ; qlu@epi.msu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Changshuai Wei
- Department of Biostatistics and Epidemiology, University of North Texas Health Science Center, Fort Worth, TX 76107
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
30
|
Davenport CA, Maity A, Sullivan PF, Tzeng JY. A Powerful Test for SNP Effects on Multivariate Binary Outcomes using Kernel Machine Regression. STATISTICS IN BIOSCIENCES 2018; 10:117-138. [PMID: 30420901 PMCID: PMC6226013 DOI: 10.1007/s12561-017-9189-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2016] [Revised: 12/20/2016] [Accepted: 03/15/2017] [Indexed: 10/19/2022]
Abstract
Evaluating multiple binary outcomes is common in genetic studies of complex diseases. These outcomes are often correlated because they are collected from the same individual and they may share common marker effects. In this paper, we propose a procedure to test for effect of a SNP-set on multiple, possibly correlated, binary responses. We develop a score-based test using a nonparametric modeling framework that jointly models the global effect of the marker set. We account for the nonlinear effects and potentially complicated interaction between markers using reproducing kernels. Our testing procedure only requires estimation under the null hypothesis and we use multivariate generalized estimating equations (GEEs) to estimate the model components to account for the correlation among the outcomes. We evaluate finite sample performance of our test via simulation study and demonstrated our methods using the CATIE antibody study data and the CoLaus Study data.
Collapse
Affiliation(s)
- Clemontina A Davenport
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC 27707, USA
| | - Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Patrick F Sullivan
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Jung-Ying Tzeng
- Department of Statistics, Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA. Department of Statistics, National Cheng-Kung University, Tainan, Taiwan Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
31
|
He Q, Liu Y, Peters U, Hsu L. Multivariate association analysis with somatic mutation data. Biometrics 2018; 74:176-184. [PMID: 28722765 PMCID: PMC5967890 DOI: 10.1111/biom.12745] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2016] [Revised: 04/01/2017] [Accepted: 05/01/2017] [Indexed: 12/21/2022]
Abstract
Somatic mutations are the driving forces for tumor development, and recent advances in cancer genome sequencing have made it feasible to evaluate the association between somatic mutations and cancer-related traits in large sample sizes. However, despite increasingly large sample sizes, it remains challenging to conduct statistical analysis for somatic mutations, because the vast majority of somatic mutations occur at very low frequencies. Furthermore, cancer is a complex disease and it is often accompanied by multiple traits that reflect various aspects of cancer; how to combine the information of these traits to identify important somatic mutations poses additional challenges. In this article, we introduce a statistical approach, named as SOMAT, for detecting somatic mutations associated with multiple cancer-related traits. Our approach provides a flexible framework for analyzing continuous, binary, or a mixture of both types of traits, and is statistically powerful and computationally efficient. In addition, we propose a data-adaptive procedure, which is grid-search free, for effectively combining test statistics to enhance statistical power. We conduct an extensive study and show that the proposed approach maintains correct type I error and is more powerful than existing approaches under the scenarios considered. We also apply our approach to an exome-sequencing study of liver tumor for illustration.
Collapse
Affiliation(s)
- Qianchuan He
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, U.S.A
| | - Yang Liu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, U.S.A
| | - Ulrike Peters
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, U.S.A
| | - Li Hsu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, U.S.A
| |
Collapse
|
32
|
Maity A, Zhao J, Sullivan PF, Tzeng JY. Inference on phenotype-specific effects of genes using multivariate kernel machine regression. Genet Epidemiol 2018; 42:64-79. [PMID: 29314255 PMCID: PMC5768462 DOI: 10.1002/gepi.22096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2017] [Revised: 10/20/2017] [Accepted: 10/20/2017] [Indexed: 12/16/2022]
Abstract
We consider the problem of assessing the joint effect of a set of genetic markers on multiple, possibly correlated phenotypes of interest. We develop a kernel machine based multivariate regression framework, where the joint effect of the marker set on each of the phenotypes is modeled using prespecified kernel functions with unknown variance components. Unlike most existing methods that mainly focus on the global association between the marker set and the phenotype set, we develop estimation and testing procedures to study phenotype-specific associations. Specifically, we develop an estimation method based on the penalized likelihood approach to estimate phenotype-specific effects and their corresponding standard errors while accounting for possible correlation among the phenotypes. We develop testing procedures for the association of the marker set with any subset of phenotypes using a score-based variance components testing method. We assess the performance of our proposed methodology via a simulation study and demonstrate the utility of the proposed method using the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) data.
Collapse
Affiliation(s)
- Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jing Zhao
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Patrick F Sullivan
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
- Department of Statistics, National Cheng-Kung University, Tainan City, Taiwan
| |
Collapse
|
33
|
Sun J, Oualkacha K, Forgetta V, Zheng HF, Richards JB, Evans DS, Orwoll E, Greenwood CMT. Exome-wide rare variant analyses of two bone mineral density phenotypes: the challenges of analyzing rare genetic variation. Sci Rep 2018; 8:220. [PMID: 29317680 PMCID: PMC5760616 DOI: 10.1038/s41598-017-18385-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 12/11/2017] [Indexed: 11/08/2022] Open
Abstract
Performance of a recently developed test for association between multivariate phenotypes and sets of genetic variants (MURAT) is demonstrated using measures of bone mineral density (BMD). By combining individual-level whole genome sequenced data from the UK10K study, and imputed genome-wide genetic data on individuals from the Study of Osteoporotic Fractures (SOF) and the Osteoporotic Fractures in Men Study (MrOS), a data set of 8810 individuals was assembled; tests of association were performed between autosomal gene-sets of genetic variants and BMD measured at lumbar spine and femoral neck. Distributions of p-values obtained from analyses of a single BMD phenotype are compared to those from the multivariate tests, across several region definitions and variant weightings. There is evidence of increased power with the multivariate test, although no new loci for BMD were identified. Among 17 genes highlighted either because there were significant p-values in region-based association tests or because they were in well-known BMD genes, 4 windows in 2 genes as well as 6 single SNPs in one of these genes showed association at genome-wide significant thresholds with the multivariate phenotype test but not with the single-phenotype test, Sequence Kernel Association Test (SKAT).
Collapse
Affiliation(s)
- Jianping Sun
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Karim Oualkacha
- Département de mathématiques, Université du Québec à Montréal, Montreal, QC, Canada
| | - Vincenzo Forgetta
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Hou-Feng Zheng
- Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Westlake University, Hangzhou, Zhejiang, China
- Institute of Aging Research and the Affiliated Hospital, School of Medicine, Hangzhou Normal University, Hangzhou, Zhejiang, China
| | - J Brent Richards
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
- Department of Human Genetics, McGill University, Montreal, QC, Canada
| | - Daniel S Evans
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Eric Orwoll
- Department of Medicine, Bone and Mineral Unit, Oregon Health and Science University, Portland, OR, USA
| | - Celia M T Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada.
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada.
- Department of Human Genetics, McGill University, Montreal, QC, Canada.
- Department of Oncology, McGill University, Montreal, QC, Canada.
| |
Collapse
|
34
|
Zhao N, Zhan X, Huang YT, Almli LM, Smith A, Epstein MP, Conneely K, Wu MC. Kernel machine methods for integrative analysis of genome-wide methylation and genotyping studies. Genet Epidemiol 2017; 42:156-167. [DOI: 10.1002/gepi.22100] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 09/26/2017] [Accepted: 10/27/2017] [Indexed: 12/22/2022]
Affiliation(s)
- Ni Zhao
- Department of Biostatistics; Johns Hopkins University; Baltimore Maryland 21205 United States of America
| | - Xiang Zhan
- Department of Public Health Sciences; Pennsylvania State University; Hershey Pennsylvania 17033 United States of America
| | - Yen-Tsung Huang
- Institute of Statistical Science; Academia Sinica; Taipei 11529 Taiwan
| | - Lynn M Almli
- Department of Psychiatry and Behavioral Sciences; Emory University; Atlanta Georgia 30322 United States of America
| | - Alicia Smith
- Department of Gynecology and Obstetrics; Emory University; Atlanta Georgia 30322 United States of America
| | - Michael P. Epstein
- Department of Human Genetics; Emory University; Atlanta Georgia 30322 United States of America
| | - Karen Conneely
- Department of Human Genetics; Emory University; Atlanta Georgia 30322 United States of America
| | - Michael C. Wu
- Public Health Sciences; Fred Hutchinson Cancer Research Center; Seattle Washington 98109 United States of America
| |
Collapse
|
35
|
Xu Z, Xu G, Pan W. Adaptive testing for association between two random vectors in moderate to high dimensions. Genet Epidemiol 2017; 41:599-609. [PMID: 28714590 PMCID: PMC5643233 DOI: 10.1002/gepi.22059] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Revised: 04/26/2017] [Accepted: 05/17/2017] [Indexed: 01/09/2023]
Abstract
Testing for association between two random vectors is a common and important task in many fields, however, existing tests, such as Escoufier's RV test, are suitable only for low-dimensional data, not for high-dimensional data. In moderate to high dimensions, it is necessary to consider sparse signals, which are often expected with only a few, but not many, variables associated with each other. We generalize the RV test to moderate-to-high dimensions. The key idea is to data adaptively weight each variable pair based on its empirical association. As the consequence, the proposed test is adaptive, alleviating the effects of noise accumulation in high-dimensional data, and thus maintaining the power for both dense and sparse alternative hypotheses. We show the connections between the proposed test with several existing tests, such as a generalized estimating equations-based adaptive test, multivariate kernel machine regression (KMR), and kernel distance methods. Furthermore, we modify the proposed adaptive test so that it can be powerful for nonlinear or nonmonotonic associations. We use both real data and simulated data to demonstrate the advantages and usefulness of the proposed new test. The new test is freely available in R package aSPC on CRAN at https://cran.r-project.org/web/packages/aSPC/index.html and https://github.com/jasonzyx/aSPC.
Collapse
Affiliation(s)
- Zhiyuan Xu
- Division of Biostatistics, University of Minnesota
| | - Gongjun Xu
- Department of Statistics, University of Michigan
| | - Wei Pan
- Division of Biostatistics, University of Minnesota
| | | |
Collapse
|
36
|
Lin N, Zhu Y, Fan R, Xiong M. A quadratically regularized functional canonical correlation analysis for identifying the global structure of pleiotropy with NGS data. PLoS Comput Biol 2017; 13:e1005788. [PMID: 29040274 PMCID: PMC5659802 DOI: 10.1371/journal.pcbi.1005788] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Revised: 10/27/2017] [Accepted: 09/21/2017] [Indexed: 01/12/2023] Open
Abstract
Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore correlation information of genetic variants, effectively reduce data dimensions, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new statistic method referred to as a quadratically regularized functional CCA (QRFCCA) for association analysis which combines three approaches: (1) quadratically regularized matrix factorization, (2) functional data analysis and (3) canonical correlation analysis (CCA). Large-scale simulations show that the QRFCCA has a much higher power than that of the ten competing statistics while retaining the appropriate type 1 errors. To further evaluate performance, the QRFCCA and ten other statistics are applied to the whole genome sequencing dataset from the TwinsUK study. We identify a total of 79 genes with rare variants and 67 genes with common variants significantly associated with the 46 traits using QRFCCA. The results show that the QRFCCA substantially outperforms the ten other statistics.
Collapse
Affiliation(s)
- Nan Lin
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States of America
| | - Yun Zhu
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, United States of America
| | - Ruzong Fan
- Biostatistics and Bioinformatics Branch (BBB), Division of Intramural Population Health Research (DIPHR), Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, United States of America
| | - Momiao Xiong
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States of America
| |
Collapse
|
37
|
Zhang W, Yang L, Tang LL, Liu A, Mills JL, Sun Y, Li Q. GATE: an efficient procedure in study of pleiotropic genetic associations. BMC Genomics 2017; 18:552. [PMID: 28732532 PMCID: PMC5521155 DOI: 10.1186/s12864-017-3928-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Accepted: 07/06/2017] [Indexed: 11/10/2022] Open
Abstract
Background The association studies on human complex traits are admittedly propitious to identify deleterious genetic markers. Compared to single-trait analyses, multiple-trait analyses can arguably make better use of the information on both traits and markers, and thus improve statistical power of association tests prominently. Principal component analysis (PCA) is a well-known useful tool in multivariate analysis and can be applied to this task. Generally, PCA is first performed on all traits and then a certain number of top principal components (PCs) that explain most of the trait variations are selected to construct the test statistics. However, under some situations, only utilizing these top PCs would lead to a loss of important evidences from discarded PCs and thus makes the capability compromised. Methods To overcome this drawback while keeping the advantages of using the top PCs, we propose a group accumulated test evidence (GATE) procedure. By dividing the PCs which is sorted in the descending order according to the corresponding eigenvalues into a few groups, GATE integrates the information of traits at the group level. Results Simulation studies demonstrate the superiority of the proposed approach over several existing methods in terms of statistical power. Sometimes, the increase of power can reach 25%. These methods are further illustrated using the Heterogeneous Stock Mice data which is collected from a quantitative genome-wide association study. Conclusions Overall, GATE provides a powerful test for pleiotropic genetic associations. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3928-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.,Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Liu Yang
- College of Geoscience and Surveying Engineering, China University of Mining and Technology, Beijing, China
| | - Larry L Tang
- Department of Statistics, George Mason University, Fairfax, VA, USA.,Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, MD, USA
| | - Aiyi Liu
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - James L Mills
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Yuanchang Sun
- Department of Mathematics and Statistics, Florida International University, Miami, FL, USA
| | - Qizhai Li
- Key Laboratory of Systems and Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
38
|
Powerful Genetic Association Analysis for Common or Rare Variants with High-Dimensional Structured Traits. Genetics 2017. [PMID: 28642271 DOI: 10.1534/genetics.116.199646] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Many genetic association studies collect a wide range of complex traits. As these traits may be correlated and share a common genetic mechanism, joint analysis can be statistically more powerful and biologically more meaningful. However, most existing tests for multiple traits cannot be used for high-dimensional and possibly structured traits, such as network-structured transcriptomic pathway expressions. To overcome potential limitations, in this article we propose the dual kernel-based association test (DKAT) for testing the association between multiple traits and multiple genetic variants, both common and rare. In DKAT, two individual kernels are used to describe the phenotypic and genotypic similarity, respectively, between pairwise subjects. Using kernels allows for capturing structure while accommodating dimensionality. Then, the association between traits and genetic variants is summarized by a coefficient which measures the association between two kernel matrices. Finally, DKAT evaluates the hypothesis of nonassociation with an analytical P-value calculation without any computationally expensive resampling procedures. By collapsing information in both traits and genetic variants using kernels, the proposed DKAT is shown to have a correct type-I error rate and higher power than other existing methods in both simulation studies and application to a study of genetic regulation of pathway gene expressions.
Collapse
|
39
|
Kwak IY, Pan W. Gene- and pathway-based association tests for multiple traits with GWAS summary statistics. Bioinformatics 2017; 33:64-71. [PMID: 27592708 PMCID: PMC5198520 DOI: 10.1093/bioinformatics/btw577] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Revised: 08/08/2016] [Accepted: 08/29/2016] [Indexed: 11/15/2022] Open
Abstract
To identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene- or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP- and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or P-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods. AVAILABILITY AND IMPLEMENTATION The methods are implemented in R package aSPU, freely and publicly available at: https://cran.r-project.org/web/packages/aSPU/ CONTACT: weip@biostat.umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Il-Youp Kwak
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
40
|
Zhan X, Tong X, Zhao N, Maity A, Wu MC, Chen J. A small-sample multivariate kernel machine test for microbiome association studies. Genet Epidemiol 2016; 41:210-220. [DOI: 10.1002/gepi.22030] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 10/04/2016] [Accepted: 10/22/2016] [Indexed: 12/28/2022]
Affiliation(s)
- Xiang Zhan
- Public Health Sciences; Fred Hutchinson Cancer Research Center; Seattle WA USA
| | - Xingwei Tong
- School of Mathematical Sciences; Beijing Normal University; Beijing China
| | - Ni Zhao
- Department of Biostatistics; Johns Hopkins University; Baltimore MD USA
| | - Arnab Maity
- Department of Statistics; North Carolina State University; Raleigh NC USA
| | - Michael C. Wu
- Public Health Sciences; Fred Hutchinson Cancer Research Center; Seattle WA USA
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic; Rochester MN USA
| |
Collapse
|
41
|
Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. Eur J Hum Genet 2016; 25:350-359. [PMID: 28000696 DOI: 10.1038/ejhg.2016.170] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2016] [Revised: 07/26/2016] [Accepted: 09/27/2016] [Indexed: 11/09/2022] Open
Abstract
To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data.
Collapse
|
42
|
Chiu CY, Jung J, Wang Y, Weeks DE, Wilson AF, Bailey-Wilson JE, Amos CI, Mills JL, Boehnke M, Xiong M, Fan R. A comparison study of multivariate fixed models and Gene Association with Multiple Traits (GAMuT) for next-generation sequencing. Genet Epidemiol 2016; 41:18-34. [PMID: 27917525 DOI: 10.1002/gepi.22014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 09/01/2016] [Accepted: 09/19/2016] [Indexed: 01/23/2023]
Abstract
In this paper, extensive simulations are performed to compare two statistical methods to analyze multiple correlated quantitative phenotypes: (1) approximate F-distributed tests of multivariate functional linear models (MFLM) and additive models of multivariate analysis of variance (MANOVA), and (2) Gene Association with Multiple Traits (GAMuT) for association testing of high-dimensional genotype data. It is shown that approximate F-distributed tests of MFLM and MANOVA have higher power and are more appropriate for major gene association analysis (i.e., scenarios in which some genetic variants have relatively large effects on the phenotypes); GAMuT has higher power and is more appropriate for analyzing polygenic effects (i.e., effects from a large number of genetic variants each of which contributes a small amount to the phenotypes). MFLM and MANOVA are very flexible and can be used to perform association analysis for (i) rare variants, (ii) common variants, and (iii) a combination of rare and common variants. Although GAMuT was designed to analyze rare variants, it can be applied to analyze a combination of rare and common variants and it performs well when (1) the number of genetic variants is large and (2) each variant contributes a small amount to the phenotypes (i.e., polygenes). MFLM and MANOVA are fixed effect models that perform well for major gene association analysis. GAMuT can be viewed as an extension of sequence kernel association tests (SKAT). Both GAMuT and SKAT are more appropriate for analyzing polygenic effects and they perform well not only in the rare variant case, but also in the case of a combination of rare and common variants. Data analyses of European cohorts and the Trinity Students Study are presented to compare the performance of the two methods.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Jeesun Jung
- Laboratory of Epidemiology and Biometry, National Institute on Alcohol, Abuse and Alcoholism, NIH, Bethesda, MD, USA
| | - Yifan Wang
- Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, MD, USA
| | - Daniel E Weeks
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
| | - James L Mills
- Epidemiology Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Michael Boehnke
- Department of Biostatistics, School of Public Health, The University of Michigan, Ann Arbor, MI, USA
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, TX, USA
| | - Ruzong Fan
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
43
|
Sun J, Bhatnagar SR, Oualkacha K, Ciampi A, Greenwood CMT. Joint analysis of multiple blood pressure phenotypes in GAW19 data by using a multivariate rare-variant association test. BMC Proc 2016; 10:309-313. [PMID: 27980654 PMCID: PMC5133485 DOI: 10.1186/s12919-016-0048-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
INTRODUCTION Large-scale sequencing studies often measure many related phenotypes in addition to the genetic variants. Joint analysis of multiple phenotypes in genetic association studies may increase power to detect disease-associated loci. METHODS We apply a recently developed multivariate rare-variant association test to the Genetic Analysis Workshop 19 data in order to test associations between genetic variants and multiple blood pressure phenotypes simultaneously. We also compare this multivariate test with a widely used univariate test that analyzes phenotypes separately. RESULTS The multivariate test identified 2 genetic variants that have been previously reported as associated with hypertension or coronary artery disease. In addition, our region-based analyses also show that the multivariate test tends to give smaller p values than the univariate test. CONCLUSIONS Hence, the multivariate test has potential to improve test power, especially when multiple phenotypes are correlated.
Collapse
Affiliation(s)
- Jianping Sun
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC H3T 1E2 Canada
| | - Sahir R. Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montréal, QC H2X 3Y7 Canada
| | - Antonio Ciampi
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC H3T 1E2 Canada
- Department of Oncology, McGill University, Montreal, QC H2W 1S6 Canada
- Department of Human Genetics, McGill University, Montreal, QC H3A 1B1 Canada
| |
Collapse
|
44
|
Statistical Methods for Testing Genetic Pleiotropy. Genetics 2016; 204:483-497. [PMID: 27527515 DOI: 10.1534/genetics.116.189308] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 08/11/2016] [Indexed: 12/28/2022] Open
Abstract
Genetic pleiotropy is when a single gene influences more than one trait. Detecting pleiotropy and understanding its causes can improve the biological understanding of a gene in multiple ways, yet current multivariate methods to evaluate pleiotropy test the null hypothesis that none of the traits are associated with a variant; departures from the null could be driven by just one associated trait. A formal test of pleiotropy should assume a null hypothesis that one or no traits are associated with a genetic variant. For the special case of two traits, one can construct this null hypothesis based on the intersection-union (IU) test, which rejects the null hypothesis only if the null hypotheses of no association for both traits are rejected. To allow for more than two traits, we developed a new likelihood-ratio test for pleiotropy. We then extended the testing framework to a sequential approach to test the null hypothesis that [Formula: see text] traits are associated, given that the null of k traits are associated was rejected. This provides a formal testing framework to determine the number of traits associated with a genetic variant, while accounting for correlations among the traits. By simulations, we illustrate the type I error rate and power of our new methods; describe how they are influenced by sample size, the number of traits, and the trait correlations; and apply the new methods to multivariate immune phenotypes in response to smallpox vaccination. Our new approach provides a quantitative assessment of pleiotropy, enhancing current analytic practice.
Collapse
|
45
|
Fan R, Chiu CY, Jung J, Weeks DE, Wilson AF, Bailey-Wilson JE, Amos CI, Chen Z, Mills JL, Xiong M. A Comparison Study of Fixed and Mixed Effect Models for Gene Level Association Studies of Complex Traits. Genet Epidemiol 2016; 40:702-721. [PMID: 27374056 DOI: 10.1002/gepi.21984] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 03/08/2016] [Accepted: 04/26/2016] [Indexed: 12/22/2022]
Abstract
In association studies of complex traits, fixed-effect regression models are usually used to test for association between traits and major gene loci. In recent years, variance-component tests based on mixed models were developed for region-based genetic variant association tests. In the mixed models, the association is tested by a null hypothesis of zero variance via a sequence kernel association test (SKAT), its optimal unified test (SKAT-O), and a combined sum test of rare and common variant effect (SKAT-C). Although there are some comparison studies to evaluate the performance of mixed and fixed models, there is no systematic analysis to determine when the mixed models perform better and when the fixed models perform better. Here we evaluated, based on extensive simulations, the performance of the fixed and mixed model statistics, using genetic variants located in 3, 6, 9, 12, and 15 kb simulated regions. We compared the performance of three models: (i) mixed models that lead to SKAT, SKAT-O, and SKAT-C, (ii) traditional fixed-effect additive models, and (iii) fixed-effect functional regression models. To evaluate the type I error rates of the tests of fixed models, we generated genotype data by two methods: (i) using all variants, (ii) using only rare variants. We found that the fixed-effect tests accurately control or have low false positive rates. We performed simulation analyses to compare power for two scenarios: (i) all causal variants are rare, (ii) some causal variants are rare and some are common. Either one or both of the fixed-effect models performed better than or similar to the mixed models except when (1) the region sizes are 12 and 15 kb and (2) effect sizes are small. Therefore, the assumption of mixed models could be satisfied and SKAT/SKAT-O/SKAT-C could perform better if the number of causal variants is large and each causal variant contributes a small amount to the traits (i.e., polygenes). In major gene association studies, we argue that the fixed-effect models perform better or similarly to mixed models in most cases because some variants should affect the traits relatively large. In practice, it makes sense to perform analysis by both the fixed and mixed effect models and to make a comparison, and this can be readily done using our R codes and the SKAT packages.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Chi-Yang Chiu
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Jeesun Jung
- Laboratory of Epidemiology and Biometry, National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Daniel E Weeks
- Departments of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America.,Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire, United States of America
| | - Zhen Chen
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - James L Mills
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas, United States of America
| |
Collapse
|
46
|
Singh P, Engel J, Jansen J, de Haan J, Buydens LMC. Dissimilarity based Partial Least Squares (DPLS) for genomic prediction from SNPs. BMC Genomics 2016; 17:324. [PMID: 27142305 PMCID: PMC4855361 DOI: 10.1186/s12864-016-2651-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2015] [Accepted: 04/22/2016] [Indexed: 05/29/2023] Open
Abstract
Background Genomic prediction (GP) allows breeders to select plants and animals based on their breeding potential for desirable traits, without lengthy and expensive field trials or progeny testing. We have proposed to use Dissimilarity-based Partial Least Squares (DPLS) for GP. As a case study, we use the DPLS approach to predict Bacterial wilt (BW) in tomatoes using SNPs as predictors. The DPLS approach was compared with the Genomic Best-Linear Unbiased Prediction (GBLUP) and single-SNP regression with SNP as a fixed effect to assess the performance of DPLS. Results Eight genomic distance measures were used to quantify relationships between the tomato accessions from the SNPs. Subsequently, each of these distance measures was used to predict the BW using the DPLS prediction model. The DPLS model was found to be robust to the choice of distance measures; similar prediction performances were obtained for each distance measure. DPLS greatly outperformed the single-SNP regression approach, showing that BW is a comprehensive trait dependent on several loci. Next, the performance of the DPLS model was compared to that of GBLUP. Although GBLUP and DPLS are conceptually very different, the prediction quality (PQ) measured by DPLS models were similar to the prediction statistics obtained from GBLUP. A considerable advantage of DPLS is that the genotype-phenotype relationship can easily be visualized in a 2-D scatter plot. This so-called score-plot provides breeders an insight to select candidates for their future breeding program. Conclusions DPLS is a highly appropriate method for GP. The model prediction performance was similar to the GBLUP and far better than the single-SNP approach. The proposed method can be used in combination with a wide range of genomic dissimilarity measures and genotype representations such as allele-count, haplotypes or allele-intensity values. Additionally, the data can be insightfully visualized by the DPLS model, allowing for selection of desirable candidates from the breeding experiments. In this study, we have assessed the DPLS performance on a single trait.
Collapse
Affiliation(s)
- Priyanka Singh
- Department of Bioinformatics, Genetwister Technologies B.V., Wageningen, The Netherlands.,Radboud University Nijmegen, Institute for Molecules and Materials, Nijmegen, The Netherlands
| | - Jasper Engel
- Radboud University Nijmegen, Institute for Molecules and Materials, Nijmegen, The Netherlands
| | - Jeroen Jansen
- Radboud University Nijmegen, Institute for Molecules and Materials, Nijmegen, The Netherlands
| | - Jorn de Haan
- Department of Bioinformatics, Genetwister Technologies B.V., Wageningen, The Netherlands
| | | |
Collapse
|
47
|
Powerful and Adaptive Testing for Multi-trait and Multi-SNP Associations with GWAS and Sequencing Data. Genetics 2016; 203:715-31. [PMID: 27075728 DOI: 10.1534/genetics.115.186502] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 04/02/2016] [Indexed: 11/18/2022] Open
Abstract
Testing for genetic association with multiple traits has become increasingly important, not only because of its potential to boost statistical power, but also for its direct relevance to applications. For example, there is accumulating evidence showing that some complex neurodegenerative and psychiatric diseases like Alzheimer's disease are due to disrupted brain networks, for which it would be natural to identify genetic variants associated with a disrupted brain network, represented as a set of multiple traits, one for each of multiple brain regions of interest. In spite of its promise, testing for multivariate trait associations is challenging: if not appropriately used, its power can be much lower than testing on each univariate trait separately (with a proper control for multiple testing). Furthermore, differing from most existing methods for single-SNP-multiple-trait associations, we consider SNP set-based association testing to decipher complicated joint effects of multiple SNPs on multiple traits. Because the power of a test critically depends on several unknown factors such as the proportions of associated SNPs and of traits, we propose a highly adaptive test at both the SNP and trait levels, giving higher weights to those likely associated SNPs and traits, to yield high power across a wide spectrum of situations. We illuminate relationships among the proposed and some existing tests, showing that the proposed test covers several existing tests as special cases. We compare the performance of the new test with that of several existing tests, using both simulated and real data. The methods were applied to structural magnetic resonance imaging data drawn from the Alzheimer's Disease Neuroimaging Initiative to identify genes associated with gray matter atrophy in the human brain default mode network (DMN). For genome-wide association studies (GWAS), genes AMOTL1 on chromosome 11 and APOE on chromosome 19 were discovered by the new test to be significantly associated with the DMN. Notably, gene AMOTL1 was not detected by single SNP-based analyses. To our knowledge, AMOTL1 has not been highlighted in other Alzheimer's disease studies before, although it was indicated to be related to cognitive impairment. The proposed method is also applicable to rare variants in sequencing data and can be extended to pathway analysis.
Collapse
|
48
|
A Statistical Approach for Testing Cross-Phenotype Effects of Rare Variants. Am J Hum Genet 2016; 98:525-540. [PMID: 26942286 DOI: 10.1016/j.ajhg.2016.01.017] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 01/29/2016] [Indexed: 11/20/2022] Open
Abstract
Increasing empirical evidence suggests that many genetic variants influence multiple distinct phenotypes. When cross-phenotype effects exist, multivariate association methods that consider pleiotropy are often more powerful than univariate methods that model each phenotype separately. Although several statistical approaches exist for testing cross-phenotype effects for common variants, there is a lack of similar tests for gene-based analysis of rare variants. In order to fill this important gap, we introduce a statistical method for cross-phenotype analysis of rare variants using a nonparametric distance-covariance approach that compares similarity in multivariate phenotypes to similarity in rare-variant genotypes across a gene. The approach can accommodate both binary and continuous phenotypes and further can adjust for covariates. Our approach yields a closed-form test whose significance can be evaluated analytically, thereby improving computational efficiency and permitting application on a genome-wide scale. We use simulated data to demonstrate that our method, which we refer to as the Gene Association with Multiple Traits (GAMuT) test, provides increased power over competing approaches. We also illustrate our approach using exome-chip data from the Genetic Epidemiology Network of Arteriopathy.
Collapse
|
49
|
A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects. Eur J Hum Genet 2016; 24:1344-51. [PMID: 26860061 PMCID: PMC4989219 DOI: 10.1038/ejhg.2016.8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Revised: 12/22/2015] [Accepted: 12/30/2015] [Indexed: 01/05/2023] Open
Abstract
For region-based sequencing data, power to detect genetic associations can be improved through analysis of multiple related phenotypes. With this motivation, we propose a novel test to detect association simultaneously between a set of rare variants, such as those obtained by sequencing in a small genomic region, and multiple continuous phenotypes. We allow arbitrary correlations among the phenotypes and build on a linear mixed model by assuming the effects of the variants follow a multivariate normal distribution with a zero mean and a specific covariance matrix structure. In order to account for the unknown correlation parameter in the covariance matrix of the variant effects, a data-adaptive variance component test based on score-type statistics is derived. As our approach can calculate the P-value analytically, the proposed test procedure is computationally efficient. Broad simulations and an application to the UK10K project show that our proposed multivariate test is generally more powerful than univariate tests, especially when there are pleiotropic effects or highly correlated phenotypes.
Collapse
|
50
|
Wu B, Pankow JS. Sequence Kernel Association Test of Multiple Continuous Phenotypes. Genet Epidemiol 2016; 40:91-100. [PMID: 26782911 PMCID: PMC4724299 DOI: 10.1002/gepi.21945] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Revised: 10/28/2015] [Accepted: 11/01/2015] [Indexed: 01/12/2023]
Abstract
Genetic studies often collect multiple correlated traits, which could be analyzed jointly to increase power by aggregating multiple weak effects and provide additional insights into the etiology of complex human diseases. Existing methods for multiple trait association tests have primarily focused on common variants. There is a surprising dearth of published methods for testing the association of rare variants with multiple correlated traits. In this paper, we extend the commonly used sequence kernel association test (SKAT) for single-trait analysis to test for the joint association of rare variant sets with multiple traits. We investigate the performance of the proposed method through extensive simulation studies. We further illustrate its usefulness with application to the analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) Study. We identified an exome-wide significant rare variant set in the gene YAP1 worthy of further investigations.
Collapse
Affiliation(s)
- Baolin Wu
- Division of Biostatistics, University of Minnesota
| | - James S. Pankow
- Division of Epidemiology and Community Health School of
Public Health, University of Minnesota
| |
Collapse
|