1
|
Rong Y, Zhao SD, Zheng X, Li Y. Kernel Cox partially linear regression: Building predictive models for cancer patients' survival. Stat Med 2024; 43:1-15. [PMID: 37875428 DOI: 10.1002/sim.9938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 09/30/2023] [Accepted: 10/03/2023] [Indexed: 10/26/2023]
Abstract
Wide heterogeneity exists in cancer patients' survival, ranging from a few months to several decades. To accurately predict clinical outcomes, it is vital to build an accurate predictive model that relates the patients' molecular profiles with the patients' survival. With complex relationships between survival and high-dimensional molecular predictors, it is challenging to conduct nonparametric modeling and irrelevant predictors removing simultaneously. In this article, we build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model. We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and nonparametric predictors through a LASSO penalty. An efficient high-dimensional algorithm is developed for the proposed method. Comparison with other competing methods in simulation shows that the proposed method always has better predictive accuracy. We apply this method to analyze a multiple myeloma dataset and predict the patients' death burden based on their gene expressions. Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.
Collapse
Affiliation(s)
- Yaohua Rong
- Faculty of Science, Beijing University of Technology, Beijing, China
| | - Sihai Dave Zhao
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA
| | - Xia Zheng
- Faculty of Science, Beijing University of Technology, Beijing, China
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
2
|
Wendel B, Heidenreich M, Budde M, Heilbronner M, Oraki Kohshour M, Papiol S, Falkai P, Schulze TG, Heilbronner U, Bickeböller H. Kalpra: A kernel approach for longitudinal pathway regression analysis integrating network information with an application to the longitudinal PsyCourse Study. Front Genet 2022; 13:1015885. [PMID: 36561312 PMCID: PMC9767414 DOI: 10.3389/fgene.2022.1015885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 11/24/2022] [Indexed: 12/12/2022] Open
Abstract
A popular approach to reduce the high dimensionality resulting from genome-wide association studies is to analyze a whole pathway in a single test for association with a phenotype. Kernel machine regression (KMR) is a highly flexible pathway analysis approach. Initially, KMR was developed to analyze a simple phenotype with just one measurement per individual. Recently, however, the investigation into the influence of genomic factors in the development of disease-related phenotypes across time (trajectories) has gained in importance. Thus, novel statistical approaches for KMR analyzing longitudinal data, i.e. several measurements at specific time points per individual are required. For longitudinal pathway analysis, we extend KMR to long-KMR using the estimation equivalence of KMR and linear mixed models. We include additional random effects to correct for the dependence structure. Moreover, within long-KMR we created a topology-based pathway analysis by combining this approach with a kernel including network information of the pathway. Most importantly, long-KMR not only allows for the investigation of the main genetic effect adjusting for time dependencies within an individual, but it also allows to test for the association of the pathway with the longitudinal course of the phenotype in the form of testing the genetic time-interaction effect. The approach is implemented as an R package, kalpra. Our simulation study demonstrates that the power of long-KMR exceeded that of another KMR method previously developed to analyze longitudinal data, while maintaining (slightly conservatively) the type I error. The network kernel improved the performance of long-KMR compared to the linear kernel. Considering different pathway densities, the power of the network kernel decreased with increasing pathway density. We applied long-KMR to cognitive data on executive function (Trail Making Test, part B) from the PsyCourse Study and 17 candidate pathways selected from Reactome. We identified seven nominally significant pathways.
Collapse
Affiliation(s)
- Bernadette Wendel
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany,*Correspondence: Bernadette Wendel,
| | - Markus Heidenreich
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| | - Monika Budde
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Maria Heilbronner
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Mojtaba Oraki Kohshour
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Sergi Papiol
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany,Department of Psychiatry and Psychotherapy, University Hospital, LMU Munich, Munich, Germany
| | - Peter Falkai
- Department of Psychiatry and Psychotherapy, University Hospital, LMU Munich, Munich, Germany
| | - Thomas G. Schulze
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany,Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Syracuse, NY, United States,Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, United States
| | - Urs Heilbronner
- Institute of Psychiatric Phenomics and Genomics (IPPG), University Hospital, LMU Munich, Munich, Germany
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University Göttingen, Göttingen, Germany
| |
Collapse
|
3
|
Arthur VL, Li Z, Cao R, Oetting WS, Israni AK, Jacobson PA, Ritchie MD, Guan W, Chen J. A Multi-Marker Test for Analyzing Paired Genetic Data in Transplantation. Front Genet 2021; 12:745773. [PMID: 34721531 PMCID: PMC8548646 DOI: 10.3389/fgene.2021.745773] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 09/23/2021] [Indexed: 12/02/2022] Open
Abstract
Emerging evidence suggests that donor/recipient matching in non-HLA (human leukocyte antigen) regions of the genome may impact transplant outcomes and recognizing these matching effects may increase the power of transplant genetics studies. Most available matching scores account for either single-nucleotide polymorphism (SNP) matching only or sum these SNP matching scores across multiple gene-coding regions, which makes it challenging to interpret the association findings. We propose a multi-marker Joint Score Test (JST) to jointly test for association between recipient genotype SNP effects and a gene-based matching score with transplant outcomes. This method utilizes Eigen decomposition as a dimension reduction technique to potentially increase statistical power by decreasing the degrees of freedom for the test. In addition, JST allows for the matching effect and the recipient genotype effect to follow different biological mechanisms, which is not the case for other multi-marker methods. Extensive simulation studies show that JST is competitive when compared with existing methods, such as the sequence kernel association test (SKAT), especially under scenarios where associated SNPs are in low linkage disequilibrium with non-associated SNPs or in gene regions containing a large number of SNPs. Applying the method to paired donor/recipient genetic data from kidney transplant studies yields various gene regions that are potentially associated with incidence of acute rejection after transplant.
Collapse
Affiliation(s)
- Victoria L. Arthur
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
| | - Zhengbang Li
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
- Departments of Statistics, Central China Normal University, Wuhan, China
| | - Rui Cao
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - William S. Oetting
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, University of Minnesota, Minneapolis, MN, United States
| | - Ajay K. Israni
- Minneapolis Medical Research Foundation, Minneapolis, MN, United States
- Department of Medicine, Hennepin County Medical Center, Minneapolis, MN, United States
- Department of Epidemiology and Community Health, University of Minnesota, Minneapolis, MN, United States
| | - Pamala A. Jacobson
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, University of Minnesota, Minneapolis, MN, United States
| | - Marylyn D. Ritchie
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, United States
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, United States
| |
Collapse
|
4
|
Pluta D, Shen T, Xue G, Chen C, Ombao H, Yu Z. Ridge-penalized adaptive Mantel test and its application in imaging genetics. Stat Med 2021; 40:5313-5332. [PMID: 34216035 DOI: 10.1002/sim.9127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 06/01/2021] [Accepted: 06/16/2021] [Indexed: 01/23/2023]
Abstract
We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement and testing. This result is not only theoretically interesting but also has important implications in penalized hypothesis testing, especially in high-dimensional settings such as imaging genetics. Applying the proposed method to an imaging genetic study of visual working memory in healthy adults, we identified interesting associations of brain connectivity (measured by electroencephalogram coherence) with selected genetic features.
Collapse
Affiliation(s)
- Dustin Pluta
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Tong Shen
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Gui Xue
- Center for Brain and Learning Science, Beijing Normal University, Beijing, China
| | - Chuansheng Chen
- Department of Psychology and Social Behavior, University of California, Irvine, Irvine, California, USA
| | - Hernando Ombao
- Statistics Program, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| |
Collapse
|
5
|
Gong M, Liu P, Sciurba FC, Stojanov P, Tao D, Tseng GC, Zhang K, Batmanghelich K. Unpaired data empowers association tests. Bioinformatics 2021; 37:785-792. [PMID: 33070196 PMCID: PMC8098021 DOI: 10.1093/bioinformatics/btaa886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 09/07/2020] [Accepted: 10/05/2020] [Indexed: 11/25/2022] Open
Abstract
Motivation There is growing interest in the biomedical research community to incorporate retrospective data, available in healthcare systems, to shed light on associations between different biomarkers. Understanding the association between various types of biomedical data, such as genetic, blood biomarkers, imaging, etc. can provide a holistic understanding of human diseases. To formally test a hypothesized association between two types of data in Electronic Health Records (EHRs), one requires a substantial sample size with both data modalities to achieve a reasonable power. Current association test methods only allow using data from individuals who have both data modalities. Hence, researchers cannot take advantage of much larger EHR samples that includes individuals with at least one of the data types, which limits the power of the association test. Results We present a new method called the Semi-paired Association Test (SAT) that makes use of both paired and unpaired data. In contrast to classical approaches, incorporating unpaired data allows SAT to produce better control of false discovery and to improve the power of the association test. We study the properties of the new test theoretically and empirically, through a series of simulations and by applying our method on real studies in the context of Chronic Obstructive Pulmonary Disease. We are able to identify an association between the high-dimensional characterization of Computed Tomography chest images and several blood biomarkers as well as the expression of dozens of genes involved in the immune system. Availability and implementation Code is available on https://github.com/batmanlab/Semi-paired-Association-Test. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingming Gong
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA.,Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA.,School of Mathematics and Statistics, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Peng Liu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Frank C Sciurba
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Petar Stojanov
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Dacheng Tao
- Australia School of Computer Science, The University of Sydney, Sydney, NSW 2006, Australia
| | - George C Tseng
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Kun Zhang
- Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Kayhan Batmanghelich
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| |
Collapse
|
6
|
Schaid DJ, Sinnwell JP, Larson NB, Chen J. Penalized variance components for association of multiple genes with traits. Genet Epidemiol 2021; 44:665-675. [PMID: 33463755 DOI: 10.1002/gepi.22340] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 06/26/2020] [Accepted: 07/05/2020] [Indexed: 12/19/2022]
Abstract
Variance component models have gained popularity for genetic analyses, driven by their flexibility to simultaneously analyze multiple genetic variants in a gene by kernel statistics, and their ability to account for population stratification via genomic relationship matrices. For exploratory analyses with modest sample sizes and a potentially large number of variance components, it can be challenging to use standard maximum-likelihood or restricted maximum-likelihood methods to estimate variance components, because these iterative methods often fail to converge when likelihood surfaces are fairly flat, and standard-likelihood ratio statistical tests are not adequate. To overcome these limitations, we developed a penalized-likelihood model, whereby the penalty function follows the popular elastic-net approach, applying both L1 and L2 penalties to the variance components. By simulations, we demonstrate the potential gain in power by using both L1 and L2 penalties, and results from our simulations suggest that assigning 80% of the penalty parameter to the L1 penalty and 20% to the L2 penalty provides a reasonable balance between false-positive and false-negative results. Larger sample size improves the properties of our methods, at the cost of longer computation time. Application of our methods to a study of the influence of DNA methylation on levels of cortisol in reaction to stress testing shows how our method can be used to prioritize findings for further functional studies.
Collapse
Affiliation(s)
- Daniel J Schaid
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Jason P Sinnwell
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Nicholas B Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
7
|
Deng Y, Wu S, Fan H. Genome-wide pathway-based quantitative multiple phenotypes analysis. PLoS One 2020; 15:e0240910. [PMID: 33175855 PMCID: PMC7657528 DOI: 10.1371/journal.pone.0240910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 10/06/2020] [Indexed: 11/18/2022] Open
Abstract
For complex diseases, genome-wide pathway association studies have become increasingly promising. Currently, however, pathway-based association analysis mainly focus on a single phenotype, which may insufficient to describe the complex diseases and physiological processes. This work proposes a combination model to evaluate the association between a pathway and multiple phenotypes and to reduce the run time based on asymptotic results. For a single phenotype, we propose a semi-supervised maximum kernel-based U-statistics (mSKU) method to assess the pathway-based association analysis. For multiple phenotypes, we propose the fisher combination function with dependent phenotypes (FC) to transform the p-values between the pathway and each marginal phenotype individually to achieve pathway-based multiple phenotypes analysis. With real data from the Alzheimer Disease Neuroimaging Initiative (ADNI) study and Human Liver Cohort (HLC) study, the FC-mSKU method allows us to specify which pathways are specific to a single phenotype or contribute to common genetic constructions of multiple phenotypes. If we only focus on single-phenotype tests, we may miss some findings for etiology studies. Through extensive simulation studies, the FC-mSKU method demonstrates its advantages compared with its counterparts.
Collapse
Affiliation(s)
- Yamin Deng
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China.,Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shiman Wu
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Huifang Fan
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| |
Collapse
|
8
|
Yin T, König S. Genomic predictions of growth curves in Holstein dairy cattle based on parameter estimates from nonlinear models combined with different kernel functions. J Dairy Sci 2020; 103:7222-7237. [PMID: 32534925 DOI: 10.3168/jds.2019-18010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 04/06/2020] [Indexed: 11/19/2022]
Abstract
Availability of longitudinal body weight (BW) records allows the application of nonlinear models (NLINM) to predict phenotypic and genomic growth curves in dairy cattle. In this regard, we considered a data set including 31,722 BW records from 4,952 female Holstein cattle, during the period from birth (mo 0) to approximately age at first calving (mo 24). Parameters of the growth curves were estimated using 3 NLINM: the logistic (LOG), the Gompertz (GOM), and the Richards (RICH) functions. Residuals for the growth curve parameters from the NLINM applications were used as pseudo-phenotypes in the ongoing genomic analyses with different similarity matrices, including 2 genomic relationship matrices (G1 and G2), a combined pedigree and genomic relationship matrix (H), and 3 kernel matrices. The kernels were a weighted "alike by state" kernel function (K1), an exponential dissimilarity kernel (K2), and a Gaussian kernel (K3). On the basis of G1 and G2 matrices, genomic heritabilities for the growth curve parameters birth weight (W0), mature weight (Wm), and growth rate (k), and the shape parameter (m; only available from RICH) were moderate to large, in the range from 0.29 (m from RICH) to 0.46 (k from RICH). Fitting the similarity matrices based on kernel functions contributed to an increase of the ratio of the variance explained by the similarity matrix in relation to the total variance (compared with the heritability when modeling G1 or G2). Genetic correlations between W0, Wm, and k were always positive (>0.30), especially for the same growth curve parameters estimated from different NLINM (>0.90). The shape parameter m from RICH was negatively correlated with other growth curve parameters, from -0.29 to -0.95. In a next step, estimated genomic breeding values for growth curve parameters were input data for the respective NLINM, aiming to construct genomic growth curves. Prediction accuracies were correlations between genomic growth curves and genomic breeding values from random regression models for sires and female cattle. Considering all genotyped female cattle with pseudo-phenotypes, prediction accuracies were larger from RICH than from LOG and GOM. However, differences in prediction accuracies from the NLINM × similarity matrix combinations were quite small. Accordingly, in 5-fold cross-validations using heifer groups with masked phenotypes, very similar prediction accuracies across modeling approaches were identified. Especially for specific age months, genomic growth curve predictions were more accurate for sires than for female cattle, indicating that the relationships between animals in training and validation sets are more important than the selection of specific NLINM × similarity matrix combinations.
Collapse
Affiliation(s)
- T Yin
- Institute of Animal Breeding and Genetics, Justus-Liebig-University Gießen, 35390 Gießen, Germany
| | - S König
- Institute of Animal Breeding and Genetics, Justus-Liebig-University Gießen, 35390 Gießen, Germany.
| |
Collapse
|
9
|
Agarwal D, Zhang NR. Semblance: An empirical similarity kernel on probability spaces. SCIENCE ADVANCES 2019; 5:eaau9630. [PMID: 31840051 PMCID: PMC6892634 DOI: 10.1126/sciadv.aau9630] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 09/30/2019] [Indexed: 06/10/2023]
Abstract
In data science, determining proximity between observations is critical to many downstream analyses such as clustering, classification, and prediction. However, when the data's underlying probability distribution is unclear, the function used to compute similarity between data points is often arbitrarily chosen. Here, we present a novel definition of proximity, Semblance, that uses the empirical distribution of a feature to inform the pair-wise similarity between observations. The advantage of Semblance lies in its distribution-free formulation and its ability to place greater emphasis on proximity between observation pairs that fall at the outskirts of the data distribution, as opposed to those toward the center. Semblance is a valid Mercer kernel, allowing its principled use in kernel-based learning algorithms, and for any data modality. We demonstrate its consistently improved performance against conventional methods through simulations and real case studies from diverse applications in single-cell transcriptomics, image reconstruction, and financial forecasting.
Collapse
Affiliation(s)
- Divyansh Agarwal
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
- Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nancy R. Zhang
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
- Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
10
|
Shao F, Wang Y, Zhao Y, Yang S. Identifying and exploiting gene-pathway interactions from RNA-seq data for binary phenotype. BMC Genet 2019; 20:36. [PMID: 30890140 PMCID: PMC6423879 DOI: 10.1186/s12863-019-0739-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 03/12/2019] [Indexed: 11/29/2022] Open
Abstract
Background RNA sequencing (RNA-seq) technology has identified multiple differentially expressed (DE) genes associated to complex disease, however, these genes only explain a modest part of variance. Omnigenic model assumes that disease may be driven by genes with indirect relevance to disease and be propagated by functional pathways. Here, we focus on identifying the interactions between the external genes and functional pathways, referring to gene-pathway interactions (GPIs). Specifically, relying on the relationship between the garrote kernel machine (GKM) and variance component test and permutations for the empirical distributions of score statistics, we propose an efficient analysis procedure as Permutation based gEne-pAthway interaction identification in binary phenotype (PEA). Results Various simulations show that PEA has well-calibrated type I error rates and higher power than the traditional likelihood ratio test (LRT). In addition, we perform the gene set enrichment algorithms and PEA to identifying the GPIs from a pan-cancer data (GES68086). These GPIs and genes possibly further illustrate the potential etiology of cancers, most of which are identified and some external genes and significant pathways are consistent with previous studies. Conclusions PEA is an efficient tool for identifying the GPIs from RNA-seq data. It can be further extended to identify the interactions between one variable and one functional set of other omics data for binary phenotypes. Electronic supplementary material The online version of this article (10.1186/s12863-019-0739-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Fang Shao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China
| | - Yaqi Wang
- Department of Pharmacy Informatics, School of Science, China Pharmaceutical University, 24 Tongjia Xiang, Nanjing , Jiangsu, People's Republic of China
| | - Yang Zhao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China
| | - Sheng Yang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China.
| |
Collapse
|
11
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
12
|
Budde M, Friedrichs S, Alliey-Rodriguez N, Ament S, Badner JA, Berrettini WH, Bloss CS, Byerley W, Cichon S, Comes AL, Coryell W, Craig DW, Degenhardt F, Edenberg HJ, Foroud T, Forstner AJ, Frank J, Gershon ES, Goes FS, Greenwood TA, Guo Y, Hipolito M, Hood L, Keating BJ, Koller DL, Lawson WB, Liu C, Mahon PB, McInnis MG, McMahon FJ, Meier SM, Mühleisen TW, Murray SS, Nievergelt CM, Nurnberger JI, Nwulia EA, Potash JB, Quarless D, Rice J, Roach JC, Scheftner WA, Schork NJ, Shekhtman T, Shilling PD, Smith EN, Streit F, Strohmaier J, Szelinger S, Treutlein J, Witt SH, Zandi PP, Zhang P, Zöllner S, Bickeböller H, Falkai PG, Kelsoe JR, Nöthen MM, Rietschel M, Schulze TG, Malzahn D. Efficient region-based test strategy uncovers genetic risk factors for functional outcome in bipolar disorder. Eur Neuropsychopharmacol 2019; 29:156-170. [PMID: 30503783 DOI: 10.1016/j.euroneuro.2018.10.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Revised: 10/16/2018] [Accepted: 10/23/2018] [Indexed: 11/21/2022]
Abstract
Genome-wide association studies of case-control status have advanced the understanding of the genetic basis of psychiatric disorders. Further progress may be gained by increasing sample size but also by new analysis strategies that advance the exploitation of existing data, especially for clinically important quantitative phenotypes. The functionally-informed efficient region-based test strategy (FIERS) introduced herein uses prior knowledge on biological function and dependence of genotypes within a powerful statistical framework with improved sensitivity and specificity for detecting consistent genetic effects across studies. As proof of concept, FIERS was used for the first genome-wide single nucleotide polymorphism (SNP)-based investigation on bipolar disorder (BD) that focuses on an important aspect of disease course, the functional outcome. FIERS identified a significantly associated locus on chromosome 15 (hg38: chr15:48965004 - 49464789 bp) with consistent effect strength between two independent studies (GAIN/TGen: European Americans, BOMA: Germans; n = 1592 BD patients in total). Protective and risk haplotypes were found on the most strongly associated SNPs. They contain a CTCF binding site (rs586758); CTCF sites are known to regulate sets of genes within a chromatin domain. The rs586758 - rs2086256 - rs1904317 haplotype is located in the promoter flanking region of the COPS2 gene, close to microRNA4716, and the EID1, SHC4, DTWD1 genes as plausible biological candidates. While implication with BD is novel, COPS2, EID1, and SHC4 are known to be relevant for neuronal differentiation and function and DTWD1 for psychopharmacological side effects. The test strategy FIERS that enabled this discovery is equally applicable for tag SNPs and sequence data.
Collapse
Affiliation(s)
- Monika Budde
- Institute of Psychiatric Phenomics and Genomics, University Hospital, LMU Munich, Nussbaumstr. 7, Munich 80336, Germany
| | - Stefanie Friedrichs
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University, Göttingen 37099, Germany
| | - Ney Alliey-Rodriguez
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, IL 60637, United States
| | - Seth Ament
- Institute for Systems Biology, Seattle, WA 98109, United States
| | - Judith A Badner
- Department of Psychiatry, Rush University Medical Center, Chicago, IL 60612, United States
| | - Wade H Berrettini
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Cinnamon S Bloss
- University of California San Diego, La Jolla, CA 92093, United States
| | - William Byerley
- Department of Psychiatry, University of California at San Francisco, San Francisco, CA 94103, United States
| | - Sven Cichon
- Human Genomics Research Group, Department of Biomedicine, University of Basel, Basel 4031, Switzerland; Institute of Medical Genetics and Pathology, University Hospital Basel, Basel 4031, Switzerland; Institute of Neuroscience and Medicine (INM-1), Research Centre Jülich, Jülich 52425, Germany
| | - Ashley L Comes
- Institute of Psychiatric Phenomics and Genomics, University Hospital, LMU Munich, Nussbaumstr. 7, Munich 80336, Germany; International Max Planck Research School for Translational Psychiatry, Max Planck Institute of Psychiatry, Munich 80804, Germany
| | - William Coryell
- University of Iowa Hospitals and Clinics, Iowa City, IA 52242, United States
| | - David W Craig
- The Translational Genomics Research Institute, Phoenix, AZ 85004, United States
| | - Franziska Degenhardt
- Institute of Human Genetics, School of Medicine & University Hospital Bonn, University of Bonn, Bonn 53127, Germany; Department of Genomics, Life & Brain Center, University of Bonn, Bonn 53127, Germany
| | - Howard J Edenberg
- Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, United States; Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Tatiana Foroud
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Andreas J Forstner
- Institute of Human Genetics, School of Medicine & University Hospital Bonn, University of Bonn, Bonn 53127, Germany; Department of Genomics, Life & Brain Center, University of Bonn, Bonn 53127, Germany; Human Genomics Research Group, Department of Biomedicine, University of Basel, Basel 4031, Switzerland; Department of Psychiatry (UPK), University of Basel, Basel 4012, Switzerland
| | - Josef Frank
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany
| | - Elliot S Gershon
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, IL 60637, United States
| | - Fernando S Goes
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21287, United States
| | - Tiffany A Greenwood
- Department of Psychiatry, University of California San Diego, San Diego, CA 92093, United States
| | - Yiran Guo
- Center for Applied Genomics, Children's Hospital of Philadelphia, Abramson Research Center, Philadelphia, PA 19104, United States; Beijing Genomics Institute at Shenzhen, Shenzhen 518083, China
| | - Maria Hipolito
- Department of Psychiatry and Behavioral Sciences, Howard University Hospital, Washington, DC 20060, United States
| | - Leroy Hood
- Institute for Systems Biology, Seattle, WA 98109, United States
| | - Brendan J Keating
- Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, PA 19104-5159, United States; Institute for Translational Medicine and Therapeutics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104-5158, United States
| | - Daniel L Koller
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - William B Lawson
- Dell Medical School, University of Texas at Austin, Austin, TX 78723, United States
| | - Chunyu Liu
- SUNY Upstate Medical University, Syracuse, NY 13210, United States
| | - Pamela B Mahon
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21287, United States
| | - Melvin G McInnis
- Department of Psychiatry, University of Michigan, Ann Arbor, MI 48105, United States
| | - Francis J McMahon
- U.S. Department of Health & Human Services, Intramural Research Program, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20894, United States
| | - Sandra M Meier
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany; National Centre for Register-Based Research, Aarhus University, Aarhus V 8210, Denmark
| | - Thomas W Mühleisen
- Institute of Neuroscience and Medicine (INM-1), Research Centre Jülich, Jülich 52425, Germany; Human Genomics Research Group, Department of Biomedicine, University of Basel, Basel 4031, Switzerland
| | - Sarah S Murray
- Scripps Genomic Medicine & The Scripps Translational Sciences Institute (STSI), La Jolla, CA 92037, United States; Department of Pathology, University of California San Diego, La Jolla, CA 92093, United States
| | - Caroline M Nievergelt
- Department of Psychiatry, University of California San Diego, San Diego, CA 92093, United States
| | - John I Nurnberger
- Department of Psychiatry, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Evaristus A Nwulia
- Department of Psychiatry and Behavioral Sciences, Howard University Hospital, Washington, DC 20060, United States
| | - James B Potash
- Department of Psychiatry, Carver College of Medicine, University of Iowa School of Medicine, Iowa City, IA 52242, United States
| | - Danjuma Quarless
- J. Craig Venter Institute, La Jolla, CA 92037, United States; University of California San Diego, La Jolla, CA 92093, United States
| | - John Rice
- Department of Psychiatry, Washington University School of Medicine in St. Louis, St. Louis, MO 63110, United States
| | - Jared C Roach
- Institute for Systems Biology, Seattle, WA 98109, United States
| | | | - Nicholas J Schork
- J. Craig Venter Institute, La Jolla, CA 92037, United States; The Translational Genomics Research Institute, Phoenix, AZ 85004, United States; University of California San Diego, La Jolla, CA 92093, United States
| | - Tatyana Shekhtman
- Department of Psychiatry, University of California San Diego, San Diego, CA 92093, United States
| | - Paul D Shilling
- Department of Psychiatry, University of California San Diego, San Diego, CA 92093, United States
| | - Erin N Smith
- Scripps Genomic Medicine & The Scripps Translational Sciences Institute (STSI), La Jolla, CA 92037, United States; Department of Pediatrics and Rady's Children's Hospital, School of Medicine, University of California San Diego, La Jolla, CA 92037, United States
| | - Fabian Streit
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany
| | - Jana Strohmaier
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany
| | - Szabolcs Szelinger
- The Translational Genomics Research Institute, Phoenix, AZ 85004, United States
| | - Jens Treutlein
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany
| | - Stephanie H Witt
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany
| | - Peter P Zandi
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, United States
| | - Peng Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States
| | - Sebastian Zöllner
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States; Department of Psychiatry, University of Michigan, Ann Arbor, MI 48105, United States
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University, Göttingen 37099, Germany
| | - Peter G Falkai
- Department of Psychiatry and Psychotherapy, University Hospital, LMU Munich, Munich 80336, Germany
| | - John R Kelsoe
- Department of Psychiatry, University of California San Diego, San Diego, CA 92093, United States
| | - Markus M Nöthen
- Institute of Human Genetics, School of Medicine & University Hospital Bonn, University of Bonn, Bonn 53127, Germany; Department of Genomics, Life & Brain Center, University of Bonn, Bonn 53127, Germany
| | - Marcella Rietschel
- Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany
| | - Thomas G Schulze
- Institute of Psychiatric Phenomics and Genomics, University Hospital, LMU Munich, Nussbaumstr. 7, Munich 80336, Germany; Department of Genetic Epidemiology in Psychiatry, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim 68159, Germany; Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21287, United States; U.S. Department of Health & Human Services, Intramural Research Program, National Institute of Mental Health, National Institutes of Health, Bethesda, MD 20894, United States.
| | - Dörthe Malzahn
- Department of Genetic Epidemiology, University Medical Center Göttingen, Georg-August-University, Göttingen 37099, Germany.
| |
Collapse
|
13
|
Yang H, Cao H, He T, Wang T, Cui Y. Multilevel heterogeneous omics data integration with kernel fusion. Brief Bioinform 2018; 21:156-170. [PMID: 30496340 DOI: 10.1093/bib/bby115] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2018] [Revised: 10/25/2018] [Accepted: 10/26/2018] [Indexed: 01/26/2023] Open
Abstract
High-throughput omics data are generated almost with no limit nowadays. It becomes increasingly important to integrate different omics data types to disentangle the molecular machinery of complex diseases with the hope for better disease prevention and treatment. Since the relationship among different omics data features are typically unknown, a supervised learning model assuming a particular distribution with a specific structure will not serve the purpose to capture the underlying complex relationship between multiple features and a disease phenotype. In this work, we briefly reviewed methods for kernel fusion (KF) based on support vector machine and kernel partial least squares (KPLS) algorithms. We then proposed a fused KPLS (fKPLS) model for disease classification and prediction with multilevel omics data. The fused kernel can deal with effect heterogeneity in which different omic data types may have different effect contribution to the trait of interest, with the purpose to improve the prediction performance. We proposed to optimize the kernel parameters and kernel weights with the genetic algorithm (GA). The proposed GA-fKPLS model can substantially improve disease classification performance by integrating multiple omics data types, demonstrated via extensive simulations and real data analysis. With properly defined fitness functions during GA optimization, the proposed KF method can be extended to other kernel-based analyses such as in kernel association analysis with common or rare variants.
Collapse
Affiliation(s)
- Haitao Yang
- Department of Epidemiology and Health Statistics, School of Public Health, and Hebei Province Key Laboratory of Environment and Human Health, Hebei Medical University, Shijiazhuang, PR China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, USA
| | - Tong Wang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Yuehua Cui
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China.,Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
14
|
He T, Li S, Zhong PS, Cui Y. An optimal kernel-based U
-statistic method for quantitative gene-set association analysis. Genet Epidemiol 2018; 43:137-149. [DOI: 10.1002/gepi.22170] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 08/19/2018] [Accepted: 09/26/2018] [Indexed: 11/09/2022]
Affiliation(s)
- Tao He
- Department of Mathematics; San Francisco State University; San Francisco California
| | - Shaoyu Li
- Department of Mathematics and Statistics; University of North Carolina at Charlotte; Charlotte North Carolina
| | - Ping-Shou Zhong
- Department of Mathematics, Statistics, and Computer Science; University of Illinois at Chicago; Chicago Illinois
| | - Yuehua Cui
- Department of Statistics & Probability; Michigan State University; East Lansing Michigan
- School of Public Health, Zhengzhou University; Zhengzhou China
| |
Collapse
|
15
|
Yasmeen S, Burger P, Friedrichs S, Papiol S, Bickeböller H. Relating drug response to epigenetic and genetic markers using a region-based kernel score test. BMC Proc 2018; 12:47. [PMID: 30275895 PMCID: PMC6157113 DOI: 10.1186/s12919-018-0154-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
16
|
|
17
|
Reexamining Dis/Similarity-Based Tests for Rare-Variant Association with Case-Control Samples. Genetics 2018; 209:105-113. [PMID: 29545466 DOI: 10.1534/genetics.118.300769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 03/02/2018] [Indexed: 11/18/2022] Open
Abstract
A properly designed distance-based measure can capture informative genetic differences among individuals with different phenotypes and can be used to detect variants responsible for the phenotypes. To detect associated variants, various tests have been designed to contrast genetic dissimilarity or similarity scores of certain subject groups in different ways, among which the most widely used strategy is to quantify the difference between the within-group genetic dissimilarity/similarity (i.e., case-case and control-control similarities) and the between-group dissimilarity/similarity (i.e., case-control similarities). While it has been noted that for common variants, the within-group and the between-group measures should all be included; in this work, we show that for rare variants, comparison based on the two within-group measures can more effectively quantify the genetic difference between cases and controls. The between-group measure tends to overlap with one of the two within-group measures for rare variants, although such overlap is not present for common variants. Consequently, a dissimilarity or similarity test that includes the between-group information tends to attenuate the association signals and leads to power loss. Based on these findings, we propose a dissimilarity test that compares the degree of SNP dissimilarity within cases to that within controls to better characterize the difference between two disease phenotypes. We provide the statistical properties, asymptotic distribution, and computation details for a small sample size of the proposed test. We use simulated and real sequence data to assess the performance of the proposed test, comparing it with other rare-variant methods including those similarity-based tests that use both within-group and between-group information. As similarity-based approaches serve as one of the dominating approaches in rare-variant analysis, our results provide some insight for the effective detection of rare variants.
Collapse
|
18
|
Randolph TW, Zhao S, Copeland W, Hullar M, Shojaie A. KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA. Ann Appl Stat 2018; 12:540-566. [PMID: 30224943 PMCID: PMC6138053 DOI: 10.1214/17-aoas1102] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.
Collapse
|
19
|
Zhu B, Song N, Shen R, Arora A, Machiela MJ, Song L, Landi MT, Ghosh D, Chatterjee N, Baladandayuthapani V, Zhao H. Integrating Clinical and Multiple Omics Data for Prognostic Assessment across Human Cancers. Sci Rep 2017; 7:16954. [PMID: 29209073 PMCID: PMC5717223 DOI: 10.1038/s41598-017-17031-8] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2017] [Accepted: 11/20/2017] [Indexed: 02/06/2023] Open
Abstract
Multiple omic profiles have been generated for many cancer types; however, comprehensive assessment of their prognostic values across cancers is limited. We conducted a pan-cancer prognostic assessment and presented a multi-omic kernel machine learning method to systematically quantify the prognostic values of high-throughput genomic, epigenomic, and transcriptomic profiles individually, integratively, and in combination with clinical factors for 3,382 samples across 14 cancer types. We found that the prognostic performance varied substantially across cancer types. mRNA and miRNA expression profile frequently performed the best, followed by DNA methylation profile. Germline susceptibility variants displayed low prognostic performance consistently across cancer types. The integration of omic profiles with clinical variables can lead to substantially improved prognostic performance over the use of clinical variables alone in half of cancer types examined. Moreover, we showed that the kernel machine learning method consistently outperformed existing prognostic signatures, suggesting that including a large number of omic biomarkers may provide substantial improvement in prognostic assessment. Our study provides a comprehensive portrait of omic architecture for tumor prognosis across cancers, and highlights the prognostic value of genome-wide omic biomarker aggregation, which may facilitate refined prognostic assessment in the era of precision oncology.
Collapse
Affiliation(s)
- Bin Zhu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Bethesda, MD, 20892, USA.
| | - Nan Song
- NSABP Foundation, Pittsburgh, PA, 15212, USA
| | - Ronglai Shen
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, 10021, USA
| | - Arshi Arora
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, 10021, USA
| | - Mitchell J Machiela
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Bethesda, MD, 20892, USA
| | - Lei Song
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Bethesda, MD, 20892, USA
| | - Maria Teresa Landi
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Bethesda, MD, 20892, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado Denver, Aurora, CO, 80045, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, 21205, USA.,Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD, 21205, USA
| | - Veera Baladandayuthapani
- Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX, 77230, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, 06520, USA
| |
Collapse
|
20
|
Xu Z, Wu C, Pan W. Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage 2017; 159:159-169. [PMID: 28736311 PMCID: PMC5671364 DOI: 10.1016/j.neuroimage.2017.07.036] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 06/22/2017] [Accepted: 07/18/2017] [Indexed: 10/19/2022] Open
Abstract
A new and powerful approach, called imaging-wide association study (IWAS), is proposed to integrate imaging endophenotypes with GWAS to boost statistical power and enhance biological interpretation for GWAS discoveries. IWAS extends the promising transcriptome-wide association study (TWAS) from using gene expression endophenotypes to using imaging and other endophenotypes with a much wider range of possible applications. As illustration, we use gray-matter volumes of several brain regions of interest (ROIs) drawn from the ADNI-1 structural MRI data as imaging endophenotypes, which are then applied to the individual-level GWAS data of ADNI-GO/2 and a large meta-analyzed GWAS summary statistics dataset (based on about 74,000 individuals), uncovering some novel genes significantly associated with Alzheimer's disease (AD). We also compare the performance of IWAS with TWAS, showing much larger numbers of significant AD-associated genes discovered by IWAS, presumably due to the stronger link between brain atrophy and AD than that between gene expression of normal individuals and the risk for AD. The proposed IWAS is general and can be applied to other imaging endophenotypes, and GWAS individual-level or summary association data.
Collapse
Affiliation(s)
- Zhiyuan Xu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Chong Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA.
| |
Collapse
|
21
|
A Powerful Framework for Integrating eQTL and GWAS Summary Data. Genetics 2017; 207:893-902. [PMID: 28893853 DOI: 10.1534/genetics.117.300270] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 09/05/2017] [Indexed: 01/26/2023] Open
Abstract
Two new gene-based association analysis methods, called PrediXcan and TWAS for GWAS individual-level and summary data, respectively, were recently proposed to integrate GWAS with eQTL data, alleviating two common problems in GWAS by boosting statistical power and facilitating biological interpretation of GWAS discoveries. Based on a novel reformulation of PrediXcan and TWAS, we propose a more powerful gene-based association test to integrate single set or multiple sets of eQTL data with GWAS individual-level data or summary statistics. The proposed test was applied to several GWAS datasets, including two lipid summary association datasets based on [Formula: see text] and [Formula: see text] samples, respectively, and uncovered more known or novel trait-associated genes, showcasing much improved performance of our proposed method. The software implementing the proposed method is freely available as an R package.
Collapse
|
22
|
Islam S, Anand S, Hamid J, Thabane L, Beyene J. Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration. Stat Appl Genet Mol Biol 2017; 16:199-216. [PMID: 28727569 DOI: 10.1515/sagmb-2016-0066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.
Collapse
|
23
|
Powerful Genetic Association Analysis for Common or Rare Variants with High-Dimensional Structured Traits. Genetics 2017. [PMID: 28642271 DOI: 10.1534/genetics.116.199646] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Many genetic association studies collect a wide range of complex traits. As these traits may be correlated and share a common genetic mechanism, joint analysis can be statistically more powerful and biologically more meaningful. However, most existing tests for multiple traits cannot be used for high-dimensional and possibly structured traits, such as network-structured transcriptomic pathway expressions. To overcome potential limitations, in this article we propose the dual kernel-based association test (DKAT) for testing the association between multiple traits and multiple genetic variants, both common and rare. In DKAT, two individual kernels are used to describe the phenotypic and genotypic similarity, respectively, between pairwise subjects. Using kernels allows for capturing structure while accommodating dimensionality. Then, the association between traits and genetic variants is summarized by a coefficient which measures the association between two kernel matrices. Finally, DKAT evaluates the hypothesis of nonassociation with an analytical P-value calculation without any computationally expensive resampling procedures. By collapsing information in both traits and genetic variants using kernels, the proposed DKAT is shown to have a correct type-I error rate and higher power than other existing methods in both simulation studies and application to a study of genetic regulation of pathway gene expressions.
Collapse
|
24
|
Malzahn D, Friedrichs S, Bickeböller H. Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data. BMC Proc 2016; 10:269-273. [PMID: 27980648 PMCID: PMC5133495 DOI: 10.1186/s12919-016-0042-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We used our extension of the kernel score test to family data to analyze real and simulated baseline systolic blood pressure in extended pedigrees. We compared the power for different kernels and for different weightings of genetic markers. Moreover, we compared the power of rare and common markers with 3 strategies for joint testing and on marker panels with different densities. Marker weights had much greater influence on power than the kernel chosen. Inverse minor allele frequency weights often increased power on common markers but could decrease power on rare markers. Furthermore, defining the gene region based on linkage disequilibrium blocks often yielded robust power of joint tests of rare and common markers.
Collapse
Affiliation(s)
- Dörthe Malzahn
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Stefanie Friedrichs
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| |
Collapse
|
25
|
Abstract
We give a short but detailed review of the methods used to deal with linear mixed models (restricted likelihood, AIREML algorithm, best linear unbiased predictors, etc.), with a few original points. Then we describe three common applications of the linear mixed model in contemporary human genetics: association testing (pathways analysis or rare variants association tests), genomic heritability estimates, and correction for population stratification in genome-wide association studies. We also consider the performance of best linear unbiased predictors for prediction in this context, through a simulation study for rare variants in a short genomic region, and through a short theoretical development for genome-wide data. For each of these applications, we discuss the relevance and the impact of modeling genetic effects as random effects.
Collapse
|
26
|
Xu HM, Xu LF, Hou TT, Luo LF, Chen GB, Sun XW, Lou XY. GMDR: Versatile Software for Detecting Gene-Gene and Gene-Environ- ment Interactions Underlying Complex Traits. Curr Genomics 2016; 17:396-402. [PMID: 28479868 PMCID: PMC5320543 DOI: 10.2174/1389202917666160513102612] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Revised: 04/10/2015] [Accepted: 04/15/2015] [Indexed: 11/22/2022] Open
Abstract
Identification of multifactor gene-gene (G×G) and gene-environment (G×E) interactions underlying complex traits poses one of the great challenges to today’s genetic study. Development of the generalized multifactor dimensionality reduction (GMDR) method provides a practicable solution to problems in detection of interactions. To exploit the opportunities brought by the availability of diverse data, it is in high demand to develop the corresponding GMDR software that can handle a breadth of phenotypes, such as continuous, count, dichotomous, polytomous nominal, ordinal, survival and multivariate, and various kinds of study designs, such as unrelated case-control, family-based and pooled unrelated and family samples, and also allows adjustment for covariates. We developed a versatile GMDR package to implement this serial of GMDR analyses for various scenarios (e.g., unified analysis of unrelated and family samples) and large-scale (e.g., genome-wide) data. This package includes other desirable features such as data management and preprocessing. Permutation testing strategies are also built in to evaluate the threshold or empirical p values. In addition, its performance is scalable to the computational resources. The software is available at http://www.soph.uab.edu/ssg/software or http://ibi.zju.edu.cn/software.
Collapse
Affiliation(s)
- Hai-Ming Xu
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, P.R. China.,Research Center for Air Pollution and Health, Zhejiang University, Hangzhou, P.R. China
| | - Li-Feng Xu
- Institute of Computer Application Technology, College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, P.R. China
| | - Ting-Ting Hou
- Institute of Bioinformatics and Institute of Crop Science, College of Agriculture and Biotechnology, Zhejiang University, Hangzhou, P.R. China
| | - Lin-Feng Luo
- Institute of Computer Application Technology, College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, P.R. China
| | - Guo-Bo Chen
- Queensland Brain Institute, University of Queensland, St Lucia, Queensland, Australia
| | - Xi-Wei Sun
- Sir Run Run Shaw Hospital and Institute of Translational Medicine, School of Medicine, Zhejiang University, Hangzhou, P.R. China
| | - Xiang-Yang Lou
- Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, Louisiana, USA
| |
Collapse
|
27
|
Yang H, Li S, Cao H, Zhang C, Cui Y. Predicting disease trait with genomic data: a composite kernel approach. Brief Bioinform 2016; 18:591-601. [DOI: 10.1093/bib/bbw043] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2016] [Indexed: 01/17/2023] Open
|
28
|
Friedrichs S, Malzahn D, Pugh EW, Almeida M, Liu XQ, Bailey JN. Filtering genetic variants and placing informative priors based on putative biological function. BMC Genet 2016; 17 Suppl 2:8. [PMID: 26866982 PMCID: PMC4895695 DOI: 10.1186/s12863-015-0313-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
High-density genetic marker data, especially sequence data, imply an immense multiple testing burden. This can be ameliorated by filtering genetic variants, exploiting or accounting for correlations between variants, jointly testing variants, and by incorporating informative priors. Priors can be based on biological knowledge or predicted variant function, or even be used to integrate gene expression or other omics data. Based on Genetic Analysis Workshop (GAW) 19 data, this article discusses diversity and usefulness of functional variant scores provided, for example, by PolyPhen2, SIFT, or RegulomeDB annotations. Incorporating functional scores into variant filters or weights and adjusting the significance level for correlations between variants yielded significant associations with blood pressure traits in a large family study of Mexican Americans (GAW19 data set). Marker rs218966 in gene PHF14 and rs9836027 in MAP4 significantly associated with hypertension; additionally, rare variants in SNUPN significantly associated with systolic blood pressure. Variant weights strongly influenced the power of kernel methods and burden tests. Apart from variant weights in test statistics, prior weights may also be used when combining test statistics or to informatively weight p values while controlling false discovery rate (FDR). Indeed, power improved when gene expression data for FDR-controlled informative weighting of association test p values of genes was used. Finally, approaches exploiting variant correlations included identity-by-descent mapping and the optimal strategy for joint testing rare and common variants, which was observed to depend on linkage disequilibrium structure.
Collapse
Affiliation(s)
- Stefanie Friedrichs
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Göttingen, Germany.
| | - Dörthe Malzahn
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Göttingen, Germany.
| | - Elizabeth W Pugh
- Center for Inherited Disease Research, Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| | - Marcio Almeida
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Brownsville, TX, USA.
| | - Xiao Qing Liu
- Department of Obstetrics, Gynecology, and Reproductive Sciences, Department of Biochemistry and Medical Genetics, Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada.
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada.
| | - Julia N Bailey
- Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, Los Angeles, CA, USA.
- Epilepsy Genetics/Genomics Laboratory, West Los Angeles Veterans Administration, Los Angeles, CA, USA.
| |
Collapse
|
29
|
Kanagawa M, Nishiyama Y, Gretton A, Fukumizu K. Filtering with State-Observation Examples via Kernel Monte Carlo Filter. Neural Comput 2015; 28:382-444. [PMID: 26654205 DOI: 10.1162/neco_a_00806] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
This letter addresses the problem of filtering with a state-space model. Standard approaches for filtering assume that a probabilistic model for observations (i.e., the observation model) is given explicitly or at least parametrically. We consider a setting where this assumption is not satisfied; we assume that the knowledge of the observation model is provided only by examples of state-observation pairs. This setting is important and appears when state variables are defined as quantities that are very different from the observations. We propose kernel Monte Carlo filter, a novel filtering method that is focused on this setting. Our approach is based on the framework of kernel mean embeddings, which enables nonparametric posterior inference using the state-observation examples. The proposed method represents state distributions as weighted samples, propagates these samples by sampling, estimates the state posteriors by kernel Bayes' rule, and resamples by kernel herding. In particular, the sampling and resampling procedures are novel in being expressed using kernel mean embeddings, so we theoretically analyze their behaviors. We reveal the following properties, which are similar to those of corresponding procedures in particle methods: the performance of sampling can degrade if the effective sample size of a weighted sample is small, and resampling improves the sampling performance by increasing the effective sample size. We first demonstrate these theoretical findings by synthetic experiments. Then we show the effectiveness of the proposed filter by artificial and real data experiments, which include vision-based mobile robot localization.
Collapse
Affiliation(s)
- Motonobu Kanagawa
- SOKENDAI (Graduate University for Advanced Studies), Tokyo 190-8562, Japan, and Institute of Statistical Mathematics, Tokyo 190-8562, Japan
| | - Yu Nishiyama
- University of Electro-Communications, Tokyo 182-8585, Japan
| | - Arthur Gretton
- Gatsby Computational Neuroscience Unit, University College London, London
| | - Kenji Fukumizu
- SOKENDAI (Graduate University for Advanced Studies), Tokyo 190-8562, Japan, and Institute of Statistical Mathematics, Tokyo 190-8562, Japan
| |
Collapse
|
30
|
Zhu N, Heinrich V, Dickhaus T, Hecht J, Robinson PN, Mundlos S, Kamphans T, Krawitz PM. Strategies to improve the performance of rare variant association studies by optimizing the selection of controls. Bioinformatics 2015; 31:3577-83. [PMID: 26249812 DOI: 10.1093/bioinformatics/btv457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2015] [Accepted: 07/30/2015] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION When analyzing a case group of patients with ultra-rare disorders the ethnicities are often diverse and the data quality might vary. The population substructure in the case group as well as the heterogeneous data quality can cause substantial inflation of test statistics and result in spurious associations in case-control studies if not properly adjusted for. Existing techniques to correct for confounding effects were especially developed for common variants and are not applicable to rare variants. RESULTS We analyzed strategies to select suitable controls for cases that are based on similarity metrics that vary in their weighting schemes. We simulated different disease entities on real exome data and show that a similarity-based selection scheme can help to reduce false positive associations and to optimize the performance of the statistical tests. Especially when data quality as well as ethnicities vary a lot in the case group, a matching approach that puts more weight on rare variants shows the best performance. We reanalyzed collections of unrelated patients with Kabuki make-up syndrome, Hyperphosphatasia with Mental Retardation syndrome and Catel-Manzke syndrome for which the disease genes were recently described. We show that rare variant association tests are more sensitive and specific in identifying the disease gene than intersection filters and should thus be considered as a favorable approach in analyzing even small patient cohorts. AVAILABILITY AND IMPLEMENTATION Datasets used in our analysis are available at ftp://ftp.1000genomes.ebi.ac.uk./vol1/ftp/ CONTACT : peter.krawitz@charite.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Na Zhu
- Institute of Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Verena Heinrich
- Institute of Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Thorsten Dickhaus
- Institute for Statistics, University of Bremen, 28344 Bremen, Germany
| | - Jochen Hecht
- Berlin-Brandenburg Center for Regenerative Therapies (BCRT), 13353 Berlin, Germany
| | - Peter N Robinson
- Institute of Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany
| | - Stefan Mundlos
- Institute of Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and
| | | | - Peter M Krawitz
- Institute of Medical Genetics and Human Genetics, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany and
| |
Collapse
|
31
|
Chen F, He J, Zhang J, Chen GK, Thomas V, Ambrosone CB, Bandera EV, Berndt SI, Bernstein L, Blot WJ, Cai Q, Carpten J, Casey G, Chanock SJ, Cheng I, Chu L, Deming SL, Driver WR, Goodman P, Hayes RB, Hennis AJM, Hsing AW, Hu JJ, Ingles SA, John EM, Kittles RA, Kolb S, Leske MC, Millikan RC, Monroe KR, Murphy A, Nemesure B, Neslund-Dudas C, Nyante S, Ostrander EA, Press MF, Rodriguez-Gil JL, Rybicki BA, Schumacher F, Stanford JL, Signorello LB, Strom SS, Stevens V, Van Den Berg D, Wang Z, Witte JS, Wu SY, Yamamura Y, Zheng W, Ziegler RG, Stram AH, Kolonel LN, Marchand LL, Henderson BE, Haiman CA, Stram DO. Methodological Considerations in Estimation of Phenotype Heritability Using Genome-Wide SNP Data, Illustrated by an Analysis of the Heritability of Height in a Large Sample of African Ancestry Adults. PLoS One 2015; 10:e0131106. [PMID: 26125186 PMCID: PMC4488332 DOI: 10.1371/journal.pone.0131106] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Accepted: 05/28/2015] [Indexed: 01/02/2023] Open
Abstract
Height has an extremely polygenic pattern of inheritance. Genome-wide association studies (GWAS) have revealed hundreds of common variants that are associated with human height at genome-wide levels of significance. However, only a small fraction of phenotypic variation can be explained by the aggregate of these common variants. In a large study of African-American men and women (n = 14,419), we genotyped and analyzed 966,578 autosomal SNPs across the entire genome using a linear mixed model variance components approach implemented in the program GCTA (Yang et al Nat Genet 2010), and estimated an additive heritability of 44.7% (se: 3.7%) for this phenotype in a sample of evidently unrelated individuals. While this estimated value is similar to that given by Yang et al in their analyses, we remain concerned about two related issues: (1) whether in the complete absence of hidden relatedness, variance components methods have adequate power to estimate heritability when a very large number of SNPs are used in the analysis; and (2) whether estimation of heritability may be biased, in real studies, by low levels of residual hidden relatedness. We addressed the first question in a semi-analytic fashion by directly simulating the distribution of the score statistic for a test of zero heritability with and without low levels of relatedness. The second question was addressed by a very careful comparison of the behavior of estimated heritability for both observed (self-reported) height and simulated phenotypes compared to imputation R2 as a function of the number of SNPs used in the analysis. These simulations help to address the important question about whether today's GWAS SNPs will remain useful for imputing causal variants that are discovered using very large sample sizes in future studies of height, or whether the causal variants themselves will need to be genotyped de novo in order to build a prediction model that ultimately captures a large fraction of the variability of height, and by implication other complex phenotypes. Our overall conclusions are that when study sizes are quite large (5,000 or so) the additive heritability estimate for height is not apparently biased upwards using the linear mixed model; however there is evidence in our simulation that a very large number of causal variants (many thousands) each with very small effect on phenotypic variance will need to be discovered to fill the gap between the heritability explained by known versus unknown causal variants. We conclude that today's GWAS data will remain useful in the future for causal variant prediction, but that finding the causal variants that need to be predicted may be extremely laborious.
Collapse
Affiliation(s)
- Fang Chen
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Jing He
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Jianqi Zhang
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Gary K. Chen
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Venetta Thomas
- Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public Health, University of Miami Miller School of Medicine, Miami, FL, United States of America
| | - Christine B. Ambrosone
- Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, NY, United States of America
| | - Elisa V. Bandera
- The Cancer Institute of New Jersey, New Brunswick, NJ, United States of America
| | - Sonja I. Berndt
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - Leslie Bernstein
- Division of Cancer Etiology, Department of Population Science, Beckman Research Institute, City of Hope, CA, United States of America
| | - William J. Blot
- International Epidemiology Institute, Rockville, MD, United States of America
- Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt University and the Vanderbilt-Ingram Cancer Center, Nashville, TN, United States of America
| | - Qiuyin Cai
- Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt University and the Vanderbilt-Ingram Cancer Center, Nashville, TN, United States of America
| | - John Carpten
- The Translational Genomics Research Institute, Phoenix, AZ, United States of America
| | - Graham Casey
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Stephen J. Chanock
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - Iona Cheng
- Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, HI, United States of America
| | - Lisa Chu
- Cancer Prevention Institute of California, Fremont, CA, United States of America
| | - Sandra L. Deming
- Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt University and the Vanderbilt-Ingram Cancer Center, Nashville, TN, United States of America
| | - W. Ryan Driver
- Epidemiology Research Program, American Cancer Society, Atlanta, GA, United States of America
| | - Phyllis Goodman
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, United States of America
| | - Richard B. Hayes
- Division of Epidemiology, Department of Environmental Medicine, New York University Langone Medical Center, New York, NY, United States of America
| | - Anselm J. M. Hennis
- Chronic Disease Research Centre and Faculty of Medical Sciences, University of the West Indies, Bridgetown, Barbados
- Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, United States of America
| | - Ann W. Hsing
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - Jennifer J. Hu
- Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public Health, University of Miami Miller School of Medicine, Miami, FL, United States of America
| | - Sue A. Ingles
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Esther M. John
- Cancer Prevention Institute of California, Fremont, CA, United States of America
- Division of Epidemiology, Department of Health Research & Policy, Stanford University School of Medicine and Stanford Cancer Institute, Stanford, CA, United States of America
| | - Rick A. Kittles
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States of America
| | - Suzanne Kolb
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, United States of America
| | - M. Cristina Leske
- Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, United States of America
| | - Robert C. Millikan
- Department of Epidemiology, Gillings School of Global Public Health, and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, United States of America
| | - Kristine R. Monroe
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Adam Murphy
- Department of Urology, Northwestern University, Chicago, IL, United States of America
| | - Barbara Nemesure
- Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, United States of America
| | - Christine Neslund-Dudas
- Department of Biostatistics and Research Epidemiology, Henry Ford Hospital, Detroit, MI, United States of America
| | - Sarah Nyante
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - Elaine A Ostrander
- Cancer Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - Michael F. Press
- Department of Pathology, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Jorge L. Rodriguez-Gil
- Sylvester Comprehensive Cancer Center and Department of Epidemiology and Public Health, University of Miami Miller School of Medicine, Miami, FL, United States of America
| | - Ben A. Rybicki
- Department of Biostatistics and Research Epidemiology, Henry Ford Hospital, Detroit, MI, United States of America
| | - Fredrick Schumacher
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Janet L. Stanford
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, United States of America
| | - Lisa B. Signorello
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, United States of America
| | - Sara S. Strom
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, TX, United States of America
| | - Victoria Stevens
- Epidemiology Research Program, American Cancer Society, Atlanta, GA, United States of America
| | - David Van Den Berg
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Zhaoming Wang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - John S. Witte
- Institute for Human Genetics, Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, United States of America
| | - Suh-Yuh Wu
- Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, United States of America
| | - Yuko Yamamura
- Institute for Human Genetics, Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, United States of America
| | - Wei Zheng
- Division of Epidemiology, Department of Medicine, Vanderbilt Epidemiology Center, Vanderbilt University and the Vanderbilt-Ingram Cancer Center, Nashville, TN, United States of America
| | - Regina G. Ziegler
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States of America
| | - Alexander H. Stram
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Laurence N. Kolonel
- Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, HI, United States of America
| | - Loïc Le Marchand
- Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu, HI, United States of America
| | - Brian E. Henderson
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Christopher A. Haiman
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
| | - Daniel O. Stram
- Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, United States of America
- * E-mail:
| |
Collapse
|
32
|
Pan W, Chen YM, Wei P. Testing for polygenic effects in genome-wide association studies. Genet Epidemiol 2015; 39:306-16. [PMID: 25847094 DOI: 10.1002/gepi.21899] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Revised: 01/30/2015] [Accepted: 02/23/2015] [Indexed: 12/20/2022]
Abstract
To confirm associations with a large number of single nucleotide polymorphisms (SNPs), each with only a small effect size, as hypothesized in the polygenic theory for schizophrenia, the International Schizophrenia Consortium (2009, Nature 460:748-752) proposed a polygenic risk score (PRS) test and demonstrated its effectiveness when applied to psychiatric disorders. The basic idea of the PRS test is to use a half of the sample to select and up-weight those more likely to be associated SNPs, and then use the other half of the sample to test for aggregated effects of the selected SNPs. Intrigued by the novelty and increasing use of the PRS test, we aimed to evaluate and improve its performance for GWAS data. First, by an analysis of the PRS test, we point out its connection with the Sum test [Chapman and Whittaker, Genet Epidemiol 32:560-566; Pan, Genet Epidemiol 33:497-507]; given the known advantages and disadvantages of the Sum test, this connection motivated the development of several other polygenic tests, some of which may be more powerful than the PRS test under certain situations. Second, more importantly, to overcome the low statistical efficiency of the data-splitting strategy as adopted in the PRS test, we reformulate and thus modify the PRS test, obtaining several adaptive tests, which are closely related to the adaptive sum of powered score (SPU) test studied in the context of rare variant analysis [Pan et al., 2014, Genetics 197:1081-1095]. We use both simulated data and a real GWAS dataset of alcohol dependence to show dramatically improved power of the new tests over the PRS test; due to its superior performance and simplicity, we recommend the whole sample-based adaptive SPU test for polygenic testing. We hope to raise the awareness of the limitations of the PRS test and potential power gain of the adaptive SPU test.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | | | | |
Collapse
|
33
|
Wang Z, Maity A, Hsiao CK, Voora D, Kaddurah-Daouk R, Tzeng JY. Module-based association analysis for omics data with network structure. PLoS One 2015; 10:e0122309. [PMID: 25822417 PMCID: PMC4378989 DOI: 10.1371/journal.pone.0122309] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2014] [Accepted: 02/20/2015] [Indexed: 02/06/2023] Open
Abstract
Module-based analysis (MBA) aims to evaluate the effect of a group of biological elements sharing common features, such as SNPs in the same gene or metabolites in the same pathways, and has become an attractive alternative to traditional single bio-element approaches. Because bio-elements regulate and interact with each other as part of network, incorporating network structure information can more precisely model the biological effects, enhance the ability to detect true associations, and facilitate our understanding of the underlying biological mechanisms. However, most MBA methods ignore the network structure information, which depicts the interaction and regulation relationship among basic functional units in biology system. We construct the connectivity kernel and the topology kernel to capture the relationship among bio-elements in a module, and use a kernel machine framework to evaluate the joint effect of bio-elements. Our proposed kernel machine approach directly incorporates network structure so to enhance the study efficiency; it can assess interactions among modules, account covariates, and is computational efficient. Through simulation studies and real data application, we demonstrate that the proposed network-based methods can have markedly better power than the approaches ignoring network information under a range of scenarios.
Collapse
Affiliation(s)
- Zhi Wang
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
| | - Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
| | - Chuhsing Kate Hsiao
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| | - Deepak Voora
- Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, United States of America
| | - Rima Kaddurah-Daouk
- Department of Psychiatry and Behavioral Sciences, Duke University, Durham, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
- Department of Statistics, National Cheng-Kung University, Taiwan, R.O.C
| |
Collapse
|
34
|
Wang X, Xing EP, Schaid DJ. Kernel methods for large-scale genomic data analysis. Brief Bioinform 2015; 16:183-92. [PMID: 25053743 PMCID: PMC4375394 DOI: 10.1093/bib/bbu024] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 05/20/2014] [Indexed: 11/12/2022] Open
Abstract
Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today's explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion.
Collapse
|
35
|
Abstract
The discovery and prioritization of heritable phenotypes is a computational challenge in a variety of settings, including neuroimaging genetics and analyses of the vast phenotypic repositories in electronic health record systems and population-based biobanks. Classical estimates of heritability require twin or pedigree data, which can be costly and difficult to acquire. Genome-wide complex trait analysis is an alternative tool to compute heritability estimates from unrelated individuals, using genome-wide data that are increasingly ubiquitous, but is computationally demanding and becomes difficult to apply in evaluating very large numbers of phenotypes. Here we present a fast and accurate statistical method for high-dimensional heritability analysis using genome-wide SNP data from unrelated individuals, termed massively expedited genome-wide heritability analysis (MEGHA) and accompanying nonparametric sampling techniques that enable flexible inferences for arbitrary statistics of interest. MEGHA produces estimates and significance measures of heritability with several orders of magnitude less computational time than existing methods, making heritability-based prioritization of millions of phenotypes based on data from unrelated individuals tractable for the first time to our knowledge. As a demonstration of application, we conducted heritability analyses on global and local morphometric measurements derived from brain structural MRI scans, using genome-wide SNP data from 1,320 unrelated young healthy adults of non-Hispanic European ancestry. We also computed surface maps of heritability for cortical thickness measures and empirically localized cortical regions where thickness measures were significantly heritable. Our analyses demonstrate the unique capability of MEGHA for large-scale heritability-based screening and high-dimensional heritability profile construction.
Collapse
|
36
|
Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genet Epidemiol 2015; 35:211-6. [PMID: 21308765 DOI: 10.1002/gepi.20567] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Revised: 11/21/2010] [Accepted: 01/04/2011] [Indexed: 11/10/2022]
Abstract
To detect genetic association with common and complex diseases, two powerful yet quite different multimarker association tests have been proposed, genomic distance-based regression (GDBR) (Wessel and Schork [2006] Am J Hum Genet 79:821–833) and kernel machine regression (KMR) (Kwee et al. [2008] Am J Hum Genet 82:386–397; Wu et al. [2010] Am J Hum Genet 86:929–942). GDBR is based on relating a multimarker similarity metric for a group of subjects to variation in their trait values, while KMR is based on nonparametric estimates of the effects of the multiple markers on the trait through a kernel function or kernel matrix. Since the two approaches are both powerful and general, but appear quite different, it is important to know their specific relationships. In this report, we show that, under the condition that there is no other covariate, there is a striking correspondence between the two approaches for a quantitative or a binary trait: if the same positive semi-definite matrix is used as the centered similarity matrix in GDBR and as the kernel matrix in KMR, the F-test statistic in GDBR and the score test statistic in KMR are equal (up to some ignorable constants). The result is based on the connections of both methods to linear or logistic (random-effects) regression models.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455–0392, USA.
| |
Collapse
|
37
|
Ge T, Nichols TE, Ghosh D, Mormino EC, Smoller JW, Sabuncu MR. A kernel machine method for detecting effects of interaction between multidimensional variable sets: an imaging genetics application. Neuroimage 2015; 109:505-514. [PMID: 25600633 DOI: 10.1016/j.neuroimage.2015.01.029] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2014] [Revised: 01/06/2015] [Accepted: 01/09/2015] [Indexed: 11/19/2022] Open
Abstract
Measurements derived from neuroimaging data can serve as markers of disease and/or healthy development, are largely heritable, and have been increasingly utilized as (intermediate) phenotypes in genetic association studies. To date, imaging genetic studies have mostly focused on discovering isolated genetic effects, typically ignoring potential interactions with non-genetic variables such as disease risk factors, environmental exposures, and epigenetic markers. However, identifying significant interaction effects is critical for revealing the true relationship between genetic and phenotypic variables, and shedding light on disease mechanisms. In this paper, we present a general kernel machine based method for detecting effects of the interaction between multidimensional variable sets. This method can model the joint and epistatic effect of a collection of single nucleotide polymorphisms (SNPs), accommodate multiple factors that potentially moderate genetic influences, and test for nonlinear interactions between sets of variables in a flexible framework. As a demonstration of application, we applied the method to the data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) to detect the effects of the interactions between candidate Alzheimer's disease (AD) risk genes and a collection of cardiovascular disease (CVD) risk factors, on hippocampal volume measurements derived from structural brain magnetic resonance imaging (MRI) scans. Our method identified that two genes, CR1 and EPHA1, demonstrate significant interactions with CVD risk factors on hippocampal volume, suggesting that CR1 and EPHA1 may play a role in influencing AD-related neurodegeneration in the presence of CVD risks.
Collapse
Affiliation(s)
- Tian Ge
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital / Harvard Medical School, Charlestown, MA 02129, USA
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Thomas E Nichols
- Department of Statistics & Warwick Manufacturing Group, The University of Warwick, Coventry CV4 7AL, UK
| | - Debashis Ghosh
- Department of Statistics, The Pennsylvania State University, PA 16802, USA
| | - Elizabeth C Mormino
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Jordan W Smoller
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02138, USA
| | - Mert R Sabuncu
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital / Harvard Medical School, Charlestown, MA 02129, USA
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
38
|
Assessing gene-environment interactions for common and rare variants with binary traits using gene-trait similarity regression. Genetics 2015; 199:695-710. [PMID: 25585620 DOI: 10.1534/genetics.114.171686] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Accounting for gene-environment (G×E) interactions in complex trait association studies can facilitate our understanding of genetic heterogeneity under different environmental exposures, improve the ability to discover susceptible genes that exhibit little marginal effect, provide insight into the biological mechanisms of complex diseases, help to identify high-risk subgroups in the population, and uncover hidden heritability. However, significant G×E interactions can be difficult to find. The sample sizes required for sufficient power to detect association are much larger than those needed for genetic main effects, and interactions are sensitive to misspecification of the main-effects model. These issues are exacerbated when working with binary phenotypes and rare variants, which bear less information on association. In this work, we present a similarity-based regression method for evaluating G×E interactions for rare variants with binary traits. The proposed model aggregates the genetic and G×E information across markers, using genetic similarity, thus increasing the ability to detect G×E signals. The model has a random effects interpretation, which leads to robustness against main-effect misspecifications when evaluating G×E interactions. We construct score tests to examine G×E interactions and a computationally efficient EM algorithm to estimate the nuisance variance components. Using simulations and data applications, we show that the proposed method is a flexible and powerful tool to study the G×E effect in common or rare variant studies with binary traits.
Collapse
|
39
|
Family L, Bensen JT, Troester MA, Wu MC, Anders CK, Olshan AF. Single-nucleotide polymorphisms in DNA bypass polymerase genes and association with breast cancer and breast cancer subtypes among African Americans and Whites. Breast Cancer Res Treat 2015; 149:181-90. [PMID: 25417172 PMCID: PMC4498665 DOI: 10.1007/s10549-014-3203-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Accepted: 11/09/2014] [Indexed: 01/18/2023]
Abstract
DNA damage recognition and repair is a complex system of genes focused on maintaining genomic stability. Recently, there has been a focus on how breast cancer susceptibility relates to genetic variation in the DNA bypass polymerases pathway. Race-stratified and subtype-specific logistic regression models were used to estimate odds ratios (ORs) and 95 % confidence intervals (CIs) for the association between 22 single-nucleotide polymorphisms (SNPs) in seven bypass polymerase genes and breast cancer risk in the Carolina Breast Cancer Study, a population-based, case-control study (1,972 cases and 1,776 controls). We used SNP-set kernel association test (SKAT) to evaluate the multi-gene, multi-locus (combined) SNP effects within bypass polymerase genes. We found similar ORs for breast cancer with three POLQ SNPs (rs487848 AG/AA vs. GG; OR = 1.31, 95 % CI 1.03-1.68 for Whites and OR = 1.22, 95 % CI 1.00-1.49 for African Americans), (rs532411 CT/TT vs. CC; OR = 1.31, 95 % CI 1.02-1.66 for Whites and OR = 1.22, 95 % CI 1.00-1.48 for African Americans), and (rs3218634 CG/CC vs. GG; OR = 1.29, 95 % CI 1.02-1.65 for Whites). These three SNPs are in high linkage disequilibrium in both races. Tumor subtype analysis showed the same SNPs to be associated with increased risk of Luminal breast cancer. SKAT analysis showed no significant combined SNP effects. These results suggest that variants in the POLQ gene may be associated with the risk of Luminal breast cancer.
Collapse
Affiliation(s)
- Leila Family
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC, USA,
| | | | | | | | | | | |
Collapse
|
40
|
Wang X, Epstein MP, Tzeng JY. Analysis of gene-gene interactions using gene-trait similarity regression. Hum Hered 2014; 78:17-26. [PMID: 24969398 DOI: 10.1159/000360161] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Accepted: 01/30/2014] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVE Gene-gene interactions (G×G) are important to study because of their extensiveness in biological systems and their potential in explaining missing heritability of complex traits. In this work, we propose a new similarity-based test to assess G×G at the gene level, which permits the study of epistasis at biologically functional units with amplified interaction signals. METHODS Under the framework of gene-trait similarity regression (SimReg), we propose a gene-based test for detecting G×G. SimReg uses a regression model to correlate trait similarity with genotypic similarity across a gene. Unlike existing gene-level methods based on leading principal components (PCs), SimReg summarizes all information on genotypic variation within a gene and can be used to assess the joint/interactive effects of two genes as well as the effect of one gene conditional on another. RESULTS Using simulations and a real data application to the Warfarin study, we show that the SimReg G×G tests have satisfactory power and robustness under different genetic architecture when compared to existing gene-based interaction tests such as PC analysis or partial least squares. A genome-wide association study with approx. 20,000 genes may be completed on a parallel computing system in 2 weeks.
Collapse
Affiliation(s)
- Xin Wang
- Bioinformatics Research Center, North Carolina State University, Raleigh, N.C., USA
| | | | | |
Collapse
|
41
|
Abstract
The kernel score statistic is a global covariance component test over a set of genetic markers. It provides a flexible modeling framework and does not collapse marker information. We generalize the kernel score statistic to allow for familial dependencies and to adjust for random confounder effects. With this extension, we adjust our analysis of real and simulated baseline systolic blood pressure for polygenic familial background. We find that the kernel score test gains appreciably in power through the use of sequencing compared to tag-single-nucleotide polymorphisms for very rare single nucleotide polymorphisms with <1% minor allele frequency.
Collapse
Affiliation(s)
- Dörthe Malzahn
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Stefanie Friedrichs
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Albert Rosenberger
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| |
Collapse
|
42
|
Tzeng JY, Lu W, Hsu FC. GENE-LEVEL PHARMACOGENETIC ANALYSIS ON SURVIVAL OUTCOMES USING GENE-TRAIT SIMILARITY REGRESSION. Ann Appl Stat 2014; 8:1232-1255. [PMID: 25018788 DOI: 10.1214/14-aoas735] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Gene/pathway-based methods are drawing significant attention due to their usefulness in detecting rare and common variants that affect disease susceptibility. The biological mechanism of drug responses indicates that a gene-based analysis has even greater potential in pharmacogenetics. Motivated by a study from the Vitamin Intervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the effect of a gene or pathway on time-to-event outcomes. The similarity regression has a general framework that covers a range of survival models, such as the proportional hazards model and the proportional odds model. The inference procedure developed under the proportional hazards model is robust against model misspecification. We derive the equivalence between the similarity survival regression and a random effects model, which further unifies the current variance-component based methods. We demonstrate the effectiveness of the proposed method through simulation studies. In addition, we apply the method to the VISP trial data to identify the genes that exhibit an association with the risk of a recurrent stroke. TCN2 gene was found to be associated with the recurrent stroke risk in the low-dose arm. This gene may impact recurrent stroke risk in response to cofactor therapy.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- North Carolina State University ; National Cheng-Kung University
| | | | | |
Collapse
|
43
|
King CR, Nicolae DL. GWAS to Sequencing: Divergence in Study Design and Analysis. Genes (Basel) 2014; 5:460-76. [PMID: 24879455 PMCID: PMC4094943 DOI: 10.3390/genes5020460] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2013] [Revised: 05/13/2014] [Accepted: 05/15/2014] [Indexed: 12/03/2022] Open
Abstract
The success of genome-wide association studies (GWAS) in uncovering genetic risk factors for complex traits has generated great promise for the complete data generated by sequencing. The bumpy transition from GWAS to whole-exome or whole-genome association studies (WGAS) based on sequencing investigations has highlighted important differences in analysis and interpretation. We show how the loss in power due to the allele frequency spectrum targeted by sequencing is difficult to compensate for with realistic effect sizes and point to study designs that may help. We discuss several issues in interpreting the results, including a special case of the winner's curse. Extrapolation and prediction using rare SNPs is complex, because of the selective ascertainment of SNPs in case-control studies and the low amount of information at each SNP, and naive procedures are biased under the alternative. We also discuss the challenges in tuning gene-based tests and accounting for multiple testing when genes have very different sets of SNPs. The examples we emphasize in this paper highlight the difficult road we must travel for a two-letter switch.
Collapse
Affiliation(s)
| | - Dan L Nicolae
- Departments of Medicine, Statistics, and Human Genetics, University of Chicago, Chicago,IL 60637, USA.
| |
Collapse
|
44
|
Vrieze SI, Feng S, Miller MB, Hicks BM, Pankratz N, Abecasis GR, Iacono WG, McGue M. Rare nonsynonymous exonic variants in addiction and behavioral disinhibition. Biol Psychiatry 2014; 75:783-9. [PMID: 24094508 PMCID: PMC3975816 DOI: 10.1016/j.biopsych.2013.08.027] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Revised: 08/02/2013] [Accepted: 08/26/2013] [Indexed: 10/26/2022]
Abstract
BACKGROUND Substance use is heritable, but few common genetic variants have been associated with these behaviors. Rare nonsynonymous exonic variants can now be efficiently genotyped, allowing exome-wide association tests. We identified and tested 111,592 nonsynonymous exonic variants for association with behavioral disinhibition and the use/misuse of nicotine, alcohol, and illicit drugs. METHODS Comprehensive genotyping of exonic variation combined with single-variant and gene-based tests of association was conducted in 7181 individuals; 172 candidate addiction genes were evaluated in greater detail. We also evaluated the aggregate effects of nonsynonymous variants on these phenotypes using Genome-wide Complex Trait Analysis. RESULTS No variant or gene was significantly associated with any phenotype. No association was found for any of the 172 candidate genes, even at reduced significance thresholds. All nonsynonymous variants jointly accounted for 35% of the heritability in illicit drug use and, when combined with common variants from a genome-wide array, accounted for 84% of the heritability. CONCLUSIONS Rare nonsynonymous variants may be important in etiology of illicit drug use, but detection of individual variants will require very large samples.
Collapse
Affiliation(s)
- Scott I Vrieze
- Center for Statistical Genetics (SIV, SF, GRA), Department of Biostatistics, University of Michigan, Ann Arbor, Michigan.
| | - Shuang Feng
- Center for Statistical Genetics (SIV, SF, GRA), Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Michael B Miller
- Department of Psychology (MBM, WGI, MM), University of Minnesota, Minneapolis, Minnesota
| | - Brian M Hicks
- Department of Psychiatry (BMH), University of Michigan, Ann Arbor, Michigan
| | - Nathan Pankratz
- Department of Laboratory Medicine and Pathology (NP), University of Minnesota, Minneapolis, Minnesota
| | - Gonçalo R Abecasis
- Center for Statistical Genetics (SIV, SF, GRA), Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - William G Iacono
- Department of Psychology (MBM, WGI, MM), University of Minnesota, Minneapolis, Minnesota
| | - Matt McGue
- Department of Psychology (MBM, WGI, MM), University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
45
|
Zeng P, Zhao Y, Zhang L, Huang S, Chen F. Rare variants detection with kernel machine learning based on likelihood ratio test. PLoS One 2014; 9:e93355. [PMID: 24675868 PMCID: PMC3968153 DOI: 10.1371/journal.pone.0093355] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2013] [Accepted: 03/03/2014] [Indexed: 11/18/2022] Open
Abstract
This paper mainly utilizes likelihood-based tests to detect rare variants associated with a continuous phenotype under the framework of kernel machine learning. Both the likelihood ratio test (LRT) and the restricted likelihood ratio test (ReLRT) are investigated. The relationship between the kernel machine learning and the mixed effects model is discussed. By using the eigenvalue representation of LRT and ReLRT, their exact finite sample distributions are obtained in a simulation manner. Numerical studies are performed to evaluate the performance of the proposed approaches under the contexts of standard mixed effects model and kernel machine learning. The results have shown that the LRT and ReLRT can control the type I error correctly at the given α level. The LRT and ReLRT consistently outperform the SKAT, regardless of the sample size and the proportion of the negative causal rare variants, and suffer from fewer power reductions compared to the SKAT when both positive and negative effects of rare variants are present. The LRT and ReLRT performed under the context of kernel machine learning have slightly higher powers than those performed under the context of standard mixed effects model. We use the Genetic Analysis Workshop 17 exome sequencing SNP data as an illustrative example. Some interesting results are observed from the analysis. Finally, we give the discussion.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu, China
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Liwei Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Shuiping Huang
- Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
- * E-mail:
| |
Collapse
|
46
|
Røislien J, Samset E. A non-parametric permutation method for assessing agreement for distance matrix observations. Stat Med 2014; 33:319-29. [PMID: 23946159 DOI: 10.1002/sim.5927] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Revised: 05/16/2013] [Accepted: 07/08/2013] [Indexed: 11/08/2022]
Abstract
Distance matrix data are occurring ever more frequently in medical research, particularly in fields such as genetics, DNA research, and image analysis. We propose a non-parametric permutation method for assessing agreement when the data under study are distance matrices. We apply agglomerative hierarchical clustering and accompanying dendrograms to visualize the internal structure of the matrix observations. The accompanying test is based on random permutations of the elements within individual matrix observations and the corresponding matrix mean of these permutations. We compare the within-matrix element sum of squares (WMESS) for the observed mean against the WMESS for the permutation means. The methodology is exemplified using simulations and real data from magnetic resonance imaging.
Collapse
Affiliation(s)
- Jo Røislien
- Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Norway
| | | |
Collapse
|
47
|
Larson NB, Jenkins GD, Larson MC, Vierkant RA, Sellers TA, Phelan CM, Schildkraut JM, Sutphen R, Pharoah PPD, Gayther SA, Wentzensen N, Goode EL, Fridley BL. Kernel canonical correlation analysis for assessing gene-gene interactions and application to ovarian cancer. Eur J Hum Genet 2014; 22:126-31. [PMID: 23591404 PMCID: PMC3865403 DOI: 10.1038/ejhg.2013.69] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2012] [Revised: 01/11/2013] [Accepted: 01/16/2013] [Indexed: 01/24/2023] Open
Abstract
Although single-locus approaches have been widely applied to identify disease-associated single-nucleotide polymorphisms (SNPs), complex diseases are thought to be the product of multiple interactions between loci. This has led to the recent development of statistical methods for detecting statistical interactions between two loci. Canonical correlation analysis (CCA) has previously been proposed to detect gene-gene coassociation. However, this approach is limited to detecting linear relations and can only be applied when the number of observations exceeds the number of SNPs in a gene. This limitation is particularly important for next-generation sequencing, which could yield a large number of novel variants on a limited number of subjects. To overcome these limitations, we propose an approach to detect gene-gene interactions on the basis of a kernelized version of CCA (KCCA). Our simulation studies showed that KCCA controls the Type-I error, and is more powerful than leading gene-based approaches under a disease model with negligible marginal effects. To demonstrate the utility of our approach, we also applied KCCA to assess interactions between 200 genes in the NF-κB pathway in relation to ovarian cancer risk in 3869 cases and 3276 controls. We identified 13 significant gene pairs relevant to ovarian cancer risk (local false discovery rate <0.05). Finally, we discuss the advantages of KCCA in gene-gene interaction analysis and its future role in genetic association studies.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Gregory D Jenkins
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Melissa C Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Robert A Vierkant
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | | | | | | | - Rebecca Sutphen
- Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
| | | | - Simon A Gayther
- Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicolas Wentzensen
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Ovarian Cancer Association Consortium
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
- Duke Comprehensive Cancer Center, Duke University, Durham, NC, USA
- Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
- Department of Oncology, University of Cambridge, Cambridge, UK
- Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
| | - Ellen L Goode
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Brooke L Fridley
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
| |
Collapse
|
48
|
Thomas DC, Yang Z, Yang F. Two-phase and family-based designs for next-generation sequencing studies. Front Genet 2013; 4:276. [PMID: 24379824 PMCID: PMC3861783 DOI: 10.3389/fgene.2013.00276] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2013] [Accepted: 11/19/2013] [Indexed: 12/21/2022] Open
Abstract
The cost of next-generation sequencing is now approaching that of early GWAS panels, but is still out of reach for large epidemiologic studies and the millions of rare variants expected poses challenges for distinguishing causal from non-causal variants. We review two types of designs for sequencing studies: two-phase designs for targeted follow-up of genomewide association studies using unrelated individuals; and family-based designs exploiting co-segregation for prioritizing variants and genes. Two-phase designs subsample subjects for sequencing from a larger case-control study jointly on the basis of their disease and carrier status; the discovered variants are then tested for association in the parent study. The analysis combines the full sequence data from the substudy with the more limited SNP data from the main study. We discuss various methods for selecting this subset of variants and describe the expected yield of true positive associations in the context of an on-going study of second breast cancers following radiotherapy. While the sharing of variants within families means that family-based designs are less efficient for discovery than sequencing unrelated individuals, the ability to exploit co-segregation of variants with disease within families helps distinguish causal from non-causal ones. Furthermore, by enriching for family history, the yield of causal variants can be improved and use of identity-by-descent information improves imputation of genotypes for other family members. We compare the relative efficiency of these designs with those using unrelated individuals for discovering and prioritizing variants or genes for testing association in larger studies. While associations can be tested with single variants, power is low for rare ones. Recent generalizations of burden or kernel tests for gene-level associations to family-based data are appealing. These approaches are illustrated in the context of a family-based study of colorectal cancer.
Collapse
Affiliation(s)
- Duncan C Thomas
- Department of Preventive Medicine, University of Southern California Los Angeles, CA, USA
| | - Zhao Yang
- Department of Preventive Medicine, University of Southern California Los Angeles, CA, USA
| | - Fan Yang
- Department of Preventive Medicine, University of Southern California Los Angeles, CA, USA
| |
Collapse
|
49
|
Qu L, Guennel T, Marshall SL. Linear score tests for variance components in linear mixed models and applications to genetic association studies. Biometrics 2013; 69:883-92. [PMID: 24328714 DOI: 10.1111/biom.12095] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2012] [Revised: 06/01/2013] [Accepted: 07/01/2013] [Indexed: 01/16/2023]
Abstract
Following the rapid development of genome-scale genotyping technologies, genetic association mapping has become a popular tool to detect genomic regions responsible for certain (disease) phenotypes, especially in early-phase pharmacogenomic studies with limited sample size. In response to such applications, a good association test needs to be (1) applicable to a wide range of possible genetic models, including, but not limited to, the presence of gene-by-environment or gene-by-gene interactions and non-linearity of a group of marker effects, (2) accurate in small samples, fast to compute on the genomic scale, and amenable to large scale multiple testing corrections, and (3) reasonably powerful to locate causal genomic regions. The kernel machine method represented in linear mixed models provides a viable solution by transforming the problem into testing the nullity of variance components. In this study, we consider score-based tests by choosing a statistic linear in the score function. When the model under the null hypothesis has only one error variance parameter, our test is exact in finite samples. When the null model has more than one variance parameter, we develop a new moment-based approximation that performs well in simulations. Through simulations and analysis of real data, we demonstrate that the new test possesses most of the aforementioned characteristics, especially when compared to existing quadratic score tests or restricted likelihood ratio tests.
Collapse
Affiliation(s)
- Long Qu
- Department of Mathematics and Statistics, Wright State University, Dayton, Ohio 45435, U.S.A
| | | | | |
Collapse
|
50
|
Hoffman GE. Correcting for population structure and kinship using the linear mixed model: theory and extensions. PLoS One 2013; 8:e75707. [PMID: 24204578 PMCID: PMC3810480 DOI: 10.1371/journal.pone.0075707] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Accepted: 08/20/2013] [Indexed: 01/20/2023] Open
Abstract
Population structure and kinship are widespread confounding factors in genome-wide association studies (GWAS). It has been standard practice to include principal components of the genotypes in a regression model in order to account for population structure. More recently, the linear mixed model (LMM) has emerged as a powerful method for simultaneously accounting for population structure and kinship. The statistical theory underlying the differences in empirical performance between modeling principal components as fixed versus random effects has not been thoroughly examined. We undertake an analysis to formalize the relationship between these widely used methods and elucidate the statistical properties of each. Moreover, we introduce a new statistic, effective degrees of freedom, that serves as a metric of model complexity and a novel low rank linear mixed model (LRLMM) to learn the dimensionality of the correction for population structure and kinship, and we assess its performance through simulations. A comparison of the results of LRLMM and a standard LMM analysis applied to GWAS data from the Multi-Ethnic Study of Atherosclerosis (MESA) illustrates how our theoretical results translate into empirical properties of the mixed model. Finally, the analysis demonstrates the ability of the LRLMM to substantially boost the strength of an association for HDL cholesterol in Europeans.
Collapse
Affiliation(s)
- Gabriel E. Hoffman
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- * E-mail:
| |
Collapse
|