1
|
Das Adhikari S, Cui Y, Wang J. BayesKAT: bayesian optimal kernel-based test for genetic association studies reveals joint genetic effects in complex diseases. Brief Bioinform 2024; 25:bbae182. [PMID: 38653490 PMCID: PMC11036342 DOI: 10.1093/bib/bbae182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 03/10/2024] [Accepted: 04/05/2024] [Indexed: 04/25/2024] Open
Abstract
Genome-wide Association Studies (GWAS) methods have identified individual single-nucleotide polymorphisms (SNPs) significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT (https://github.com/wangjr03/BayesKAT), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.
Collapse
Affiliation(s)
- Sikta Das Adhikari
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Jianrong Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
2
|
Chen AA, Weinstein SM, Adebimpe A, Gur RC, Gur RE, Merikangas KR, Satterthwaite TD, Shinohara RT, Shou H. Similarity-based multimodal regression. Biostatistics 2023:kxad033. [PMID: 38058018 DOI: 10.1093/biostatistics/kxad033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 10/07/2023] [Accepted: 11/06/2023] [Indexed: 12/08/2023] Open
Abstract
To better understand complex human phenotypes, large-scale studies have increasingly collected multiple data modalities across domains such as imaging, mobile health, and physical activity. The properties of each data type often differ substantially and require either separate analyses or extensive processing to obtain comparable features for a combined analysis. Multimodal data fusion enables certain analyses on matrix-valued and vector-valued data, but it generally cannot integrate modalities of different dimensions and data structures. For a single data modality, multivariate distance matrix regression provides a distance-based framework for regression accommodating a wide range of data types. However, no distance-based method exists to handle multiple complementary types of data. We propose a novel distance-based regression model, which we refer to as Similarity-based Multimodal Regression (SiMMR), that enables simultaneous regression of multiple modalities through their distance profiles. We demonstrate through simulation, imaging studies, and longitudinal mobile health analyses that our proposed method can detect associations between clinical variables and multimodal data of differing properties and dimensionalities, even with modest sample sizes. We perform experiments to evaluate several different test statistics and provide recommendations for applying our method across a broad range of scenarios.
Collapse
Affiliation(s)
- Andrew A Chen
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA
| | - Sarah M Weinstein
- Department of Epidemiology and Biostatistics, Temple University College of Public Health, Philadelphia, PA 19122, USA
| | - Azeez Adebimpe
- Penn Lifespan Informatics & Neuroimaging Center, Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Ruben C Gur
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Lifespan Brain Institute Penn Medicine and CHOP, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Raquel E Gur
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Lifespan Brain Institute Penn Medicine and CHOP, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kathleen R Merikangas
- Genetic Epidemiology Research Branch, Intramural Research Program, National Institute of Mental Health, Bethesda, MD 20892, USA
| | - Theodore D Satterthwaite
- Penn Lifespan Informatics & Neuroimaging Center, Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Russell T Shinohara
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Haochang Shou
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
3
|
Carpenter CM, Gillenwater L, Bowler R, Kechris K, Ghosh D. TreeKernel: interpretable kernel machine tests for interactions between -omics and clinical predictors with applications to metabolomics and COPD phenotypes. BMC Bioinformatics 2023; 24:398. [PMID: 37880571 PMCID: PMC10601228 DOI: 10.1186/s12859-023-05459-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 08/30/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND In this paper, we are interested in interactions between a high-dimensional -omics dataset and clinical covariates. The goal is to evaluate the relationship between a phenotype of interest and a high-dimensional omics pathway, where the effect of the omics data depends on subjects' clinical covariates (age, sex, smoking status, etc.). For instance, metabolic pathways can vary greatly between sexes which may also change the relationship between certain metabolic pathways and a clinical phenotype of interest. We propose partitioning the clinical covariate space and performing a kernel association test within those partitions. To illustrate this idea, we focus on hierarchical partitions of the clinical covariate space and kernel tests on metabolic pathways. RESULTS We see that our proposed method outperforms competing methods in most simulation scenarios. It can identify different relationships among clinical groups with higher power in most scenarios while maintaining a proper Type I error rate. The simulation studies also show a robustness to the grouping structure within the clinical space. We also apply the method to the COPDGene study and find several clinically meaningful interactions between metabolic pathways, the clinical space, and lung function. CONCLUSION TreeKernel provides a simple and interpretable process for testing for relationships between high-dimensional omics data and clinical outcomes in the presence of interactions within clinical cohorts. The method is broadly applicable to many studies.
Collapse
Affiliation(s)
- Charlie M Carpenter
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical Campus, Denver, CO, USA.
| | - Lucas Gillenwater
- Computational Bioscience Program, University of Colorado Denver, Anschutz Medical Campus, Denver, CO, USA
| | - Russell Bowler
- Department of Medicine, National Jewish Health, Denver, USA
- University of Colorado Denver, Anschutz Medical Campus, Denver, CO, USA
| | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical Campus, Denver, CO, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado Denver, Anschutz Medical Campus, Denver, CO, USA
| |
Collapse
|
4
|
Feltman NR, Burkness EC, Ebbenga D, Hutchison WD, Smanski MJ. HUGE pipeline to measure temporal genetic variation in Drosophila suzukii populations for genetic biocontrol applications. FRONTIERS IN INSECT SCIENCE 2022; 2:981974. [PMID: 38468784 PMCID: PMC10926429 DOI: 10.3389/finsc.2022.981974] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 08/22/2022] [Indexed: 03/13/2024]
Abstract
Understanding the fine-scale genome sequence diversity that exists within natural populations is important for developing models of species migration, temporal stability, and range expansion. For invasive species, agricultural pests, and disease vectors, sequence diversity at specific loci in the genome can impact the efficacy of next-generation genetic biocontrol strategies. Here we describe a pipeline for haplotype-resolution genetic variant discovery and quantification from thousands of Spotted Wing Drosophila (Drosophila suzukii, SWD) isolated at two field sites in the North-Central United States (Minnesota) across two seasons. We observed highly similar single nucleotide polymorphism (SNP) frequencies at each genomic location at each field site and year. This supports the hypotheses that SWD overwinters in Minnesota, is annually populated by the same source populations or a combination of both theories. Also, the stable genetic structure of SWD populations allows for the rational design of genetic biocontrol technologies for population suppression.
Collapse
Affiliation(s)
- Nathan R. Feltman
- Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, Saint Paul, MN, United States
- Biotechnology Institute, University of Minnesota, Saint Paul, MN, United States
| | - Eric C. Burkness
- Department of Entomology, University of Minnesota, Saint Paul, MN, United States
| | - Dominique N. Ebbenga
- Department of Entomology, University of Minnesota, Saint Paul, MN, United States
| | - William D. Hutchison
- Department of Entomology, University of Minnesota, Saint Paul, MN, United States
| | - Michael J. Smanski
- Department of Biochemistry, Molecular Biology, and Biophysics, University of Minnesota, Saint Paul, MN, United States
- Biotechnology Institute, University of Minnesota, Saint Paul, MN, United States
| |
Collapse
|
5
|
Hébert F, Causeur D, Emily M. Omnibus testing approach for gene-based gene-gene interaction. Stat Med 2022; 41:2854-2878. [PMID: 35338506 DOI: 10.1002/sim.9389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2020] [Revised: 03/03/2022] [Accepted: 03/04/2022] [Indexed: 11/07/2022]
Abstract
Genetic interaction is considered as one of the main heritable component of complex traits. With the emergence of genome-wide association studies (GWAS), a collection of statistical methods dedicated to the identification of interaction at the SNP level have been proposed. More recently, gene-based gene-gene interaction testing has emerged as an attractive alternative as they confer advantage in both statistical power and biological interpretation. Most of the gene-based interaction methods rely on a multidimensional modeling of the interaction, thus facing a lack of robustness against the huge space of interaction patterns. In this paper, we study a global testing approaches to address the issue of gene-based gene-gene interaction. Based on a logistic regression modeling framework, all SNP-SNP interaction tests are combined to produce a gene-level test for interaction. We propose an omnibus test that takes advantage of (1) the heterogeneity between existing global tests and (2) the complementarity between allele-based and genotype-based coding of SNPs. Through an extensive simulation study, it is demonstrated that the proposed omnibus test has the ability to detect with high power the most common interaction genetic models with one causal pair as well as more complex genetic models where more than one causal pair is involved. On the other hand, the flexibility of the proposed approach is shown to be robust and improves power compared to single global tests in replication studies. Furthermore, the application of our procedure to real datasets confirms the adaptability of our approach to replicate various gene-gene interactions.
Collapse
Affiliation(s)
- Florian Hébert
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| | - David Causeur
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| | - Mathieu Emily
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| |
Collapse
|
6
|
Liu JZ, Deng W, Lee J, Lin PID, Valeri L, Christiani DC, Bellinger DC, Wright RO, Mazumdar MM, Coull BA. A Cross-validated Ensemble Approach to Robust Hypothesis Testing of Continuous Nonlinear Interactions: Application to Nutrition-Environment Studies. J Am Stat Assoc 2021; 117:561-573. [PMID: 36310839 PMCID: PMC9611147 DOI: 10.1080/01621459.2021.1962889] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2019] [Revised: 07/23/2021] [Accepted: 07/28/2021] [Indexed: 01/03/2023]
Abstract
Gene-environment and nutrition-environment studies often involve testing of high-dimensional interactions between two sets of variables, each having potentially complex nonlinear main effects on an outcome. Construction of a valid and powerful hypothesis test for such an interaction is challenging, due to the difficulty in constructing an efficient and unbiased estimator for the complex, nonlinear main effects. In this work we address this problem by proposing a Cross-validated Ensemble of Kernels (CVEK) that learns the space of appropriate functions for the main effects using a cross-validated ensemble approach. With a carefully chosen library of base kernels, CVEK flexibly estimates the form of the main-effect functions from the data, and encourages test power by guarding against over-fitting under the alternative. The method is motivated by a study on the interaction between metal exposures in utero and maternal nutrition on children's neurodevelopment in rural Bangladesh. The proposed tests identified evidence of an interaction between minerals and vitamins intake and arsenic and manganese exposures. Results suggest that the detrimental effects of these metals are most pronounced at low intake levels of the nutrients, suggesting nutritional interventions in pregnant women could mitigate the adverse impacts of in utero metal exposures on children's neurodevelopment.
Collapse
Affiliation(s)
- Jeremiah Zhe Liu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Wenying Deng
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jane Lee
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Neurology, Boston Children’s Hospital, Boston, MA, USA
| | - Pi-i Debby Lin
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Linda Valeri
- Department of Biostatistics, Columbia Mailman School of Public Health, New York, New York, USA
| | - David C. Christiani
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - David C. Bellinger
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Neurology, Boston Children’s Hospital, Boston, MA, USA
| | - Robert O. Wright
- Department of Environmental Medicine and Public Health, Icahn School of Medicine, New York, NY, USA
| | - Maitreyi M. Mazumdar
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Neurology, Boston Children’s Hospital, Boston, MA, USA
| | - Brent A. Coull
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
7
|
Liu Y, Gao Y, Fang R, Cao H, Sa J, Wang J, Liu H, Wang T, Cui Y. Identifying complex gene-gene interactions: a mixed kernel omnibus testing approach. Brief Bioinform 2021; 22:6346804. [PMID: 34373892 DOI: 10.1093/bib/bbab305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 07/01/2021] [Accepted: 07/17/2021] [Indexed: 11/12/2022] Open
Abstract
Genes do not function independently; rather, they interact with each other to fulfill their joint tasks. Identification of gene-gene interactions has been critically important in elucidating the molecular mechanisms responsible for the variation of a phenotype. Regression models are commonly used to model the interaction between two genes with a linear product term. The interaction effect of two genes can be linear or nonlinear, depending on the true nature of the data. When nonlinear interactions exist, the linear interaction model may not be able to detect such interactions; hence, it suffers from substantial power loss. While the true interaction mechanism (linear or nonlinear) is generally unknown in practice, it is critical to develop statistical methods that can be flexible to capture the underlying interaction mechanism without assuming a specific model assumption. In this study, we develop a mixed kernel function which combines both linear and Gaussian kernels with different weights to capture the linear or nonlinear interaction of two genes. Instead of optimizing the weight function, we propose a grid search strategy and use a Cauchy transformation of the P-values obtained under different weights to aggregate the P-values. We further extend the two-gene interaction model to a high-dimensional setup using a de-biased LASSO algorithm. Extensive simulation studies are conducted to verify the performance of the proposed method. Application to two case studies further demonstrates the utility of the model. Our method provides a flexible and computationally efficient tool for disentangling complex gene-gene interactions associated with complex traits.
Collapse
Affiliation(s)
- Yan Liu
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Yuzhao Gao
- School of Statistics, Shanxi University of Finance and Economics, Taiyuan, PR China
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Jian Sa
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Jianrong Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Hongqi Liu
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Tong Wang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, PR China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
8
|
Alam MA, Qiu C, Shen H, Wang YP, Deng HW. A generalized kernel machine approach to identify higher-order composite effects in multi-view datasets, with application to adolescent brain development and osteoporosis. J Biomed Inform 2021; 120:103854. [PMID: 34237438 DOI: 10.1016/j.jbi.2021.103854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 05/28/2021] [Accepted: 06/28/2021] [Indexed: 10/20/2022]
Abstract
In recent years, a comprehensive study of complex disease with multi-view datasets (e.g., multi-omics and imaging scans) has been a focus and forefront in biomedical research. State-of-the-art biomedical technologies are enabling us to collect multi-view biomedical datasets for the study of complex diseases. While all the views of data tend to explore complementary information of disease, analysis of multi-view data with complex interactions is challenging for a deeper and holistic understanding of biological systems. In this paper, we propose a novel generalized kernel machine approach to identify higher-order composite effects in multi-view biomedical datasets (GKMAHCE). This generalized semi-parametric (a mixed-effect linear model) approach includes the marginal and joint Hadamard product of features from different views of data. The proposed kernel machine approach considers multi-view data as predictor variables to allow a more thorough and comprehensive modeling of a complex trait. We applied GKMAHCE approach to both synthesized datasets and real multi-view datasets from adolescent brain development and osteoporosis study. Our experiments demonstrate that the proposed method can effectively identify higher-order composite effects and suggest that corresponding features (genes, region of interests, and chemical taxonomies) function in a concerted effort. We show that the proposed method is more generalizable than existing ones. To promote reproducible research, the source code of the proposed method is available at.
Collapse
Affiliation(s)
- Md Ashad Alam
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA 70112, USA; Division of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University, New Orleans, LA 70112, USA.
| | - Chuan Qiu
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA 70112, USA; Division of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hui Shen
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA 70112, USA; Division of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Yu-Ping Wang
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA 70112, USA; Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA 70112, USA; Division of Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University, New Orleans, LA 70112, USA
| |
Collapse
|
9
|
Distance-Based Analysis with Quantile Regression Models. STATISTICS IN BIOSCIENCES 2021; 13:291-312. [DOI: 10.1007/s12561-021-09306-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
10
|
Ashad Alam M, Komori O, Deng HW, Calhoun VD, Wang YP. Robust kernel canonical correlation analysis to detect gene-gene co-associations: A case study in genetics. J Bioinform Comput Biol 2020; 17:1950028. [PMID: 31617462 DOI: 10.1142/s0219720019500288] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The kernel canonical correlation analysis based U-statistic (KCCU) is being used to detect nonlinear gene-gene co-associations. Estimating the variance of the KCCU is however computationally intensive. In addition, the kernel canonical correlation analysis (kernel CCA) is not robust to contaminated data. Using a robust kernel mean element and a robust kernel (cross)-covariance operator potentially enables the use of a robust kernel CCA, which is studied in this paper. We first propose an influence function-based estimator for the variance of the KCCU. We then present a non-parametric robust KCCU, which is designed for dealing with contaminated data. The robust KCCU is less sensitive to noise than KCCU. We investigate the proposed method using both synthesized and real data from the Mind Clinical Imaging Consortium (MCIC). We show through simulation studies that the power of the proposed methods is a monotonically increasing function of sample size, and the robust test statistics bring incremental gains in power. To demonstrate the advantage of the robust kernel CCA, we study MCIC data among 22,442 candidate Schizophrenia genes for gene-gene co-associations. We select 768 genes with strong evidence for shedding light on gene-gene interaction networks for Schizophrenia. By performing gene ontology enrichment analysis, pathway analysis, gene-gene network and other studies, the proposed robust methods can find undiscovered genes in addition to significant gene pairs, and demonstrate superior performance over several of current approaches.
Collapse
Affiliation(s)
- Md Ashad Alam
- Tulane Center of Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA 70118, USA
| | - Osamu Komori
- Department of Computer and Information Science, Seikei University 3-3-1 Kichijojikitamachi, Musashino-shi Tokyo 180-8633 Japan
| | - Hong-Wen Deng
- Tulane Center of Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA 70118, USA
| | - Vince D Calhoun
- Tri-Institutional Center for Translational Research in Neuroimaging and Data Science, Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA30302, USA
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA
| |
Collapse
|
11
|
Deng Y, He T, Fang R, Li S, Cao H, Cui Y. Genome-Wide Gene-Based Multi-Trait Analysis. Front Genet 2020; 11:437. [PMID: 32508874 PMCID: PMC7248273 DOI: 10.3389/fgene.2020.00437] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/08/2020] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association studies focusing on a single phenotype have been broadly conducted to identify genetic variants associated with a complex disease. The commonly applied single variant analysis is limited by failing to consider the complex interactions between variants, which motivated the development of association analyses focusing on genes or gene sets. Moreover, when multiple correlated phenotypes are available, methods based on a multi-trait analysis can improve the association power. However, most currently available multi-trait analyses are single variant-based analyses; thus have limited power when disease variants function as a group in a gene or a gene set. In this work, we propose a genome-wide gene-based multi-trait analysis method by considering genes as testing units. For a given phenotype, we adopt a rapid and powerful kernel-based testing method which can evaluate the joint effect of multiple variants within a gene. The joint effect, either linear or nonlinear, is captured through kernel functions. Given a series of candidate kernel functions, we propose an omnibus test strategy to integrate the test results based on different candidate kernels. A p-value combination method is then applied to integrate dependent p-values to assess the association between a gene and multiple correlated phenotypes. Simulation studies show a reasonable type I error control and an excellent power of the proposed method compared to its counterparts. We further show the utility of the method by applying it to two data sets: the Human Liver Cohort and the Alzheimer Disease Neuroimaging Initiative data set, and novel genes are identified. Our method has broad applications in other fields in which the interest is to evaluate the joint effect (linear or nonlinear) of a set of variants.
Collapse
Affiliation(s)
- Yamin Deng
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Tao He
- Department of Mathematics, San Francisco State University, San Francisco, CA, United States
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, United States
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
12
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
13
|
Wen Y, Lu Q. Multikernel linear mixed model with adaptive lasso for complex phenotype prediction. Stat Med 2020; 39:1311-1327. [PMID: 31985088 DOI: 10.1002/sim.8477] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 11/17/2019] [Accepted: 12/24/2019] [Indexed: 12/15/2022]
Abstract
Linear mixed models (LMMs) and their extensions have been widely used for high-dimensional genomic data analyses. While LMMs hold great promise for risk prediction research, the high dimensionality of the data and different effect sizes of genomic regions bring great analytical and computational challenges. In this work, we present a multikernel linear mixed model with adaptive lasso (KLMM-AL) to predict phenotypes using high-dimensional genomic data. We develop two algorithms for estimating parameters from our model and also establish the asymptotic properties of LMM with adaptive lasso when only one dependent observation is available. The proposed KLMM-AL can account for heterogeneous effect sizes from different genomic regions, capture both additive and nonadditive genetic effects, and adaptively and efficiently select predictive genomic regions and their corresponding effects. Through simulation studies, we demonstrate that KLMM-AL outperforms most of existing methods. Moreover, KLMM-AL achieves high sensitivity and specificity of selecting predictive genomic regions. KLMM-AL is further illustrated by an application to the sequencing dataset obtained from the Alzheimer's disease neuroimaging initiative.
Collapse
Affiliation(s)
- Yalu Wen
- Department of Statistics, The University of Auckland, Auckland, New Zealand
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan
| |
Collapse
|
14
|
Chang YC, Wu JT, Hong MY, Tung YA, Hsieh PH, Yee SW, Giacomini KM, Oyang YJ, Chen CY. GenEpi: gene-based epistasis discovery using machine learning. BMC Bioinformatics 2020; 21:68. [PMID: 32093643 PMCID: PMC7041299 DOI: 10.1186/s12859-020-3368-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Accepted: 01/14/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) provide a powerful means to identify associations between genetic variants and phenotypes. However, GWAS techniques for detecting epistasis, the interactions between genetic variants associated with phenotypes, are still limited. We believe that developing an efficient and effective GWAS method to detect epistasis will be a key for discovering sophisticated pathogenesis, which is especially important for complex diseases such as Alzheimer's disease (AD). RESULTS In this regard, this study presents GenEpi, a computational package to uncover epistasis associated with phenotypes by the proposed machine learning approach. GenEpi identifies both within-gene and cross-gene epistasis through a two-stage modeling workflow. In both stages, GenEpi adopts two-element combinatorial encoding when producing features and constructs the prediction models by L1-regularized regression with stability selection. The simulated data showed that GenEpi outperforms other widely-used methods on detecting the ground-truth epistasis. As real data is concerned, this study uses AD as an example to reveal the capability of GenEpi in finding disease-related variants and variant interactions that show both biological meanings and predictive power. CONCLUSIONS The results on simulation data and AD demonstrated that GenEpi has the ability to detect the epistasis associated with phenotypes effectively and efficiently. The released package can be generalized to largely facilitate the studies of many complex diseases in the near future.
Collapse
Affiliation(s)
- Yu-Chuan Chang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, 10617, Taiwan
- Taiwan AI Labs, Taipei, 10351, Taiwan
| | - June-Tai Wu
- Department of Dermatology, National Taiwan University Hospital, Taipei, 10002, Taiwan
| | - Ming-Yi Hong
- Department of Biomechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan
| | - Yi-An Tung
- Taiwan AI Labs, Taipei, 10351, Taiwan
- Genome and Systems biology degree program, Academia Sinica and National Taiwan University, Taipei, 10617, Taiwan
| | - Ping-Han Hsieh
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, 10617, Taiwan
| | - Sook Wah Yee
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, 94158, California, USA
| | - Kathleen M Giacomini
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, 94158, California, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, 94143, California, USA
| | - Yen-Jen Oyang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, 10617, Taiwan
| | - Chien-Yu Chen
- Taiwan AI Labs, Taipei, 10351, Taiwan.
- Department of Biomechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan.
| |
Collapse
|
15
|
Zhao M, Chen L, Qiao Z, Zhou J, Zhang T, Zhang W, Ke S, Zhao X, Qiu X, Song X, Zhao E, Pan H, Yang Y, Yang X. Association Between FoxO1, A2M, and TGF-β1, Environmental Factors, and Major Depressive Disorder. Front Psychiatry 2020; 11:675. [PMID: 32792993 PMCID: PMC7394695 DOI: 10.3389/fpsyt.2020.00675] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Accepted: 06/29/2020] [Indexed: 01/14/2023] Open
Abstract
INTRODUCTION Investigations of gene-environment (G×E) interactions in major depressive disorder (MDD) have been limited to hypothesis testing of candidate genes while poly-gene-environmental causation has not been adequately address. To this end, the present study analyzed the association between three candidate genes, two environmental factors, and MDD using a hypothesis-free testing approach. METHODS A logistic regression model was used to analyze interaction effects; a hierarchical regression model was used to evaluate the effects of different genotypes and the dose-response effects of the environment; genetic risk score (GRS) was used to estimate the cumulative contribution of genetic factors to MDD; and protein-protein interaction (PPI) analyses were carried out to evaluate the relationship between candidate genes and top MDD susceptibility genes. RESULTS Allelic association analyses revealed significant effects of the interaction between the candidate genes Forkhead box (Fox)O1, α2-macroglobulin (A2M), and transforming growth factor (TGF)-β1 genes and the environment on MDD. Gene-gene (G×G) and gene-gene-environment (G×G×E) interactions in MDD were also included in the model. Hierarchical regression analysis showed that the effect of environmental factors on MDD was greater in homozygous than in heterozygous mutant genotypes of the FoxO1 and TGF-β1 genes; a dose-response effect between environment and MDD on genotypes was also included in this model. Haplotype analyses revealed significant global and individual effects of haplotypes on MDD in the whole sample as well as in subgroups. There was a significant association between GRS and MDD (P = 0.029) and a GRS and environment interaction effect on MDD (P = 0.009). Candidate and top susceptibility genes were connected in PPI networks. CONCLUSIONS FoxO1, A2M, and TGF-β1 interact with environmental factors and with each other in MDD. Multi-factorial G×E interactions may be responsible for a higher explained variance and may be associated with causal factors and mechanisms that could inform new diagnosis and therapeutic strategies, which can contribute to the personalized medicine of MDD.
Collapse
Affiliation(s)
- Mingzhe Zhao
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Lu Chen
- Department of Endocrinology, Peking Union Medical College Hospital, Beijing, China
| | - Zhengxue Qiao
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Jiawei Zhou
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Tianyu Zhang
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Wenxin Zhang
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Siyuan Ke
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Xiaoyun Zhao
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Xiaohui Qiu
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Xuejia Song
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Erying Zhao
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Hui Pan
- Department of Endocrinology, Peking Union Medical College Hospital, Beijing, China
| | - Yanjie Yang
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| | - Xiuxian Yang
- Psychology Department, Public Health Institute, Harbin Medical University, Harbin, China
| |
Collapse
|
16
|
Wu M, Ma S. Robust genetic interaction analysis. Brief Bioinform 2019; 20:624-637. [PMID: 29897421 PMCID: PMC6556899 DOI: 10.1093/bib/bby033] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Indexed: 01/17/2023] Open
Abstract
For the risk, progression, and response to treatment of many complex diseases, it has been increasingly recognized that genetic interactions (including gene-gene and gene-environment interactions) play important roles beyond the main genetic and environmental effects. In practical genetic interaction analyses, model mis-specification and outliers/contaminations in response variables and covariates are not uncommon, and demand robust analysis methods. Compared with their nonrobust counterparts, robust genetic interaction analysis methods are significantly less popular but are gaining attention fast. In this article, we provide a comprehensive review of robust genetic interaction analysis methods, on their methodologies and applications, for both marginal and joint analysis, and for addressing model mis-specification as well as outliers/contaminations in response variables and covariates.
Collapse
Affiliation(s)
- Mengyun Wu
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| | - Shuangge Ma
- Mengyun Wu and Shuangge Ma, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China and Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
17
|
Shao F, Wang Y, Zhao Y, Yang S. Identifying and exploiting gene-pathway interactions from RNA-seq data for binary phenotype. BMC Genet 2019; 20:36. [PMID: 30890140 PMCID: PMC6423879 DOI: 10.1186/s12863-019-0739-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 03/12/2019] [Indexed: 11/29/2022] Open
Abstract
Background RNA sequencing (RNA-seq) technology has identified multiple differentially expressed (DE) genes associated to complex disease, however, these genes only explain a modest part of variance. Omnigenic model assumes that disease may be driven by genes with indirect relevance to disease and be propagated by functional pathways. Here, we focus on identifying the interactions between the external genes and functional pathways, referring to gene-pathway interactions (GPIs). Specifically, relying on the relationship between the garrote kernel machine (GKM) and variance component test and permutations for the empirical distributions of score statistics, we propose an efficient analysis procedure as Permutation based gEne-pAthway interaction identification in binary phenotype (PEA). Results Various simulations show that PEA has well-calibrated type I error rates and higher power than the traditional likelihood ratio test (LRT). In addition, we perform the gene set enrichment algorithms and PEA to identifying the GPIs from a pan-cancer data (GES68086). These GPIs and genes possibly further illustrate the potential etiology of cancers, most of which are identified and some external genes and significant pathways are consistent with previous studies. Conclusions PEA is an efficient tool for identifying the GPIs from RNA-seq data. It can be further extended to identify the interactions between one variable and one functional set of other omics data for binary phenotypes. Electronic supplementary material The online version of this article (10.1186/s12863-019-0739-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Fang Shao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China
| | - Yaqi Wang
- Department of Pharmacy Informatics, School of Science, China Pharmaceutical University, 24 Tongjia Xiang, Nanjing , Jiangsu, People's Republic of China
| | - Yang Zhao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China
| | - Sheng Yang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, Jiangsu, People's Republic of China.
| |
Collapse
|
18
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
19
|
Bhattacharya D, Bhattacharya S. A Bayesian semiparametric approach to learning about gene–gene interactions in case-control studies. J Appl Stat 2018. [DOI: 10.1080/02664763.2018.1444741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Durba Bhattacharya
- St. Xavier's College, Kolkata, India
- Interdisciplinary Statistical Research Unit, Indian Statistical Institute, Kolkata, India
| | - Sourabh Bhattacharya
- Interdisciplinary Statistical Research Unit, Indian Statistical Institute, Kolkata, India
| |
Collapse
|
20
|
He T, Li S, Zhong PS, Cui Y. An optimal kernel-based U
-statistic method for quantitative gene-set association analysis. Genet Epidemiol 2018; 43:137-149. [DOI: 10.1002/gepi.22170] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 08/19/2018] [Accepted: 09/26/2018] [Indexed: 11/09/2022]
Affiliation(s)
- Tao He
- Department of Mathematics; San Francisco State University; San Francisco California
| | - Shaoyu Li
- Department of Mathematics and Statistics; University of North Carolina at Charlotte; Charlotte North Carolina
| | - Ping-Shou Zhong
- Department of Mathematics, Statistics, and Computer Science; University of Illinois at Chicago; Chicago Illinois
| | - Yuehua Cui
- Department of Statistics & Probability; Michigan State University; East Lansing Michigan
- School of Public Health, Zhengzhou University; Zhengzhou China
| |
Collapse
|
21
|
Zhang W, Chen Z, Liu A, Buck Louis GM. A weighted kernel machine regression approach to environmental pollutants and infertility. Stat Med 2018; 38:809-827. [PMID: 30328128 DOI: 10.1002/sim.8003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 08/17/2018] [Accepted: 09/19/2018] [Indexed: 11/09/2022]
Abstract
In epidemiological studies of environmental pollutants in relation to human infertility, it is common that concentrations of a large number of exposures are collected in both male and female partners. Such a couple-based study poses some new challenges in statistical analysis, especially when the effect of the totality of these chemical mixtures is of interest, because these exposures may have complex nonlinear and nonadditive relationships with the infertility outcome. Kernel machine regression, as a nonparametric regression method, can be applied to model such effects, while accounting for the highly correlated structure within and across exposures. However, it does not consider the partner-specific structure in these study data, which may lead to suboptimal estimation for the effects of environmental exposures. To overcome this limitation, we developed a weighted kernel machine regression method (wKRM) to model the joint effect of partner-specific exposures, in which a linear weight procedure is used to combine the female and male partners' exposure concentrations. The proposed wKRM is not only able to reduce the number of analyzed exposures but also provide an overall importance index of female and male partners' exposures in the risk of infertility. Simulation studies demonstrate good performance of the wKRM in both estimating the joint effects of exposures and fitting the infertility outcome. Application of the proposed method to a prospective infertility study suggests that the male partner's exposure to polychlorinated biphenyls might contribute more toward infertility than the female partner's.
Collapse
Affiliation(s)
- Wei Zhang
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| | - Zhen Chen
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| | - Aiyi Liu
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| | - Germaine M Buck Louis
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
22
|
Alam MA, Lin HY, Deng HW, Calhoun VD, Wang YP. A kernel machine method for detecting higher order interactions in multimodal datasets: Application to schizophrenia. J Neurosci Methods 2018; 309:161-174. [PMID: 30184473 DOI: 10.1016/j.jneumeth.2018.08.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 08/12/2018] [Accepted: 08/30/2018] [Indexed: 12/20/2022]
Abstract
BACKGROUND Technological advances are enabling us to collect multimodal datasets at an increasing depth and resolution while with decreasing labors. Understanding complex interactions among multimodal datasets, however, is challenging. NEW METHOD In this study, we tested the interaction effect of multimodal datasets using a novel method called the kernel machine for detecting higher order interactions among biologically relevant multimodal data. Using a semiparametric method on a reproducing kernel Hilbert space, we formulated the proposed method as a standard mixed-effects linear model and derived a score-based variance component statistic to test higher order interactions between multimodal datasets. RESULTS The method was evaluated using extensive numerical simulation and real data from the Mind Clinical Imaging Consortium with both schizophrenia patients and healthy controls. Our method identified 13-triplets that included 6 gene-derived SNPs, 10 ROIs, and 6 gene-specific DNA methylations that are correlated with the changes in hippocampal volume, suggesting that these triplets may be important for explaining schizophrenia-related neurodegeneration. COMPARISON WITH EXISTING METHOD(S) The performance of the proposed method is compared with the following methods: test based on only first and first few principal components followed by multiple regression, and full principal component analysis regression, and the sequence kernel association test. CONCLUSIONS With strong evidence (p-value ≤0.000001), the triplet (MAGI2, CRBLCrus1.L, FBXO28) is a significant biomarker for schizophrenia patients. This novel method can be applicable to the study of other disease processes, where multimodal data analysis is a common task.
Collapse
Affiliation(s)
- Md Ashad Alam
- Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA.
| | - Hui-Yi Lin
- Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA 70112, USA
| | - Vince D Calhoun
- Department of Electrical and Computer Engineering, The University of New Mexico, Albuquerque, NM 87131, USA
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118, USA
| |
Collapse
|
23
|
Abstract
BACKGROUND A large amount of research has been devoted to the detection and investigation of epistatic interactions in genome-wide association studies (GWASs). Most of the literature focuses on low-order interactions between single-nucleotide polymorphisms (SNPs) with significant main effects. RESULTS In this paper we propose an original approach for detecting epistasis at the gene level, without systematically filtering on significant genes. We first compute interaction variables for each gene pair by finding its Eigen-Epistasis component, defined as the linear combination of Gene SNPs having the highest correlation with the phenotype. The selection of significant effects is done using a penalized regression method based on Group Lasso controlling the False Discovery Rate. CONCLUSION The method is tested against two recent alternative proposals from the literature using synthetic data, and shows good performances in different settings. We demonstrate the power of our approach by detecting new gene-gene interactions on three genome-wide association studies.
Collapse
|
24
|
He Z, Zhang M, Lee S, Smith JA, Kardia SLR, Diez Roux AV, Mukherjee B. Set-Based Tests for the Gene-Environment Interaction in Longitudinal Studies. J Am Stat Assoc 2016; 112:966-978. [PMID: 29780190 PMCID: PMC5954413 DOI: 10.1080/01621459.2016.1252266] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 10/01/2016] [Indexed: 01/09/2023]
Abstract
We propose a generalized score type test for set-based inference for gene-environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for gene-environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene-environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with 4 exams.
Collapse
Affiliation(s)
- Zihuai He
- Department of Biostatistics, University of Michigan
| | - Min Zhang
- Department of Biostatistics, University of Michigan
| | | | | | | | | | | |
Collapse
|
25
|
Kodama K, Saigo H. KDSNP: A kernel-based approach to detecting high-order SNP interactions. J Bioinform Comput Biol 2016; 14:1644003. [DOI: 10.1142/s0219720016440030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Despite the accumulation of quantitative trait loci (QTL) data in many complex human diseases, most of current approaches that have attempted to relate genotype to phenotype have achieved limited success, and genetic factors of many common diseases are yet remained to be elucidated. One of the reasons that makes this problem complex is the existence of single nucleotide polymorphism (SNP) interaction, or epistasis. Due to excessive amount of computation for searching the combinatorial space, existing approaches cannot fully incorporate high-order SNP interactions into their models, but limit themselves to detecting only lower-order SNP interactions. We present an empirical approach based on ridge regression with polynomial kernels and model selection technique for determining the true degree of epistasis among SNPs. Computer experiments in simulated data show the ability of the proposed method to correctly predict the number of interacting SNPs provided that the number of samples is large enough relative to the number of SNPs. For cases in which the number of the available samples is limited, we propose to perform sliding window approach to ensure sufficiently large sample/SNP ratio in each window. In computational experiments using heterogeneous stock mice data, our approach has successfully detected subregions that harbor known causal SNPs. Our analysis further suggests the existence of additional candidate causal SNPs interacting to each other in the neighborhood of the known causal gene. Software is available from https://github.com/HirotoSaigo/KDSNP .
Collapse
Affiliation(s)
- Kento Kodama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka 820-8502, Fukuoka, Japan
| | - Hiroto Saigo
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka 820-8502, Fukuoka, Japan
| |
Collapse
|
26
|
A Nonlinear Model for Gene-Based Gene-Environment Interaction. Int J Mol Sci 2016; 17:ijms17060882. [PMID: 27271617 PMCID: PMC4926416 DOI: 10.3390/ijms17060882] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Revised: 05/07/2016] [Accepted: 05/21/2016] [Indexed: 11/16/2022] Open
Abstract
A vast amount of literature has confirmed the role of gene-environment (G×E) interaction in the etiology of complex human diseases. Traditional methods are predominantly focused on the analysis of interaction between a single nucleotide polymorphism (SNP) and an environmental variable. Given that genes are the functional units, it is crucial to understand how gene effects (rather than single SNP effects) are influenced by an environmental variable to affect disease risk. Motivated by the increasing awareness of the power of gene-based association analysis over single variant based approach, in this work, we proposed a sparse principle component regression (sPCR) model to understand the gene-based G×E interaction effect on complex disease. We first extracted the sparse principal components for SNPs in a gene, then the effect of each principal component was modeled by a varying-coefficient (VC) model. The model can jointly model variants in a gene in which their effects are nonlinearly influenced by an environmental variable. In addition, the varying-coefficient sPCR (VC-sPCR) model has nice interpretation property since the sparsity on the principal component loadings can tell the relative importance of the corresponding SNPs in each component. We applied our method to a human birth weight dataset in Thai population. We analyzed 12,005 genes across 22 chromosomes and found one significant interaction effect using the Bonferroni correction method and one suggestive interaction. The model performance was further evaluated through simulation studies. Our model provides a system approach to evaluate gene-based G×E interaction.
Collapse
|
27
|
Wen Y, He Z, Li M, Lu Q. Risk Prediction Modeling of Sequencing Data Using a Forward Random Field Method. Sci Rep 2016; 6:21120. [PMID: 26892725 PMCID: PMC4759688 DOI: 10.1038/srep21120] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 01/18/2016] [Indexed: 11/09/2022] Open
Abstract
With the advance in high-throughput sequencing technology, it is feasible to investigate the role of common and rare variants in disease risk prediction. While the new technology holds great promise to improve disease prediction, the massive amount of data and low frequency of rare variants pose great analytical challenges on risk prediction modeling. In this paper, we develop a forward random field method (FRF) for risk prediction modeling using sequencing data. In FRF, subjects' phenotypes are treated as stochastic realizations of a random field on a genetic space formed by subjects' genotypes, and an individual's phenotype can be predicted by adjacent subjects with similar genotypes. The FRF method allows for multiple similarity measures and candidate genes in the model, and adaptively chooses the optimal similarity measure and disease-associated genes to reflect the underlying disease model. It also avoids the specification of the threshold of rare variants and allows for different directions and magnitudes of genetic effects. Through simulations, we demonstrate the FRF method attains higher or comparable accuracy over commonly used support vector machine based methods under various disease models. We further illustrate the FRF method with an application to the sequencing data obtained from the Dallas Heart Study.
Collapse
Affiliation(s)
- Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| | - Zihuai He
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, U.S.A
| | - Ming Li
- Department of Epidemiology and Biostatistics, Indiana University at Bloomington, Bloomington, IN 47405, U.S.A
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, U.S.A
| |
Collapse
|
28
|
Song M. Jackknife-based gene-gene interactiontests for untyped SNPs. BMC Genet 2015; 16:85. [PMID: 26187382 PMCID: PMC4506584 DOI: 10.1186/s12863-015-0225-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2015] [Accepted: 06/10/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Testing gene-gene interaction in genome-wide association studies generally yields lower power than testing marginal association. Meta-analysis that combines different genotyping platforms is one method used to increase power when assessing gene-gene interactions, which requires a test for interaction on untyped SNPs. However, to date, formal statistical tests for gene-gene interaction on untyped SNPs have not been thoroughly addressed. The key concern for gene-gene interaction testing on untyped SNPs located on different chromosomes is that the pair of genes might not be independent and the current generation of imputation methods provides imputed genotypes at the marginal accuracy. RESULTS In this study we address this challenge and describe a novel method for testing gene-gene interaction on marginally imputed values of untyped SNPs. We show that our novel Wald-type test statistics for interactions with and without constraints in the interaction parameters follow the asymptotic distributions which are the same as those of the corresponding tests for typed SNPs. Through simulations, we show that the proposed tests properly control type I error and are more powerful than the extension of the classical dosage method to interaction tests. The increase in power results from a proper correction for the uncertainty in imputation through the variance estimator using the jackknife, one of resampling techniques. We apply the method to detect interactions between SNPs on chromosomes 5 and 15 on lung cancer data. The inclusion of the results at the untyped SNPs provides a much more detailed information at the regions of interest. CONCLUSIONS As demonstrated by the simulation studies and real data analysis, our approaches outperform the application of traditional dosage method to detection of gene-gene interaction in terms of power while providing control of the type I error.
Collapse
Affiliation(s)
- Minsun Song
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Rockville, MD, USA.
| |
Collapse
|
29
|
Wang X, Xing EP, Schaid DJ. Kernel methods for large-scale genomic data analysis. Brief Bioinform 2015; 16:183-92. [PMID: 25053743 PMCID: PMC4375394 DOI: 10.1093/bib/bbu024] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 05/20/2014] [Indexed: 11/12/2022] Open
Abstract
Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today's explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion.
Collapse
|
30
|
Ge T, Nichols TE, Ghosh D, Mormino EC, Smoller JW, Sabuncu MR. A kernel machine method for detecting effects of interaction between multidimensional variable sets: an imaging genetics application. Neuroimage 2015; 109:505-514. [PMID: 25600633 DOI: 10.1016/j.neuroimage.2015.01.029] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2014] [Revised: 01/06/2015] [Accepted: 01/09/2015] [Indexed: 11/19/2022] Open
Abstract
Measurements derived from neuroimaging data can serve as markers of disease and/or healthy development, are largely heritable, and have been increasingly utilized as (intermediate) phenotypes in genetic association studies. To date, imaging genetic studies have mostly focused on discovering isolated genetic effects, typically ignoring potential interactions with non-genetic variables such as disease risk factors, environmental exposures, and epigenetic markers. However, identifying significant interaction effects is critical for revealing the true relationship between genetic and phenotypic variables, and shedding light on disease mechanisms. In this paper, we present a general kernel machine based method for detecting effects of the interaction between multidimensional variable sets. This method can model the joint and epistatic effect of a collection of single nucleotide polymorphisms (SNPs), accommodate multiple factors that potentially moderate genetic influences, and test for nonlinear interactions between sets of variables in a flexible framework. As a demonstration of application, we applied the method to the data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) to detect the effects of the interactions between candidate Alzheimer's disease (AD) risk genes and a collection of cardiovascular disease (CVD) risk factors, on hippocampal volume measurements derived from structural brain magnetic resonance imaging (MRI) scans. Our method identified that two genes, CR1 and EPHA1, demonstrate significant interactions with CVD risk factors on hippocampal volume, suggesting that CR1 and EPHA1 may play a role in influencing AD-related neurodegeneration in the presence of CVD risks.
Collapse
Affiliation(s)
- Tian Ge
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital / Harvard Medical School, Charlestown, MA 02129, USA
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Thomas E Nichols
- Department of Statistics & Warwick Manufacturing Group, The University of Warwick, Coventry CV4 7AL, UK
| | - Debashis Ghosh
- Department of Statistics, The Pennsylvania State University, PA 16802, USA
| | - Elizabeth C Mormino
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
| | - Jordan W Smoller
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02138, USA
| | - Mert R Sabuncu
- Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital / Harvard Medical School, Charlestown, MA 02129, USA
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
31
|
Larson NB, Jenkins GD, Larson MC, Vierkant RA, Sellers TA, Phelan CM, Schildkraut JM, Sutphen R, Pharoah PPD, Gayther SA, Wentzensen N, Goode EL, Fridley BL. Kernel canonical correlation analysis for assessing gene-gene interactions and application to ovarian cancer. Eur J Hum Genet 2014; 22:126-31. [PMID: 23591404 PMCID: PMC3865403 DOI: 10.1038/ejhg.2013.69] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2012] [Revised: 01/11/2013] [Accepted: 01/16/2013] [Indexed: 01/24/2023] Open
Abstract
Although single-locus approaches have been widely applied to identify disease-associated single-nucleotide polymorphisms (SNPs), complex diseases are thought to be the product of multiple interactions between loci. This has led to the recent development of statistical methods for detecting statistical interactions between two loci. Canonical correlation analysis (CCA) has previously been proposed to detect gene-gene coassociation. However, this approach is limited to detecting linear relations and can only be applied when the number of observations exceeds the number of SNPs in a gene. This limitation is particularly important for next-generation sequencing, which could yield a large number of novel variants on a limited number of subjects. To overcome these limitations, we propose an approach to detect gene-gene interactions on the basis of a kernelized version of CCA (KCCA). Our simulation studies showed that KCCA controls the Type-I error, and is more powerful than leading gene-based approaches under a disease model with negligible marginal effects. To demonstrate the utility of our approach, we also applied KCCA to assess interactions between 200 genes in the NF-κB pathway in relation to ovarian cancer risk in 3869 cases and 3276 controls. We identified 13 significant gene pairs relevant to ovarian cancer risk (local false discovery rate <0.05). Finally, we discuss the advantages of KCCA in gene-gene interaction analysis and its future role in genetic association studies.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Gregory D Jenkins
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Melissa C Larson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Robert A Vierkant
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | | | | | | | - Rebecca Sutphen
- Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
| | | | - Simon A Gayther
- Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicolas Wentzensen
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Ovarian Cancer Association Consortium
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
- Duke Comprehensive Cancer Center, Duke University, Durham, NC, USA
- Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
- Department of Oncology, University of Cambridge, Cambridge, UK
- Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
| | - Ellen L Goode
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Brooke L Fridley
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
| |
Collapse
|
32
|
Qu L, Guennel T, Marshall SL. Linear score tests for variance components in linear mixed models and applications to genetic association studies. Biometrics 2013; 69:883-92. [PMID: 24328714 DOI: 10.1111/biom.12095] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2012] [Revised: 06/01/2013] [Accepted: 07/01/2013] [Indexed: 01/16/2023]
Abstract
Following the rapid development of genome-scale genotyping technologies, genetic association mapping has become a popular tool to detect genomic regions responsible for certain (disease) phenotypes, especially in early-phase pharmacogenomic studies with limited sample size. In response to such applications, a good association test needs to be (1) applicable to a wide range of possible genetic models, including, but not limited to, the presence of gene-by-environment or gene-by-gene interactions and non-linearity of a group of marker effects, (2) accurate in small samples, fast to compute on the genomic scale, and amenable to large scale multiple testing corrections, and (3) reasonably powerful to locate causal genomic regions. The kernel machine method represented in linear mixed models provides a viable solution by transforming the problem into testing the nullity of variance components. In this study, we consider score-based tests by choosing a statistic linear in the score function. When the model under the null hypothesis has only one error variance parameter, our test is exact in finite samples. When the null model has more than one variance parameter, we develop a new moment-based approximation that performs well in simulations. Through simulations and analysis of real data, we demonstrate that the new test possesses most of the aforementioned characteristics, especially when compared to existing quadratic score tests or restricted likelihood ratio tests.
Collapse
Affiliation(s)
- Long Qu
- Department of Mathematics and Statistics, Wright State University, Dayton, Ohio 45435, U.S.A
| | | | | |
Collapse
|
33
|
Li F, Zhao J, Yuan Z, Zhang X, Ji J, Xue F. A powerful latent variable method for detecting and characterizing gene-based gene-gene interaction on multiple quantitative traits. BMC Genet 2013; 14:89. [PMID: 24059907 PMCID: PMC3848962 DOI: 10.1186/1471-2156-14-89] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2013] [Accepted: 09/17/2013] [Indexed: 01/10/2023] Open
Abstract
Background On thinking quantitatively of complex diseases, there are at least three statistical strategies for analyzing the gene-gene interaction: SNP by SNP interaction on single trait, gene-gene (each can involve multiple SNPs) interaction on single trait and gene-gene interaction on multiple traits. The third one is the most general in dissecting the genetic mechanism underlying complex diseases underpinning multiple quantitative traits. In this paper, we developed a novel statistic for this strategy through modifying the Partial Least Squares Path Modeling (PLSPM), called mPLSPM statistic. Results Simulation studies indicated that mPLSPM statistic was powerful and outperformed the principal component analysis (PCA) based linear regression method. Application to real data in the EPIC-Norfolk GWAS sub-cohort showed suggestive interaction (γ) between TMEM18 gene and BDNF gene on two composite body shape scores (γ = 0.047 and γ = 0.058, with P = 0.021, P = 0.005), and BMI (γ = 0.043, P = 0.034). This suggested these scores (synthetically latent traits) were more suitable to capture the obesity related genetic interaction effect between genes compared to single trait. Conclusions The proposed novel mPLSPM statistic is a valid and powerful gene-based method for detecting gene-gene interaction on multiple quantitative phenotypes.
Collapse
Affiliation(s)
- Fangyu Li
- Department of Epidemiology and Biostatistics, School of Public Health, Shandong University, Jinan 250012, China.
| | | | | | | | | | | |
Collapse
|
34
|
Larson NB, Schaid DJ. A kernel regression approach to gene-gene interaction detection for case-control studies. Genet Epidemiol 2013; 37:695-703. [PMID: 23868214 DOI: 10.1002/gepi.21749] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2013] [Revised: 05/07/2013] [Accepted: 06/12/2013] [Indexed: 01/13/2023]
Abstract
Gene-gene interactions are increasingly being addressed as a potentially important contributor to the variability of complex traits. Consequently, attentions have moved beyond single locus analysis of association to more complex genetic models. Although several single-marker approaches toward interaction analysis have been developed, such methods suffer from very high testing dimensionality and do not take advantage of existing information, notably the definition of genes as functional units. Here, we propose a comprehensive family of gene-level score tests for identifying genetic elements of disease risk, in particular pairwise gene-gene interactions. Using kernel machine methods, we devise score-based variance component tests under a generalized linear mixed model framework. We conducted simulations based upon coalescent genetic models to evaluate the performance of our approach under a variety of disease models. These simulations indicate that our methods are generally higher powered than alternative gene-level approaches and at worst competitive with exhaustive SNP-level (where SNP is single-nucleotide polymorphism) analyses. Furthermore, we observe that simulated epistatic effects resulted in significant marginal testing results for the involved genes regardless of whether or not true main effects were present. We detail the benefits of our methods and discuss potential genome-wide analysis strategies for gene-gene interaction analysis in a case-control study design.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | | |
Collapse
|
35
|
Gene-based testing of interactions in association studies of quantitative traits. PLoS Genet 2013; 9:e1003321. [PMID: 23468652 PMCID: PMC3585009 DOI: 10.1371/journal.pgen.1003321] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 12/31/2012] [Indexed: 01/05/2023] Open
Abstract
Various methods have been developed for identifying gene–gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene–gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following P value combining methods: minimum P value, extended Simes procedure, truncated tail strength, and truncated P value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein–protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between SMAD3 and NEDD9, in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies. Epistasis is likely to play a significant role in complex diseases or traits and is one of the many possible explanations for “missing heritability.” However, epistatic interactions have been difficult to detect in genome-wide association studies (GWAS) due to the limited power caused by the multiple-testing correction from the large number of tests conducted. Gene-based gene–gene interaction (GGG) tests might hold the key to relaxing the multiple-testing correction burden and increasing the power for identifying epistatic interactions in GWAS. Here, we developed GGG tests of quantitative traits by extending four P value combining methods and evaluated their type I error rates and power using extensive simulations. All four GGG tests are more powerful than a principal component-based test. We also applied our GGG tests to data from the Atherosclerosis Risk in Communities study and found five gene-level interactions associated with the levels of total cholesterol and high-density lipoprotein cholesterol (HDL-C). One interaction between SMAD3 and NEDD9 on HDL-C was further replicated in an independent sample from the Multi-Ethnic Study of Atherosclerosis.
Collapse
|