1
|
Gaye A, Diongue AK, Komen LN, Diallo A, Sylla SN, Diarra M, Talla C, Loucoubar C. High-dimensional supervised classification in a context of non-independence of observations to identify the determining SNPs in a phenotype. Infect Dis Model 2023; 8:1079-1087. [PMID: 37727806 PMCID: PMC10505671 DOI: 10.1016/j.idm.2023.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 08/29/2023] [Accepted: 09/03/2023] [Indexed: 09/21/2023] Open
Abstract
This work addresses the problem of supervised classification for highly correlated high-dimensional data describing non-independent observations to identify SNPs related to a phenotype. We use a general penalized linear mixed model with a single random effect that performs simultaneous SNP selection and population structure adjustment in high-dimensional prediction models. Specifically, the model simultaneously selects variables and estimates their effects, taking into account correlations between individuals. Single nucleotide polymorphisms (SNPs) are a type of genetic variation and each SNP represents a difference in a single DNA building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct source population of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is of great importance. In this study, we used uncorrelated variables from the construction of blocks of correlated variables done in a previous work to describe the most related observations of the dataset. The model was trained with 90% of the observations and tested with the remaining 10%. The best model obtained with the generalized information criterion (GIC) identified the SNP named rs2493311 located on the first chromosome of the gene called PRDM16 ((PR/SET domain 16)) as the most decisive factor in malaria attacks.
Collapse
Affiliation(s)
- Aboubacry Gaye
- Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis, Senegal
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Abdou Ka Diongue
- Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis, Senegal
| | | | - Amadou Diallo
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Seydou Nourou Sylla
- Information and Communication Technologies for Development, Alioune Diop University of Bambey, Senegal
| | - Maryam Diarra
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Cheikh Talla
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Cheikh Loucoubar
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| |
Collapse
|
2
|
Dossa HRG, Bureau A, Maziade M, Lakhal-Chaieb L, Oualkacha K. A novel rare variants association test for binary traits in family-based designs via copulas. Stat Methods Med Res 2023; 32:2096-2122. [PMID: 37832140 PMCID: PMC10683345 DOI: 10.1177/09622802231197977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2023]
Abstract
With the cost-effectiveness technology in whole-genome sequencing, more sophisticated statistical methods for testing genetic association with both rare and common variants are being investigated to identify the genetic variation between individuals. Several methods which group variants, also called gene-based approaches, are developed. For instance, advanced extensions of the sequence kernel association test, which is a widely used variant-set test, have been proposed for unrelated samples and extended for family data. Family data have been shown to be powerful when analyzing rare variants. However, most of such methods capture familial relatedness using a random effect component within the generalized linear mixed model framework. Therefore, there is a need to develop unified and flexible methods to study the association between a set of genetic variants and a trait, especially for a binary outcome. Copulas are multivariate distribution functions with uniform margins on the [ 0 , 1 ] interval and they provide suitable models to capture familial dependence structure. In this work, we propose a flexible family-based association test for both rare and common variants in the presence of binary traits. The method, termed novel rare variant association test (NRVAT), uses a marginal logistic model and a Gaussian Copula. The latter is employed to model the dependence between relatives. An analytic score-type test is derived. Through simulations, we show that our method can achieve greater power than existing approaches. The proposed model is applied to investigate the association between schizophrenia and bipolar disorder in a family-based cohort consisting of 17 extended families from Eastern Quebec.
Collapse
Affiliation(s)
- Houssou R. G. Dossa
- Département de Mathématiques, Université du Québec à Montréal (UQAM) et, Québec, Canada
| | - Alexandre Bureau
- Département de Médecine Sociale et Préventive, Université Laval, Québec, Canada
- Centre de Recherche CERVO, Quebec, Canada
| | - Michel Maziade
- Centre de Recherche CERVO, Quebec, Canada
- Département de Psychiatrie et Neuroscience, Université Laval, Québec, Canada
| | - Lajmi Lakhal-Chaieb
- Département de Mathématiques et Statistique, Université Laval, Québec, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal (UQAM) et, Québec, Canada
| |
Collapse
|
3
|
Bhatnagar SR, Yang Y, Lu T, Schurr E, Loredo-Osti JC, Forest M, Oualkacha K, Greenwood CMT. Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet 2020; 16:e1008766. [PMID: 32365090 PMCID: PMC7224575 DOI: 10.1371/journal.pgen.1008766] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Revised: 05/14/2020] [Accepted: 04/08/2020] [Indexed: 12/23/2022] Open
Abstract
Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (https://cran.r-project.org/package=ggmix).
Collapse
Affiliation(s)
- Sahir R. Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada
| | - Yi Yang
- Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada
| | - Tianyuan Lu
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
| | - Erwin Schurr
- Department of Medicine, McGill University, Montréal, Québec, Canada
| | - JC Loredo-Osti
- Department of Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada
| | - Marie Forest
- École de Technologie Supérieure, Montréal, Québec, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
- Gerald Bronfman Department of Oncology, McGill University, Montréal, Québec, Canada
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
| |
Collapse
|
4
|
Hamazaki K, Iwata H. RAINBOW: Haplotype-based genome-wide association study using a novel SNP-set method. PLoS Comput Biol 2020; 16:e1007663. [PMID: 32059004 PMCID: PMC7046296 DOI: 10.1371/journal.pcbi.1007663] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 02/27/2020] [Accepted: 01/18/2020] [Indexed: 11/18/2022] Open
Abstract
Difficulty in detecting rare variants is one of the problems in conventional genome-wide association studies (GWAS). The problem is closely related to the complex gene compositions comprising multiple alleles, such as haplotypes. Several single nucleotide polymorphism (SNP) set approaches have been proposed to solve this problem. These methods, however, have been rarely discussed in connection with haplotypes. In this study, we developed a novel SNP-set method named "RAINBOW" and applied the method to haplotype-based GWAS by regarding a haplotype block as a SNP-set. Combining haplotype block estimation and SNP-set GWAS, haplotype-based GWAS can be conducted without prior information of haplotypes. We prepared 100 datasets of simulated phenotypic data and real marker genotype data of Oryza sativa subsp. indica, and performed GWAS of the datasets. We compared the power of our method, the conventional single-SNP GWAS, the conventional haplotype-based GWAS, and the conventional SNP-set GWAS. Our proposed method was shown to be superior to these in three aspects: (1) controlling false positives; (2) in detecting causal variants without relying on the linkage disequilibrium if causal variants were genotyped in the dataset; and (3) it showed greater power than the other methods, i.e., it was able to detect causal variants that were not detected by the others, primarily when the causal variants were located very close to each other, and the directions of their effects were opposite. By using the SNP-set approach as in this study, we expect that detecting not only rare variants but also genes with complex mechanisms, such as genes with multiple causal variants, can be realized. RAINBOW was implemented as an R package named "RAINBOWR" and is available from CRAN (https://cran.r-project.org/web/packages/RAINBOWR/index.html) and GitHub (https://github.com/KosukeHamazaki/RAINBOWR).
Collapse
Affiliation(s)
- Kosuke Hamazaki
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | - Hiroyoshi Iwata
- Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
- * E-mail:
| |
Collapse
|
5
|
Povysil G, Petrovski S, Hostyk J, Aggarwal V, Allen AS, Goldstein DB. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet 2019; 20:747-759. [PMID: 31605095 DOI: 10.1038/s41576-019-0177-4] [Citation(s) in RCA: 117] [Impact Index Per Article: 23.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/06/2019] [Indexed: 12/11/2022]
Abstract
The first phase of genome-wide association studies (GWAS) assessed the role of common variation in human disease. Advances optimizing and economizing high-throughput sequencing have enabled a second phase of association studies that assess the contribution of rare variation to complex disease in all protein-coding genes. Unlike the early microarray-based studies, sequencing-based studies catalogue the full range of genetic variation, including the evolutionarily youngest forms. Although the experience with common variants helped establish relevant standards for genome-wide studies, the analysis of rare variation introduces several challenges that require novel analysis approaches.
Collapse
Affiliation(s)
- Gundula Povysil
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA
| | - Slavé Petrovski
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK.,Department of Medicine, The University of Melbourne, Austin Health and Royal Melbourne Hospital, Melbourne, Victoria, Australia
| | - Joseph Hostyk
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA
| | - Vimla Aggarwal
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA
| | - Andrew S Allen
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - David B Goldstein
- Institute for Genomic Medicine, Columbia University Irving Medical Center, Columbia University, New York, NY, USA.
| |
Collapse
|
6
|
Chiu CY, Yuan F, Zhang BS, Yuan A, Li X, Fang HB, Lange K, Weeks DE, Wilson AF, Bailey-Wilson JE, Musolf AM, Stambolian D, Lakhal-Chaieb ML, Cook RJ, McMahon FJ, Amos CI, Xiong M, Fan R. Linear mixed models for association analysis of quantitative traits with next-generation sequencing data. Genet Epidemiol 2019; 43:189-206. [PMID: 30537345 PMCID: PMC6375753 DOI: 10.1002/gepi.22177] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 08/27/2018] [Accepted: 09/26/2018] [Indexed: 01/01/2023]
Abstract
We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene-based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. F -statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the F -distributed statistics provide a good control of the type I error rate. The F -test statistics of the LMMs have similar or higher power than the FLMMs, kernel-based famSKAT (family-based sequence kernel association test), and burden test famBT (family-based burden test). The F -statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error rate well at the nominal levels α = 0.01 and 0.05 . For moderate/large samples, the LRT statistics of the FLMMs control the type I error rates well. The LRT statistics of the LMMs can lead to inflated type I error rates. The proposed models are useful in whole genome and whole exome association studies of complex traits.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Fang Yuan
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, Yunnan, China
| | - Bing-Song Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Xin Li
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Hong-Bin Fang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Kenneth Lange
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California
| | - Daniel E Weeks
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Anthony M Musolf
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Dwight Stambolian
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania
| | | | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, Ontario, Quebec, Canada
| | - Francis J McMahon
- Human Genetics Branch and Genetic Basis of Mood and Anxiety Disorders Section, University of Waterloo, National Institute of Mental Health, NIH, Bethesda, Maryland
| | | | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas
| | - Ruzong Fan
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, Yunnan, China
| |
Collapse
|
7
|
Multivariate association test for rare variants controlling for cryptic and family relatedness. CAN J STAT 2019. [DOI: 10.1002/cjs.11475] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
8
|
Chen H, Huffman JE, Brody JA, Wang C, Lee S, Li Z, Gogarten SM, Sofer T, Bielak LF, Bis JC, Blangero J, Bowler RP, Cade BE, Cho MH, Correa A, Curran JE, de Vries PS, Glahn DC, Guo X, Johnson AD, Kardia S, Kooperberg C, Lewis JP, Liu X, Mathias RA, Mitchell BD, O’Connell JR, Peyser PA, Post WS, Reiner AP, Rich SS, Rotter JI, Silverman EK, Smith JA, Vasan RS, Wilson JG, Yanek LR, Redline S, Smith NL, Boerwinkle E, Borecki IB, Cupples LA, Laurie CC, Morrison AC, Rice KM, Lin X, Rice KM, Lin X. Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies. Am J Hum Genet 2019; 104:260-274. [PMID: 30639324 DOI: 10.1016/j.ajhg.2018.12.012] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2018] [Accepted: 12/17/2018] [Indexed: 12/12/2022] Open
Abstract
With advances in whole-genome sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and sequence kernel association test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-set mixed model association tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program. SMMATs share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be fit only once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMATs correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Kenneth M Rice
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA; Department of Statistics, Harvard University, Cambridge, MA 02138, USA.
| |
Collapse
|
9
|
Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genet Epidemiol 2019; 43:122-136. [PMID: 30604442 DOI: 10.1002/gepi.22180] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/09/2018] [Accepted: 11/26/2018] [Indexed: 12/17/2022]
Abstract
Evaluating the association of multiple genetic variants with a trait of interest by use of kernel-based methods has made a significant impact on how genetic association analyses are conducted. An advantage of kernel methods is that they tend to be robust when the genetic variants have effects that are a mixture of positive and negative effects, as well as when there is a small fraction of causal variants. Another advantage is that kernel methods fit within the framework of mixed models, providing flexible ways to adjust for additional covariates that influence traits. Herein, we review the basic ideas behind the use of kernel methods for genetic association analysis as well as recent methodological advancements for different types of traits, multivariate traits, pedigree data, and longitudinal data. Finally, we discuss opportunities for future research.
Collapse
Affiliation(s)
- Nicholas B Larson
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Jun Chen
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Daniel J Schaid
- Department of Health Sciences Research, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
10
|
Fuady AM, Lent S, Sarnowski C, Tintle NL. Application of novel and existing methods to identify genes with evidence of epigenetic association: results from GAW20. BMC Genet 2018; 19:72. [PMID: 30255777 PMCID: PMC6157126 DOI: 10.1186/s12863-018-0647-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The rise in popularity and accessibility of DNA methylation data to evaluate epigenetic associations with disease has led to numerous methodological questions. As part of GAW20, our working group of 8 research groups focused on gene searching methods. RESULTS Although the methods were varied, we identified 3 main themes within our group. First, many groups tackled the question of how best to use pedigree information in downstream analyses, finding that (a) the use of kinship matrices is common practice, (b) ascertainment corrections may be necessary, and (c) pedigree information may be useful for identifying parent-of-origin effects. Second, many groups also considered multimarker versus single-marker tests. Multimarker tests had modestly improved power versus single-marker methods on simulated data, and on real data identified additional associations that were not identified with single-marker methods, including identification of a gene with a strong biological interpretation. Finally, some of the groups explored methods to combine single-nucleotide polymorphism (SNP) and DNA methylation into a single association analysis. CONCLUSIONS A causal inference method showed promise at discovering new mechanisms of SNP activity; gene-based methods of summarizing SNP and DNA methylation data also showed promise. Even though numerous questions still remain in the analysis of DNA methylation data, our discussions at GAW20 suggest some emerging best practices.
Collapse
Affiliation(s)
- Angga M. Fuady
- Medical Statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Einthovenweg 20, 2333 Leiden, ZC Netherlands
| | - Samantha Lent
- Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue, Boston, MA 02118 USA
| | - Chloé Sarnowski
- Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue, Boston, MA 02118 USA
| | - Nathan L. Tintle
- Department of Mathematics and Statistics, Dordt College, Sioux Center, IA 51250 USA
| |
Collapse
|
11
|
Zhao K, Jiang L, Klein K, Greenwood CMT, Oualkacha K. CpG-set association assessment of lipid concentration changes and DNA methylation. BMC Proc 2018; 12:30. [PMID: 30263044 PMCID: PMC6157033 DOI: 10.1186/s12919-018-0127-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Epigenome association studies that test a large number of methylation sites suffer from stringent multiple-testing corrections. This study's goals were to investigate region-based associations between DNA methylation sites and lipid-level changes in response to the treatment with fenofibrate in the GAW20 data and to investigate whether improvements in power could be obtained by taking into account correlations between DNA methylation at neighboring cytosine-phosphate-guanine (CpG) sites. To this end, we applied both a recently developed block-based data-dimension-reduction approach and a region-based variance-component (VC) linear mixed model to GAW20 data. We compared analyses of unrelated individuals with familial data. The region-based VC approach using unrelated (independent) individuals identified the gene LGALS9C as significantly associated with changes in triglycerides. However, univariate tests of individual CpG sites yielded no valid statistically significant results.
Collapse
Affiliation(s)
- Kaiqiong Zhao
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal, Quebec, H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, Quebec, H3T 1E2 Canada
| | - Lai Jiang
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal, Quebec, H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, Quebec, H3T 1E2 Canada
| | - Kathleen Klein
- Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, Quebec, H3T 1E2 Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal, Quebec, H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, Quebec, H3T 1E2 Canada
- Departments of Oncology and Human Genetics, McGill University, 3640 rue University, Montreal, Quebec, H3A 0C7 Canada
| | - Karim Oualkacha
- Department of Mathematics, Université du Québec à Montréal, 201, Ave. President Kennedy, Montreal, Montreal, H2X 3Y7 Canada
| |
Collapse
|
12
|
Wang X, Zhang Z, Morris N, Cai T, Lee S, Wang C, Yu TW, Walsh CA, Lin X. Rare variant association test in family-based sequencing studies. Brief Bioinform 2018; 18:954-961. [PMID: 27677958 DOI: 10.1093/bib/bbw083] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Indexed: 12/20/2022] Open
Abstract
The objective of this article is to introduce valid and robust methods for the analysis of rare variants for family-based exome chips, whole-exome sequencing or whole-genome sequencing data. Family-based designs provide unique opportunities to detect genetic variants that complement studies of unrelated individuals. Currently, limited methods and software tools have been developed to assist family-based association studies with rare variants, especially for analyzing binary traits. In this article, we address this gap by extending existing burden and kernel-based gene set association tests for population data to related samples, with a particular emphasis on binary phenotypes. The proposed approach blends the strengths of kernel machine methods and generalized estimating equations. Importantly, the efficient generalized kernel score test can be applied as a mega-analysis framework to combine studies with different designs. We illustrate the application of the proposed method using data from an exome sequencing study of autism. Methods discussed in this article are implemented in an R package 'gskat', which is available on CRAN and GitHub.
Collapse
|
13
|
Novel Methods for Family-Based Genetic Studies. Methods Mol Biol 2018. [PMID: 29876895 DOI: 10.1007/978-1-4939-7868-7_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
The recent development of microarray and sequencing technology allows identification of disease susceptibility genes. Although the genome-wide association studies (GWAS) have successfully identified many genetic markers related to human diseases, the traditional statistical methods are not powerful to detect rare genetic markers. The rare genetic markers are usually grouped together and tested at the set level. One of such methods is the sequence kernel association test (SKAT), which has been commonly used in the rare genetic marker analysis. In recent publications, SKAT has been extended to be applicable for family-based rare variant analysis. Here, I present three published statistical approaches for family-based rare variant analysis for: 1. continuous traits, 2. binary traits, and 3. multiple correlated traits.
Collapse
|
14
|
Abstract
Relatedness within a sample can be of ancient (population stratification) or recent (familial structure) origin, and can either be known (pedigree data) or unknown (cryptic relatedness). All of these forms of familial relatedness have the potential to confound the results of genome-wide association studies. This chapter reviews the major methods available to researchers to adjust for the biases introduced by relatedness and maximize power to detect associations. The advantages and disadvantages of different methods are presented with reference to elements of study design, population characteristics, and computational requirements.
Collapse
Affiliation(s)
- Russell Thomson
- Centre for Research in Mathematics, School of Computing, Engineering and Mathematics, Western Sydney University, Parramatta, Australia.
| | - Rebekah McWhirter
- Menzies Institute for Medical Research, University of Tasmania, Hobart, TAS, Australia
| |
Collapse
|
15
|
Abstract
While genome-wide association studies have been very successful in identifying associations of common genetic variants with many different traits, the rarer frequency spectrum of the genome has not yet been comprehensively explored. Technological developments increasingly lift restrictions to access rare genetic variation. Dense reference panels enable improved genotype imputation for rarer variants in studies using DNA microarrays. Moreover, the decreasing cost of next generation sequencing makes whole exome and genome sequencing increasingly affordable for large samples. Large-scale efforts based on sequencing, such as ExAC, 100,000 Genomes, and TopMed, are likely to significantly advance this field.The main challenge in evaluating complex trait associations of rare variants is statistical power. The choice of population should be considered carefully because allele frequencies and linkage disequilibrium structure differ between populations. Genetically isolated populations can have favorable genomic characteristics for the study of rare variants.One strategy to increase power is to assess the combined effect of multiple rare variants within a region, known as aggregate testing. A range of methods have been developed for this. Model performance depends on the genetic architecture of the region of interest.
Collapse
Affiliation(s)
- Karoline Kuchenbaecker
- Wellcome Trust Sanger Institute, Cambridge, UK. .,University College London, London, UK.
| | - Emil Vincent Rosenbaum Appel
- Novo Nordisk Foundation Center for Basic Metabolic Research, Section for Metabolic Genetics, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
16
|
Malzahn D, Friedrichs S, Bickeböller H. Comparing strategies for combined testing of rare and common variants in whole sequence and genome-wide genotype data. BMC Proc 2016; 10:269-273. [PMID: 27980648 PMCID: PMC5133495 DOI: 10.1186/s12919-016-0042-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We used our extension of the kernel score test to family data to analyze real and simulated baseline systolic blood pressure in extended pedigrees. We compared the power for different kernels and for different weightings of genetic markers. Moreover, we compared the power of rare and common markers with 3 strategies for joint testing and on marker panels with different densities. Marker weights had much greater influence on power than the kernel chosen. Inverse minor allele frequency weights often increased power on common markers but could decrease power on rare markers. Furthermore, defining the gene region based on linkage disequilibrium blocks often yielded robust power of joint tests of rare and common markers.
Collapse
Affiliation(s)
- Dörthe Malzahn
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Stefanie Friedrichs
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center, Georg-August University Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| |
Collapse
|
17
|
Lin PL, Tsai WY, Chung RH. A combined association test for rare variants using family and case-control data. BMC Proc 2016; 10:215-219. [PMID: 27980639 PMCID: PMC5133518 DOI: 10.1186/s12919-016-0033-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Statistical association tests for rare variants can be classified as the burden approach and the sequence kernel association test (SKAT) approach. The burden and SKAT approaches, originally developed for case–control analysis, have also been extended to family-based tests. In the presence of both case–control and family data for a study, joint analysis for the combined data set can increase the statistical power. We extended the Combined Association in the Presence of Linkage (CAPL) test, using both case–control and family data for testing common variants, to rare variant association analysis. The burden and SKAT algorithms were applied to the CAPL test. We used simulations to verify that the CAPL tests incorporating the burden and SKAT algorithms have correct type I error rates. Power studies suggested that both tests have adequate power to identify rare variants associated with the disease. We applied the tests to the Genetic Analysis Workshop 19 data set using the combined family and case–control data for hypertension. The analysis identified several candidate genes for hypertension.
Collapse
Affiliation(s)
- Peng-Lin Lin
- Department of Medical Science, National Tsing Hua University, Hsin-Chu, Taiwan
| | - Wei-Yun Tsai
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Miaoli Taiwan
| | - Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Miaoli Taiwan
| |
Collapse
|
18
|
Increasing Generality and Power of Rare-Variant Tests by Utilizing Extended Pedigrees. Am J Hum Genet 2016; 99:846-859. [PMID: 27666371 DOI: 10.1016/j.ajhg.2016.08.015] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2016] [Accepted: 08/17/2016] [Indexed: 11/24/2022] Open
Abstract
Recently, multiple studies have performed whole-exome or whole-genome sequencing to identify groups of rare variants associated with complex traits and diseases. They have primarily utilized case-control study designs that often require thousands of individuals to reach acceptable statistical power. Family-based studies can be more powerful because a rare variant can be enriched in an extended pedigree and segregate with the phenotype. Although many methods have been proposed for using family data to discover rare variants involved in a disease, a majority of them focus on a specific pedigree structure and are designed to analyze either binary or continuously measured outcomes. In this article, we propose RareIBD, a general and powerful approach to identifying rare variants involved in disease susceptibility. Our method can be applied to large extended families of arbitrary structure, including pedigrees with only affected individuals. The method accommodates both binary and quantitative traits. A series of simulation experiments suggest that RareIBD is a powerful test that outperforms existing approaches. In addition, our method accounts for individuals in top generations, which are not usually genotyped in extended families. In contrast to available statistical tests, RareIBD generates accurate p values even when genetic data from these individuals are missing. We applied RareIBD, as well as other methods, to two extended family datasets generated by different genotyping technologies and representing different ethnicities. The analysis of real data confirmed that RareIBD is the only method that properly controls type I error.
Collapse
|
19
|
Jeng XJ, Daye ZJ, Lu W, Tzeng JY. Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level. PLoS Comput Biol 2016; 12:e1004993. [PMID: 27355347 PMCID: PMC4927097 DOI: 10.1371/journal.pcbi.1004993] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2015] [Accepted: 05/21/2016] [Indexed: 11/24/2022] Open
Abstract
Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information. Next-generation sequencing technologies have allowed genetic association studies of complex traits at the single base-pair resolution, where most genetic variants have extremely low mutation frequencies. These rare variants have been the focus of modern statistical-computational genomics due to their potential to explain missing disease heritability. The identification of individual rare variants associated with diseases can provide new biological insights and enable the precise delineation of disease mechanisms. However, due to the extreme rarity of mutations and large numbers of variants, significances of causative variants tend to be mixed inseparably with a few noncausative ones, and standard multiple testing procedures controlling for false positives fail to provide a meaningful way to include a large proportion of the causative variants. To address the challenge of detecting weak biological signals, we propose a novel statistical procedure, based on false-negative control, to provide a practical approach for variant inclusion in large-scale sequencing studies. By determining those variants that can be confidently dispatched as noncausative, the proposed procedure offers an objective selection of a modest number of potentially causative variants at the single-locus level. Results can be further prioritized or used to infer disease-associated genes with annotation information.
Collapse
Affiliation(s)
- Xinge Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Zhongyin John Daye
- Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- * E-mail:
| |
Collapse
|
20
|
Yan Q, Weeks DE, Tiwari HK, Yi N, Zhang K, Gao G, Lin WY, Lou XY, Chen W, Liu N. Rare-Variant Kernel Machine Test for Longitudinal Data from Population and Family Samples. Hum Hered 2016; 80:126-38. [PMID: 27161037 DOI: 10.1159/000445057] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2015] [Accepted: 02/24/2016] [Indexed: 01/12/2023] Open
Abstract
OBJECTIVE The kernel machine (KM) test reportedly performs well in the set-based association test of rare variants. Many studies have been conducted to measure phenotypes at multiple time points, but the standard KM methodology has only been available for phenotypes at a single time point. In addition, family-based designs have been widely used in genetic association studies; therefore, the data analysis method used must appropriately handle familial relatedness. A rare-variant test does not currently exist for longitudinal data from family samples. Therefore, in this paper, we aim to introduce an association test for rare variants, which includes multiple longitudinal phenotype measurements for either population or family samples. METHODS This approach uses KM regression based on the linear mixed model framework and is applicable to longitudinal data from either population (L-KM) or family samples (LF-KM). RESULTS In our population-based simulation studies, L-KM has good control of Type I error rate and increased power in all the scenarios we considered compared with other competing methods. Conversely, in the family-based simulation studies, we found an inflated Type I error rate when L-KM was applied directly to the family samples, whereas LF-KM retained the desired Type I error rate and had the best power performance overall. Finally, we illustrate the utility of our proposed LF-KM approach by analyzing data from an association study between rare variants and blood pressure from the Genetic Analysis Workshop 18 (GAW18). CONCLUSION We propose a method for rare-variant association testing in population and family samples using phenotypes measured at multiple time points for each subject. The proposed method has the best power performance compared to competing approaches in our simulation study.
Collapse
Affiliation(s)
- Qi Yan
- Division of Pulmonary Medicine, Allergy and Immunology, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, Pa., USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Oualkacha K, Lakhal-Chaieb L, Greenwood CM. Software Application Profile: RVPedigree: a suite of family-based rare variant association tests for normally and non-normally distributed quantitative traits. Int J Epidemiol 2016; 45:402-7. [PMID: 27085080 DOI: 10.1093/ije/dyw047] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/18/2016] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION RVPedigree (Rare Variant association tests in Pedigrees) implements a suite of programs facilitating genome-wide analysis of association between a quantitative trait and autosomal region-based genetic variation. The main features here are the ability to appropriately test for association of rare variants with non-normally distributed quantitative traits, and also to appropriately adjust for related individuals, either from families or from population structure and cryptic relatedness. IMPLEMENTATION RVPedigree is available as an R package. GENERAL FEATURES The package includes calculation of kinship matrices, various options for coping with non-normality, three different ways of estimating statistical significance incorporating triaging to enable efficient use of the most computationally-intensive calculations, and a parallelization option for genome-wide analysis. AVAILABILITY The software is available from the Comprehensive R Archive Network [CRAN.R-project.org] under the name 'RVPedigree' and at [https://github.com/GreenwoodLab]. It has been published under General Public License (GPL) version 3 or newer.
Collapse
Affiliation(s)
- Karim Oualkacha
- Départment de mathématiques, Université du Québec à Montréal, QC, Canada
| | - Lajmi Lakhal-Chaieb
- Département de mathématiques et de statistique, Université Laval, Québec, QC, Canada
| | - Celia Mt Greenwood
- Lady Davis Institute, Jewish General Hospital, Montréal, QC, Canada Departments of Oncology, Epidemiology, Biostatistics & Occupational Health, and Human Genetics, McGill University, Montreal, QC, Canada Ludmer Centre for Neuroinformatics and Mental Health, Montreal, QC, Canada
| |
Collapse
|
22
|
Belonogova NM, Svishcheva GR, Axenovich TI. FREGAT: an R package for region-based association analysis. ACTA ACUST UNITED AC 2016; 32:2392-3. [PMID: 27153598 DOI: 10.1093/bioinformatics/btw160] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Accepted: 03/20/2016] [Indexed: 11/14/2022]
Abstract
UNLABELLED Several approaches to the region-based association analysis of quantitative traits have recently been developed and successively applied. However, no software package has been developed that implements all of these approaches for either independent or structured samples. Here we introduce FREGAT (Family REGional Association Tests), an R package that can handle family and population samples and implements a wide range of region-based association methods including burden tests, functional linear models, and kernel machine-based regression. FREGAT can be used in genome/exome-wide region-based association studies of quantitative traits and candidate gene analysis. FREGAT offers many useful options to empower its users and increase the effectiveness and applicability of region-based association analysis. AVAILABILITY AND IMPLEMENTATION https://cran.r-project.org/web/packages/FREGAT/index.html SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics Online. CONTACT belon@bionet.nsc.ru.
Collapse
Affiliation(s)
- Nadezhda M Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk
| | - Gulnara R Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk Vavilov Institute of General Genetics, the Russian Academy of Sciences, Moscow, Russia
| | - Tatiana I Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk Novosibirsk State University, Novosibirsk, Russia
| |
Collapse
|
23
|
Wu B, Pankow JS. On Sample Size and Power Calculation for Variant Set-Based Association Tests. Ann Hum Genet 2016; 80:136-43. [PMID: 26831402 PMCID: PMC4761288 DOI: 10.1111/ahg.12147] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Accepted: 12/07/2015] [Indexed: 01/03/2023]
Abstract
Sample size and power calculations are an important part of designing new sequence-based association studies. The recently developed SEQPower and SPS programs adopted computationally intensive Monte Carlo simulations to empirically estimate power for a series of variant set association (VSA) test methods including the sequence kernel association test (SKAT). It is desirable to develop methods that can quickly and accurately compute power without intensive Monte Carlo simulations. We will show that the computed power for SKAT based on the existing analytical approach could be inflated especially for small significance levels, which are often of primary interest for large-scale whole genome and exome sequencing projects. We propose a new χ(2) -approximation-based approach to accurately and efficiently compute sample size and power. In addition, we propose and implement a more accurate "exact" method to compute power, which is more efficient than the Monte Carlo approach though generally involves more computations than the χ(2) approximation method. The exact approach could produce very accurate results and be used to verify alternative approximation approaches. We implement the proposed methods in publicly available R programs that can be readily adapted when planning sequencing projects.
Collapse
Affiliation(s)
- Baolin Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - James S. Pankow
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
24
|
A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects. Eur J Hum Genet 2016; 24:1344-51. [PMID: 26860061 PMCID: PMC4989219 DOI: 10.1038/ejhg.2016.8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Revised: 12/22/2015] [Accepted: 12/30/2015] [Indexed: 01/05/2023] Open
Abstract
For region-based sequencing data, power to detect genetic associations can be improved through analysis of multiple related phenotypes. With this motivation, we propose a novel test to detect association simultaneously between a set of rare variants, such as those obtained by sequencing in a small genomic region, and multiple continuous phenotypes. We allow arbitrary correlations among the phenotypes and build on a linear mixed model by assuming the effects of the variants follow a multivariate normal distribution with a zero mean and a specific covariance matrix structure. In order to account for the unknown correlation parameter in the covariance matrix of the variant effects, a data-adaptive variance component test based on score-type statistics is derived. As our approach can calculate the P-value analytically, the proposed test procedure is computationally efficient. Broad simulations and an application to the UK10K project show that our proposed multivariate test is generally more powerful than univariate tests, especially when there are pleiotropic effects or highly correlated phenotypes.
Collapse
|
25
|
Wu B, Guan W, Pankow JS. On Efficient and Accurate Calculation of Significance P-Values for Sequence Kernel Association Testing of Variant Set. Ann Hum Genet 2016; 80:123-35. [PMID: 26757198 DOI: 10.1111/ahg.12144] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2015] [Accepted: 11/12/2015] [Indexed: 01/04/2023]
Abstract
The objective of this paper is to discuss and develop alternative computational methods to accurately and efficiently calculate significance P-values for the commonly used sequence kernel association test (SKAT) and adaptive sum of SKAT and burden test (SKAT-O) for variant set association. We show that the existing software can lead to either conservative or inflated type I errors. We develop alternative and efficient computational algorithms that quickly compute the SKAT P-value and have well-controlled type I errors. In addition, we derive an alternative and simplified formula for calculating the significance P-value of SKAT-O, which sheds light on the development of efficient and accurate numerical algorithms. We implement the proposed methods in the publicly available R package that can be readily used or adapted to large-scale sequencing studies. Given that more and more large-scale exome and whole genome sequencing or re-sequencing studies are being conducted, the proposed methods are practically very important. We conduct extensive numerical studies to investigate the performance of the proposed methods. We further illustrate their usefulness with application to associations between rare exonic variants and fasting glucose levels in the Atherosclerosis Risk in Communities (ARIC) study.
Collapse
Affiliation(s)
- Baolin Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - James S Pankow
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
26
|
Associating Multivariate Quantitative Phenotypes with Genetic Variants in Family Samples with a Novel Kernel Machine Regression Method. Genetics 2015; 201:1329-39. [PMID: 26482791 DOI: 10.1534/genetics.115.178590] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 10/04/2015] [Indexed: 11/18/2022] Open
Abstract
The recent development of sequencing technology allows identification of association between the whole spectrum of genetic variants and complex diseases. Over the past few years, a number of association tests for rare variants have been developed. Jointly testing for association between genetic variants and multiple correlated phenotypes may increase the power to detect causal genes in family-based studies, but familial correlation needs to be appropriately handled to avoid an inflated type I error rate. Here we propose a novel approach for multivariate family data using kernel machine regression (denoted as MF-KM) that is based on a linear mixed-model framework and can be applied to a large range of studies with different types of traits. In our simulation studies, the usual kernel machine test has inflated type I error rates when applied directly to familial data, while our proposed MF-KM method preserves the expected type I error rates. Moreover, the MF-KM method has increased power compared to methods that either analyze each phenotype separately while considering family structure or use only unrelated founders from the families. Finally, we illustrate our proposed methodology by analyzing whole-genome genotyping data from a lung function study.
Collapse
|
27
|
Lakhal-Chaieb L, Oualkacha K, Richards BJ, Greenwood CM. A rare variant association test in family-based designs and non-normal quantitative traits. Stat Med 2015; 35:905-21. [DOI: 10.1002/sim.6750] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Revised: 09/04/2015] [Accepted: 09/05/2015] [Indexed: 12/13/2022]
Affiliation(s)
- Lajmi Lakhal-Chaieb
- Département de mathématiques et statistique; Université Laval; Québec G1V 0A6 Québec Canada
| | - Karim Oualkacha
- Département de mathématiques; Université de Québec À Montréal; Montreal Québec Canada
| | - Brent J. Richards
- Lady Davis Institute for Medical Research; Jewish General Hospital; Montreal Québec Canada
- Department of Epidemiology, Biostatistics and Occupational Health; McGill University; Montreal Québec Canada
- Department of Twin Research; King's College London; London U.K
| | - Celia M.T. Greenwood
- Lady Davis Institute for Medical Research; Jewish General Hospital; Montreal Québec Canada
- Department of Epidemiology, Biostatistics and Occupational Health; McGill University; Montreal Québec Canada
- Departments of Oncology and Human Genetics; McGill University; Montreal Québec Canada
| |
Collapse
|
28
|
Leclerc M, Simard J, Lakhal-Chaieb L. SNP Set Association Testing for Survival Outcomes in the Presence of Intrafamilial Correlation. Genet Epidemiol 2015; 39:406-14. [PMID: 26282997 DOI: 10.1002/gepi.21914] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Revised: 06/04/2015] [Accepted: 06/17/2015] [Indexed: 11/06/2022]
Abstract
In this work, we propose a single nucleotide polymorphism (SNP) set association test for censored phenotypes in the presence of a family-based design. The proposed test is valid for both common and rare variants. A proportional hazards Cox model is specified for the marginal distribution of the trait and the familial dependence is modeled via a Gaussian copula. Censored values are treated as partially missing data and a multiple imputation procedure is proposed in order to compute the test statistics. The P-value is then deduced analytically. The finite-sample empirical properties of the proposed method are evaluated and compared to existing competitors by simulations and its use is illustrated using a breast cancer data set from the Consortium of Investigators of Modifiers of BRCA1 and BRCA2.
Collapse
Affiliation(s)
- Martin Leclerc
- Département de mathématiques et de statistique, Université Laval, Québec, Canada
| | | | - Jacques Simard
- Department of Molecular Medicine, Canada Research Chair in Oncogenetics, Laval University & Genomics Centre, CHU de Québec Research Centre, Québec, Canada
| | - Lajmi Lakhal-Chaieb
- Département de mathématiques et de statistique, Université Laval, Québec, Canada
| |
Collapse
|
29
|
Svishcheva GR, Belonogova NM, Axenovich TI. Region-Based Association Test for Familial Data under Functional Linear Models. PLoS One 2015; 10:e0128999. [PMID: 26111046 PMCID: PMC4481467 DOI: 10.1371/journal.pone.0128999] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2014] [Accepted: 05/04/2015] [Indexed: 12/22/2022] Open
Abstract
Region-based association analysis is a more powerful tool for gene mapping than testing of individual genetic variants, particularly for rare genetic variants. The most powerful methods for regional mapping are based on the functional data analysis approach, which assumes that the regional genome of an individual may be considered as a continuous stochastic function that contains information about both linkage and linkage disequilibrium. Here, we extend this powerful approach, earlier applied only to independent samples, to the samples of related individuals. To this end, we additionally include a random polygene effects in functional linear model used for testing association between quantitative traits and multiple genetic variants in the region. We compare the statistical power of different methods using Genetic Analysis Workshop 17 mini-exome family data and a wide range of simulation scenarios. Our method increases the power of regional association analysis of quantitative traits compared with burden-based and kernel-based methods for the majority of the scenarios. In addition, we estimate the statistical power of our method using regions with small number of genetic variants, and show that our method retains its advantage over burden-based and kernel-based methods in this case as well. The new method is implemented as the R-function 'famFLM' using two types of basis functions: the B-spline and Fourier bases. We compare the properties of the new method using models that differ from each other in the type of their function basis. The models based on the Fourier basis functions have an advantage in terms of speed and power over the models that use the B-spline basis functions and those that combine B-spline and Fourier basis functions. The 'famFLM' function is distributed under GPLv3 license and is freely available at http://mga.bionet.nsc.ru/soft/famFLM/.
Collapse
Affiliation(s)
- Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- Department of Natural Sciences, Novosibirsk State University, Novosibirsk, Russia
| |
Collapse
|
30
|
Yan Q, Tiwari HK, Yi N, Gao G, Zhang K, Lin WY, Lou XY, Cui X, Liu N. A Sequence Kernel Association Test for Dichotomous Traits in Family Samples under a Generalized Linear Mixed Model. Hum Hered 2015; 79:60-8. [PMID: 25791389 DOI: 10.1159/000375409] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 01/21/2015] [Indexed: 01/15/2023] Open
Abstract
OBJECTIVE The existing methods for identifying multiple rare variants underlying complex diseases in family samples are underpowered. Therefore, we aim to develop a new set-based method for an association study of dichotomous traits in family samples. METHODS We introduce a framework for testing the association of genetic variants with diseases in family samples based on a generalized linear mixed model. Our proposed method is based on a kernel machine regression and can be viewed as an extension of the sequence kernel association test (SKAT and famSKAT) for application to family data with dichotomous traits (F-SKAT). RESULTS Our simulation studies show that the original SKAT has inflated type I error rates when applied directly to family data. By contrast, our proposed F-SKAT has the correct type I error rate. Furthermore, in all of the considered scenarios, F-SKAT, which uses all family data, has higher power than both SKAT, which uses only unrelated individuals from the family data, and another method, which uses all family data. CONCLUSION We propose a set-based association test that can be used to analyze family data with dichotomous phenotypes while handling genetic variants with the same or opposite directions of effects as well as any types of family relationships.
Collapse
Affiliation(s)
- Qi Yan
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Ala., USA
| | | | | | | | | | | | | | | | | |
Collapse
|
31
|
CAO YING, MAXWELL TAYLORJ, WEI PENG. A family-based joint test for mean and variance heterogeneity for quantitative traits. Ann Hum Genet 2015; 79:46-56. [PMID: 25393880 PMCID: PMC4275359 DOI: 10.1111/ahg.12089] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2014] [Accepted: 09/22/2014] [Indexed: 01/26/2023]
Abstract
Traditional quantitative trait locus (QTL) analysis focuses on identifying loci associated with mean heterogeneity. Recent research has discovered loci associated with phenotype variance heterogeneity (vQTL), which is important in studying genetic association with complex traits, especially for identifying gene-gene and gene-environment interactions. While several tests have been proposed to detect vQTL for unrelated individuals, there are no tests for related individuals, commonly seen in family-based genetic studies. Here we introduce a likelihood ratio test (LRT) for identifying mean and variance heterogeneity simultaneously or for either effect alone, adjusting for covariates and family relatedness using a linear mixed effect model approach. The LRT test statistic for normally distributed quantitative traits approximately follows χ(2)-distributions. To correct for inflated Type I error for non-normally distributed quantitative traits, we propose a parametric bootstrap-based LRT that removes the best linear unbiased prediction (BLUP) of family random effect. Simulation studies show that our family-based test controls Type I error and has good power, while Type I error inflation is observed when family relatedness is ignored. We demonstrate the utility and efficiency gains of the proposed method using data from the Framingham Heart Study to detect loci associated with body mass index (BMI) variability.
Collapse
Affiliation(s)
- YING CAO
- Division of Biostatistics, The University of Texas School of Public Health, Houston, Texas, USA
- Human Genetics Center, The University of Texas School of Public Health, Houston, Texas, USA
| | - TAYLOR J. MAXWELL
- Computational Biology Institute, The George Washington University, Ashburn, Virginia, USA
| | - PENG WEI
- Division of Biostatistics, The University of Texas School of Public Health, Houston, Texas, USA
- Human Genetics Center, The University of Texas School of Public Health, Houston, Texas, USA
| |
Collapse
|
32
|
Lin WY. Adaptive combination of P-values for family-based association testing with sequence data. PLoS One 2014; 9:e115971. [PMID: 25541952 PMCID: PMC4277421 DOI: 10.1371/journal.pone.0115971] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 12/01/2014] [Indexed: 12/24/2022] Open
Abstract
Family-based study design will play a key role in identifying rare causal variants, because rare causal variants can be enriched in families with multiple affected subjects. Furthermore, different from population-based studies, family studies are robust to bias induced by population substructure. It is well known that rare causal variants are difficult to detect from single-locus tests. Therefore, burden tests and non-burden tests have been developed, by combining signals of multiple variants in a chromosomal region or a functional unit. This inevitably incorporates some neutral variants into the test statistics, which can dilute the power of statistical methods. To guard against the noise caused by neutral variants, we here propose an 'adaptive combination of P-values method' (abbreviated as 'ADA'). This method combines per-site P-values of variants that are more likely to be causal. Variants with large P-values (which are more likely to be neutral variants) are discarded from the combined statistic. In addition to performing extensive simulation studies, we applied these tests to the Genetic Analysis Workshop 17 data sets, where real sequence data were generated according to the 1000 Genomes Project. Compared with some existing methods, ADA is more robust to the inclusion of neutral variants. This is a merit especially when dichotomous traits are analyzed. However, there are some limitations for ADA. First, it is more computationally intensive. Second, pedigree structures and founders' sequence data are required for the permutation procedure. Third, unrelated controls cannot be included. We here show that, for family-based studies, the application of ADA is limited to dichotomous trait analyses with full pedigree information.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
33
|
Santorico SA, Edwards KL. Challenges of linkage analysis in the era of whole-genome sequencing. Genet Epidemiol 2014; 38 Suppl 1:S92-6. [PMID: 25112196 DOI: 10.1002/gepi.21832] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Whole-genome sequencing (WGS) is becoming an affordable technology for the study of the genetics of complex traits. With any new technology, experimental designs and statistical methods, both old and new, must be evaluated. One design seeing a resurgence of interest is the use of families. Genetic Analysis Workshop 18 provided the opportunity to evaluate statistical methods applied to WGS data for family-based studies. We summarize the results of five contributions that used linkage in the context of WGS. The investigators took differing approaches, including assessment of false-positive rates in classic two-point linkage, the effects of heterogeneity on linkage and association tests, and the use of linkage to focus association tests. We describe the primary findings of each contribution and note challenges that are not new to those working in family designs or specific to WGS data; for example, choice of phenotype definition, covariate adjustment, and use of longitudinal data may produce different results, making comparisons challenging. We detail new issues brought about by WGS, such as the elevated genome-wide false-positive rate for classic two-point parametric linkage analysis, computational demands in multipoint calculations, and lack of clarity in how to best use linkage to focus association testing. Finally, we comment on when linkage may be helpful for WGS, highlighting where additional research is needed; for example, although linkage analysis has been successful in the study of rare variants of large effect, how to best use family information in the context of rare variants of moderate effect remains an open research question.
Collapse
Affiliation(s)
- Stephanie A Santorico
- Department of Mathematical and Statistical Sciences, University of Colorado, Denver, CO, USA
| | | |
Collapse
|
34
|
Aslibekyan S, Almeida M, Tintle N. Pathway analysis approaches for rare and common variants: insights from Genetic Analysis Workshop 18. Genet Epidemiol 2014; 38 Suppl 1:S86-91. [PMID: 25112195 DOI: 10.1002/gepi.21831] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Pathway analysis, broadly defined as a group of methods incorporating a priori biological information from public databases, has emerged as a promising approach for analyzing high-dimensional genomic data. As part of Genetic Analysis Workshop 18, seven research groups applied pathway analysis techniques to whole-genome sequence data from the San Antonio Family Study. Overall, the groups found that the potential of pathway analysis to improve detection of causal variants by lowering the multiple-testing burden and incorporating biologic insight remains largely unrealized. Specifically, there is a lack of best practices at each stage of the pathway approach: annotation, analysis, interpretation, and follow-up. Annotation of genetic variants is inconsistent across databases, incomplete, and biased toward known genes. At the analysis stage insufficient statistical power remains a major challenge. Analyses combining rare and common variants may have an inflated type I error rate and may not improve detection of causal genes. Inclusion of known causal genes may not improve statistical power, although the fraction of explained phenotypic variance may be a more appropriate metric. Interpretation of findings is further complicated by evidence in support of interactions between pathways and by the lack of consensus on how to best incorporate functional information. Finally, all presented approaches warranted follow-up studies, both to reduce the likelihood of false-positive findings and to identify specific causal variants within a given pathway. Despite the initial promise of pathway analysis for modeling biological complexity of disease phenotypes, many methodological challenges currently remain to be addressed.
Collapse
Affiliation(s)
- Stella Aslibekyan
- Department of Epidemiology, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | | | | |
Collapse
|
35
|
Zhang Q, Wang L, Koboldt D, Boreki IB, Province MA. Adjusting family relatedness in data-driven burden test of rare variants. Genet Epidemiol 2014; 38:722-7. [PMID: 25169066 DOI: 10.1002/gepi.21848] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Revised: 07/01/2014] [Accepted: 07/16/2014] [Indexed: 11/08/2022]
Abstract
Family data represent a rich resource for detecting association between rare variants (RVs) and human traits. However, most RV association analysis methods developed in recent years are data-driven burden tests which can adaptively learn weights from data but require permutation to evaluate significance, thus are not readily applicable to family data, because random permutation will destroy family structure. Direct application of these methods to family data may result in a significant inflation of false positives. To overcome this issue, we have developed a generalized, weighted sum mixed model (WSMM), and corresponding computational techniques that can incorporate family information into data-driven burden tests, and allow adaptive and efficient permutation test in family data. Using simulated and real datasets, we demonstrate that the WSMM method can be used to appropriately adjust for genetic relatedness among family members and has a good control for the inflation of false positives. We compare WSMM with a nondata-driven, family-based Sequence Kernel Association Test (famSKAT), showing that WSMM has significantly higher power in some cases. WSMM provides a generalized, flexible framework for adapting different data-driven burden tests to analyze data with any family structures, and it can be extended to binary and time-to-onset traits, with or without covariates.
Collapse
Affiliation(s)
- Qunyuan Zhang
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, Missouri, United States of America
| | | | | | | | | |
Collapse
|
36
|
Lippert C, Xiang J, Horta D, Widmer C, Kadie C, Heckerman D, Listgarten J. Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. ACTA ACUST UNITED AC 2014; 30:3206-14. [PMID: 25075117 PMCID: PMC4221116 DOI: 10.1093/bioinformatics/btu504] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods. Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500. Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/. Contact:heckerma@microsoft.com Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christoph Lippert
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Jing Xiang
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Danilo Horta
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Christian Widmer
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Carl Kadie
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - David Heckerman
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| | - Jennifer Listgarten
- eScience Research Group, Microsoft Research, Los Angeles, CA, 90024 and eScience Research Group, Microsoft Research, Redmond, WA, 98052, USA
| |
Collapse
|
37
|
Jiang Y, Conneely KN, Epstein MP. Flexible and robust methods for rare-variant testing of quantitative traits in trios and nuclear families. Genet Epidemiol 2014; 38:542-51. [PMID: 25044337 DOI: 10.1002/gepi.21839] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Revised: 05/21/2014] [Accepted: 05/29/2014] [Indexed: 11/07/2022]
Abstract
Most rare-variant association tests for complex traits are applicable only to population-based or case-control resequencing studies. There are fewer rare-variant association tests for family-based resequencing studies, which is unfortunate because pedigrees possess many attractive characteristics for such analyses. Family-based studies can be more powerful than their population-based counterparts due to increased genetic load and further enable the implementation of rare-variant association tests that, by design, are robust to confounding due to population stratification. With this in mind, we propose a rare-variant association test for quantitative traits in families; this test integrates the QTDT approach of Abecasis et al. [Abecasis et al., ] into the kernel-based SNP association test KMFAM of Schifano et al. [Schifano et al., ]. The resulting within-family test enjoys the many benefits of the kernel framework for rare-variant association testing, including rapid evaluation of P-values and preservation of power when a region harbors rare causal variation that acts in different directions on phenotype. Additionally, by design, this within-family test is robust to confounding due to population stratification. Although within-family association tests are generally less powerful than their counterparts that use all genetic information, we show that we can recover much of this power (although still ensuring robustness to population stratification) using a straightforward screening procedure. Our method accommodates covariates and allows for missing parental genotype data, and we have written software implementing the approach in R for public use.
Collapse
Affiliation(s)
- Yunxuan Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia, United States of America
| | | | | |
Collapse
|
38
|
Abstract
The use of genetically isolated populations can empower next-generation association studies. In this review, we discuss the advantages of this approach and review study design and analytical considerations of genetic association studies focusing on isolates. We cite successful examples of using population isolates in association studies and outline potential ways forward.
Collapse
|
39
|
Hu H, Roach JC, Coon H, Guthery SL, Voelkerding KV, Margraf RL, Durtschi JD, Tavtigian SV, Shankaracharya, Wu W, Scheet P, Wang S, Xing J, Glusman G, Hubley R, Li H, Garg V, Moore B, Hood L, Galas DJ, Srivastava D, Reese MG, Jorde LB, Yandell M, Huff CD. A unified test of linkage analysis and rare-variant association for analysis of pedigree sequence data. Nat Biotechnol 2014; 32:663-9. [PMID: 24837662 PMCID: PMC4157619 DOI: 10.1038/nbt.2895] [Citation(s) in RCA: 86] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2013] [Accepted: 04/04/2014] [Indexed: 01/02/2023]
Abstract
High-throughput sequencing of related individuals has become an important tool for studying human disease. However, owing to technical complexity and lack of available tools, most pedigree-based sequencing studies rely on an ad hoc combination of suboptimal analyses. Here we present pedigree-VAAST (pVAAST), a disease-gene identification tool designed for high-throughput sequence data in pedigrees. pVAAST uses a sequence-based model to perform variant and gene-based linkage analysis. Linkage information is then combined with functional prediction and rare variant case-control association information in a unified statistical framework. pVAAST outperformed linkage and rare-variant association tests in simulations and identified disease-causing genes from whole-genome sequence data in three human pedigrees with dominant, recessive and de novo inheritance patterns. The approach is robust to incomplete penetrance and locus heterogeneity and is applicable to a wide variety of genetic traits. pVAAST maintains high power across studies of monogenic, high-penetrance phenotypes in a single pedigree to highly polygenic, common phenotypes involving hundreds of pedigrees.
Collapse
Affiliation(s)
- Hao Hu
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | - Jared C Roach
- Institute for Systems Biology, Seattle, Washington, USA
| | - Hilary Coon
- Department of Psychiatry, University of Utah, Salt Lake City, Utah, USA
| | - Stephen L Guthery
- Department of Pediatrics, University of Utah, Salt Lake City, Utah, USA
| | - Karl V Voelkerding
- 1] Department of Pathology, University of Utah School of Medicine, Salt Lake City, Utah, USA. [2] ARUP Institute for Clinical and Experimental Pathology, ARUP Laboratories, Salt Lake City, Utah, USA
| | - Rebecca L Margraf
- ARUP Institute for Clinical and Experimental Pathology, ARUP Laboratories, Salt Lake City, Utah, USA
| | - Jacob D Durtschi
- ARUP Institute for Clinical and Experimental Pathology, ARUP Laboratories, Salt Lake City, Utah, USA
| | - Sean V Tavtigian
- Department of Oncological Sciences, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah, USA
| | - Shankaracharya
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | - Wilfred Wu
- Department of Human Genetics and USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
| | - Paul Scheet
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | - Shuoguo Wang
- Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, New Jersey, USA
| | - Jinchuan Xing
- Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, New Jersey, USA
| | | | - Robert Hubley
- Institute for Systems Biology, Seattle, Washington, USA
| | - Hong Li
- Institute for Systems Biology, Seattle, Washington, USA
| | - Vidu Garg
- 1] Department of Pediatrics, The Ohio State University, Columbus, Ohio, USA. [2] Center for Cardiovascular and Pulmonary Research, Research Institute at Nationwide Children's Hospital, Columbus, Ohio, USA
| | - Barry Moore
- Department of Human Genetics and USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
| | - Leroy Hood
- Institute for Systems Biology, Seattle, Washington, USA
| | - David J Galas
- 1] Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg. [2] Pacific Northwest Diabetes Research Institute, Seattle, Washington, USA
| | - Deepak Srivastava
- Gladstone Institute of Cardiovascular Disease and University of California, San Francisco, San Francisco, California, USA
| | | | - Lynn B Jorde
- Department of Human Genetics and USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
| | - Mark Yandell
- Department of Human Genetics and USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
| | - Chad D Huff
- Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
40
|
Dufresne L, Oualkacha K, Forgetta V, Greenwood CM. Pathway analysis for genetic association studies: to do, or not to do? That is the question. BMC Proc 2014; 8:S103. [PMID: 25519357 PMCID: PMC4144468 DOI: 10.1186/1753-6561-8-s1-s103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
In Genetic Analysis Workshop 18 data, we used a 3-stage approach to explore the
benefits of pathway analysis in improving a model to predict 2 diastolic blood
pressure phenotypes as a function of genetic variation. At stage 1, gene-based tests
of association in family data of approximately 800 individuals found over 600 genes
associated at p<0.05 for each phenotype. At stage 2, networks and
enriched pathways were estimated with Cytoscape for genes from stage 1, separately
for the 2 phenotypes, then examining network overlap. This overlap identified 4
enriched pathways, and 3 of these pathways appear to interact, and are likely
candidates for playing a role in hypertension. At stage 3, using 157 maximally
unrelated individuals, partial least squares regression was used to find associations
between diastolic blood pressure and single-nucleotide polymorphisms in genes
highlighted by the pathway analyses. However, we saw no improvement in the adjusted
cross-validated R2. Although our pathway-motivated regressions
did not improve prediction of diastolic blood pressure, merging gene networks did
identify several plausible pathways for hypertension.
Collapse
Affiliation(s)
- Line Dufresne
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal, Quebec, H3A 1A2, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, PK-5151, 201 avenue du Président-Kennedy, Montréal, QC H2X 3Y7, Canada
| | - Vincenzo Forgetta
- Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, QC, H3T 1E2, Canada
| | - Celia Mt Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal, Quebec, H3A 1A2, Canada ; Lady Davis Institute for Medical Research, Jewish General Hospital, 3755 Côte Ste. Catherine, Montreal, QC, H3T 1E2, Canada ; Department of Oncology, McGill University, Montreal, QC, Canada
| |
Collapse
|
41
|
Svishcheva GR, Belonogova NM, Axenovich TI. FFBSKAT: fast family-based sequence kernel association test. PLoS One 2014; 9:e99407. [PMID: 24905468 PMCID: PMC4048315 DOI: 10.1371/journal.pone.0099407] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2014] [Accepted: 05/14/2014] [Indexed: 11/28/2022] Open
Abstract
The kernel machine-based regression is an efficient approach to region-based association analysis aimed at identification of rare genetic variants. However, this method is computationally complex. The running time of kernel-based association analysis becomes especially long for samples with genetic (sub) structures, thus increasing the need to develop new and effective methods, algorithms, and software packages. We have developed a new R-package called fast family-based sequence kernel association test (FFBSKAT) for analysis of quantitative traits in samples of related individuals. This software implements a score-based variance component test to assess the association of a given set of single nucleotide polymorphisms with a continuous phenotype. We compared the performance of our software with that of two existing software for family-based sequence kernel association testing, namely, ASKAT and famSKAT, using the Genetic Analysis Workshop 17 family sample. Results demonstrate that FFBSKAT is several times faster than other available programs. In addition, the calculations of the three-compared software were similarly accurate. With respect to the available analysis modes, we combined the advantages of both ASKAT and famSKAT and added new options to empower FFBSKAT users. The FFBSKAT package is fast, user-friendly, and provides an easy-to-use method to perform whole-exome kernel machine-based regression association analysis of quantitative traits in samples of related individuals. The FFBSKAT package, along with its manual, is available for free download at http://mga.bionet.nsc.ru/soft/FFBSKAT/.
Collapse
Affiliation(s)
- Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
- * E-mail:
| |
Collapse
|
42
|
He L, Sillanpää MJ, Ripatti S, Pitkäniemi J. Bayesian Latent Variable Collapsing Model for Detecting Rare Variant Interaction Effect in Twin Study. Genet Epidemiol 2014; 38:310-24. [DOI: 10.1002/gepi.21804] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2013] [Revised: 02/28/2014] [Accepted: 02/28/2014] [Indexed: 12/12/2022]
Affiliation(s)
- Liang He
- Department of Public Health; Hjelt Institute; University of Helsinki; Finland
| | - Mikko J. Sillanpää
- Department of Mathematical Sciences; University of Oulu; Oulu Finland
- Department of Biology and Biocenter Oulu; University of Oulu; Oulu Finland
| | - Samuli Ripatti
- Department of Public Health; Hjelt Institute; University of Helsinki; Finland
- Institute for Molecular Medicine Finland FIMM; University of Helsinki; Finland
- Human Genetics; Wellcome Trust Sanger Institute; United Kingdom
| | - Janne Pitkäniemi
- Department of Public Health; Hjelt Institute; University of Helsinki; Finland
- Finnish Cancer Registry; Institute for Statistical and Epidemiological Cancer Research; Helsinki Finland
| |
Collapse
|
43
|
Sha Q, Zhang S. A novel test for testing the optimally weighted combination of rare and common variants based on data of parents and affected children. Genet Epidemiol 2014; 38:135-43. [PMID: 24382753 PMCID: PMC4162402 DOI: 10.1002/gepi.21787] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Revised: 10/28/2013] [Accepted: 12/02/2013] [Indexed: 11/10/2022]
Abstract
With the development of sequencing technologies, the direct testing of rare variant associations has become possible. Many statistical methods for detecting associations between rare variants and complex diseases have recently been developed, most of which are population-based methods for unrelated individuals. A limitation of population-based methods is that spurious associations can occur when there is a population structure. For rare variants, this problem can be more serious, because the spectrum of rare variation can be very different in diverse populations, as well as the current nonexistence of methods to control for population stratification in population-based rare variant associations. A solution to the problem of population stratification is to use family-based association tests, which use family members to control for population stratification. In this article, we propose a novel test for Testing the Optimally Weighted combination of variants based on data of Parents and Affected Children (TOW-PAC). TOW-PAC is a family-based association test that tests the combined effect of rare and common variants in a genomic region, and is robust to the directions of the effects of causal variants. Simulation studies confirm that, for rare variant associations, family-based association tests are robust to population stratification although population-based association tests can be seriously confounded by population stratification. The results of power comparisons show that the power of TOW-PAC increases with an increase of the number of affected children in each family and TOW-PAC based on multiple affected children per family is more powerful than TOW based on unrelated individuals.
Collapse
Affiliation(s)
- Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, United States of America
| | | |
Collapse
|
44
|
Turkmen AS, Lin S. Blocking approach for identification of rare variants in family-based association studies. PLoS One 2014; 9:e86126. [PMID: 24465912 PMCID: PMC3900483 DOI: 10.1371/journal.pone.0086126] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 12/09/2013] [Indexed: 01/14/2023] Open
Abstract
With the advent of next-generation sequencing technology, rare variant association analysis is increasingly being conducted to identify genetic variants associated with complex traits. In recent years, significant effort has been devoted to develop powerful statistical methods to test such associations for population-based designs. However, there has been relatively little development for family-based designs although family data have been shown to be more powerful to detect rare variants. This study introduces a blocking approach that extends two popular family-based common variant association tests to rare variants association studies. Several options are considered to partition a genomic region (gene) into "independent" blocks by which information from SNVs is aggregated within a block and an overall test statistic for the entire genomic region is calculated by combining information across these blocks. The proposed methodology allows different variants to have different directions (risk or protective) and specification of minor allele frequency threshold is not needed. We carried out a simulation to verify the validity of the method by showing that type I error is well under control when the underlying null hypothesis and the assumption of independence across blocks are satisfied. Further, data from the Genetic Analysis Workshop [Formula: see text] are utilized to illustrate the feasibility and performance of the proposed methodology in a realistic setting.
Collapse
Affiliation(s)
- Asuman S Turkmen
- Statistics Department, The Ohio State University, Columbus, Ohio, United States of America ; Statistics Department, The Ohio State University, Newark, Ohio, United States of America
| | - Shili Lin
- Statistics Department, The Ohio State University, Columbus, Ohio, United States of America
| |
Collapse
|
45
|
Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell 2013; 155:27-38. [PMID: 24074859 DOI: 10.1016/j.cell.2013.09.006] [Citation(s) in RCA: 613] [Impact Index Per Article: 55.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2013] [Indexed: 02/07/2023]
Abstract
Genomics is a relatively new scientific discipline, having DNA sequencing as its core technology. As technology has improved the cost and scale of genome characterization over sequencing's 40-year history, the scope of inquiry has commensurately broadened. Massively parallel sequencing has proven revolutionary, shifting the paradigm of genomics to address biological questions at a genome-wide scale. Sequencing now empowers clinical diagnostics and other aspects of medical care, including disease risk, therapeutic identification, and prenatal testing. This Review explores the current state of genomics in the massively parallel sequencing era.
Collapse
Affiliation(s)
- Daniel C Koboldt
- The Genome Institute, School of Medicine, Washington University, St. Louis, MO 63108, USA
| | | | | | | | | |
Collapse
|
46
|
Saad M, Wijsman EM. Power of family-based association designs to detect rare variants in large pedigrees using imputed genotypes. Genet Epidemiol 2013; 38:1-9. [PMID: 24243664 DOI: 10.1002/gepi.21776] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2013] [Revised: 09/30/2013] [Accepted: 10/15/2013] [Indexed: 01/09/2023]
Abstract
Recently, the "Common Disease-Multiple Rare Variants" hypothesis has received much attention, especially with current availability of next-generation sequencing. Family-based designs are well suited for discovery of rare variants, with large and carefully selected pedigrees enriching for multiple copies of such variants. However, sequencing a large number of samples is still prohibitive. Here, we evaluate a cost-effective strategy (pseudosequencing) to detect association with rare variants in large pedigrees. This strategy consists of sequencing a small subset of subjects, genotyping the remaining sampled subjects on a set of sparse markers, and imputing the untyped markers in the remaining subjects conditional on the sequenced subjects and pedigree information. We used a recent pedigree imputation method (GIGI), which is able to efficiently handle large pedigrees and accurately impute rare variants. We used burden and kernel association tests, famWS and famSKAT, which both account for family relationships and heterogeneity of allelic effect for famSKAT only. We simulated pedigree sequence data and compared the power of association tests for pseudosequence data, a subset of sequence data used for imputation, and all subjects sequenced. We also compared, within the pseudosequence data, the power of association test using best-guess genotypes and allelic dosages. Our results show that the pseudosequencing strategy considerably improves the power to detect association with rare variants. They also show that the use of allelic dosages results in much higher power than use of best-guess genotypes in these family-based data. Moreover, famSKAT shows greater power than famWS in most of scenarios we considered.
Collapse
Affiliation(s)
- Mohamad Saad
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington, United States of America; Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | | |
Collapse
|
47
|
|