1
|
Chen W, Coombes BJ, Larson NB. Recent advances and challenges of rare variant association analysis in the biobank sequencing era. Front Genet 2022; 13:1014947. [PMID: 36276986 PMCID: PMC9582646 DOI: 10.3389/fgene.2022.1014947] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/22/2022] [Indexed: 12/04/2022] Open
Abstract
Causal variants for rare genetic diseases are often rare in the general population. Rare variants may also contribute to common complex traits and can have much larger per-allele effect sizes than common variants, although power to detect these associations can be limited. Sequencing costs have steadily declined with technological advancements, making it feasible to adopt whole-exome and whole-genome profiling for large biobank-scale sample sizes. These large amounts of sequencing data provide both opportunities and challenges for rare-variant association analysis. Herein, we review the basic concepts of rare-variant analysis methods, the current state-of-the-art methods in utilizing variant annotations or external controls to improve the statistical power, and particular challenges facing rare variant analysis such as accounting for population structure, extremely unbalanced case-control design. We also review recent advances and challenges in rare variant analysis for familial sequencing data and for more complex phenotypes such as survival data. Finally, we discuss other potential directions for further methodology investigation.
Collapse
Affiliation(s)
- Wenan Chen
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Brandon J. Coombes
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Nicholas B. Larson
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| |
Collapse
|
2
|
Fischer ST, Jiang Y, Broadaway KA, Conneely KN, Epstein MP. Powerful and robust cross-phenotype association test for case-parent trios. Genet Epidemiol 2018; 42:447-458. [PMID: 29460449 PMCID: PMC6013339 DOI: 10.1002/gepi.22116] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 01/05/2018] [Accepted: 01/08/2018] [Indexed: 12/17/2022]
Abstract
There has been increasing interest in identifying genes within the human genome that influence multiple diverse phenotypes. In the presence of pleiotropy, joint testing of these phenotypes is not only biologically meaningful but also statistically more powerful than univariate analysis of each separate phenotype accounting for multiple testing. Although many cross-phenotype association tests exist, the majority of such methods assume samples composed of unrelated subjects and therefore are not applicable to family-based designs, including the valuable case-parent trio design. In this paper, we describe a robust gene-based association test of multiple phenotypes collected in a case-parent trio study. Our method is based on the kernel distance covariance (KDC) method, where we first construct a similarity matrix for multiple phenotypes and a similarity matrix for genetic variants in a gene; we then test the dependency between the two similarity matrices. The method is applicable to either common variants or rare variants in a gene, and resulting tests from the method are by design robust to confounding due to population stratification. We evaluated our method through simulation studies and observed that the method is substantially more powerful than standard univariate testing of each separate phenotype. We also applied our method to phenotypic and genotypic data collected in case-parent trios as part of the Genetics of Kidneys in Diabetes (GoKinD) study and identified a genome-wide significant gene demonstrating cross-phenotype effects that was not identified using standard univariate approaches.
Collapse
Affiliation(s)
- S. Taylor Fischer
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Yunxuan Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA
| | - K. Alaine Broadaway
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Karen N. Conneely
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| | - Michael P. Epstein
- Department of Human Genetics and Center for Computational and Quantitative Genetics, Emory University, Atlanta, GA
| |
Collapse
|
3
|
Kaakinen M, Mägi R, Fischer K, Heikkinen J, Järvelin MR, Morris AP, Prokopenko I. A rare-variant test for high-dimensional data. Eur J Hum Genet 2017; 25:988-994. [PMID: 28537275 PMCID: PMC5513099 DOI: 10.1038/ejhg.2017.90] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Revised: 02/17/2017] [Accepted: 03/28/2017] [Indexed: 12/22/2022] Open
Abstract
Genome-wide association studies have facilitated the discovery of thousands of loci for hundreds of phenotypes. However, the issue of missing heritability remains unsolved for most complex traits. Locus discovery could be enhanced with both improved power through multi-phenotype analysis (MPA) and use of a wider allele frequency range, including rare variants (RVs). MPA methods for single-variant association have been proposed, but given their low power for RVs, more efficient approaches are required. We propose multi-phenotype analysis of rare variants (MARV), a burden test-based method for RVs extended to the joint analysis of multiple phenotypes through a powerful reverse regression technique. Specifically, MARV models the proportion of RVs at which minor alleles are carried by individuals within a genomic region as a linear combination of multiple phenotypes, which can be both binary and continuous, and the method accommodates directly the genotyped and imputed data. The full model, including all phenotypes, is tested for association for discovery, and a more thorough dissection of the phenotype combinations for any set of RVs is also enabled. We show, via simulations, that the type I error rate is well controlled under various correlations between two continuous phenotypes, and that the method outperforms a univariate burden test in all considered scenarios. Application of MARV to 4876 individuals from the Northern Finland Birth Cohort 1966 for triglycerides, high- and low-density lipoprotein cholesterols highlights known loci with stronger signals of association than those observed in univariate RV analyses and suggests novel RV effects for these lipid traits.
Collapse
Affiliation(s)
- Marika Kaakinen
- Department of Genomics of Common Disease, Imperial College London, London, UK
| | - Reedik Mägi
- Estonian Genome Center, University of Tartu, Tartu, Estonia
| | - Krista Fischer
- Estonian Genome Center, University of Tartu, Tartu, Estonia
| | - Jani Heikkinen
- Department of Genomics of Common Disease, Imperial College London, London, UK.,Neuroepidemiology and Ageing (NEA) Research Unit, Imperial College London, London, UK
| | - Marjo-Riitta Järvelin
- Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London, UK.,Center for Life Course Health Research, University of Oulu, Oulu, Finland.,Unit of Primary Care, Oulu University Hospital, Oulu, Finland.,Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Andrew P Morris
- Department of Biostatistics, University of Liverpool, Liverpool, UK
| | - Inga Prokopenko
- Department of Genomics of Common Disease, Imperial College London, London, UK
| |
Collapse
|
4
|
Kaakinen M, Mägi R, Fischer K, Heikkinen J, Järvelin MR, Morris AP, Prokopenko I. MARV: a tool for genome-wide multi-phenotype analysis of rare variants. BMC Bioinformatics 2017; 18:110. [PMID: 28209135 PMCID: PMC5311849 DOI: 10.1186/s12859-017-1530-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2016] [Accepted: 02/06/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide association studies have enabled identification of thousands of loci for hundreds of traits. Yet, for most human traits a substantial part of the estimated heritability is unexplained. This and recent advances in technology to produce high-dimensional data cost-effectively have led to method development beyond standard common variant analysis, including single-phenotype rare variant and multi-phenotype common variant analysis, with the latter increasing power for locus discovery and providing suggestions of pleiotropic effects. However, there are currently no optimal methods and tools for the combined analysis of rare variants and multiple phenotypes. RESULTS We propose a user-friendly software tool MARV for Multi-phenotype Analysis of Rare Variants. The tool is based on a method that collapses rare variants within a genomic region and models the proportion of minor alleles in the rare variants on a linear combination of multiple phenotypes. MARV provides analyses of all phenotype combinations within one run and calculates the Bayesian Information Criterion to facilitate model selection. The running time increases with the size of the genetic data while the number of phenotypes to analyse has little effect both on running time and required memory. We illustrate the use of MARV with analysis of triglycerides (TG), fasting insulin (FI) and waist-to-hip ratio (WHR) in 4,721 individuals from the Northern Finland Birth Cohort 1966. The analysis suggests novel multi-phenotype effects for these metabolic traits at APOA5 and ZNF259, and at ZNF259 provides stronger support for association (P TG+FI = 1.8 × 10-9) than observed in single phenotype rare variant analyses (P TG = 6.5 × 10-8 and P FI = 0.27). CONCLUSIONS MARV is a computationally efficient, flexible and user-friendly software tool allowing rapid identification of rare variant effects on multiple phenotypes, thus paving the way for novel discoveries and insights into biology of complex traits.
Collapse
Affiliation(s)
- Marika Kaakinen
- Department of Genomics of Common Disease, Imperial College London, London, W12 0NN UK
| | - Reedik Mägi
- Estonian Genome Center, University of Tartu, Tartu, 51010 Estonia
| | - Krista Fischer
- Estonian Genome Center, University of Tartu, Tartu, 51010 Estonia
| | - Jani Heikkinen
- Department of Genomics of Common Disease, Imperial College London, London, W12 0NN UK
- Neuroepidemiology and Ageing (NEA) Research Unit, Imperial College London, London, W6 8RP UK
| | - Marjo-Riitta Järvelin
- Department of Epidemiology and Biostatistics, MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London, W2 1PG UK
- Center for Life Course Health Research, University of Oulu, 90014 Oulu, Finland
- Unit of Primary Care, Oulu University Hospital, 90220 Oulu, Finland
- Biocenter Oulu, University of Oulu, 90014 Oulu, Finland
| | - Andrew P. Morris
- Department of Biostatistics, University of Liverpool, Liverpool, L69 3BX UK
| | - Inga Prokopenko
- Department of Genomics of Common Disease, Imperial College London, London, W12 0NN UK
| |
Collapse
|
5
|
A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects. Eur J Hum Genet 2016; 24:1344-51. [PMID: 26860061 PMCID: PMC4989219 DOI: 10.1038/ejhg.2016.8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Revised: 12/22/2015] [Accepted: 12/30/2015] [Indexed: 01/05/2023] Open
Abstract
For region-based sequencing data, power to detect genetic associations can be improved through analysis of multiple related phenotypes. With this motivation, we propose a novel test to detect association simultaneously between a set of rare variants, such as those obtained by sequencing in a small genomic region, and multiple continuous phenotypes. We allow arbitrary correlations among the phenotypes and build on a linear mixed model by assuming the effects of the variants follow a multivariate normal distribution with a zero mean and a specific covariance matrix structure. In order to account for the unknown correlation parameter in the covariance matrix of the variant effects, a data-adaptive variance component test based on score-type statistics is derived. As our approach can calculate the P-value analytically, the proposed test procedure is computationally efficient. Broad simulations and an application to the UK10K project show that our proposed multivariate test is generally more powerful than univariate tests, especially when there are pleiotropic effects or highly correlated phenotypes.
Collapse
|
6
|
Melton PE, Pankratz N. Joint analyses of disease and correlated quantitative phenotypes using next-generation sequencing data. Genet Epidemiol 2012; 35 Suppl 1:S67-73. [PMID: 22128062 DOI: 10.1002/gepi.20653] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The joint analysis of multiple disease phenotypes aims to increase statistical power and potentially identify pleiotropic genes involved in the biological development of common chronic diseases. As next-generation sequencing data become more common, it will be important to consider ways to maximize the ability to detect rare variants within the human genome. The two exome sequence data sets provided for analysis at Genetic Analysis Workshop 17 (GAW17) offered three quantitative phenotypes related to disease status in 200 simulated replicates for both families and unrelated individuals. Participants in Group 10 addressed the challenges and potential uses of next-generation sequencing data to identify causal variants through a broad range of statistical methods. These methods included investigating multiple phenotypes either through data reduction or joint methods, using family or unrelated individuals, and reducing the dimensionality inherent in these data. Most of the research teams regarded the use of multiple phenotypes as a means of increasing analytical power and as a way to clarify the biology of complex disease. Three major observations were gleaned from these Group 10 contributions. First, family and unrelated case-control samples are suited to finding different types of variants. In addition, collapsing either phenotypes or genotypes can reduce the dimensionality of the data and alleviate some of the problems of multiple testing. Finally, we were able to demonstrate in certain cases that performing a joint analysis of disease status and a quantitative trait can improve statistical power.
Collapse
Affiliation(s)
- Phillip E Melton
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, Texas, USA
| | | |
Collapse
|