1
|
Marceau West R, Lu W, Rotroff DM, Kuenemann MA, Chang SM, Wu MC, Wagner MJ, Buse JB, Motsinger-Reif AA, Fourches D, Tzeng JY. Identifying individual risk rare variants using protein structure guided local tests (POINT). PLoS Comput Biol 2019; 15:e1006722. [PMID: 30779729 PMCID: PMC6396946 DOI: 10.1371/journal.pcbi.1006722] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 03/01/2019] [Accepted: 12/17/2018] [Indexed: 01/08/2023] Open
Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data.
Collapse
Affiliation(s)
- Rachel Marceau West
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Daniel M. Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Melaine A. Kuenemann
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Sheng-Mao Chang
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| | - Michael C. Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Michael J. Wagner
- Center for Pharmacogenomics and Individualized Therapy, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - John B. Buse
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, United States of America
| | - Alison A. Motsinger-Reif
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Denis Fourches
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
2
|
Shin S, Keleş S. Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. STATISTICS IN BIOSCIENCES 2017; 9:50-72. [PMID: 28781711 PMCID: PMC5542423 DOI: 10.1007/s12561-016-9154-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Revised: 05/09/2016] [Accepted: 06/20/2016] [Indexed: 10/21/2022]
Abstract
Although genome-wide association studies (GWAS) have been successful at finding thousands of disease-associated genetic variants (GVs), identifying causal variants and elucidating the mechanisms by which genotypes influence phenotypes are critical open questions. A key challenge is that a large percentage of disease-associated GVs are potential regulatory variants located in noncoding regions, making them difficult to interpret. Recent research efforts focus on going beyond annotating GVs by integrating functional annotation data with GWAS to prioritize GVs. However, applicability of these approaches is challenged by high dimensionality and heterogeneity of functional annotation data. Furthermore, existing methods often assume global associations of GVs with annotation data. This strong assumption is susceptible to violations for GVs involved in many complex diseases. To address these issues, we develop a general regression framework, named Annotation Regression for GWAS (ARoG). ARoG is based on a finite mixture of linear regressions model where GWAS association measures are viewed as responses and functional annotations as predictors. This mixture framework addresses heterogeneity of effects of GVs by grouping them into clusters and high dimensionality of the functional annotations by enabling annotation selection within each cluster. ARoG further employs permutation testing to evaluate the significance of selected annotations. Computational experiments indicate that ARoG can discover distinct associations between disease risk and functional annotations. Application of ARoG to autism and schizophrenia data from Psychiatric Genomics Consortium led to identification of GVs that significantly affect interactions of several transcription factors with DNA as potential mechanisms contributing to these disorders.
Collapse
Affiliation(s)
- Sunyoung Shin
- Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, USA
| | - Sündüz Keleş
- Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, USA
| |
Collapse
|
3
|
Jeng XJ, Daye ZJ, Lu W, Tzeng JY. Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level. PLoS Comput Biol 2016; 12:e1004993. [PMID: 27355347 PMCID: PMC4927097 DOI: 10.1371/journal.pcbi.1004993] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2015] [Accepted: 05/21/2016] [Indexed: 11/24/2022] Open
Abstract
Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information. Next-generation sequencing technologies have allowed genetic association studies of complex traits at the single base-pair resolution, where most genetic variants have extremely low mutation frequencies. These rare variants have been the focus of modern statistical-computational genomics due to their potential to explain missing disease heritability. The identification of individual rare variants associated with diseases can provide new biological insights and enable the precise delineation of disease mechanisms. However, due to the extreme rarity of mutations and large numbers of variants, significances of causative variants tend to be mixed inseparably with a few noncausative ones, and standard multiple testing procedures controlling for false positives fail to provide a meaningful way to include a large proportion of the causative variants. To address the challenge of detecting weak biological signals, we propose a novel statistical procedure, based on false-negative control, to provide a practical approach for variant inclusion in large-scale sequencing studies. By determining those variants that can be confidently dispatched as noncausative, the proposed procedure offers an objective selection of a modest number of potentially causative variants at the single-locus level. Results can be further prioritized or used to infer disease-associated genes with annotation information.
Collapse
Affiliation(s)
- Xinge Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Zhongyin John Daye
- Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- * E-mail:
| |
Collapse
|
4
|
van Parijs FRD, Ruttink T, Boerjan W, Haesaert G, Byrne SL, Asp T, Roldán-Ruiz I, Muylle H. Clade classification of monolignol biosynthesis gene family members reveals target genes to decrease lignin in Lolium perenne. PLANT BIOLOGY (STUTTGART, GERMANY) 2015; 17:877-92. [PMID: 25683375 DOI: 10.1111/plb.12316] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Accepted: 01/19/2015] [Indexed: 05/08/2023]
Abstract
In monocots, lignin content has a strong impact on the digestibility of the cell wall fraction. Engineering lignin biosynthesis requires a profound knowledge of the role of paralogues in the multigene families that constitute the monolignol biosynthesis pathway. We applied a bioinformatics approach for genome-wide identification of candidate genes in Lolium perenne that are likely to be involved in the biosynthesis of monolignols. More specifically, we performed functional subtyping of phylogenetic clades in four multigene families: 4CL, COMT, CAD and CCR. Essential residues were considered for functional clade delineation within these families. This classification was complemented with previously published experimental evidence on gene expression, gene function and enzymatic activity in closely related crops and model species. This allowed us to assign functions to novel identified L. perenne genes, and to assess functional redundancy among paralogues. We found that two 4CL paralogues, two COMT paralogues, three CCR paralogues and one CAD gene are prime targets for genetic studies to engineer developmentally regulated lignin in this species. Based on the delineation of sequence conservation between paralogues and a first analysis of allelic diversity, we discuss possibilities to further study the roles of these paralogues in lignin biosynthesis, including expression analysis, reverse genetics and forward genetics, such as association mapping. We propose criteria to prioritise paralogues within multigene families and certain SNPs within these genes for developing genotyping assays or increasing power in association mapping studies. Although L. perenne was the target of the analyses presented here, this functional subtyping of phylogenetic clades represents a valuable tool for studies investigating monolignol biosynthesis genes in other monocot species.
Collapse
Affiliation(s)
- F R D van Parijs
- Plant Sciences Unit - Growth and Development, Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium
| | - T Ruttink
- Plant Sciences Unit - Growth and Development, Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium
| | - W Boerjan
- Department of Plant Systems Biology, VIB, Gent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent, Belgium
| | - G Haesaert
- Faculty Bioscience Engineering, Department of Applied Biosciences, Ghent University, Gent, Belgium
| | - S L Byrne
- Department of Molecular Biology and Genetics, Research Centre Flakkebjerg, Aarhus University, Slagelse, Denmark
| | - T Asp
- Department of Molecular Biology and Genetics, Research Centre Flakkebjerg, Aarhus University, Slagelse, Denmark
| | - I Roldán-Ruiz
- Plant Sciences Unit - Growth and Development, Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium
| | - H Muylle
- Plant Sciences Unit - Growth and Development, Institute for Agricultural and Fisheries Research (ILVO), Melle, Belgium
| |
Collapse
|
5
|
Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model. PLoS Genet 2015; 11:e1004969. [PMID: 25849665 PMCID: PMC4388571 DOI: 10.1371/journal.pgen.1004969] [Citation(s) in RCA: 248] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Accepted: 12/19/2014] [Indexed: 12/23/2022] Open
Abstract
Gene discovery, estimation of heritability captured by SNP arrays, inference on genetic architecture and prediction analyses of complex traits are usually performed using different statistical models and methods, leading to inefficiency and loss of power. Here we use a Bayesian mixture model that simultaneously allows variant discovery, estimation of genetic variance explained by all variants and prediction of unobserved phenotypes in new samples. We apply the method to simulated data of quantitative traits and Welcome Trust Case Control Consortium (WTCCC) data on disease and show that it provides accurate estimates of SNP-based heritability, produces unbiased estimators of risk in new samples, and that it can estimate genetic architecture by partitioning variation across hundreds to thousands of SNPs. We estimated that, depending on the trait, 2,633 to 9,411 SNPs explain all of the SNP-based heritability in the WTCCC diseases. The majority of those SNPs (>96%) had small effects, confirming a substantial polygenic component to common diseases. The proportion of the SNP-based variance explained by large effects (each SNP explaining 1% of the variance) varied markedly between diseases, ranging from almost zero for bipolar disorder to 72% for type 1 diabetes. Prediction analyses demonstrate that for diseases with major loci, such as type 1 diabetes and rheumatoid arthritis, Bayesian methods outperform profile scoring or mixed model approaches. Most genome-wide association studies performed to date have focused on testing individual genetic markers for associations with phenotype. Recently, methods that analyse the joint effects of multiple markers on genetic variation have provided further insights into the genetic basis of complex human traits. In addition, there is increasing interest in using genotype data for genetic risk prediction of disease. Often disparate analytical methods are used for each of these tasks. We propose a flexible novel approach that simultaneously performs identification of susceptibility loci, inference on the genetic architecture and provides polygenic risk prediction in the same statistical model. We illustrate the broad applicability of the approach by considering both simulated and real data. In the analysis of seven common diseases we show large differences in the proportion of genetic variation due to loci with different effect sizes and differences in prediction accuracy between complex traits. These findings are important for future studies and the understanding of the complex genetic architecture of common diseases.
Collapse
Affiliation(s)
- Gerhard Moser
- Queensland Brain Institute, University of Queensland, Brisbane, Australia
- * E-mail:
| | - Sang Hong Lee
- Queensland Brain Institute, University of Queensland, Brisbane, Australia
| | - Ben J. Hayes
- Department of Primary Industries, Biosciences Research Division, Bundoora, Australia
- Dairy Futures Cooperative Research Centre, Bundoora, Australia
| | - Michael E. Goddard
- Department of Primary Industries, Biosciences Research Division, Bundoora, Australia
- Faculty of Land and Food Resources, University of Melbourne, Melbourne, Australia
| | - Naomi R. Wray
- Queensland Brain Institute, University of Queensland, Brisbane, Australia
| | - Peter M. Visscher
- Queensland Brain Institute, University of Queensland, Brisbane, Australia
- University of Queensland Diamantina Institute, University of Queensland, Translational Research Institute (TRI), Brisbane, Australia
| |
Collapse
|
6
|
Ionita-Laza I, Capanu M, De Rubeis S, McCallum K, Buxbaum JD. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet 2014; 10:e1004729. [PMID: 25502226 PMCID: PMC4263785 DOI: 10.1371/journal.pgen.1004729] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 09/02/2014] [Indexed: 11/18/2022] Open
Abstract
Pinpointing the small number of causal variants among the abundant naturally occurring genetic variation is a difficult challenge, but a crucial one for understanding precise molecular mechanisms of disease and follow-up functional studies. We propose and investigate two complementary statistical approaches for identification of rare causal variants in sequencing studies: a backward elimination procedure based on groupwise association tests, and a hierarchical approach that can integrate sequencing data with diverse functional and evolutionary conservation annotations for individual variants. Using simulations, we show that incorporation of multiple bioinformatic predictors of deleteriousness, such as PolyPhen-2, SIFT and GERP++ scores, can improve the power to discover truly causal variants. As proof of principle, we apply the proposed methods to VPS13B, a gene mutated in the rare neurodevelopmental disorder called Cohen syndrome, and recently reported with recessive variants in autism. We identify a small set of promising candidates for causal variants, including two loss-of-function variants and a rare, homozygous probably-damaging variant that could contribute to autism risk.
Collapse
Affiliation(s)
- Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University, New York, New York, United States of America
- * E-mail:
| | - Marinela Capanu
- Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Silvia De Rubeis
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Departments of Psychiatry, Mount Sinai School of Medicine, New York, New York, United States of America
| | - Kenneth McCallum
- Department of Biostatistics, Columbia University, New York, New York, United States of America
| | - Joseph D. Buxbaum
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Departments of Psychiatry, Mount Sinai School of Medicine, New York, New York, United States of America
- Departments of Genetics and Genomic Sciences, and Neuroscience, and Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Mindich Child Health and Development Institute, Mount Sinai School of Medicine, New York, New York, United States of America
| |
Collapse
|
7
|
Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics 2014; 198:497-508. [PMID: 25104515 PMCID: PMC4196608 DOI: 10.1534/genetics.114.167908] [Citation(s) in RCA: 294] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 07/18/2014] [Indexed: 12/22/2022] Open
Abstract
Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Emrah Kostem
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Eun Yong Kang
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Bogdan Pasaniuc
- Department of Human Genetics, University of California, Los Angeles, California 90095 Department of Pathology and Laboratory Medicine, University of California, Los Angeles, California 90095
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, California 90095 Department of Human Genetics, University of California, Los Angeles, California 90095
| |
Collapse
|