1
|
Hsieh AR, Luo YL, Bao BY, Chou TC. Comparative analysis of genetic risk scores for predicting biochemical recurrence in prostate cancer patients after radical prostatectomy. BMC Urol 2024; 24:136. [PMID: 38956663 PMCID: PMC11218119 DOI: 10.1186/s12894-024-01524-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 06/25/2024] [Indexed: 07/04/2024] Open
Abstract
BACKGROUND In recent years, Genome-Wide Association Studies (GWAS) has identified risk variants related to complex diseases, but most genetic variants have less impact on phenotypes. To solve the above problems, methods that can use variants with low genetic effects, such as genetic risk score (GRS), have been developed to predict disease risk. METHODS As the GRS model with the most incredible prediction power for complex diseases has not been determined, our study used simulation data and prostate cancer data to explore the disease prediction power of three GRS models, including the simple count genetic risk score (SC-GRS), the direct logistic regression genetic risk score (DL-GRS), and the explained variance weighted GRS based on directed logistic regression (EVDL-GRS). RESULTS AND CONCLUSIONS We used 26 SNPs to establish GRS models to predict the risk of biochemical recurrence (BCR) after radical prostatectomy. Combining clinical variables such as age at diagnosis, body mass index, prostate-specific antigen, Gleason score, pathologic T stage, and surgical margin and GRS models has better predictive power for BCR. The results of simulation data (statistical power = 0.707) and prostate cancer data (area under curve = 0.8462) show that DL-GRS has the best prediction performance. The rs455192 was the most relevant locus for BCR (p = 2.496 × 10-6) in our study.
Collapse
Affiliation(s)
- Ai-Ru Hsieh
- Department of Statistics, Tamkang University, New Taipei City, 251301, Taiwan.
| | - Yi-Ling Luo
- Department of Public Health, College of Public Health, China Medical University, Taichung, 40402, Taiwan
| | - Bo-Ying Bao
- School of Pharmacy, China Medical University, Taichung, 406040, Taiwan
- Department of Nursing, Asia University, Taichung, 41354, Taiwan
| | - Tzu-Chieh Chou
- Department of Public Health, College of Public Health, China Medical University, Taichung, 40402, Taiwan
- Department of Health Risk Management, College of Public Health, China Medical University, Taichung, 40402, Taiwan
| |
Collapse
|
2
|
Hershberger RE, Cowan J, Jordan E, Kinnamon DD. The Complex and Diverse Genetic Architecture of Dilated Cardiomyopathy. Circ Res 2021; 128:1514-1532. [PMID: 33983834 DOI: 10.1161/circresaha.121.318157] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Our insight into the diverse and complex nature of dilated cardiomyopathy (DCM) genetic architecture continues to evolve rapidly. The foundations of DCM genetics rest on marked locus and allelic heterogeneity. While DCM exhibits a Mendelian, monogenic architecture in some families, preliminary data from our studies and others suggests that at least 20% to 30% of DCM may have an oligogenic basis, meaning that multiple rare variants from different, unlinked loci, determine the DCM phenotype. It is also likely that low-frequency and common genetic variation contribute to DCM complexity, but neither has been examined within a rare variant context. Other types of genetic variation are also likely relevant for DCM, along with gene-by-environment interaction, now established for alcohol- and chemotherapy-related DCM. Collectively, this suggests that the genetic architecture of DCM is broader in scope and more complex than previously understood. All of this elevates the impact of DCM genetics research, as greater insight into the causes of DCM can lead to interventions to mitigate or even prevent it and thus avoid the morbid and mortal scourge of human heart failure.
Collapse
Affiliation(s)
- Ray E Hershberger
- Divisions of Cardiovascular Medicine (R.E.H.), The Ohio State University Wexner Medical Center, Columbus.,Human Genetics (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus.,Department of Internal Medicine and the Davis Heart and Lung Research Institute (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus
| | - Jason Cowan
- Human Genetics (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus.,Department of Internal Medicine and the Davis Heart and Lung Research Institute (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus
| | - Elizabeth Jordan
- Human Genetics (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus.,Department of Internal Medicine and the Davis Heart and Lung Research Institute (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus
| | - Daniel D Kinnamon
- Human Genetics (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus.,Department of Internal Medicine and the Davis Heart and Lung Research Institute (R.E.H., J.C., E.J., D.D.K.), The Ohio State University Wexner Medical Center, Columbus
| |
Collapse
|
3
|
Novel directions in data pre-processing and genome-wide association study (GWAS) methodologies to overcome ongoing challenges. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
4
|
Zhang J, Wu B, Sha Q, Zhang S, Wang X. A general statistic to test an optimally weighted combination of common and/or rare variants. Genet Epidemiol 2019; 43:966-979. [PMID: 31498476 DOI: 10.1002/gepi.22255] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2019] [Revised: 06/17/2019] [Accepted: 07/30/2019] [Indexed: 11/10/2022]
Abstract
Both genome-wide association study and next-generation sequencing data analyses are widely employed to identify disease susceptible common and/or rare genetic variants. Rare variants generally have large effects though they are hard to detect due to their low frequencies. Currently, many existing statistical methods for rare variants association studies employ a weighted combination scheme, which usually puts subjective weights or suboptimal weights based on some adhoc assumptions (e.g., ignoring dependence between rare variants). In this study, we analytically derived optimal weights for both common and rare variants and proposed a general and novel approach to test association between an optimally weighted combination of variants (G-TOW) in a gene or pathway for a continuous or dichotomous trait while easily adjusting for covariates. Results of the simulation studies show that G-TOW has properly controlled type I error rates and it is the most powerful test among the methods we compared when testing effects of either both rare and common variants or rare variants only. We also illustrate the effectiveness of G-TOW using the Genetic Analysis Workshop 17 (GAW17) data. Additionally, we applied G-TOW and other competitive methods to test disease-associated genes in real data of schizophrenia. The G-TOW has successfully verified genes FYN and VPS39 which are associated with schizophrenia reported in existing publications. Both of these genes are missed by the weighted sum statistic and the sequence kernel association test. Simulation study and real data analysis indicate that G-TOW is a powerful test.
Collapse
Affiliation(s)
- Jianjun Zhang
- Department of Mathematics, University of North Texas, Denton, Texas
| | - Baolin Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan
| | - Xuexia Wang
- Department of Mathematics, University of North Texas, Denton, Texas
| |
Collapse
|
5
|
Marceau West R, Lu W, Rotroff DM, Kuenemann MA, Chang SM, Wu MC, Wagner MJ, Buse JB, Motsinger-Reif AA, Fourches D, Tzeng JY. Identifying individual risk rare variants using protein structure guided local tests (POINT). PLoS Comput Biol 2019; 15:e1006722. [PMID: 30779729 PMCID: PMC6396946 DOI: 10.1371/journal.pcbi.1006722] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 03/01/2019] [Accepted: 12/17/2018] [Indexed: 01/08/2023] Open
Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data.
Collapse
Affiliation(s)
- Rachel Marceau West
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Daniel M. Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Melaine A. Kuenemann
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Sheng-Mao Chang
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| | - Michael C. Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Michael J. Wagner
- Center for Pharmacogenomics and Individualized Therapy, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - John B. Buse
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, United States of America
| | - Alison A. Motsinger-Reif
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Denis Fourches
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
6
|
Konigorski S, Yilmaz YE, Pischon T. Comparison of single-marker and multi-marker tests in rare variant association studies of quantitative traits. PLoS One 2017; 12:e0178504. [PMID: 28562689 PMCID: PMC5451057 DOI: 10.1371/journal.pone.0178504] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2017] [Accepted: 05/15/2017] [Indexed: 11/19/2022] Open
Abstract
In genetic association studies of rare variants, low statistical power and potential violations of established estimator properties are among the main challenges of association tests. Multi-marker tests (MMTs) have been proposed to target these challenges, but any comparison with single-marker tests (SMTs) has to consider that their aim is to identify causal genomic regions instead of variants. Valid power comparisons have been performed for the analysis of binary traits indicating that MMTs have higher power, but there is a lack of conclusive studies for quantitative traits. The aim of our study was therefore to fairly compare SMTs and MMTs in their empirical power to identify the same causal loci associated with a quantitative trait. The results of extensive simulation studies indicate that previous results for binary traits cannot be generalized. First, we show that for the analysis of quantitative traits, conventional estimation methods and test statistics of single-marker approaches have valid properties yielding association tests with valid type I error, even when investigating singletons or doubletons. Furthermore, SMTs lead to more powerful association tests for identifying causal genes than MMTs when the effect sizes of causal variants are large, and less powerful tests when causal variants have small effect sizes. For moderate effect sizes, whether SMTs or MMTs have higher power depends on the sample size and percentage of causal SNVs. For a more complete picture, we also compare the power in studies of quantitative and binary traits, and the power to identify causal genes with the power to identify causal rare variants. In a genetic association analysis of systolic blood pressure in the Genetic Analysis Workshop 19 data, SMTs yielded smaller p-values compared to MMTs for most of the investigated blood pressure genes, and were least influenced by the definition of gene regions.
Collapse
Affiliation(s)
- Stefan Konigorski
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Yildiz E. Yilmaz
- Department of Mathematics and Statistics, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
- Discipline of Genetics, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
- Discipline of Medicine, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
| | - Tobias Pischon
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
- Charité Universitätsmedizin Berlin, Berlin, Germany
- DZHK (German Center for Cardiovascular Research), Berlin, Germany
| |
Collapse
|
7
|
Turkmen A, Lin S. Are rare variants really independent? Genet Epidemiol 2017; 41:363-371. [DOI: 10.1002/gepi.22039] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Revised: 11/02/2016] [Accepted: 12/26/2016] [Indexed: 11/10/2022]
Affiliation(s)
- Asuman Turkmen
- Department of Statistics The Ohio State University Columbus Ohio United States of America
- Department of Statistics The Ohio State University at Newark Newark Ohio United States of America
| | - Shili Lin
- Department of Statistics The Ohio State University Columbus Ohio United States of America
| |
Collapse
|
8
|
Shin JH, Yi R, Bull SB. Identification of low frequency and rare variants for hypertension using sparse-data methods. BMC Proc 2016; 10:389-395. [PMID: 27980667 PMCID: PMC5133522 DOI: 10.1186/s12919-016-0061-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Availability of genomic sequence data provides opportunities to study the role of low-frequency and rare variants in the etiology of complex disease. In this study, we conduct association analyses of hypertension status in the cohort of 1943 unrelated Mexican Americans provided by Genetic Analysis Workshop 19, focusing on exonic variants in MAP4 on chromosome 3. Our primary interest is to compare the performance of standard and sparse-data approaches for single-variant tests and variant-collapsing tests for sets of rare and low-frequency variants. We analyze both the real and the simulated phenotypes.
Collapse
Affiliation(s)
- Ji-Hyung Shin
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, University of Toronto, Toronto, ON M5T 3L9 Canada
| | - Ruiyang Yi
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, University of Toronto, Toronto, ON M5T 3L9 Canada
| | - Shelley B Bull
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, University of Toronto, Toronto, ON M5T 3L9 Canada
| |
Collapse
|
9
|
Block-based association tests for rare variants using Kullback–Leibler divergence. J Hum Genet 2016; 61:965-975. [DOI: 10.1038/jhg.2016.90] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 05/03/2016] [Accepted: 06/17/2016] [Indexed: 11/09/2022]
|
10
|
Jeng XJ, Daye ZJ, Lu W, Tzeng JY. Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level. PLoS Comput Biol 2016; 12:e1004993. [PMID: 27355347 PMCID: PMC4927097 DOI: 10.1371/journal.pcbi.1004993] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2015] [Accepted: 05/21/2016] [Indexed: 11/24/2022] Open
Abstract
Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information. Next-generation sequencing technologies have allowed genetic association studies of complex traits at the single base-pair resolution, where most genetic variants have extremely low mutation frequencies. These rare variants have been the focus of modern statistical-computational genomics due to their potential to explain missing disease heritability. The identification of individual rare variants associated with diseases can provide new biological insights and enable the precise delineation of disease mechanisms. However, due to the extreme rarity of mutations and large numbers of variants, significances of causative variants tend to be mixed inseparably with a few noncausative ones, and standard multiple testing procedures controlling for false positives fail to provide a meaningful way to include a large proportion of the causative variants. To address the challenge of detecting weak biological signals, we propose a novel statistical procedure, based on false-negative control, to provide a practical approach for variant inclusion in large-scale sequencing studies. By determining those variants that can be confidently dispatched as noncausative, the proposed procedure offers an objective selection of a modest number of potentially causative variants at the single-locus level. Results can be further prioritized or used to infer disease-associated genes with annotation information.
Collapse
Affiliation(s)
- Xinge Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Zhongyin John Daye
- Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- * E-mail:
| |
Collapse
|
11
|
He L, Pitkäniemi J, Sarin AP, Salomaa V, Sillanpää MJ, Ripatti S. Hierarchical Bayesian model for rare variant association analysis integrating genotype uncertainty in human sequence data. Genet Epidemiol 2014; 39:89-100. [PMID: 25395270 DOI: 10.1002/gepi.21871] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Revised: 09/18/2014] [Accepted: 10/03/2014] [Indexed: 11/08/2022]
Abstract
Next-generation sequencing (NGS) has led to the study of rare genetic variants, which possibly explain the missing heritability for complex diseases. Most existing methods for rare variant (RV) association detection do not account for the common presence of sequencing errors in NGS data. The errors can largely affect the power and perturb the accuracy of association tests due to rare observations of minor alleles. We developed a hierarchical Bayesian approach to estimate the association between RVs and complex diseases. Our integrated framework combines the misclassification probability with shrinkage-based Bayesian variable selection. It allows for flexibility in handling neutral and protective RVs with measurement error, and is robust enough for detecting causal RVs with a wide spectrum of minor allele frequency (MAF). Imputation uncertainty and MAF are incorporated into the integrated framework to achieve the optimal statistical power. We demonstrate that sequencing error does significantly affect the findings, and our proposed model can take advantage of it to improve statistical power in both simulated and real data. We further show that our model outperforms existing methods, such as sequence kernel association test (SKAT). Finally, we illustrate the behavior of the proposed method using a Finnish low-density lipoprotein cholesterol study, and show that it identifies an RV known as FH North Karelia in LDLR gene with three carriers in 1,155 individuals, which is missed by both SKAT and Granvil.
Collapse
Affiliation(s)
- Liang He
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | | | | | | | | | | |
Collapse
|
12
|
Vsevolozhskaya OA, Zaykin DV, Greenwood MC, Wei C, Lu Q. Functional analysis of variance for association studies. PLoS One 2014; 9:e105074. [PMID: 25244256 PMCID: PMC4171465 DOI: 10.1371/journal.pone.0105074] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Accepted: 07/18/2014] [Indexed: 12/21/2022] Open
Abstract
While progress has been made in identifying common genetic variants associated with human diseases, for most of common complex diseases, the identified genetic variants only account for a small proportion of heritability. Challenges remain in finding additional unknown genetic variants predisposing to complex diseases. With the advance in next-generation sequencing technologies, sequencing studies have become commonplace in genetic research. The ongoing exome-sequencing and whole-genome-sequencing studies generate a massive amount of sequencing variants and allow researchers to comprehensively investigate their role in human diseases. The discovery of new disease-associated variants can be enhanced by utilizing powerful and computationally efficient statistical methods. In this paper, we propose a functional analysis of variance (FANOVA) method for testing an association of sequence variants in a genomic region with a qualitative trait. The FANOVA has a number of advantages: (1) it tests for a joint effect of gene variants, including both common and rare; (2) it fully utilizes linkage disequilibrium and genetic position information; and (3) allows for either protective or risk-increasing causal variants. Through simulations, we show that FANOVA outperform two popularly used methods - SKAT and a previously proposed method based on functional linear models (FLM), - especially if a sample size of a study is small and/or sequence variants have low to moderate effects. We conduct an empirical study by applying three methods (FANOVA, SKAT and FLM) to sequencing data from Dallas Heart Study. While SKAT and FLM respectively detected ANGPTL 4 and ANGPTL 3 associated with obesity, FANOVA was able to identify both genes associated with obesity.
Collapse
Affiliation(s)
- Olga A. Vsevolozhskaya
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| | - Dmitri V. Zaykin
- National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, United States of America
| | - Mark C. Greenwood
- Department of Mathematical Sciences, Montana State University, Bozeman, Montana, United States of America
| | - Changshuai Wei
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
13
|
Lin YC, Hsieh AR, Hsiao CL, Wu SJ, Wang HM, Lian IB, Fann CSJ. Identifying rare and common disease associated variants in genomic data using Parkinson's disease as a model. J Biomed Sci 2014; 21:88. [PMID: 25175702 PMCID: PMC4428531 DOI: 10.1186/s12929-014-0088-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 08/21/2014] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Genome-wide association studies have been successful in identifying common genetic variants for human diseases. However, much of the heritable variation associated with diseases such as Parkinson's disease remains unknown suggesting that many more risk loci are yet to be identified. Rare variants have become important in disease association studies for explaining missing heritability. Methods for detecting this type of association require prior knowledge on candidate genes and combining variants within the region. These methods may suffer from power loss in situations with many neutral variants or causal variants with opposite effects. RESULTS We propose a method capable of scanning genetic variants to identify the region most likely harbouring disease gene with rare and/or common causal variants. Our method assigns a score at each individual variant based on our scoring system. It uses aggregate scores to identify the region with disease association. We evaluate performance by simulation based on 1000 Genomes sequencing data and compare with three commonly used methods. We use a Parkinson's disease case-control dataset as a model to demonstrate the application of our method. Our method has better power than CMC and WSS and similar power to SKAT-O with well-controlled type I error under simulation based on 1000 Genomes sequencing data. In real data analysis, we confirm the association of α-synuclein gene (SNCA) with Parkinson's disease (p = 0.005). We further identify association with hyaluronan synthase 2 (HAS2, p = 0.028) and kringle containing transmembrane protein 1 (KREMEN1, p = 0.006). KREMEN1 is associated with Wnt signalling pathway which has been shown to play an important role for neurodegeneration in Parkinson's disease. CONCLUSIONS Our method is time efficient and less sensitive to inclusion of neutral variants and direction effect of causal variants. It can narrow down a genomic region or a chromosome to a disease associated region. Using Parkinson's disease as a model, our method not only confirms association for a known gene but also identifies two genes previously found by other studies. In spite of many existing methods, we conclude that our method serves as an efficient alternative for exploring genomic data containing both rare and common variants.
Collapse
Affiliation(s)
- Ying-Chao Lin
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan. .,Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan. .,Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan.
| | - Ai-Ru Hsieh
- Graduate Institute of Biostatistics, China Medical University, Taichung, Taiwan.
| | - Ching-Lin Hsiao
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan.
| | - Shang-Jung Wu
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan.
| | - Hui-Min Wang
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan.
| | - Ie-Bin Lian
- Graduate Institute of Statistics and Information Science, National Changhua University of Education, Changhua, Taiwan.
| | - Cathy S J Fann
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan. .,Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan. .,Institute of Public Health, National Yang-Ming University, Taipei, Taiwan.
| |
Collapse
|
14
|
Sha Q, Zhang S. A rare variant association test based on combinations of single-variant tests. Genet Epidemiol 2014; 38:494-501. [PMID: 25065727 DOI: 10.1002/gepi.21834] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Revised: 04/17/2014] [Accepted: 05/19/2014] [Indexed: 01/22/2023]
Abstract
Next generation sequencing technologies make direct testing rare variant associations possible. However, the development of powerful statistical methods for rare variant association studies is still underway. Most of existing methods are burden and quadratic tests. Recent studies show that the performance of each of burden and quadratic tests depends strongly upon the underlying assumption and no test demonstrates consistently acceptable power. Thus, combined tests by combining information from the burden and quadratic tests have been proposed recently. However, results from recent studies (including this study) show that there exist tests that can outperform both burden and quadratic tests. In this article, we propose three classes of tests that include tests outperforming both burden and quadratic tests. Then, we propose the optimal combination of single-variant tests (OCST) by combining information from tests of the three classes. We use extensive simulation studies to compare the performance of OCST with that of burden, quadratic and optimal single-variant tests. Our results show that OCST either is the most powerful test or has similar power with the most powerful test. We also compare the performance of OCST with that of the two existing combined tests. Our results show that OCST has better power than the two combined tests.
Collapse
Affiliation(s)
- Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, United States of America
| | | |
Collapse
|
15
|
Courtenay MD, Cade W, Schwartz SG, Kovach JL, Agarwal A, Wang G, Haines JL, Pericak-Vance MA, Scott WK. Set-based joint test of interaction between SNPs in the VEGF pathway and exogenous estrogen finds association with age-related macular degeneration. Invest Ophthalmol Vis Sci 2014; 55:IOVS-14-14494. [PMID: 25015356 PMCID: PMC4126792 DOI: 10.1167/iovs.14-14494] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Accepted: 06/27/2014] [Indexed: 11/24/2022] Open
Abstract
Purpose:Age-Related Macular Degeneration (AMD) is the leading cause of irreversible visual loss in developed countries. Its etiology includes genetic and environmental factors. Although VEGFA variants are associated with AMD, the joint action of variants within the VEGF pathway and their interaction with non-genetic factors has not been investigated. Methods:Affymetrix 6.0 chipsets were used to genotype 668,238 SNPs in 1,207 AMD cases and 686 controls. Environmental exposures were collected by questionnaire. A set-based test was conducted using the chi-square statistic at each SNP derived from Kraft's 2df joint test. Pathway and gene-based test statistics were calculated as the mean of all independent SNP statistics. Phenotype labels were permuted 10,000 times to generate an empirical p-value. Results: While a main effect of the VEGF pathway was not identified, the pathway was associated with neovascular AMD in women when accounting for birth control pill (BCP) use (P= 0.017). Analysis of VEGF's subpathways found that SNPs in the Proliferation subpathway were associated with neovascular AMD (P=0.029) when accounting for BCP use. Nominally significant genes within this subpathway were also observed. Stratification by BCP use revealed novel significant genetic effects in women who had taken BCPs. Conclusions: These results illustrate that some AMD genetic risk factors may only be revealed when considering complex relationships among risk factors. This shows the utility of exploring pathways of previously associated genes to find novel effects. It also demonstrates the importance of incorporating environmental exposures in tests of genetic association at the SNP, gene, or pathway level.
Collapse
Affiliation(s)
- Monique D Courtenay
- Human Genetics and Genomics, University of Miami Miller School Medicine, 1501 NW 10th Ave, Miami, FL, 33136, United States
| | - William Cade
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, 1501 NW 10 Ave, BRB-314 (M860), Miami, Florida, 33136, United States
| | - Stephen G Schwartz
- Ophthalmology, Bascom Palmer Eye Institute, Retina Center of Naples, 311 9th Street North, Naples, Florida, 34102, United States of America
| | - Jaclyn L Kovach
- Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, 311 9th St N, Naples, FL, 34102, United States of America
| | - Anita Agarwal
- VEI, Vanderbilt University, 2311 Pierce avenue, Nashville, Tennessee, 37232-8808, United States of America
| | - Gaofeng Wang
- Human Genetics, University of Miami Miller School of Medicine, 1501 NW 10th Avenue; BRB 525, Miami, Florida, 33136, United States
| | - Jonathan L Haines
- Department of Epidemiology & Biostatistics, Case Western Reserve University, 2-529 Wolstein Research Building, 2103 Cornell Road, Cleveland, Ohio, 44106, United States
| | - Margaret A Pericak-Vance
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, 1501 NW 10th Avenue, BRB-314 (M860), Miami, Florida, 33136, United States of America
| | - Wiliam K Scott
- Dr. John T. Macdonald Foundation Department of Human Genetics, John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, 1501 NW 10 Ave., Biomedical Research Building (BRB) # 414, Miami, Florida, 33136, United States
| |
Collapse
|
16
|
Lohmueller KE. The impact of population demography and selection on the genetic architecture of complex traits. PLoS Genet 2014; 10:e1004379. [PMID: 24875776 PMCID: PMC4038606 DOI: 10.1371/journal.pgen.1004379] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2013] [Accepted: 03/28/2014] [Indexed: 02/06/2023] Open
Abstract
Population genetic studies have found evidence for dramatic population growth in recent human history. It is unclear how this recent population growth, combined with the effects of negative natural selection, has affected patterns of deleterious variation, as well as the number, frequency, and effect sizes of mutations that contribute risk to complex traits. Because researchers are performing exome sequencing studies aimed at uncovering the role of low-frequency variants in the risk of complex traits, this topic is of critical importance. Here I use simulations under population genetic models where a proportion of the heritability of the trait is accounted for by mutations in a subset of the exome. I show that recent population growth increases the proportion of nonsynonymous variants segregating in the population, but does not affect the genetic load relative to a population that did not expand. Under a model where a mutation's effect on a trait is correlated with its effect on fitness, rare variants explain a greater portion of the additive genetic variance of the trait in a population that has recently expanded than in a population that did not recently expand. Further, when using a single-marker test, for a given false-positive rate and sample size, recent population growth decreases the expected number of significant associations with the trait relative to the number detected in a population that did not expand. However, in a model where there is no correlation between a mutation's effect on fitness and the effect on the trait, common variants account for much of the additive genetic variance, regardless of demography. Moreover, here demography does not affect the number of significant associations detected. These findings suggest recent population history may be an important factor influencing the power of association tests and in accounting for the missing heritability of certain complex traits.
Collapse
Affiliation(s)
- Kirk E Lohmueller
- Department of Ecology and Evolutionary Biology, Interdepartmental Program in Bioinformatics, University of California, Los Angeles, California, United States of America
| |
Collapse
|
17
|
Kinnamon DD, Martin ER. Valid Monte Carlo permutation tests for genetic case-control studies with missing genotypes. Genet Epidemiol 2014; 38:325-44. [PMID: 24723341 PMCID: PMC6391735 DOI: 10.1002/gepi.21805] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2013] [Revised: 12/30/2013] [Accepted: 02/28/2014] [Indexed: 02/04/2023]
Abstract
Monte Carlo permutation tests can be formally constructed by choosing a set of permutations of individual indices and a real-valued test statistic measuring the association between genotypes and affection status. In this paper, we develop a rigorous theoretical framework for verifying the validity of these tests when there are missing genotypes. We begin by specifying a nonparametric probability model for the observed genotype data in a genetic case-control study with unrelated subjects. Under this model and some minimal assumptions about the test statistic, we establish that the resulting Monte Carlo permutation test is exact level α if (1) the chosen set of permutations of individual indices is a group under composition and (2) the distribution of the observed genotype score matrix under the null hypothesis does not change if the assignment of individuals to rows is shuffled according to an arbitrary permutation in this set. We apply these conditions to show that frequently used Monte Carlo permutation tests based on the set of all permutations of individual indices are guaranteed to be exact level α only for missing data processes satisfying a rather restrictive additional assumption. However, if the missing data process depends on covariates that are all identified and recorded, we also show that Monte Carlo permutation tests based on the set of permutations within strata of individuals with identical covariate values are exact level α. Our theoretical results are verified and supplemented by simulations for a variety of missing data processes and test statistics.
Collapse
Affiliation(s)
- Daniel D. Kinnamon
- Division of Human Genetics, Department of Internal Medicine, The
Ohio State University Wexner Medical Center, Columbus, OH, USA
- Dr. John T. Macdonald Foundation Department of Human Genetics,
University of Miami Miller School of Medicine, Miami, FL, USA
| | - Eden R. Martin
- Dr. John T. Macdonald Foundation Department of Human Genetics,
University of Miami Miller School of Medicine, Miami, FL, USA
| |
Collapse
|
18
|
Yang T, Deng HW, Niu T. Critical assessment of coalescent simulators in modeling recombination hotspots in genomic sequences. BMC Bioinformatics 2014; 15:3. [PMID: 24387001 PMCID: PMC3890628 DOI: 10.1186/1471-2105-15-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2013] [Accepted: 12/30/2013] [Indexed: 12/04/2022] Open
Abstract
Background Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging. Results We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data. Conclusions While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.
Collapse
Affiliation(s)
| | | | - Tianhua Niu
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, Tulane University School of Public Health and Tropical Medicine, 1440 Canal Street, Suite 2001, New Orleans, LA 70112, USA.
| |
Collapse
|
19
|
|
20
|
Chung RH, Shih CC. SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies. BMC Bioinformatics 2013; 14:199. [PMID: 23782512 PMCID: PMC3693898 DOI: 10.1186/1471-2105-14-199] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2013] [Accepted: 06/14/2013] [Indexed: 11/22/2022] Open
Abstract
Background Association studies based on next-generation sequencing (NGS) technology have become popular, and statistical association tests for NGS data have been developed rapidly. A flexible tool for simulating sequence data in either unrelated case–control or family samples with different disease and quantitative trait models would be useful for evaluating the statistical power for planning a study design and for comparing power among statistical methods based on NGS data. Results We developed a simulation tool, SeqSIMLA, which can simulate sequence data with user-specified disease and quantitative trait models. We implemented two disease models, in which the user can flexibly specify the number of disease loci, effect sizes or population attributable risk, disease prevalence, and risk or protective loci. We also implemented a quantitative trait model, in which the user can specify the number of quantitative trait loci (QTL), proportions of variance explained by the QTL, and genetic models. We compiled recombination rates from the HapMap project so that genomic structures similar to the real data can be simulated. Conclusions SeqSIMLA can efficiently simulate sequence data with disease or quantitative trait models specified by the user. SeqSIMLA will be very useful for evaluating statistical properties for new study designs and new statistical methods using NGS. SeqSIMLA can be downloaded for free at http://seqsimla.sourceforge.net.
Collapse
Affiliation(s)
- Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan, Miaoli, Taiwan.
| | | |
Collapse
|
21
|
Long N, Dickson SP, Maia JM, Kim HS, Zhu Q, Allen AS. Leveraging prior information to detect causal variants via multi-variant regression. PLoS Comput Biol 2013; 9:e1003093. [PMID: 23762022 PMCID: PMC3675126 DOI: 10.1371/journal.pcbi.1003093] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 04/29/2013] [Indexed: 01/03/2023] Open
Abstract
Although many methods are available to test sequence variants for association with complex diseases and traits, methods that specifically seek to identify causal variants are less developed. Here we develop and evaluate a Bayesian hierarchical regression method that incorporates prior information on the likelihood of variant causality through weighting of variant effects. By simulation studies using both simulated and real sequence variants, we compared a standard single variant test for analyzing variant-disease association with the proposed method using different weighting schemes. We found that by leveraging linkage disequilibrium of variants with known GWAS signals and sequence conservation (phastCons), the proposed method provides a powerful approach for detecting causal variants while controlling false positives. The decline in DNA sequencing cost permits the interrogation of potentially all variants across the entire allele frequency spectrum for their associations with complex human diseases and traits. However, the identification of causal variants remains challenging. Existing single variant tests do not distinguish between causal association and association induced by linkage disequilibrium and tend to be underpowered for rare or low-frequency variants, whereas variant grouping methods do not identify individual causal variants. We propose a novel Bayesian hierarchical regression approach that estimates effects of multiple variants on a disease trait simultaneously and incorporates prior information on the likelihood of causality. By simulation, we show that by combining linkage disequilibrium with known genome wide association signals and functional conservation, the proposed method, the first of its kind, is powerful to correctly detect causal variants.
Collapse
Affiliation(s)
- Nanye Long
- Center for Human Genome Variation, Duke University School of Medicine, Durham, North Carolina, United States of America.
| | | | | | | | | | | |
Collapse
|
22
|
Valsesia A, Macé A, Jacquemont S, Beckmann JS, Kutalik Z. The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation. Front Genet 2013; 4:92. [PMID: 23750167 PMCID: PMC3667386 DOI: 10.3389/fgene.2013.00092] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2013] [Accepted: 05/04/2013] [Indexed: 02/03/2023] Open
Abstract
Differences between genomes can be due to single nucleotide variants, translocations, inversions, and copy number variants (CNVs, gain or loss of DNA). The latter can range from sub-microscopic events to complete chromosomal aneuploidies. Small CNVs are often benign but those larger than 500 kb are strongly associated with morbid consequences such as developmental disorders and cancer. Detecting CNVs within and between populations is essential to better understand the plasticity of our genome and to elucidate its possible contribution to disease. Hence there is a need for better-tailored and more robust tools for the detection and genome-wide analyses of CNVs. While a link between a given CNV and a disease may have often been established, the relative CNV contribution to disease progression and impact on drug response is not necessarily understood. In this review we discuss the progress, challenges, and limitations that occur at different stages of CNV analysis from the detection (using DNA microarrays and next-generation sequencing) and identification of recurrent CNVs to the association with phenotypes. We emphasize the importance of germline CNVs and propose strategies to aid clinicians to better interpret structural variations and assess their clinical implications.
Collapse
Affiliation(s)
- Armand Valsesia
- Genetics Core, Nestlé Institute of Health Sciences Lausanne, Switzerland
| | | | | | | | | |
Collapse
|
23
|
Kim W, Londono D, Zhou L, Xing J, Nato AQ, Musolf A, Matise TC, Finch SJ, Gordon D. Single-variant and multi-variant trend tests for genetic association with next-generation sequencing that are robust to sequencing error. Hum Hered 2013; 74:172-83. [PMID: 23594495 DOI: 10.1159/000346824] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
As with any new technology, next-generation sequencing (NGS) has potential advantages and potential challenges. One advantage is the identification of multiple causal variants for disease that might otherwise be missed by SNP-chip technology. One potential challenge is misclassification error (as with any emerging technology) and the issue of power loss due to multiple testing. Here, we develop an extension of the linear trend test for association that incorporates differential misclassification error and may be applied to any number of SNPs. We call the statistic the linear trend test allowing for error, applied to NGS, or LTTae,NGS. This statistic allows for differential misclassification. The observed data are phenotypes for unrelated cases and controls, coverage, and the number of putative causal variants for every individual at all SNPs. We simulate data considering multiple factors (disease mode of inheritance, genotype relative risk, causal variant frequency, sequence error rate in cases, sequence error rate in controls, number of loci, and others) and evaluate type I error rate and power for each vector of factor settings. We compare our results with two recently published NGS statistics. Also, we create a fictitious disease model based on downloaded 1000 Genomes data for 5 SNPs and 388 individuals, and apply our statistic to those data. We find that the LTTae,NGS maintains the correct type I error rate in all simulations (differential and non-differential error), while the other statistics show large inflation in type I error for lower coverage. Power for all three methods is approximately the same for all three statistics in the presence of non-differential error. Application of our statistic to the 1000 Genomes data suggests that, for the data downloaded, there is a 1.5% sequence misclassification rate over all SNPs. Finally, application of the multi-variant form of LTTae,NGS shows high power for a number of simulation settings, although it can have lower power than the corresponding single-variant simulation results, most probably due to our specification of multi-variant SNP correlation values. In conclusion, our LTTae,NGS addresses two key challenges with NGS disease studies; first, it allows for differential misclassification when computing the statistic; and second, it addresses the multiple-testing issue in that there is a multi-variant form of the statistic that has only one degree of freedom, and provides a single p value, no matter how many loci.
Collapse
Affiliation(s)
- Wonkuk Kim
- Department of Mathematics and Statistics, University of South Florida, Tampa, FL, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Abstract
In genomic research phenotype transformations are commonly used as a straightforward way to reach normality of the model outcome. Many researchers still believe it to be necessary for proper inference. Using regression simulations, we show that phenotype transformations are typically not needed and, when used in phenotype with heteroscedasticity, result in inflated Type I error rates. We further explain that important is to address a combination of rare variant genotypes and heteroscedasticity. Incorrectly estimated parameter variability or incorrect choice of the distribution of the underlying test statistic provide spurious detection of associations. We conclude that it is a combination of heteroscedasticity, minor allele frequency, sample size, and to a much lesser extent the error distribution, that matter for proper statistical inference.
Collapse
Affiliation(s)
- Petra Bůžková
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
25
|
Nuytemans K, Bademci G, Inchausti V, Dressen A, Kinnamon DD, Mehta A, Wang L, Züchner S, Beecham GW, Martin ER, Scott WK, Vance JM. Whole exome sequencing of rare variants in EIF4G1 and VPS35 in Parkinson disease. Neurology 2013; 80:982-9. [PMID: 23408866 DOI: 10.1212/wnl.0b013e31828727d4] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
OBJECTIVE Recently, vacuolar protein sorting 35 (VPS35) and eukaryotic translation initiation factor 4 gamma 1 (EIF4G1) have been identified as 2 causal Parkinson disease (PD) genes. We used whole exome sequencing for rapid, parallel analysis of variations in these 2 genes. METHODS We performed whole exome sequencing in 213 patients with PD and 272 control individuals. Those rare variants (RVs) with <5% frequency in the exome variant server database and our own control data were considered for analysis. We performed joint gene-based tests for association using RVASSOC and SKAT (Sequence Kernel Association Test) as well as single-variant test statistics. RESULTS We identified 3 novel VPS35 variations that changed the coded amino acid (nonsynonymous) in 3 cases. Two variations were in multiplex families and neither segregated with PD. In EIF4G1, we identified 11 (9 nonsynonymous and 2 small indels) RVs including the reported pathogenic mutation p.R1205H, which segregated in all affected members of a large family, but also in 1 unaffected 86-year-old family member. Two additional RVs were found in isolated patients only. Whereas initial association studies suggested an association (p = 0.04) with all RVs in EIF4G1, subsequent testing in a second dataset for the driving variant (p.F1461) suggested no association between RVs in the gene and PD. CONCLUSIONS We confirm that the specific EIF4G1 variation p.R1205H seems to be a strong PD risk factor, but is nonpenetrant in at least one 86-year-old. A few other select RVs in both genes could not be ruled out as causal. However, there was no evidence for an overall contribution of genetic variability in VPS35 or EIF4G1 to PD development in our dataset.
Collapse
Affiliation(s)
- Karen Nuytemans
- John P. Hussman Institute for Human Genomics, Miller School of Medicine, University of Miami, FL, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Chen YC, Carter H, Parla J, Kramer M, Goes FS, Pirooznia M, Zandi PP, McCombie WR, Potash JB, Karchin R. A hybrid likelihood model for sequence-based disease association studies. PLoS Genet 2013; 9:e1003224. [PMID: 23358228 PMCID: PMC3554549 DOI: 10.1371/journal.pgen.1003224] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2012] [Accepted: 11/21/2012] [Indexed: 11/18/2022] Open
Abstract
In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (e.g., multiple candidate genes with shared biological function or pathway), when rare variants cluster in key functional regions of a gene, and when protective variants are present. When applied to data from an ongoing sequencing study of bipolar disorder (191 cases, 107 controls), the model identifies seven gene sets with nominal p-values0.05, of which one MAPK signaling pathway (KEGG) reaches trend-level significance after correcting for multiple testing. Inexpensive, high-throughput sequencing has transformed the field of case-control association studies. For the first time, it may be possible to identify the genetic underpinnings of complex diseases, by sequencing the DNA of hundreds (even thousands) of cases and controls and comparing patterns of DNA sequence variation. However, complex diseases are likely to be caused by many variants, some of which are very rare. Taken one at a time, the association between variant and disease phenotype may not be detectable by current statistical methods. One strategy is to identify regions where important variants occur by “collapsing” variants into groups. Here, we present a new collapsing approach, capable of detecting subtle genetic differences between cases and controls. We show, in extensive simulations and using a benchmark set of genes involved in human triglyceride levels, that the approach is potentially more powerful than existing methods. We apply the new method to an ongoing sequencing study of bipolar cases and controls and identify a set of genes found in neuronal synapses, which may be implicated in bipolar disorder.
Collapse
Affiliation(s)
- Yun-Ching Chen
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Hannah Carter
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Jennifer Parla
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Melissa Kramer
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Fernando S. Goes
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America
| | - Mehdi Pirooznia
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America
| | - Peter P. Zandi
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America
| | - W. Richard McCombie
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - James B. Potash
- Department of Psychiatry, University of Iowa, Iowa City, Iowa, United States of America
| | - Rachel Karchin
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
- * E-mail:
| |
Collapse
|
27
|
Hadjixenofontos A, Schmidt MA, Whitehead PL, Konidari I, Hedges DJ, Wright HH, Abramson RK, Menon R, Williams SM, Cuccaro ML, Haines JL, Gilbert JR, Pericak-Vance MA, Martin ER, McCauley JL. Evaluating mitochondrial DNA variation in autism spectrum disorders. Ann Hum Genet 2012; 77:9-21. [PMID: 23130936 DOI: 10.1111/j.1469-1809.2012.00736.x] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2012] [Accepted: 09/07/2012] [Indexed: 11/28/2022]
Abstract
Despite the increasing speculation that oxidative stress and abnormal energy metabolism may play a role in Autism Spectrum Disorders (ASD), and the observation that patients with mitochondrial defects have symptoms consistent with ASD, there are no comprehensive published studies examining the role of mitochondrial variation in autism. Therefore, we have sought to comprehensively examine the role of mitochondrial DNA (mtDNA) variation with regard to ASD risk, employing a multi-phase approach. In phase 1 of our experiment, we examined 132 mtDNA single-nucleotide polymorphisms (SNPs) genotyped as part of our genome-wide association studies of ASD. In phase 2 we genotyped the major European mitochondrial haplogroup-defining variants within an expanded set of autism probands and controls. Finally in phase 3, we resequenced the entire mtDNA in a subset of our Caucasian samples (∼400 proband-father pairs). In each phase we tested whether mitochondrial variation showed evidence of association to ASD. Despite a thorough interrogation of mtDNA variation, we found no evidence to suggest a major role for mtDNA variation in ASD susceptibility. Accordingly, while there may be attractive biological hints suggesting the role of mitochondria in ASD our data indicate that mtDNA variation is not a major contributing factor to the development of ASD.
Collapse
Affiliation(s)
- Athena Hadjixenofontos
- John P. Hussman Institute for Human Genomics, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Abstract
This unit provides an overview of the design and analysis of population-based case-control studies of genetic risk factors for complex disease. Considerations specific to genetic studies are emphasized. The unit reviews basic study designs differentiating case-control studies from others, presents different genetic association strategies (candidate gene, genome-wide association, and high-throughput sequencing), introduces basic methods of statistical analysis for case-control data and approaches to combining case-control studies, and discusses measures of association and impact. Admixed populations, controlling for confounding (including population stratification), consideration of multiple loci and environmental risk factors, and complementary analyses of haplotypes, genes, and pathways are briefly discussed. Readers are referred to basic texts on epidemiology for more details on general conduct of case-control studies.
Collapse
Affiliation(s)
- Dana B Hancock
- Research Triangle Institute International, Research Triangle Park, North Carolina, USA
| | | |
Collapse
|
29
|
Epstein M, Duncan R, Jiang Y, Conneely K, Allen A, Satten G. A permutation procedure to correct for confounders in case-control studies, including tests of rare variation. Am J Hum Genet 2012; 91:215-23. [PMID: 22818855 DOI: 10.1016/j.ajhg.2012.06.004] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2012] [Revised: 05/03/2012] [Accepted: 06/05/2012] [Indexed: 01/30/2023] Open
Abstract
Many case-control tests of rare variation are implemented in statistical frameworks that make correction for confounders like population stratification difficult. Simple permutation of disease status is unacceptable for resolving this issue because the replicate data sets do not have the same confounding as the original data set. These limitations make it difficult to apply rare-variant tests to samples in which confounding most likely exists, e.g., samples collected from admixed populations. To enable the use of such rare-variant methods in structured samples, as well as to facilitate permutation tests for any situation in which case-control tests require adjustment for confounding covariates, we propose to establish the significance of a rare-variant test via a modified permutation procedure. Our procedure uses Fisher's noncentral hypergeometric distribution to generate permuted data sets with the same structure present in the actual data set such that inference is valid in the presence of confounding factors. We use simulated sequence data based on coalescent models to show that our permutation strategy corrects for confounding due to population stratification that, if ignored, would otherwise inflate the size of a rare-variant test. We further illustrate the approach by using sequence data from the Dallas Heart Study of energy metabolism traits. Researchers can implement our permutation approach by using the R package BiasedUrn.
Collapse
|