1
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Rare variant association on unrelated individuals in case-control studies using aggregation tests: existing methods and current limitations. Brief Bioinform 2023; 24:bbad412. [PMID: 37974506 DOI: 10.1093/bib/bbad412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 10/14/2023] [Accepted: 10/28/2023] [Indexed: 11/19/2023] Open
Abstract
Over the past years, progress made in next-generation sequencing technologies and bioinformatics have sparked a surge in association studies. Especially, genome-wide association studies (GWASs) have demonstrated their effectiveness in identifying disease associations with common genetic variants. Yet, rare variants can contribute to additional disease risk or trait heterogeneity. Because GWASs are underpowered for detecting association with such variants, numerous statistical methods have been recently proposed. Aggregation tests collapse multiple rare variants within a genetic region (e.g. gene, gene set, genomic loci) to test for association. An increasing number of studies using such methods successfully identified trait-associated rare variants and led to a better understanding of the underlying disease mechanism. In this review, we compare existing aggregation tests, their statistical features and scope of application, splitting them into the five classical classes: burden, adaptive burden, variance-component, omnibus and other. Finally, we describe some limitations of current aggregation tests, highlighting potential direction for further investigations.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- WELBIO department, WEL Research Institute, avenue Pasteur, 6, 1300 Wavre, Belgium
| |
Collapse
|
2
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Excalibur: A new ensemble method based on an optimal combination of aggregation tests for rare-variant association testing for sequencing data. PLoS Comput Biol 2023; 19:e1011488. [PMID: 37708232 PMCID: PMC10522036 DOI: 10.1371/journal.pcbi.1011488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 09/26/2023] [Accepted: 09/04/2023] [Indexed: 09/16/2023] Open
Abstract
The development of high-throughput next-generation sequencing technologies and large-scale genetic association studies produced numerous advances in the biostatistics field. Various aggregation tests, i.e. statistical methods that analyze associations of a trait with multiple markers within a genomic region, have produced a variety of novel discoveries. Notwithstanding their usefulness, there is no single test that fits all needs, each suffering from specific drawbacks. Selecting the right aggregation test, while considering an unknown underlying genetic model of the disease, remains an important challenge. Here we propose a new ensemble method, called Excalibur, based on an optimal combination of 36 aggregation tests created after an in-depth study of the limitations of each test and their impact on the quality of result. Our findings demonstrate the ability of our method to control type I error and illustrate that it offers the best average power across all scenarios. The proposed method allows for novel advances in Whole Exome/Genome sequencing association studies, able to handle a wide range of association models, providing researchers with an optimal aggregation analysis for the genetic regions of interest.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
- WELBIO department, WEL Research Institute, Wavre, Belgium
| |
Collapse
|
3
|
Chen X, Zhang H, Liu M, Deng HW, Wu Z. Simultaneous detection of novel genes and SNPs by adaptive p-value combination. Front Genet 2022; 13:1009428. [DOI: 10.3389/fgene.2022.1009428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 11/03/2022] [Indexed: 11/18/2022] Open
Abstract
Combining SNP p-values from GWAS summary data is a promising strategy for detecting novel genetic factors. Existing statistical methods for the p-value-based SNP-set testing confront two challenges. First, the statistical power of different methods depends on unknown patterns of genetic effects that could drastically vary over different SNP sets. Second, they do not identify which SNPs primarily contribute to the global association of the whole set. We propose a new signal-adaptive analysis pipeline to address these challenges using the omnibus thresholding Fisher’s method (oTFisher). The oTFisher remains robustly powerful over various patterns of genetic effects. Its adaptive thresholding can be applied to estimate important SNPs contributing to the overall significance of the given SNP set. We develop efficient calculation algorithms to control the type I error rate, which accounts for the linkage disequilibrium among SNPs. Extensive simulations show that the oTFisher has robustly high power and provides a higher balanced accuracy in screening SNPs than the traditional Bonferroni and FDR procedures. We applied the oTFisher to study the genetic association of genes and haplotype blocks of the bone density-related traits using the summary data of the Genetic Factors for Osteoporosis Consortium. The oTFisher identified more novel and literature-reported genetic factors than existing p-value combination methods. Relevant computation has been implemented into the R package TFisher to support similar data analysis.
Collapse
|
4
|
Testing the equality of multivariate means when $$p>n$$ by combining the Hotelling and Simes tests. TEST-SPAIN 2022. [DOI: 10.1007/s11749-021-00781-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
5
|
Xiao J, Zhou Y, He S, Ren WL. An Efficient Score Test Integrated with Empirical Bayes for Genome-Wide Association Studies. Front Genet 2021; 12:742752. [PMID: 34659362 PMCID: PMC8517403 DOI: 10.3389/fgene.2021.742752] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 09/13/2021] [Indexed: 11/30/2022] Open
Abstract
Many methods used in multi-locus genome-wide association studies (GWAS) have been developed to improve statistical power. However, most existing multi-locus methods are not quicker than single-locus methods. To address this concern, we proposed a fast score test integrated with Empirical Bayes (ScoreEB) for multi-locus GWAS. Firstly, a score test was conducted for each single nucleotide polymorphism (SNP) under a linear mixed model (LMM) framework, taking into account the genetic relatedness and population structure. Then, all of the potentially associated SNPs were selected with a less stringent criterion. Finally, Empirical Bayes in a multi-locus model was performed for all of the selected SNPs to identify the true quantitative trait nucleotide (QTN). Our new method ScoreEB adopts the similar strategy of multi-locus random-SNP-effect mixed linear model (mrMLM) and fast multi-locus random-SNP-effect EMMA (FASTmrEMMA), and the only difference is that we use the score test to select all the potentially associated markers. Monte Carlo simulation studies demonstrate that ScoreEB significantly improved the computational efficiency compared with the popular methods mrMLM, FASTmrEMMA, iterative modified-sure independence screening EM-Bayesian lasso (ISIS EM-BLASSO), hybrid of restricted and penalized maximum likelihood (HRePML) and genome-wide efficient mixed model association (GEMMA). In addition, ScoreEB remained accurate in QTN effect estimation and effectively controlled false positive rate. Subsequently, ScoreEB was applied to re-analyze quantitative traits in plants and animals. The results show that ScoreEB not only can detect previously reported genes, but also can mine new genes.
Collapse
Affiliation(s)
- Jing Xiao
- Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, China
| | - Yang Zhou
- Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, China
| | - Shu He
- Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, China
| | - Wen-Long Ren
- Department of Epidemiology and Medical Statistics, School of Public Health, Nantong University, Nantong, China
| |
Collapse
|
6
|
PM2RA: A Framework for Detecting and Quantifying Relationship Alterations in Microbial Community. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:154-167. [PMID: 33581337 PMCID: PMC8498968 DOI: 10.1016/j.gpb.2020.07.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2020] [Revised: 06/28/2020] [Accepted: 08/09/2020] [Indexed: 11/21/2022]
Abstract
The dysbiosis of gut microbiota is associated with the pathogenesis of human diseases. However, observing shifts in the microbe abundance cannot fully reveal underlying perturbations. Examining the relationship alterations (RAs) in the microbiome between health and disease statuses provides additional hints about the pathogenesis of human diseases, but no methods were designed to detect and quantify the RAs between different conditions directly. Here, we present profile monitoring for microbial relationship alteration (PM2RA), an analysis framework to identify and quantify the microbial RAs. The performance of PM2RA was evaluated with synthetic data, and it showed higher specificity and sensitivity than the co-occurrence-based methods. Analyses of real microbial datasets showed that PM2RA was robust for quantifying microbial RAs across different datasets in several diseases. By applying PM2RA, we identified several novel or previously reported microbes implicated in multiple diseases. PM2RA is now implemented as a web-based application available at http://www.pm2ra-xingyinliulab.cn/.
Collapse
|
7
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020. [DOI: 10.1007/s12041-019-1166-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
8
|
Statistical Method Based on Bayes-Type Empirical Score Test for Assessing Genetic Association with Multilocus Genotype Data. Int J Genomics 2020; 2020:4708152. [PMID: 32455126 PMCID: PMC7229558 DOI: 10.1155/2020/4708152] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 04/21/2020] [Indexed: 12/20/2022] Open
Abstract
Simultaneous testing of multiple genetic variants for association is widely recognized as a valuable complementary approach to single-marker tests. As such, principal component regression (PCR) has been found to have competitive power. We focus on exploring a robust test for an unknown genetic mode of all SNPs, an unknown Hardy-Weinberg equilibrium (HWE) in a population, and a large number of all SNPs. First, we propose a new global test by means of the use of codominant codes for all markers and PCR. The new global test is built on an empirical Bayes-type score statistic for testing marginal associations with each single marker. The new global test gains power by robustly exploiting the Hardy-Weinberg equilibrium in the control population and effectively using linkage disequilibrium among test markers. The new global test reduces to PCR when the genotype for each marker is coded as the number of minor alleles. This connection lends insight into the power of the new global test relative to PCR and some other popular multimarker test methods. Second, we propose a robust test method based on the new global test and the ordinary PCR test built on a prospective score statistic for testing marginal associations with each single marker when the genotype for each marker is coded as the number of minor alleles by taking the minimum p value of these two tests. Finally, through extensive simulation studies and analysis of the association between pancreatic cancer and some genes of interest, we show that the proposed robust test method has desirable power and can often identify association signals that may be missed by existing methods.
Collapse
|
9
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020; 99:9. [PMID: 32089528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The sum of squared score (SSU) and sequence kernel association test (SKAT) are the two good alternative tests for genetic association studies in case-control data. Both SSU and SKAT are derived through assuming a dose-response model between the risk of disease and genotypes. However, in practice, the real genetic mode of inheritance is impossible to know. Thus, these two tests might losepower substantially as shown in simulation results when the genetic model is misspecified. Here, to make both the tests suitable in broad situations, we propose two-phase SSU (tpSSU) and two-phase SKAT (tpSKAT), where the Hardy-Weinberg equilibrium test is adopted to choose the genetic model in the first phase and the SSU and SKAT are constructed corresponding to the selected genetic model in the second phase. We found that both tpSSU and tpSKAT outperformed the original SSU and SKAT in most of our simulation scenarios. Byapplying tpSSU and tpSKAT to the study of type 2 diabetes data, we successfully identified some genes that have direct effects on obesity. Besides, we also detected the significant chromosomal region 10q21.22 in GAW16 rheumatoid arthritis dataset, with P<10-6. These findings suggest that tpSSU and tpSKAT can be effective in identifying genetic variants for complex diseases in case-control association studies.
Collapse
Affiliation(s)
- Yuan Xue
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, People's Republic of China.
| | | | | | | | | |
Collapse
|
10
|
Rytova AI, Khlebus EY, Shevtsov AE, Kutsenko VA, Shcherbakova NV, Zharikova AA, Ershova AI, Kiseleva AV, Boytsov SA, Yarovaya EB, Meshkov AN. Modern probabilistic and statistical approaches to search for nucleotide sequence options associated with integrated diseases. RUSS J GENET+ 2017. [DOI: 10.1134/s1022795417100088] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
11
|
POSTULA MAREK, JANICKI PIOTRKAZIMIERZ, ROSIAK MAREK, EYILETEN CEREN, ZAREMBA MAŁGORZATA, KAPLON-CIESLICKA AGNIESZKA, SUGINO SHIGEKAZU, KOSIOR DARIUSZARTUR, OPOLSKI GRZEGORZ, FILIPIAK KRZYSZTOFJERZY, MIROWSKA-GUZEL DAGMARA. Targeted deep resequencing of ALOX5 and ALOX5AP in patients with diabetes and association of rare variants with leukotriene pathways. Exp Ther Med 2016; 12:415-421. [PMID: 27347071 PMCID: PMC4906979 DOI: 10.3892/etm.2016.3334] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2015] [Accepted: 02/11/2016] [Indexed: 02/07/2023] Open
Abstract
The aim of the present study was to investigate a possible association between the accumulation of rare coding variants in the genes for arachidonate 5-lipoxygenase (ALOX5) and ALOX5-activating protein (ALOX5AP), and corresponding production of leukotrienes (LTs) in patients with type 2 diabetes mellitus (T2DM) receiving acetylsalicylic therapy. Twenty exons and corresponding introns of the selected genes were resequenced in 303 DNA samples from patients with T2DM using pooled polymerase chain reaction amplification and next-generation sequencing, using an Illumina HiSeq 2000 sequencing system. The observed non-synonymous variants were further confirmed by individual genotyping of DNA samples comprising of all individuals from the original discovery pools. The association between the investigated phenotypes was based on LTB4 and LTE4 concentrations, and the accumulation of rare missense variants (genetic burden) in investigated genes was evaluated using statistical collapsing tests. A total of 10 exonic variants were identified for each resequenced gene, including 5 missense and 5 synonymous variants. The rare missense variants did not exhibit statistically significant differences in the accumulation pattern between the patients with low and high LTs concentrations. As the present study only included patients with T2DM, it is unclear whether the absence of observed association between the accumulation of rare missense variants in investigated genes and LT production is associated with diabetic populations only or may also be applied to other populations.
Collapse
Affiliation(s)
- MAREK POSTULA
- Department of Experimental and Clinical Pharmacology, Medical University of Warsaw, Center for Preclinical Research and Technology CEPT, Warsaw 02-097, Poland
- Perioperative Genomics Laboratory, Penn State University, College of Medicine, Hershey, PA 17033, USA
| | - PIOTR KAZIMIERZ JANICKI
- Perioperative Genomics Laboratory, Penn State University, College of Medicine, Hershey, PA 17033, USA
| | - MAREK ROSIAK
- Department of Experimental and Clinical Pharmacology, Medical University of Warsaw, Center for Preclinical Research and Technology CEPT, Warsaw 02-097, Poland
- Department of Cardiology and Hypertension, Central Clinical Hospital, The Ministry of the Interior, Warsaw 02-507, Poland
| | - CEREN EYILETEN
- Department of Experimental and Clinical Pharmacology, Medical University of Warsaw, Center for Preclinical Research and Technology CEPT, Warsaw 02-097, Poland
| | - MAŁGORZATA ZAREMBA
- Department of Experimental and Clinical Pharmacology, Medical University of Warsaw, Center for Preclinical Research and Technology CEPT, Warsaw 02-097, Poland
| | | | - SHIGEKAZU SUGINO
- Perioperative Genomics Laboratory, Penn State University, College of Medicine, Hershey, PA 17033, USA
| | - DARIUSZ ARTUR KOSIOR
- Department of Cardiology and Hypertension, Central Clinical Hospital, The Ministry of the Interior, Warsaw 02-507, Poland
- Department of Applied Physiology, Mossakowski Medical Research Centre, Polish Academy of Sciences, Warsaw 02-106, Poland
| | - GRZEGORZ OPOLSKI
- Department of Cardiology, Medical University of Warsaw, Warsaw 02-091, Poland
| | | | - DAGMARA MIROWSKA-GUZEL
- Department of Experimental and Clinical Pharmacology, Medical University of Warsaw, Center for Preclinical Research and Technology CEPT, Warsaw 02-097, Poland
| |
Collapse
|
12
|
Power Calculation of Multi-step Combined Principal Components with Applications to Genetic Association Studies. Sci Rep 2016; 6:26243. [PMID: 27189724 PMCID: PMC4870571 DOI: 10.1038/srep26243] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 04/28/2016] [Indexed: 12/03/2022] Open
Abstract
Principal component analysis (PCA) is a useful tool to identify important linear combination of correlated variables in multivariate analysis and has been applied to detect association between genetic variants and human complex diseases of interest. How to choose adequate number of principal components (PCs) to represent the original system in an optimal way is a key issue for PCA. Note that the traditional PCA, only using a few top PCs while discarding the other PCs, might significantly lose power in genetic association studies if all the PCs contain non-ignorable signals. In order to make full use of information from all PCs, Aschard and his colleagues have proposed a multi-step combined PCs method (named mCPC) recently, which performs well especially when several traits are highly correlated. However, the power superiority of mCPC has just been illustrated by simulation, while the theoretical power performance of mCPC has not been studied yet. In this work, we attempt to investigate theoretical properties of mCPC and further propose a novel and efficient strategy to combine PCs. Extensive simulation results confirm that the proposed method is more robust than existing procedures. A real data application to detect the association between gene TRAF1-C5 and rheumatoid arthritis further shows good performance of the proposed procedure.
Collapse
|
13
|
Xu Z, Pan W. Binomial Mixture Model Based Association Testing to Account for Genetic Heterogeneity for GWAS. Genet Epidemiol 2016; 40:202-9. [PMID: 26916514 PMCID: PMC4814320 DOI: 10.1002/gepi.21954] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2015] [Revised: 11/20/2015] [Accepted: 12/14/2015] [Indexed: 11/09/2022]
Abstract
Genome-wide association studies (GWAS) have confirmed the ubiquitous existence of genetic heterogeneity for common disease: multiple common genetic variants have been identified to be associated, while many more are yet expected to be uncovered. However, the single SNP (single-nucleotide polymorphism) based trend test (or its variants) that has been dominantly used in GWAS is based on contrasting the allele frequency difference between the case and control groups, completely ignoring possible genetic heterogeneity. In spite of the widely accepted notion of genetic heterogeneity, we are not aware of any previous attempt to apply genetic heterogeneity motivated methods in GWAS. Here, to explicitly account for unknown genetic heterogeneity, we applied a mixture model based single-SNP test to the Wellcome Trust Case Control Consortium (WTCCC) GWAS data with traits of Crohn's disease, bipolar disease, coronary artery disease, and type 2 diabetes, identifying much larger numbers of significant SNPs and risk loci for each trait than those of the popular trend test, demonstrating potential power gain of the mixture model based test.
Collapse
Affiliation(s)
- Zhiyuan Xu
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, United States of America
| |
Collapse
|
14
|
Doroz R, Porwik P, Orczyk T. Dynamic signature verification method based on association of features with similarity measures. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.07.026] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
15
|
Postula M, Janicki PK, Eyileten C, Rosiak M, Kaplon-Cieslicka A, Sugino S, Wilimski R, Kosior DA, Opolski G, Filipiak KJ, Mirowska-Guzel D. Next-generation re-sequencing of genes involved in increased platelet reactivity in diabetic patients on acetylsalicylic acid. Platelets 2015; 27:357-64. [PMID: 26599574 DOI: 10.3109/09537104.2015.1109071] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The objective of this study was to investigate whether rare missense genetic variants in several genes related to platelet functions and acetylsalicylic acid (ASA) response are associated with the platelet reactivity in patients with diabetes type 2 (T2D) on ASA therapy. Fifty eight exons and corresponding introns of eight selected genes, including PTGS1, PTGS2, TXBAS1, PTGIS, ADRA2A, ADRA2B, TXBA2R, and P2RY1 were re-sequenced in 230 DNA samples from T2D patients by using a pooled PCR amplification and next-generation sequencing by Illumina HiSeq2000. The observed non-synonymous variants were confirmed by individual genotyping of 384 DNA samples comprising of the individuals from the original discovery pools and additional verification cohort of 154 ASA-treated T2DM patients. The association between investigated phenotypes (ASA induced changes in platelets reactivity by PFA-100, VerifyNow and serum thromboxane B2 level [sTxB2]), and accumulation of rare missense variants (genetic burden) in investigated genes was tested using statistical collapsing tests. We identified a total of 35 exonic variants, including 3 common missense variants, 15 rare missense variants, and 17 synonymous variants in 8 investigated genes. The rare missense variants exhibited statistically significant difference in the accumulation pattern between a group of patients with increased and normal platelet reactivity based on PFA-100 assay. Our study suggests that genetic burden of the rare functional variants in eight genes may contribute to differences in the platelet reactivity measured with the PFA-100 assay in the T2DM patients treated with ASA.
Collapse
Affiliation(s)
- Marek Postula
- a Department of Experimental and Clinical Pharmacology , Medical University of Warsaw, Center for Preclinical Research and Technology CEPT , Warsaw , Poland.,b Perioperative Genomics Laboratory , Penn State College of Medicine , Hershey , PA , USA
| | - Piotr K Janicki
- b Perioperative Genomics Laboratory , Penn State College of Medicine , Hershey , PA , USA
| | - Ceren Eyileten
- a Department of Experimental and Clinical Pharmacology , Medical University of Warsaw, Center for Preclinical Research and Technology CEPT , Warsaw , Poland
| | - Marek Rosiak
- a Department of Experimental and Clinical Pharmacology , Medical University of Warsaw, Center for Preclinical Research and Technology CEPT , Warsaw , Poland.,c Department of Cardiology and Hypertension , Central Clinical Hospital, The Ministry of the Interior , Warsaw , Poland
| | | | - Shigekazu Sugino
- b Perioperative Genomics Laboratory , Penn State College of Medicine , Hershey , PA , USA
| | - Radosław Wilimski
- e Department of Cardiac Surgery , Medical University of Warsaw , Warsaw , Poland
| | - Dariusz A Kosior
- c Department of Cardiology and Hypertension , Central Clinical Hospital, The Ministry of the Interior , Warsaw , Poland.,f Department of Applied Physiology , Mossakowski Medical Research Centre, Polish Academy of Sciences , Warsaw , Poland
| | - Grzegorz Opolski
- d Department of Cardiology , Medical University of Warsaw , Warsaw , Poland
| | | | - Dagmara Mirowska-Guzel
- a Department of Experimental and Clinical Pharmacology , Medical University of Warsaw, Center for Preclinical Research and Technology CEPT , Warsaw , Poland
| |
Collapse
|
16
|
Kim W. Transmission Disequilibrium Tests Based on Read Counts for Low-Coverage Next-Generation Sequence Data. Hum Hered 2015; 80:36-49. [PMID: 26278553 DOI: 10.1159/000434645] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Accepted: 05/30/2015] [Indexed: 11/19/2022] Open
Abstract
The purpose of this paper is the introduction of new statistical methods for case-parent trio association studies based on the read counts that can be obtained from next-generation sequencing (NGS) experiments. This work focuses on the inclusion of low-coverage data into the case-parent trio design without genotype classification or imputation. Two different approaches are considered: (1) a likelihood-based approach implementing a 15-component parametric mixture model and (2) a model-free approach that applies non-parametric statistical methods to the ratios of the read counts to coverage. Simulation studies are conducted to evaluate the performances of the proposed tests. In addition, the non-centrality parameters of the mixture likelihood-based tests are derived to determine sample sizes and coverage for a NGS experimental design. As an example, the sample sizes to maintain specified powers of a published adolescent idiopathic scoliosis (AIS) study are presented. The simulation results show that the tests using the genotypes classified by the maximum Bayesian posterior probability have significantly inflated type I error rates for low-coverage data. The tests using the posterior probabilities instead of the classified genotypes show lower power than the proposed tests. Generally, power for the likelihood-based approach is higher than that for the non-parametric ratio-based approach. For the AIS example, approximately 654 trios with 4× coverage are necessary to maintain 90% power when detecting an association of odds ratio 2 at a locus with a minor allele frequency of 0.35 at the level of significance α = 5 × 10(-8). By comparison, approximately 416 trios with 25× coverage are required to maintain the same power with the same settings. The R and C source codes to calculate the proposed test statistics, the sample sizes and power can be obtained by contacting the author (wkim@cau.ac.kr).
Collapse
Affiliation(s)
- Wonkuk Kim
- Department of Applied Statistics, Chung-Ang University, Seoul, South Korea
| |
Collapse
|
17
|
Upadhyayula SM, Mutheneni SR, Chenna S, Parasaram V, Kadiri MR. Climate drivers on malaria transmission in Arunachal Pradesh, India. PLoS One 2015; 10:e0119514. [PMID: 25803481 PMCID: PMC4372434 DOI: 10.1371/journal.pone.0119514] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Accepted: 01/23/2015] [Indexed: 01/19/2023] Open
Abstract
The present study was conducted during the years 2006 to 2012 and provides information on prevalence of malaria and its regulation with effect to various climatic factors in East Siang district of Arunachal Pradesh, India. Correlation analysis, Principal Component Analysis and Hotelling's T² statistics models are adopted to understand the effect of weather variables on malaria transmission. The epidemiological study shows that the prevalence of malaria is mostly caused by the parasite Plasmodium vivax followed by Plasmodium falciparum. It is noted that, the intensity of malaria cases declined gradually from the year 2006 to 2012. The transmission of malaria observed was more during the rainy season, as compared to summer and winter seasons. Further, the data analysis study with Principal Component Analysis and Hotelling's T² statistic has revealed that the climatic variables such as temperature and rainfall are the most influencing factors for the high rate of malaria transmission in East Siang district of Arunachal Pradesh.
Collapse
Affiliation(s)
- Suryanaryana Murty Upadhyayula
- Biology Division, Council of Scientific and Industrial Research-Indian Institute of Chemical Technology, Hyderabad-500 607, India
| | - Srinivasa Rao Mutheneni
- Biology Division, Council of Scientific and Industrial Research-Indian Institute of Chemical Technology, Hyderabad-500 607, India
| | - Sumana Chenna
- Chemical Engineering Sciences, Council of Scientific and Industrial Research-Indian Institute of Chemical Technology, Hyderabad-500 607, India
| | - Vaideesh Parasaram
- Chemical Engineering Sciences, Council of Scientific and Industrial Research-Indian Institute of Chemical Technology, Hyderabad-500 607, India
| | - Madhusudhan Rao Kadiri
- Biology Division, Council of Scientific and Industrial Research-Indian Institute of Chemical Technology, Hyderabad-500 607, India
| |
Collapse
|
18
|
Garner C. Confounded by sequencing depth in association studies of rare alleles. Genet Epidemiol 2015; 35:261-8. [PMID: 21328616 DOI: 10.1002/gepi.20574] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 01/12/2011] [Indexed: 11/12/2022]
Abstract
Next-generation DNA sequencing technologies are facilitating large-scale association studies of rare genetic variants. The depth of the sequence read coverage is an important experimental variable in the next-generation technologies and it is a major determinant of the quality of genotype calls generated from sequence data. When case and control samples are sequenced separately or in different proportions across batches, they are unlikely to be matched on sequencing read depth and a differential misclassification of genotypes can result, causing confounding and an increased false-positive rate. Data from Pilot Study 3 of the 1000 Genomes project was used to demonstrate that a difference between the mean sequencing read depth of case and control samples can result in false-positive association for rare and uncommon variants, even when the mean coverage depth exceeds 30× in both groups. The degree of the confounding and inflation in the false-positive rate depended on the extent to which the mean depth was different in the case and control groups. A logistic regression model was used to test for association between case-control status and the cumulative number of alleles in a collapsed set of rare and uncommon variants. Including each individual's mean sequence read depth across the variant sites in the logistic regression model nearly eliminated the confounding effect and the inflated false-positive rate. Furthermore, accounting for the potential error by modeling the probability of the heterozygote genotype calls in the regression analysis had a relatively minor but beneficial effect on the statistical results.
Collapse
Affiliation(s)
- Chad Garner
- Department of Epidemiology, University of California, Irvine, CA 92697-3905, USA.
| |
Collapse
|
19
|
Li Z, Yuan A, Han G, Gao G, Li Q. Rank-based tests for identifying multiple genetic variants associated with quantitative traits. Ann Hum Genet 2015; 78:306-10. [PMID: 24942081 DOI: 10.1111/ahg.12067] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We consider the analysis of multiple genetic variants within a gene or a region that are expected to confer risks to human complex diseases with quantitative traits, where the trait values do not follow the normal distribution even after some transformations. We rank the phenotypic values, calculate a score to measure the trend effect of a particular allele for each marker, and then construct three statistics based on the quadratic frameworks of methods Hotelling T(2) , the summation of squared univariate statistic and the inverse of the square root weighted statistics to combine the scores for different marker loci. Simulation results show that the above three test statistics can control the type I error rate well and are more robust than standard tests constructed based on linear regression. Application to GAW16 data for rheumatoid arthritis successfully detects the association between the HLA-DRB1 gene and anticyclic citrullinated protein measure, while the standard methods based on normal assumption cannot detect this association.
Collapse
|
20
|
Abstract
Family data and rare variants are two key features of whole genome sequencing analysis for hunting the missing heritability of common human diseases. Recently, Zhu and Xiong proposed the generalized T2 tests that combine rare variant analysis and family data analysis. In similar fashion, we developed the extended T2 tests for longitudinal whole genome sequencing data for family-based association studies. The new methods simultaneously incorporate three correlation sources: from linkage disequilibrium, from pedigree structure, and from the repeated measures of covariates. We assess and compare these methods using the simulated data from Genetic Analysis Workshop 18. We show that, in general, the extended T2 tests incorporating longitudinal repeated measures have higher power than the single-time-point T2 tests in detecting hypertension-associated genome segments.
Collapse
Affiliation(s)
- Yiwei Liu
- Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280, USA
| | - Jing Xuan
- Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280, USA
| | - Zheyang Wu
- Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280, USA
| |
Collapse
|
21
|
Chen H, Malzahn D, Balliu B, Li C, Bailey JN. Testing genetic association with rare and common variants in family data. Genet Epidemiol 2014; 38 Suppl 1:S37-43. [PMID: 25112186 DOI: 10.1002/gepi.21823] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
With the advance of next-generation sequencing technologies in recent years, rare genetic variant data have now become available for genetic epidemiology studies. For family samples, however, only a few statistical methods for association analysis of rare genetic variants have been developed. Rare variant approaches are of great interest, particularly for family data, because samples enriched for trait-relevant variants can be ascertained and rare variants are putatively enriched through segregation. To facilitate the evaluation of existing and new rare variant testing approaches for analyzing family data, Genetic Analysis Workshop 18 (GAW18) provided genotype and next-generation sequencing data and longitudinal blood pressure traits from extended pedigrees of Mexican American families from the San Antonio Family Study. Our GAW18 group members analyzed real and simulated phenotype data from GAW18 by using generalized linear mixed-effects models or principal components to adjust for familial correlation or by testing binary traits using a correction factor for familial effects. With one exception, approaches dealt with the extended pedigrees in their original state using information based on the kinship matrix or alternative genetic similarity measures. For simulated data our group demonstrated that the family-based kernel machine score test is superior in power to family-based single-marker or burden tests, except in a few specific scenarios. For real data three contributions identified significant associations. They substantially reduced the number of tests before performing the association analysis. We conclude from our real data analyses that further development of strategies for targeted testing or more focused screening of genetic variants is strongly desirable.
Collapse
Affiliation(s)
- Han Chen
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
| | | | | | | | | |
Collapse
|
22
|
The k-NN classifier and self-adaptive Hotelling data reduction technique in handwritten signatures recognition. Pattern Anal Appl 2014. [DOI: 10.1007/s10044-014-0419-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
23
|
Abstract
Although many genetic factors have been successfully identified for human diseases in genome-wide association studies, genes discovered to date only account for a small proportion of overall genetic contributions to many complex traits. Association studies have difficulty in detecting the remaining true genetic variants that are either common variants with weak allelic effects, or rare variants that have strong allelic effects but are weakly associated at the population level. In this work, we applied a goodness-of-fit test for detecting sets of common and rare variants associated with quantitative or binary traits by using whole genome sequencing data. This test has been proved optimal for detecting weak and sparse signals in the literature, which fits the requirements for targeting the genetic components of missing heritability. Furthermore, this p value-combining method allows one to incorporate different data and/or research results for meta-analysis. The method was used to simultaneously analyse the whole genome sequencing and genome-wide association studies data of Genetic Analysis Workshop 18 for detecting true genetic variants. The results show that goodness-of-fit test is comparable or better than the influential sequence kernel association test in many cases.
Collapse
Affiliation(s)
- Li Yang
- Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280, USA
| | - Jing Xuan
- Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280, USA
| | - Zheyang Wu
- Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609-2280, USA
| |
Collapse
|
24
|
Abstract
Genome-wide association studies are very powerful in determining the genetic variants affecting complex diseases. Most of the available methods are very useful in detecting association between common variants and complex diseases. Recently, methods to detect rare variants in association with complex diseases have been developed with the increasingly available sequencing data from next-generation sequencing. In this paper, we evaluate and compare several of these recent methods for performing statistical association using whole genome sequencing data in pedigrees. Specifically, functional principal component analysis (FPCA), extended combined multivariate and collapsing (CMC) method for families, a generalized T(2) method, and chi-square minimum approach were compared by analyzing all the genetic variants, common and rare, of both the real data set and the simulated data set provided as part of Genetic Analysis Workshop 18.
Collapse
Affiliation(s)
- George Mathew
- Department of Mathematics, Missouri State University, 901 South National Avenue, Springfield, Missouri 65897, USA
| | - Varghese George
- Department of Biostatistics & Epidemiology, Georgia Regents University, 1469 Laney Walker Boulevard, Augusta, Georgia 30912-4900, USA
| | - Hongyan Xu
- Department of Biostatistics & Epidemiology, Georgia Regents University, 1469 Laney Walker Boulevard, Augusta, Georgia 30912-4900, USA
| |
Collapse
|
25
|
Hu J, Tzeng JY. Integrative gene set analysis of multi-platform data with sample heterogeneity. ACTA ACUST UNITED AC 2014; 30:1501-7. [PMID: 24489370 DOI: 10.1093/bioinformatics/btu060] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Gene set analysis is a popular method for large-scale genomic studies. Because genes that have common biological features are analyzed jointly, gene set analysis often achieves better power and generates more biologically informative results. With the advancement of technologies, genomic studies with multi-platform data have become increasingly common. Several strategies have been proposed that integrate genomic data from multiple platforms to perform gene set analysis. To evaluate the performances of existing integrative gene set methods under various scenarios, we conduct a comparative simulation analysis based on The Cancer Genome Atlas breast cancer dataset. RESULTS We find that existing methods for gene set analysis are less effective when sample heterogeneity exists. To address this issue, we develop three methods for multi-platform genomic data with heterogeneity: two non-parametric methods, multi-platform Mann-Whitney statistics and multi-platform outlier robust T-statistics, and a parametric method, multi-platform likelihood ratio statistics. Using simulations, we show that the proposed multi-platform Mann-Whitney statistics method has higher power for heterogeneous samples and comparable performance for homogeneous samples when compared with the existing methods. Our real data applications to two datasets of The Cancer Genome Atlas also suggest that the proposed methods are able to identify novel pathways that are missed by other strategies. AVAILABILITY AND IMPLEMENTATION http://www4.stat.ncsu.edu/∼jytzeng/Software/Multiplatform_gene_set_analysis/
Collapse
Affiliation(s)
- Jun Hu
- Bioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, TaiwanBioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, Taiwan
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, TaiwanBioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, TaiwanBioinformatics Research Center, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA, Division of Bioinformatics, Omicsoft Inc., 200 Cascade Pointe Lane, Suite 101, Cary, NC 27513, USA, Department of Statistics, North Carolina State University, Ricks Hall, 1 Lampe Dr., Raleigh, NC 27607, USA and Department of Statistics, National Cheng-Kung University, No.1, University Road, Tainan 701, Taiwan
| |
Collapse
|
26
|
Taub MA, Schwender HR, Younkin SG, Louis TA, Ruczinski I. On multi-marker tests for association in case-control studies. Front Genet 2013; 4:252. [PMID: 24379823 PMCID: PMC3863805 DOI: 10.3389/fgene.2013.00252] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 11/07/2013] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies (GWAs) have identified thousands of DNA loci associated with a variety of traits. Statistical inference is almost always based on single marker hypothesis tests of association and the respective p-values with Bonferroni correction. Since commercially available genomic arrays interrogate hundreds of thousands or even millions of loci simultaneously, many causal yet undetected loci are believed to exist because the conditional power to achieve a genome-wide significance level can be low, in particular for markers with small effect sizes and low minor allele frequencies and in studies with modest sample size. However, the correlation between neighboring markers in the human genome due to linkage disequilibrium (LD) resulting in correlated marker test statistics can be incorporated into multi-marker hypothesis tests, thereby increasing power to detect association. Herein, we establish a theoretical benchmark by quantifying the maximum power achievable for multi-marker tests of association in case-control studies, achievable only when the causal marker is known. Using that genotype correlations within an LD block translate into an asymptotically multivariate normal distribution for score test statistics, we develop a set of weights for the markers that maximize the non-centrality parameter, and assess the relative loss of power for other approaches. We find that the method of Conneely and Boehnke (2007) based on the maximum absolute test statistic observed in an LD block is a practical and powerful method in a variety of settings. We also explore the effect on the power that prior biological or functional knowledge used to narrow down the locus of the causal marker can have, and conclude that this prior knowledge has to be very strong and specific for the power to approach the maximum achievable level, or even beat the power observed for methods such as the one proposed by Conneely and Boehnke (2007).
Collapse
Affiliation(s)
- Margaret A Taub
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Holger R Schwender
- Mathematical Institute, Heinrich Heine University Düsseldorf Düsseldorf, Germany
| | - Samuel G Younkin
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Thomas A Louis
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins University Baltimore, MD, USA
| |
Collapse
|
27
|
Wang J, Zhao Z, Cao Z, Yang A, Zhang J. A probabilistic method for identifying rare variants underlying complex traits. BMC Genomics 2013; 14 Suppl 1:S11. [PMID: 23369113 PMCID: PMC3549819 DOI: 10.1186/1471-2164-14-s1-s11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered. RESULTS In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without pre-selection by experts. In our model, each variant has two hidden states that represent the causal/non-causal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003.
Collapse
Affiliation(s)
- Jiayin Wang
- Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, PR China.
| | | | | | | | | |
Collapse
|
28
|
Shugart YY, Zhu Y, Guo W, Xiong M. Weighted pedigree-based statistics for testing the association of rare variants. BMC Genomics 2012; 13:667. [PMID: 23176082 PMCID: PMC3827928 DOI: 10.1186/1471-2164-13-667] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2012] [Accepted: 11/12/2012] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND With the advent of next-generation sequencing (NGS) technologies, researchers are now generating a deluge of data on high dimensional genomic variations, whose analysis is likely to reveal rare variants involved in the complex etiology of disease. Standing in the way of such discoveries, however, is the fact that statistics for rare variants are currently designed for use with population-based data. In this paper, we introduce a pedigree-based statistic specifically designed to test for rare variants in family-based data. The additional power of pedigree-based statistics stems from the fact that while rare variants related to diseases or traits of interest occur only infrequently in populations, in families with multiple affected individuals, such variants are enriched. Note that while the proposed statistic can be applied with and without statistical weighting, our simulations show that its power increases when weighting (WSS and VT) are applied. RESULTS Our working hypothesis was that, since rare variants are concentrated in families with multiple affected individuals, pedigree-based statistics should detect rare variants more powerfully than population-based statistics. To evaluate how well our new pedigree-based statistics perform in association studies, we develop a general framework for sequence-based association studies capable of handling data from pedigrees of various types and also from unrelated individuals. In short, we developed a procedure for transforming population-based statistics into tests for family-based associations. Furthermore, we modify two existing tests, the weighted sum-square test and the variable-threshold test, and apply both to our family-based collapsing methods. We demonstrate that the new family-based tests are more powerful than corresponding population-based test and they generate a reasonable type I error rate.To demonstrate feasibility, we apply the newly developed tests to a pedigree-based GWAS data set from the Framingham Heart Study (FHS). FHS-GWAS data contain approximately 5000 uncommon variants with frequencies less than 0.05. Potential association findings in these data demonstrate the feasibility of the software PB-STAR (note, PB-STAR is now freely available to the public). CONCLUSION Our tests show that when analyzing for rare variants, a pedigree-based design is more powerful than a population-based case-control design. We further demonstrate that a pedigree-based statistic's power to detect rare variants increases in direct relation to the proportion of affected individuals within the pedigree.
Collapse
Affiliation(s)
- Yin Yao Shugart
- Unit of Statistical Genomics, Division of Intramural Division Program, National Institute of Mental Health, National Institute of Health, Bethesda, MD, USA
| | - Yun Zhu
- Division of Biostatistics, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Wei Guo
- Unit of Statistical Genomics, Division of Intramural Division Program, National Institute of Mental Health, National Institute of Health, Bethesda, MD, USA
| | - Momiao Xiong
- Division of Biostatistics, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Human Genetics Center, The University of Texas Health Science Center at Houston, P.O. Box 20186, Houston, TX 77225, USA
| |
Collapse
|
29
|
Brisbin A, Jenkins GD, Ellsworth KA, Wang L, Fridley BL. Localization of association signal from risk and protective variants in sequencing studies. Front Genet 2012; 3:173. [PMID: 22973297 PMCID: PMC3434438 DOI: 10.3389/fgene.2012.00173] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2012] [Accepted: 08/19/2012] [Indexed: 11/13/2022] Open
Abstract
Aggregating information across multiple variants in a gene or region can improve power for rare variant association testing. Power is maximized when the aggregation region contains many causal variants and few neutral variants. In this paper, we present a method for the localization of the association signal in a region using a sliding-window based approach to rare variant association testing in a region. We first introduce a novel method for analysis of rare variants, the Difference in Minor Allele Frequency test (DMAF), which allows combined analysis of common and rare variants, and makes no assumptions about the direction of effects. In whole-region analyses of simulated data with risk and protective variants, DMAF and other methods which pool data across individuals were found to outperform methods which pool data across variants. We then implement a sliding-window version of DMAF, using a step-down permutation approach to control type I error with the testing of multiple windows. In simulations, the sliding-window DMAF improved power to detect a causal sub-region, compared to applying DMAF to the whole region. Sliding-window DMAF was also effective in localizing the causal sub-region. We also applied the DMAF sliding-window approach to test for an association between response to the drug gemcitabine and variants in the gene FKBP5 sequenced in 91 lymphoblastoid cell lines derived from white non-Hispanic individuals. The application of the sliding-window test procedure detected an association in a sub-region spanning an exon and two introns, when rare and common variants were analyzed together.
Collapse
Affiliation(s)
- Abra Brisbin
- Department of Health Sciences Research, Mayo Clinic Rochester, MN, USA
| | | | | | | | | |
Collapse
|
30
|
Brand OJ, Gough SCL. Immunogenetic mechanisms leading to thyroid autoimmunity: recent advances in identifying susceptibility genes and regions. Curr Genomics 2012; 12:526-41. [PMID: 22654554 PMCID: PMC3271307 DOI: 10.2174/138920211798120790] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2011] [Revised: 08/25/2011] [Accepted: 08/27/2011] [Indexed: 02/06/2023] Open
Abstract
The autoimmune thyroid diseases (AITD) include Graves’ disease (GD) and Hashimoto’s thyroiditis (HT), which are characterised by a breakdown in immune tolerance to thyroid antigens. Unravelling the genetic architecture of AITD is vital to better understanding of AITD pathogenesis, required to advance therapeutic options in both disease management and prevention. The early whole-genome linkage and candidate gene association studies provided the first evidence that the HLA region and CTLA-4 represented AITD risk loci. Recent improvements in; high throughput genotyping technologies, collection of larger disease cohorts and cataloguing of genome-scale variation have facilitated genome-wide association studies and more thorough screening of candidate gene regions. This has allowed identification of many novel AITD risk genes and more detailed association mapping. The growing number of confirmed AITD susceptibility loci, implicates a number of putative disease mechanisms most of which are tightly linked with aspects of immune system function. The unprecedented advances in genetic study will allow future studies to identify further novel disease risk genes and to identify aetiological variants within specific gene regions, which will undoubtedly lead to a better understanding of AITD patho-physiology.
Collapse
Affiliation(s)
- Oliver J Brand
- Oxford Centre for Diabetes Endocrinology and Metabolism (OCDEM), Oxford, UK
| | | |
Collapse
|
31
|
Smoothed functional principal component analysis for testing association of the entire allelic spectrum of genetic variation. Eur J Hum Genet 2012; 21:217-24. [PMID: 22781089 DOI: 10.1038/ejhg.2012.141] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
Fast and cheaper next-generation sequencing technologies will generate unprecedentedly massive and highly dimensional genetic variation data that allow nearly complete evaluation of genetic variation including both common and rare variants. There are two types of association tests: variant-by-variant test and group test. The variant-by-variant test is designed to test the association of common variants, while the group test is suitable to collectively test the association of multiple rare variants. We propose here a smoothed functional principal component analysis (SFPCA) statistic as a general approach for testing association of the entire allelic spectrum of genetic variation (both common and rare variants), which utilizes the merits of both variant-by-variant analysis and group tests. By intensive simulations, we demonstrate that the SFPCA statistic has the correct type 1 error rates and much higher power than the existing methods to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants and (4) variants with opposite directions of effects. To further evaluate its performance, the SFPCA statistic is applied to ANGPTL4 sequence and six continuous phenotypes data from the Dallas Heart Study as an example for testing association of rare variants and a GWAS of schizophrenia data as an example for testing association of common variants. The results show that the SFPCA statistic has much smaller P-values than many existing statistics in both real data analysis examples.
Collapse
|
32
|
Wang X, Morris NJ, Schaid DJ, Elston RC. Power of single- vs. multi-marker tests of association. Genet Epidemiol 2012; 36:480-7. [PMID: 22648939 PMCID: PMC3708310 DOI: 10.1002/gepi.21642] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Revised: 03/23/2012] [Accepted: 04/23/2012] [Indexed: 01/15/2023]
Abstract
Current genome-wide association studies still heavily rely on a single-marker strategy, in which each single nucleotide polymorphism (SNP) is tested individually for association with a phenotype. Although methods and software packages that consider multimarker models have become available, they have been slow to become widely adopted and their efficacy in real data analysis is often questioned. Based on conducting extensive simulations, here we endeavor to provide more insights into the performance of simple multimarker association tests as compared to single-marker tests. The results reveal the power advantage as well as disadvantage of the two- vs. the single-marker test. Power differentials depend on the correlation structure among tag SNPs, as well as that between tag SNPs and causal variants. A two-marker test has relatively better performance than single-marker tests when the correlation of the two adjacent markers is high. However, using HapMap data, two-marker tests tended to have a greater chance of being less powerful than single-marker tests, due to constraints on the number of actual possible haplotypes in the HapMap data. Yet, the average power difference was small whenever the one-marker test is more powerful, while there were many situations where the two-marker test can be much more powerful. These findings can be useful to guide analyses of future studies.
Collapse
Affiliation(s)
- Xuefeng Wang
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
| | - Nathan J. Morris
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
| | - Daniel J. Schaid
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
| | - Robert C. Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
| |
Collapse
|
33
|
Zhu Y, Xiong M. Family-based association studies for next-generation sequencing. Am J Hum Genet 2012; 90:1028-45. [PMID: 22682329 DOI: 10.1016/j.ajhg.2012.04.022] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2011] [Revised: 04/19/2012] [Accepted: 04/28/2012] [Indexed: 12/31/2022] Open
Abstract
An individual's disease risk is determined by the compounded action of both common variants, inherited from remote ancestors, that segregated within the population and rare variants, inherited from recent ancestors, that segregated mainly within pedigrees. Next-generation sequencing (NGS) technologies generate high-dimensional data that allow a nearly complete evaluation of genetic variation. Despite their promise, NGS technologies also suffer from remarkable limitations: high error rates, enrichment of rare variants, and a large proportion of missing values, as well as the fact that most current analytical methods are designed for population-based association studies. To meet the analytical challenges raised by NGS, we propose a general framework for sequence-based association studies that can use various types of family and unrelated-individual data sampled from any population structure and a universal procedure that can transform any population-based association test statistic for use in family-based association tests. We develop family-based functional principal-component analysis (FPCA) with or without smoothing, a generalized T(2), combined multivariate and collapsing (CMC) method, and single-marker association test statistics. Through intensive simulations, we demonstrate that the family-based smoothed FPCA (SFPCA) has the correct type I error rates and much more power to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants, and (4) variants with opposite directions of effect from other population-based or family-based association analysis methods. The proposed statistics are applied to two data sets with pedigree structures. The results show that the smoothed FPCA has a much smaller p value than other statistics.
Collapse
|
34
|
Pongpanich M, Neely ML, Tzeng JY. On the Aggregation of Multimarker Information for Marker-Set and Sequencing Data Analysis: Genotype Collapsing vs. Similarity Collapsing. Front Genet 2012; 2:110. [PMID: 22303404 PMCID: PMC3266618 DOI: 10.3389/fgene.2011.00110] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 12/25/2011] [Indexed: 12/12/2022] Open
Abstract
Methods that collapse information across genetic markers when searching for association signals are gaining momentum in the literature. Although originally developed to achieve a better balance between retaining information and controlling degrees of freedom when performing multimarker association analysis, these methods have recently been proven to be a powerful tool for identifying rare variants that contribute to complex phenotypes. The information among markers can be collapsed at the genotype level, which focuses on the mean of genetic information, or the similarity level, which focuses on the variance of genetic information. The aim of this work is to understand the strengths and weaknesses of these two collapsing strategies. Our results show that neither collapsing strategy outperforms the other across all simulated scenarios. Two factors that dominate the performance of these strategies are the signal-to-noise ratio and the underlying genetic architecture of the causal variants. Genotype collapsing is more sensitive to the marker set being contaminated by noise loci than similarity collapsing. In addition, genotype collapsing performs best when the genetic architecture of the causal variants is not complex (e.g., causal loci with similar effects and similar frequencies). Similarity collapsing is more robust as the complexity of the genetic architecture increases and outperforms genotype collapsing when the genetic architecture of the marker set becomes more sophisticated (e.g., causal loci with various effect sizes or frequencies and potential non-linear or interactive effects). Because the underlying genetic architecture is not known a priori, we also considered a two-stage analysis that combines the two top-performing methods from different collapsing strategies. We find that it is reasonably robust across all simulated scenarios.
Collapse
Affiliation(s)
- Monnat Pongpanich
- Bioinformatics Research Center, North Carolina State University Raleigh, NC, USA
| | | | | |
Collapse
|
35
|
Abstract
Genome-wide association studies have been firmly established in investigations of the associations between common genetic variants and complex traits or diseases. However, a large portion of complex traits and diseases cannot be explained well by common variants. Detecting rare functional variants becomes a trend and a necessity. Because rare variants have such a small minor allele frequency (e.g., <0.05), detecting functional rare variants is challenging. Group iterative sure independence screening (ISIS), a fast group selection tool, was developed to select important genes and the single-nucleotide polymorphisms within. The performance of the group ISIS and group penalization methods is compared for detecting important genes in the Genetic Analysis Workshop 17 data. The results suggest that the group ISIS is an efficient tool to discover genes and single-nucleotide polymorphisms associated to phenotypes.
Collapse
Affiliation(s)
- Yue S Niu
- Interdisciplinary Program in Statistics, The University of Arizona, Tucson, AZ 85721, USA.
| | | | | |
Collapse
|
36
|
Gao Q, He Y, Yuan Z, Zhao J, Zhang B, Xue F. Gene- or region-based association study via kernel principal component analysis. BMC Genet 2011; 12:75. [PMID: 21871061 PMCID: PMC3176196 DOI: 10.1186/1471-2156-12-75] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2011] [Accepted: 08/26/2011] [Indexed: 11/12/2022] Open
Abstract
Background In genetic association study, especially in GWAS, gene- or region-based methods have been more popular to detect the association between multiple SNPs and diseases (or traits). Kernel principal component analysis combined with logistic regression test (KPCA-LRT) has been successfully used in classifying gene expression data. Nevertheless, the purpose of association study is to detect the correlation between genetic variations and disease rather than to classify the sample, and the genomic data is categorical rather than numerical. Recently, although the kernel-based logistic regression model in association study has been proposed by projecting the nonlinear original SNPs data into a linear feature space, it is still impacted by multicolinearity between the projections, which may lead to loss of power. We, therefore, proposed a KPCA-LRT model to avoid the multicolinearity. Results Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR = 1.2, 1.3). Application to the four gene regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. Conclusions KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.
Collapse
Affiliation(s)
- Qingsong Gao
- Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan 250012, China
| | | | | | | | | | | |
Collapse
|
37
|
Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 2011; 35:606-19. [PMID: 21769936 DOI: 10.1002/gepi.20609] [Citation(s) in RCA: 188] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Revised: 03/23/2011] [Accepted: 06/03/2011] [Indexed: 01/31/2023]
Abstract
In anticipation of the availability of next-generation sequencing data, there is increasing interest in investigating association between complex traits and rare variants (RVs). In contrast to association studies for common variants (CVs), due to the low frequencies of RVs, common wisdom suggests that existing statistical tests for CVs might not work, motivating the recent development of several new tests for analyzing RVs, most of which are based on the idea of pooling/collapsing RVs. However, there is a lack of evaluations of, and thus guidance on the use of, existing tests. Here we provide a comprehensive comparison of various statistical tests using simulated data. We consider both independent and correlated rare mutations, and representative tests for both CVs and RVs. As expected, if there are no or few non-causal (i.e. neutral or non-associated) RVs in a locus of interest while the effects of causal RVs on the trait are all (or mostly) in the same direction (i.e. either protective or deleterious, but not both), then the simple pooled association tests (without selecting RVs and their association directions) and a new test called kernel-based adaptive clustering (KBAC) perform similarly and are most powerful; KBAC is more robust than simple pooled association tests in the presence of non-causal RVs; however, as the number of non-causal CVs increases and/or in the presence of opposite association directions, the winners are two methods originally proposed for CVs and a new test called C-alpha test proposed for RVs, each of which can be regarded as testing on a variance component in a random-effects model. Interestingly, several methods based on sequential model selection (i.e. selecting causal RVs and their association directions), including two new methods proposed here, perform robustly and often have statistical power between those of the above two classes.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | |
Collapse
|
38
|
Pan W, Shen X. Adaptive tests for association analysis of rare variants. Genet Epidemiol 2011; 35:381-8. [PMID: 21520272 DOI: 10.1002/gepi.20586] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Revised: 03/03/2011] [Accepted: 03/21/2011] [Indexed: 01/30/2023]
Abstract
In anticipation of the availability of next-generation sequencing data, there has been increasing interest in association analysis of rare variants (RVs). Owing to the extremely low frequency of a RV, single variant-based analysis and many existing tests developed for common variants may not be suitable. Hence, it is of interest to develop powerful statistical tests to assess association between complex traits and RVs with sequence data. Recently, a pooled association test based on variable thresholds (VT) was proposed and shown to be more powerful than some existing tests (Price et al. [2010] Am J Hum Genet 86:832-838). In this study, we generalize the VT test of Price et al. in several aspects. We propose a general class of adaptive tests that covers the VT test of Price et al. as a special case. In particular, we show that some of our proposed adaptive tests may substantially improve the power over the pooled association tests, including the VT test of Price et al., especially so in the presence of many neutral RVs and/or of causal RVs with opposite association directions, in which cases most of the existing pooled association tests suffer from significant loss of power. Our proposed tests are also general and flexible with the ability to incorporate weights on RVs and to adjust for covariates.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455–0392, USA.
| | | |
Collapse
|
39
|
Sha Q, Zhang Z, Zhang S. An improved score test for genetic association studies. Genet Epidemiol 2011; 35:350-9. [DOI: 10.1002/gepi.20583] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2010] [Revised: 02/16/2011] [Accepted: 03/01/2011] [Indexed: 11/06/2022]
|
40
|
Han F, Pan W. Powerful multi-marker association tests: unifying genomic distance-based regression and logistic regression. Genet Epidemiol 2011; 34:680-8. [PMID: 20976795 DOI: 10.1002/gepi.20529] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
To detect genetic association with common and complex diseases, many statistical tests have been proposed for candidate gene or genome-wide association studies with the case-control design. Due to linkage disequilibrium (LD), multi-marker association tests can gain power over single-marker tests with a Bonferroni multiple testing adjustment. Among many existing multi-marker association tests, most target to detect only one of many possible aspects in distributional differences between the genotypes of cases and controls, such as allele frequency differences, while a few new ones aim to target two or three aspects, all of which can be implemented in logistic regression. In contrast to logistic regression, a genomic distance-based regression (GDBR) approach aims to detect some high-order genotypic differences between cases and controls. A recent study has confirmed the high power of GDBR tests. At this moment, the popular logistic regression and the emerging GDBR approaches are completely unrelated; for example, one has to choose between the two. In this article, we reformulate GDBR as logistic regression, opening a venue to constructing other powerful tests while overcoming some limitations of GDBR. For example, asymptotic distributions can replace time-consuming permutations for deriving P-values and covariates, including gene-gene interactions, can be easily incorporated. Importantly, this reformulation facilitates combining GDBR with other existing methods in a unified framework of logistic regression. In particular, we show that Fisher's P-value combining method can boost statistical power by incorporating information from allele frequencies, Hardy-Weinberg disequilibrium, LD patterns, and other higher-order interactions among multi-markers as captured by GDBR.
Collapse
Affiliation(s)
- Fang Han
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455–0392, USA
| | | |
Collapse
|
41
|
Hussman JP, Chung RH, Griswold AJ, Jaworski JM, Salyakina D, Ma D, Konidari I, Whitehead PL, Vance JM, Martin ER, Cuccaro ML, Gilbert JR, Haines JL, Pericak-Vance MA. A noise-reduction GWAS analysis implicates altered regulation of neurite outgrowth and guidance in autism. Mol Autism 2011; 2:1. [PMID: 21247446 PMCID: PMC3035032 DOI: 10.1186/2040-2392-2-1] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2010] [Accepted: 01/19/2011] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Genome-wide Association Studies (GWAS) have proved invaluable for the identification of disease susceptibility genes. However, the prioritization of candidate genes and regions for follow-up studies often proves difficult due to false-positive associations caused by statistical noise and multiple-testing. In order to address this issue, we propose the novel GWAS noise reduction (GWAS-NR) method as a way to increase the power to detect true associations in GWAS, particularly in complex diseases such as autism. METHODS GWAS-NR utilizes a linear filter to identify genomic regions demonstrating correlation among association signals in multiple datasets. We used computer simulations to assess the ability of GWAS-NR to detect association against the commonly used joint analysis and Fisher's methods. Furthermore, we applied GWAS-NR to a family-based autism GWAS of 597 families and a second existing autism GWAS of 696 families from the Autism Genetic Resource Exchange (AGRE) to arrive at a compendium of autism candidate genes. These genes were manually annotated and classified by a literature review and functional grouping in order to reveal biological pathways which might contribute to autism aetiology. RESULTS Computer simulations indicate that GWAS-NR achieves a significantly higher classification rate for true positive association signals than either the joint analysis or Fisher's methods and that it can also achieve this when there is imperfect marker overlap across datasets or when the closest disease-related polymorphism is not directly typed. In two autism datasets, GWAS-NR analysis resulted in 1535 significant linkage disequilibrium (LD) blocks overlapping 431 unique reference sequencing (RefSeq) genes. Moreover, we identified the nearest RefSeq gene to the non-gene overlapping LD blocks, producing a final candidate set of 860 genes. Functional categorization of these implicated genes indicates that a significant proportion of them cooperate in a coherent pathway that regulates the directional protrusion of axons and dendrites to their appropriate synaptic targets. CONCLUSIONS As statistical noise is likely to particularly affect studies of complex disorders, where genetic heterogeneity or interaction between genes may confound the ability to detect association, GWAS-NR offers a powerful method for prioritizing regions for follow-up studies. Applying this method to autism datasets, GWAS-NR analysis indicates that a large subset of genes involved in the outgrowth and guidance of axons and dendrites is implicated in the aetiology of autism.
Collapse
Affiliation(s)
| | - Ren-Hua Chung
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Anthony J Griswold
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - James M Jaworski
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Daria Salyakina
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Deqiong Ma
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Ioanna Konidari
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Patrice L Whitehead
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Jeffery M Vance
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Eden R Martin
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Michael L Cuccaro
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - John R Gilbert
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| | - Jonathan L Haines
- Vanderbilt Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA
| | - Margaret A Pericak-Vance
- John P. Hussman Institute for Human Genomics, University of Miami, 1501 NW 10th Avenue, Miami, FL 33136, USA
| |
Collapse
|
42
|
Affiliation(s)
- Jennifer Asimit
- Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom;
| | | |
Collapse
|
43
|
Dong H, Luo L, Hong S, Siu H, Xiao Y, Jin L, Chen R, Xiong M. Integrated analysis of mutations, miRNA and mRNA expression in glioblastoma. BMC SYSTEMS BIOLOGY 2010; 4:163. [PMID: 21114830 PMCID: PMC3002314 DOI: 10.1186/1752-0509-4-163] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2010] [Accepted: 11/29/2010] [Indexed: 11/10/2022]
Abstract
BACKGROUND Glioblastoma arises from complex interactions between a variety of genetic alterations and environmental perturbations. Little attention has been paid to understanding how genetic variations, altered gene expression and microRNA (miRNA) expression are integrated into networks which act together to alter regulation and finally lead to the emergence of complex phenotypes and glioblastoma. RESULTS We identified association of somatic mutations in 14 genes with glioblastoma, of which 8 genes are newly identified, and association of loss of heterozygosity (LOH) is identified in 11 genes with glioblastoma, of which 9 genes are newly discovered. By gene coexpression network analysis, we identified 15 genes essential to the function of the network, most of which are cancer related genes. We also constructed miRNA coexpression networks and found 19 important miRNAs of which 3 were significantly related to glioblastoma patients' survival. We identified 3,953 predicted miRNA-mRNA pairs, of which 14 were previously verified by experiments in other groups. Using pathway enrichment analysis we also found that the genes in the target network of the top 19 important miRNAs were mainly involved in cancer related signaling pathways, synaptic transmission and nervous systems processes. Finally, we developed new methods to decipher the pathway connecting mutations, expression information and glioblastoma. We identified 4 cis-expression quantitative trait locus (eQTL): TP53, EGFR, NF1 and PIK3C2G; 262 trans eQTL and 26 trans miRNA eQTL for somatic mutation; 2 cis-eQTL: NRAP and EGFR; 409 trans- eQTL and 27 trans- miRNA eQTL for lost of heterozygosity (LOH) mutation. CONCLUSIONS Our results demonstrate that integrated analysis of multi-dimensional data has the potential to unravel the mechanism of tumor initiation and progression.
Collapse
Affiliation(s)
- Hua Dong
- State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, 200433, China
| | | | | | | | | | | | | | | |
Collapse
|
44
|
Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet 2010; 11:773-85. [PMID: 20940738 PMCID: PMC3743540 DOI: 10.1038/nrg2867] [Citation(s) in RCA: 381] [Impact Index Per Article: 27.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The limitations of genome-wide association (GWA) studies that focus on the phenotypic influence of common genetic variants have motivated human geneticists to consider the contribution of rare variants to phenotypic expression. The increasing availability of high-throughput sequencing technologies has enabled studies of rare variants but these methods will not be sufficient for their success as appropriate analytical methods are also needed. We consider data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts.
Collapse
Affiliation(s)
- Vikas Bansal
- The Scripps Translational Science Institute, 3344 North Torrey Pines Court, Suite 300, La Jolla, California 92037, USA
| | | | | | | |
Collapse
|
45
|
Wang T, Lin CY, Rohan TE, Ye K. Resequencing of pooled DNA for detecting disease associations with rare variants. Genet Epidemiol 2010; 34:492-501. [PMID: 20578089 DOI: 10.1002/gepi.20502] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A combination of common and rare variants is thought to contribute to genetic susceptibility to complex diseases. Recently, next-generation sequencers have greatly lowered sequencing costs, providing an opportunity to identify rare disease variants in large genetic epidemiology studies. At present, it is still expensive and time consuming to resequence large number of individual genomes. However, given that next-generation sequencing technology can provide accurate estimates of allele frequencies from pooled DNA samples, it is possible to detect associations of rare variants using pooled DNA sequencing. Current statistical approaches to the analysis of associations with rare variants are not designed for use with pooled next-generation sequencing data. Hence, they may not be optimal in terms of both validity and power. Therefore, we propose here a new statistical procedure to analyze the output of pooled sequencing data. The test statistic can be computed rapidly, making it feasible to test the association of a large number of variants with disease. By simulation, we compare this approach to Fisher's exact test based either on pooled or individual genotypic data. Our results demonstrate that the proposed method provides good control of the Type I error rate, while yielding substantially higher power than Fisher's exact test using pooled genotypic data for testing rare variants, and has similar or higher power than that of Fisher's exact test using individual genotypic data. Our results also provide guidelines on how various parameters of the pooled sequencing design affect the efficiency of detecting associations.
Collapse
Affiliation(s)
- Tao Wang
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York 10461, USA.
| | | | | | | |
Collapse
|
46
|
Zhang Z, Niu A, Sha Q. Identification of interacting genes in genome-wide association studies using a model-based two-stage approach. Ann Hum Genet 2010; 74:406-15. [PMID: 20636464 DOI: 10.1111/j.1469-1809.2010.00594.x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
In this paper, we propose a two-stage approach based on 17 biologically plausible models to search for two-locus combinations that have significant joint effects on the disease status in genome-wide association (GWA) studies. In the two-stage analyses, we only test two-locus joint effects of SNPs that show modest marginal effects. We use simulation studies to compare the power of our two-stage analysis with a single-marker analysis and a two-stage analysis by using a full model. We find that for most plausible interaction effects, our two-stage analysis can dramatically increase the power to identify two-locus joint effects compared to a single-marker analysis and a two-stage analysis based on the full model. We also compare two-stage methods with one-stage methods. Our simulation results indicate that two-stage methods are more powerful than one-stage methods. We applied our two-stage approach to a GWA study for identifying genetic factors that might be relevant in the pathogenesis of sporadic Amyotrophic Lateral Sclerosis (ALS). Our proposed two-stage approach found that two SNPs have significant joint effect on sporadic ALS while the single-marker analysis and the two-stage analysis based on the full model did not find any significant results.
Collapse
Affiliation(s)
- Zhaogong Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA
| | | | | |
Collapse
|
47
|
Schwender H, Ruczinski I, Ickstadt K. Testing SNPs and sets of SNPs for importance in association studies. Biostatistics 2010; 12:18-32. [PMID: 20601626 DOI: 10.1093/biostatistics/kxq042] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Collapse
Affiliation(s)
- Holger Schwender
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA.
| | | | | |
Collapse
|
48
|
Han F, Pan W. A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered 2010; 70:42-54. [PMID: 20413981 PMCID: PMC2912645 DOI: 10.1159/000288704] [Citation(s) in RCA: 241] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2009] [Accepted: 02/05/2010] [Indexed: 12/14/2022] Open
Abstract
Since associations between complex diseases and common variants are typically weak, and approaches to genotyping rare variants (e.g. by next-generation resequencing) multiply, there is an urgent demand to develop powerful association tests that are able to detect disease associations with both common and rare variants. In this article we present such a test. It is based on data-adaptive modifications to a so-called Sum test originally proposed for common variants, which aims to strike a balance between utilizing information on multiple markers in linkage disequilibrium and reducing the cost of large degrees of freedom or of multiple testing adjustment. When applied to multiple common or rare variants in a candidate region, the proposed test is easy to use with 1 degree of freedom and without the need for multiple testing adjustment. We show that the proposed test has high power across a wide range of scenarios with either common or rare variants, or both. In particular, in some situations the proposed test performs better than several commonly used methods.
Collapse
Affiliation(s)
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minn., USA
| |
Collapse
|
49
|
Kim S, Morris NJ, Won S, Elston RC. Single-marker and two-marker association tests for unphased case-control genotype data, with a power comparison. Genet Epidemiol 2010; 34:67-77. [PMID: 19557751 DOI: 10.1002/gepi.20436] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In case-control single nucleotide polymorphism (SNP) data, the allele frequency, Hardy Weinberg Disequilibrium, and linkage disequilibrium (LD) contrast tests are three distinct sources of information about genetic association. While all three tests are typically developed in a retrospective context, we show that prospective logistic regression models may be developed that correspond conceptually to the retrospective tests. This approach provides a flexible framework for conducting a systematic series of association analyses using unphased genotype data and any number of covariates. For a single stage study, two single-marker tests and four two-marker tests are discussed. The true association models are derived and they allow us to understand why a model with only a linear term will generally fit well for a SNP in weak LD with a causal SNP, whatever the disease model, but not for a SNP in high LD with a non-additive disease SNP. We investigate the power of the association tests using real LD parameters from chromosome 11 in the HapMap CEU population data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease, but for dominant, recessive, and heterozygote disadvantage diseases, the genotypic test has the most power. Among the four two-marker tests, the Allelic-LD contrast test, which incorporates linear terms for two markers and their interaction term, provides the most reliable power overall for the cases studied. Therefore, our result supports incorporating an interaction term as well as linear terms in multi-marker tests.
Collapse
Affiliation(s)
- Sulgi Kim
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106-7281, USA
| | | | | | | |
Collapse
|
50
|
Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. Pathway analysis by adaptive combination of P-values. Genet Epidemiol 2010; 33:700-9. [PMID: 19333968 DOI: 10.1002/gepi.20422] [Citation(s) in RCA: 222] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
It is increasingly recognized that pathway analyses-a joint test of association between the outcome and a group of single nucleotide polymorphisms (SNPs) within a biological pathway-could potentially complement single-SNP analysis and provide additional insights for the genetic architecture of complex diseases. Building upon existing P-value combining methods, we propose a class of highly flexible pathway analysis approaches based on an adaptive rank truncated product statistic that can effectively combine evidence of associations over different SNPs and genes within a pathway. The statistical significance of the pathway-level test statistics is evaluated using a highly efficient permutation algorithm that remains computationally feasible irrespective of the size of the pathway and complexity of the underlying test statistics for summarizing SNP- and gene-level associations. We demonstrate through simulation studies that a gene-based analysis that treats the underlying genes, as opposed to the underlying SNPs, as the basic units for hypothesis testing, is a very robust and powerful approach to pathway-based association testing. We also illustrate the advantage of the proposed methods using a study of the association between the nicotinic receptor pathway and cigarette smoking behaviors.
Collapse
Affiliation(s)
- Kai Yu
- Division of Cancer Epidemiology and Genetics, NCI, Rockville, Maryland 20892, USA.
| | | | | | | | | | | | | | | |
Collapse
|