1
|
Bedere N, Bovenhuis H. Characterizing a region on BTA11 affecting β-lactoglobulin content of milk using high-density genotyping and haplotype grouping. BMC Genet 2017; 18:17. [PMID: 28222684 PMCID: PMC5320657 DOI: 10.1186/s12863-017-0483-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Accepted: 02/11/2017] [Indexed: 11/13/2022] Open
Abstract
Background Milk β-lactoglobulin (β-LG) content is of interest as it is associated with nutritional and manufacturing properties. It is known that milk β-LG content is strongly affected by genetic factors. In cattle, most of the genetic differences are associated with a chromosomal region on BTA11, which contains the β-LG gene. The aim of this study was to characterize this region using 777 k SNP data (BovineHDbeadChip) and perform a haplotype-based association study. A statistical approach was developed to build haplotypes that capture the genetic variation associated with this genomic region. Results The SNP with the most significant effect on β-lactoglobulin content was one of the 2 causal mutations responsible for the β-lactoglobulin protein variants A/B. Haplotypes based on 2 to 5 selected lead SNP were clustered in groups with different effects on β-lactoglobulin content. Four different groups were identified suggesting that β-lactoglobulin variant A and B can be further refined in A1, A2, B1 and B2. Conclusions This study showed that β-lactoglobulin protein variants A/B do not explain all genetic variation associated with the tail part of BTA11 but this region contains more than one mutation with an effect on β-lactoglobulin content. These findings can be used for selection of cows with higher cheese yield, which is desirable for the dairy industry.
Collapse
Affiliation(s)
- Nicolas Bedere
- Present address: PEGASE, Agrocampus Ouest, INRA, 35590, Saint-Gilles, France
| | - Henk Bovenhuis
- Animal Breeding and Genomics Centre, Wageningen University, P.O. Box 338, 6700, AH, Wageningen, The Netherlands.
| |
Collapse
|
2
|
|
3
|
Thair SA, Topchiy E, Boyd JH, Cirstea M, Wang C, Nakada TA, Fjell CD, Wurfel M, Russell JA, Walley KR. TNFAIP2 Inhibits Early TNFα-Induced NF-x03BA;B Signaling and Decreases Survival in Septic Shock Patients. J Innate Immun 2015; 8:57-66. [PMID: 26347487 DOI: 10.1159/000437330] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2014] [Accepted: 07/01/2015] [Indexed: 12/24/2022] Open
Abstract
During septic shock, tumor necrosis factor alpha (TNFα) is an early response gene and induces a plethora of genes and signaling pathways. To identify robust signals in genes reliably upregulated by TNFα, we first measured microarray gene expression in vitro and searched methodologically comparable, publicly available data sets to identify concordant signals. Using tag single-nucleotide polymorphisms in the genes common to all data sets, we identified a genetic variant of the TNFAIP2 gene, rs8126, associated with decreased 28-day survival and increased organ dysfunction in an adult cohort in the Vasopressin and Septic Shock Trial. Similar to this cohort, we found that an association with rs8126 and increased organ dysfunction is replicated in a second cohort of septic shock patients in the St. Paul's Hospital Intensive Care Unit. We found that TNFAIP2 inhibits NF-x03BA;B activity, impacting the downstream cytokine interleukin (IL)-8. The minor G allele of TNFAIP2 rs8126 resulted in greater TNFAIP2 expression, decreased IL-8 production and was associated with decreased survival in patients experiencing septic shock. These data suggest that TNFAIP2 is a novel inhibitor of NF-x03BA;B that acts as an autoinhibitor of the TNFα response during septic shock.
Collapse
Affiliation(s)
- Simone A Thair
- Department of Emergency and Surgery, Stanford University School of Medicine, Stanford, Calif., USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics. Genetics 2015; 200:719-36. [PMID: 25948564 DOI: 10.1534/genetics.115.176107] [Citation(s) in RCA: 146] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2015] [Accepted: 05/04/2015] [Indexed: 01/08/2023] Open
Abstract
Two recently developed fine-mapping methods, CAVIAR and PAINTOR, demonstrate better performance over other fine-mapping methods. They also have the advantage of using only the marginal test statistics and the correlation among SNPs. Both methods leverage the fact that the marginal test statistics asymptotically follow a multivariate normal distribution and are likelihood based. However, their relationship with Bayesian fine mapping, such as BIMBAM, is not clear. In this study, we first show that CAVIAR and BIMBAM are actually approximately equivalent to each other. This leads to a fine-mapping method using marginal test statistics in the Bayesian framework, which we call CAVIAR Bayes factor (CAVIARBF). Another advantage of the Bayesian framework is that it can answer both association and fine-mapping questions. We also used simulations to compare CAVIARBF with other methods under different numbers of causal variants. The results showed that both CAVIARBF and BIMBAM have better performance than PAINTOR and other methods. Compared to BIMBAM, CAVIARBF has the advantage of using only marginal test statistics and takes about one-quarter to one-fifth of the running time. We applied different methods on two independent cohorts of the same phenotype. Results showed that CAVIARBF, BIMBAM, and PAINTOR selected the same top 3 SNPs; however, CAVIARBF and BIMBAM had better consistency in selecting the top 10 ranked SNPs between the two cohorts. Software is available at https://bitbucket.org/Wenan/caviarbf.
Collapse
|
5
|
Burkett KM, Greenwood CMT, McNeney B, Graham J. Gene genealogies for genetic association mapping, with application to Crohn's disease. Front Genet 2013; 4:260. [PMID: 24348515 PMCID: PMC3845011 DOI: 10.3389/fgene.2013.00260] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2013] [Accepted: 11/12/2013] [Indexed: 11/30/2022] Open
Abstract
A gene genealogy describes relationships among haplotypes sampled from a population. Knowledge of the gene genealogy for a set of haplotypes is useful for estimation of population genetic parameters and it also has potential application in finding disease-predisposing genetic variants. As the true gene genealogy is unknown, Markov chain Monte Carlo (MCMC) approaches have been used to sample genealogies conditional on data at multiple genetic markers. We previously implemented an MCMC algorithm to sample from an approximation to the distribution of the gene genealogy conditional on haplotype data. Our approach samples ancestral trees, recombination and mutation rates at a genomic focal point. In this work, we describe how our sampler can be used to find disease-predisposing genetic variants in samples of cases and controls. We use a tree-based association statistic that quantifies the degree to which case haplotypes are more closely related to each other around the focal point than control haplotypes, without relying on a disease model. As the ancestral tree is a latent variable, so is the tree-based association statistic. We show how the sampler can be used to estimate the posterior distribution of the latent test statistic and corresponding latent p-values, which together comprise a fuzzy p-value. We illustrate the approach on a publicly-available dataset from a study of Crohn's disease that consists of genotypes at multiple SNP markers in a small genomic region. We estimate the posterior distribution of the tree-based association statistic and the recombination rate at multiple focal points in the region. Reassuringly, the posterior mean recombination rates estimated at the different focal points are consistent with previously published estimates. The tree-based association approach finds multiple sub-regions where the case haplotypes are more genetically related than the control haplotypes, and that there may be one or multiple disease-predisposing loci.
Collapse
Affiliation(s)
- Kelly M Burkett
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada ; Department of Epidemiology, Biostatistics and Occupational Health, McGill University Montreal, QC, Canada
| | - Celia M T Greenwood
- Department of Oncology, Department of Epidemiology, Biostatistics and Occupational Health, and Division of Cancer Epidemiology, McGill University Montreal, QC, Canada ; Lady Davis Institute for Medical Research, Jewish General Hospital Montreal, QC, Canada
| | - Brad McNeney
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University Burnaby, BC, Canada
| |
Collapse
|
6
|
Fine-scale mapping of disease susceptibility locus with Bayesian partition model. Genes Genomics 2012. [DOI: 10.1007/s13258-011-0220-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
7
|
Wason JMS, Dudbridge F. Comparison of multimarker logistic regression models, with application to a genomewide scan of schizophrenia. BMC Genet 2010; 11:80. [PMID: 20828390 PMCID: PMC2949738 DOI: 10.1186/1471-2156-11-80] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2010] [Accepted: 09/09/2010] [Indexed: 11/29/2022] Open
Abstract
Background Genome-wide association studies (GWAS) are a widely used study design for detecting genetic causes of complex diseases. Current studies provide good coverage of common causal SNPs, but not rare ones. A popular method to detect rare causal variants is haplotype testing. A disadvantage of this approach is that many parameters are estimated simultaneously, which can mean a loss of power and slower fitting to large datasets. Haplotype testing effectively tests both the allele frequencies and the linkage disequilibrium (LD) structure of the data. LD has previously been shown to be mostly attributable to LD between adjacent SNPs. We propose a generalised linear model (GLM) which models the effects of each SNP in a region as well as the statistical interactions between adjacent pairs. This is compared to two other commonly used multimarker GLMs: one with a main-effect parameter for each SNP; one with a parameter for each haplotype. Results We show the haplotype model has higher power for rare untyped causal SNPs, the main-effects model has higher power for common untyped causal SNPs, and the proposed model generally has power in between the two others. We show that the relative power of the three methods is dependent on the number of marker haplotypes the causal allele is present on, which depends on the age of the mutation. Except in the case of a common causal variant in high LD with markers, all three multimarker models are superior in power to single-SNP tests. Including the adjacent statistical interactions results in lower inflation in test statistics when a realistic level of population stratification is present in a dataset. Using the multimarker models, we analyse data from the Molecular Genetics of Schizophrenia study. The multimarker models find potential associations that are not found by single-SNP tests. However, multimarker models also require stricter control of data quality since biases can have a larger inflationary effect on multimarker test statistics than on single-SNP test statistics. Conclusions Analysing a GWAS with multimarker models can yield candidate regions which may contain rare untyped causal variants. This is useful for increasing prior odds of association in future whole-genome sequence analyses.
Collapse
Affiliation(s)
- James M S Wason
- MRC Biostatistics Unit, Institute of Public Health, Cambridge CB2 0SR, UK.
| | | |
Collapse
|
8
|
Zhang L, Li W, Song L, Chen L. A towards-multidimensional screening approach to predict candidate genes of rheumatoid arthritis based on SNP, structural and functional annotations. BMC Med Genomics 2010; 3:38. [PMID: 20727150 PMCID: PMC2939610 DOI: 10.1186/1755-8794-3-38] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2009] [Accepted: 08/20/2010] [Indexed: 11/20/2022] Open
Abstract
Background According to the Genetic Analysis Workshops (GAW), hundreds of thousands of SNPs have been tested for association with rheumatoid arthritis. Traditional genome-wide association studies (GWAS) have been developed to identify susceptibility genes using a "most significant SNPs/genes" model. However, many minor- or modest-risk genes are likely to be missed after adjustment of multiple testing. This screening process uses a strict selection of statistical thresholds that aim to identify susceptibility genes based only on statistical model, without considering multi-dimensional biological similarities in sequence arrangement, crystal structure, or functional categories/biological pathways between candidate and known disease genes. Methods Multidimensional screening approaches combined with traditional statistical genetics methods can consider multiple biological backgrounds of genetic mutation, structural, and functional annotations. Here we introduce a newly developed multidimensional screening approach for rheumatoid arthritis candidate genes that considers all SNPs with nominal evidence of Bayesian association (BFLn > 0), and structural and functional similarities of corresponding genes or proteins. Results Our multidimensional screening approach extracted all risk genes (BFLn > 0) by odd ratios of hypothesis H1 to H0, and determined whether a particular group of genes shared underlying biological similarities with known disease genes. Using this method, we found 6614 risk SNPs in our Bayesian screen result set. Finally, we identified 146 likely causal genes for rheumatoid arthritis, including CD4, FGFR1, and KDR, which have been reported as high risk factors by recent studies. We must denote that 790 (96.1%) of genes identified by GWAS could not easily be classified into related functional categories or biological processes associated with the disease, while our candidate genes shared underlying biological similarities (e.g. were in the same pathway or GO term) and contributed to disease etiology, but where common variations in each of these genes make modest contributions to disease risk. We also found 6141 risk SNPs that were too minor to be detected by conventional approaches, and associations between 58 candidate genes and rheumatoid arthritis were verified by literature retrieved from the NCBI PubMed module. Conclusions Our proposed approach to the analysis of GAW16 data for rheumatoid arthritis was based on an underlying biological similarities-based method applied to candidate and known disease genes. Application of our method could identify likely causal candidate disease genes of rheumatoid arthritis, and could yield biological insights that not detected when focusing only on genes that give the strongest evidence by multiple testing. We hope that our proposed method complements the "most significant SNPs/genes" model, and provides additional insights into the pathogenesis of rheumatoid arthritis and other diseases, when searching datasets for hundreds of genetic variances.
Collapse
Affiliation(s)
- Liangcai Zhang
- Department of Biophysics, College of Bioinformatics Science and Technology; Harbin Medical University, Harbin, Hei Longjiang Province, China
| | | | | | | |
Collapse
|
9
|
Nonyane BAS, Whittaker JC. A variance components factor model for genetic association studies: a Bayesian analysis. Genet Epidemiol 2010; 34:529-36. [PMID: 20718044 DOI: 10.1002/gepi.20503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Studies of gene-trait associations for complex diseases often involve multiple traits that may vary by genotype groups or patterns. Such traits are usually manifestations of lower-dimensional latent factors or disease syndromes. We illustrate the use of a variance components factor (VCF) model to model the association between multiple traits and genotype groups as well as any other existing patient-level covariates. This model characterizes the correlations between traits as underlying latent factors that can be used in clinical decision-making. We apply it within the Bayesian framework and provide a straightforward implementation using the WinBUGS software. The VCF model is illustrated with simulated data and an example that comprises changes in plasma lipid measurements of patients who were treated with statins to lower low-density lipoprotein cholesterol, and polymorphisms from the apolipoprotein-E gene. The simulation shows that this model clearly characterizes existing multiple trait manifestations across genotype groups where individuals' group assignments are fully observed or can be deduced from the observed data. It also allows one to investigate covariate by genotype group interactions that may explain the variability in the traits. The flexibility to characterize such multiple trait manifestations makes the VCF model more desirable than the univariate variance components model, which is applied to each trait separately. The Bayesian framework offers a flexible approach that allows one to incorporate prior information.
Collapse
Affiliation(s)
- B A S Nonyane
- Department of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK.
| | | |
Collapse
|
10
|
Ghosh S. Genome-wide association analyses of quantitative traits: the GAW16 experience. Genet Epidemiol 2010; 33 Suppl 1:S13-8. [PMID: 19924711 DOI: 10.1002/gepi.20466] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The group that formed on the theme of genome-wide association analyses of quantitative traits (Group 2) in the Genetic Analysis Workshop 16 comprised eight sets of investigators. Three data sets were available: one on autoantibodies related to rheumatoid arthritis provided by the North American Rheumatoid Arthritis Consortium; the second on anthropometric, lipid, and biochemical measures provided by the Framingham Heart Study (FHS); and the third a simulated data set modeled after FHS. The different investigators in the group addressed a large set of statistical challenges and applied a wide spectrum of association methods in analyzing quantitative traits at the genome-wide level. While some previously reported genes were validated, some novel chromosomal regions provided significant evidence of association in multiple contributions in the group. In this report, we discuss the different strategies explored by the different investigators with the common goal of improving the power to detect association.
Collapse
Affiliation(s)
- Saurabh Ghosh
- Human Genetics Unit, Indian Statistical Institute, Kolkata, India.
| |
Collapse
|
11
|
Schulz A, Fischer C, Chang-Claude J, Beckmann L. Entropy-supported marker selection and Mantel statistics for haplotype sharing analysis. Genet Epidemiol 2010; 34:354-63. [DOI: 10.1002/gepi.20491] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
12
|
The diverse applications of cladistic analysis of molecular evolution, with special reference to nested clade analysis. Int J Mol Sci 2010; 11:124-39. [PMID: 20162005 PMCID: PMC2820993 DOI: 10.3390/ijms11010124] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2009] [Revised: 01/06/2010] [Accepted: 01/06/2010] [Indexed: 11/17/2022] Open
Abstract
The genetic variation found in small regions of the genomes of many species can be arranged into haplotype trees that reflect the evolutionary genealogy of the DNA lineages found in that region and the accumulation of mutations on those lineages. This review demonstrates some of the many ways in which clades (branches) of haplotype trees have been applied in recent years, including the study of genotype/phenotype associations at candidate loci and in genome-wide association studies, the phylogeographic history of species, human evolution, the conservation of endangered species, and the identification of species.
Collapse
|
13
|
Abstract
We describe a fast hierarchical Bayesian method for mapping quantitative trait loci by haplotype-based association, applicable when haplotypes are not observed directly but are inferred from multiple marker genotypes. The method avoids the use of a Monte Carlo Markov chain by employing priors for which the likelihood factorizes completely. It is parameterized by a single hyperparameter, the fraction of variance explained by the quantitative trait locus, compared to the frequentist fixed-effects model, which requires a parameter for the phenotypic effect of each combination of haplotypes; nevertheless it still provides estimates of haplotype effects. We use simulation to show that the method matches the power of the frequentist regression model and, when the haplotypes are inferred, exceeds it for small QTL effect sizes. The Bayesian estimates of the haplotype effects are more accurate than the frequentist estimates, for both known and inferred haplotypes, which indicates that this advantage is independent of the effect of uncertainty in haplotype inference and will hold in comparison with frequentist methods in general. We apply the method to data from a panel of recombinant inbred lines of Arabidopsis thaliana, descended from 19 inbred founders.
Collapse
|
14
|
Buil A, Martinez-Perez A, Perera-Lluna A, Rib L, Caminal P, Soria JM. A new gene-based association test for genome-wide association studies. BMC Proc 2009; 3 Suppl 7:S130. [PMID: 20017997 PMCID: PMC2795904 DOI: 10.1186/1753-6561-3-s7-s130] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genome-wide association studies are widely used today to discover genetic factors that modify the risk of complex diseases. Usually, these methods work in a SNP-by-SNP fashion. We present a gene-based test that can be applied in the context of genome-wide association studies. We compare both strategies, SNP-based and gene-based, in a sample of cases and controls for rheumatoid arthritis.We obtained different results using each strategy. The SNP-based test found the PTPN22 gene while the gene-based test found the PHF19-TRAF1-C5 region. That suggests that no single strategy performs better than another in all cases and that a certain underlying genetic architecture can be delineated more easily with one strategy rather than with another.
Collapse
Affiliation(s)
- Alfonso Buil
- Unitat de Genomica de Malalties Complexes, Institut de Recerca de l'Hospital de la Santa Creu i Sant Pau, Barcelona, 08025, Spain.
| | | | | | | | | | | |
Collapse
|
15
|
Zhang M, Lin Y, Wang L, Pungpapong V, Fleet JC, Zhang D. Case-control genome-wide association study of rheumatoid arthritis from Genetic Analysis Workshop 16 using penalized orthogonal-components regression-linear discriminant analysis. BMC Proc 2009; 3 Suppl 7:S17. [PMID: 20018006 PMCID: PMC2795913 DOI: 10.1186/1753-6561-3-s7-s17] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023] Open
Abstract
Currently, genome-wide association studies (GWAS) are conducted by collecting a massive number of SNPs (i.e., large p) for a relatively small number of individuals (i.e., small n) and associations are made between clinical phenotypes and genetic variation one single-nucleotide polymorphism (SNP) at a time. Univariate association approaches like this ignore the linkage disequilibrium between SNPs in regions of low recombination. This results in a low reliability of candidate gene identification. Here we propose to improve the case-control GWAS approach by implementing linear discriminant analysis (LDA) through a penalized orthogonal-components regression (POCRE), a newly developed variable selection method for large p small n data. The proposed POCRE-LDA method was applied to the Genetic Analysis Workshop 16 case-control data for rheumatoid arthritis (RA). In addition to the two regions on chromosomes 6 and 9 previously associated with RA by GWAS, we identified SNPs on chromosomes 10 and 18 as potential candidates for further investigation.
Collapse
Affiliation(s)
- Min Zhang
- Department of Statistics, Purdue University, 150 North University Street, West Lafayette, IN 47907, USA.
| | | | | | | | | | | |
Collapse
|
16
|
Lin Y, Zhang M, Wang L, Pungpapong V, Fleet JC, Zhang D. Simultaneous genome-wide association studies of anti-cyclic citrullinated peptide in rheumatoid arthritis using penalized orthogonal-components regression. BMC Proc 2009; 3 Suppl 7:S20. [PMID: 20018010 PMCID: PMC2795917 DOI: 10.1186/1753-6561-3-s7-s20] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simultaneously conducted using penalized orthogonal-components regression. This method was developed to identify the genetic variants controlling phenotypes from a massive number of candidate variants. By investigating the association between all single-nucleotide polymorphisms to the phenotype of antibodies against cyclic citrullinated peptide using the rheumatoid arthritis data provided by Genetic Analysis Workshop 16, we identified genetic regions which may contribute to the pathogenesis of rheumatoid arthritis. Bioinformatic analysis of these genomic regions showed most of them harbor protein-coding gene(s).
Collapse
Affiliation(s)
- Yanzhu Lin
- Department of Statistics, Purdue University, West Lafayette, Indiana 47907, USA. YL:
| | | | | | | | | | | |
Collapse
|
17
|
Igo RP, Li J, Goddard KAB. Association mapping by generalized linear regression with density-based haplotype clustering. Genet Epidemiol 2009; 33:16-26. [PMID: 18561202 DOI: 10.1002/gepi.20352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Haplotypes of closely linked single-nucleotide polymorphisms (SNPs) potentially offer greater power than individual SNPs to detect association between genetic variants and disease. We present a novel approach for association mapping in which density-based clustering of haplotypes reduces the dimensionality of the general linear model (GLM)-based score test of association implemented in the HaploStats software (Schaid et al. [2002] Am. J. Hum. Genet. 70:425-434). A flexible haplotype similarity score, a generalization of previously used measures, forms the basis, for grouping haplotypes of probable recent common ancestry. All haplotypes within a cluster are assigned the same regression coefficient within the GLM, and evidence for association is assessed with a score statistic. The approach is applicable to both binary and continuous trait data, and does not require prior phase information. Results of simulation studies demonstrated that clustering enhanced the power of the score test to detect association, under a variety of conditions, while preserving valid Type-I error. Improvement in performance was most dramatic in the presence of extreme haplotype diversity, while a slight improvement was observed even at low diversity. Our method also offers, for binary traits, a slight advantage in power over a similar approach based on an evolutionary model (Tzeng et al. [2006] Am. J. Hum. Genet. 78:231-242).
Collapse
Affiliation(s)
- Robert P Igo
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio, USA
| | | | | |
Collapse
|
18
|
Pei YF, Zhang L, Liu J, Deng HW. Multivariate association test using haplotype trend regression. Ann Hum Genet 2009; 73:456-64. [PMID: 19489754 DOI: 10.1111/j.1469-1809.2009.00527.x] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Genetic association analyses with haplotypes may be more powerful than analyses with single markers, under certain conditions. Furthermore, simultaneously considering multiple correlated traits may make use of additional information that would not be considered when analyzing individual traits. In this study, we propose a haplotype based test of association for multivariate quantitative traits in unrelated samples. Specifically, we extend a population based haplotype trend regression (HTR) approach to multivariate scenarios. We mainly focused on bivariate HTR, and the simulation results showed that the proposed method had correct pre-specified type-I error rates. The power of the proposed method was largely influenced by the size and source of correlation between variables, being greatest when correlation of a specific gene was opposite in sign to the residual correlation.
Collapse
Affiliation(s)
- Yu-Fang Pei
- The Key Laboratory of Biomedical Information Engineering of Ministry of Education, Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, P. R. China
| | | | | | | |
Collapse
|
19
|
Hartman L, Hössjer O, Humphreys K. Utilizing identity-by-descent probabilities for genetic fine-mapping in population based samples, via spatial smoothing of haplotype effects. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.05.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
20
|
Su SY, White J, Balding DJ, Coin LJM. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 2008; 9:513. [PMID: 19046436 PMCID: PMC2647950 DOI: 10.1186/1471-2105-9-513] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Accepted: 12/01/2008] [Indexed: 12/18/2022] Open
Abstract
Background The power of haplotype-based methods for association studies, identification of regions under selection, and ancestral inference, is well-established for diploid organisms. For polyploids, however, the difficulty of determining phase has limited such approaches. Polyploidy is common in plants and is also observed in animals. Partial polyploidy is sometimes observed in humans (e.g. trisomy 21; Down's syndrome), and it arises more frequently in some human tissues. Local changes in ploidy, known as copy number variations (CNV), arise throughout the genome. Here we present a method, implemented in the software polyHap, for the inference of haplotype phase and missing observations from polyploid genotypes. PolyHap allows each individual to have a different ploidy, but ploidy cannot vary over the genomic region analysed. It employs a hidden Markov model (HMM) and a sampling algorithm to infer haplotypes jointly in multiple individuals and to obtain a measure of uncertainty in its inferences. Results In the simulation study, we combine real haplotype data to create artificial diploid, triploid, and tetraploid genotypes, and use these to demonstrate that polyHap performs well, in terms of both switch error rate in recovering phase and imputation error rate for missing genotypes. To our knowledge, there is no comparable software for phasing a large, densely genotyped region of chromosome from triploids and tetraploids, while for diploids we found polyHap to be more accurate than fastPhase. We also compare the results of polyHap to SATlotyper on an experimentally haplotyped tetraploid dataset of 12 SNPs, and show that polyHap is more accurate. Conclusion With the availability of large SNP data in polyploids and CNV regions, we believe that polyHap, our proposed method for inferring haplotypic phase from genotype data, will be useful in enabling researchers analysing such data to exploit the power of haplotype-based analyses.
Collapse
Affiliation(s)
- Shu-Yi Su
- Department of Epidemiology and Public Health, Imperial College, London, W2 1PG, UK.
| | | | | | | |
Collapse
|
21
|
Hosking FJ, Sterne JAC, Smith GD, Green PJ. Inference from genome-wide association studies using a novel Markov model. Genet Epidemiol 2008; 32:497-504. [PMID: 18383184 DOI: 10.1002/gepi.20322] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
In this paper we propose a Bayesian modeling approach to the analysis of genome-wide association studies based on single nucleotide polymorphism (SNP) data. Our latent seed model combines various aspects of k-means clustering, hidden Markov models (HMMs) and logistic regression into a fully Bayesian model. It is fitted using the Markov chain Monte Carlo stochastic simulation method, with Metropolis-Hastings update steps. The approach is flexible, both in allowing different types of genetic models, and because it can be easily extended while remaining computationally feasible due to the use of fast algorithms for HMMs. It allows for inference primarily on the location of the causal locus and also on other parameters of interest. The latent seed model is used here to analyze three data sets, using both synthetic and real disease phenotypes with real SNP data, and shows promising results. Our method is able to correctly identify the causal locus in examples where single SNP analysis is both successful and unsuccessful at identifying the causal SNP.
Collapse
Affiliation(s)
- Fay J Hosking
- Department of Mathematics, University of Bristol, Bristol, UK.
| | | | | | | |
Collapse
|
22
|
Ding Z, Mailund T, Song YS. Efficient whole-genome association mapping using local phylogenies for unphased genotype data. Bioinformatics 2008; 24:2215-21. [PMID: 18667442 PMCID: PMC2553438 DOI: 10.1093/bioinformatics/btn406] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2008] [Revised: 07/25/2008] [Accepted: 07/29/2008] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome. RESULTS In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets. AVAILABILITY The software described in this article is available at http://www.daimi.au.dk/~mailund/Blossoc and distributed under the GNU General Public License.
Collapse
Affiliation(s)
- Zhihong Ding
- Department of Computer Science, University of California, Davis, USA
| | | | | |
Collapse
|
23
|
Abo R, Knight S, Wong J, Cox A, Camp NJ. hapConstructor: automatic construction and testing of haplotypes in a Monte Carlo framework. Bioinformatics 2008; 24:2105-7. [PMID: 18653522 PMCID: PMC2530882 DOI: 10.1093/bioinformatics/btn359] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary: Haplotypes carry important information that can direct investigators towards underlying susceptibility variants, and hence multiple tagging single nucleotide polymorphisms (tSNPs) are usually studied in candidate gene association studies. However, it is often unknown which SNPs should be included in haplotype analyses, or which tests should be performed for maximum power. We have developed a program, hapConstructor, which automatically builds multi-locus SNP sets to test for association in a case-control framework. The multi-SNP sets considered need not be contiguous; they are built based on significance. An important feature is that the missing data imputation is carried out based on the full data, for maximal information and consistency. HapConstructor is implemented in a Monte Carlo framework and naturally extends to allow for significance testing and false discovery rates that account for the construction process and to related individuals. HapConstructor is a useful tool for exploring multi-locus associations in candidate genes and regions. Availability: http://www-genepi.med.utah.edu/Genie Contact:ryan.abo@hsc.utah.edu
Collapse
Affiliation(s)
- Ryan Abo
- Department of Biomedical Informatics, University of Utah, UT, USA.
| | | | | | | | | |
Collapse
|
24
|
Tachmazidou I, Verzilli CJ, De Iorio M. Genetic association mapping via evolution-based clustering of haplotypes. PLoS Genet 2008; 3:e111. [PMID: 17616979 PMCID: PMC1913101 DOI: 10.1371/journal.pgen.0030111] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Accepted: 05/21/2007] [Indexed: 11/19/2022] Open
Abstract
Multilocus analysis of single nucleotide polymorphism haplotypes is a promising approach to dissecting the genetic basis of complex diseases. We propose a coalescent-based model for association mapping that potentially increases the power to detect disease-susceptibility variants in genetic association studies. The approach uses Bayesian partition modelling to cluster haplotypes with similar disease risks by exploiting evolutionary information. We focus on candidate gene regions with densely spaced markers and model chromosomal segments in high linkage disequilibrium therein assuming a perfect phylogeny. To make this assumption more realistic, we split the chromosomal region of interest into sub-regions or windows of high linkage disequilibrium. The haplotype space is then partitioned into disjoint clusters, within which the phenotype-haplotype association is assumed to be the same. For example, in case-control studies, we expect chromosomal segments bearing the causal variant on a common ancestral background to be more frequent among cases than controls, giving rise to two separate haplotype clusters. The novelty of our approach arises from the fact that the distance used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered according to the time to their most recent common ancestor. Our approach is fully Bayesian and we develop a Markov Chain Monte Carlo algorithm to sample efficiently over the space of possible partitions. We compare the proposed approach to both single-marker analyses and recently proposed multi-marker methods and show that the Bayesian partition modelling performs similarly in localizing the causal allele while yielding lower false-positive rates. Also, the method is computationally quicker than other multi-marker approaches. We present an application to real genotype data from the CYP2D6 gene region, which has a confirmed role in drug metabolism, where we succeed in mapping the location of the susceptibility variant within a small error.
Collapse
Affiliation(s)
- Ioanna Tachmazidou
- Department of Epidemiology and Public Health, Imperial College London, United Kingdom.
| | | | | |
Collapse
|
25
|
Teo YY. Common statistical issues in genome-wide association studies: a review on power, data quality control, genotype calling and population structure. Curr Opin Lipidol 2008; 19:133-43. [PMID: 18388693 DOI: 10.1097/mol.0b013e3282f5dd77] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
PURPOSE OF REVIEW Genetic association studies which survey the entire genome have become a common design for uncovering the genetic basis of common diseases, including lipid-related traits. Such studies have identified several novel loci which influence blood lipids. The present review highlights the statistical challenges associated with such large-scale genetic studies and discusses the available methodological strategies for handling these issues. RECENT FINDINGS The successful analysis of genome-wide data assayed on commercial genotyping arrays depends on careful exploration of the data. Unaccounted sample failures, genotyping errors and population structure can introduce misleading signals that mimic genuine association. Careful interpretation of useful summary statistics and graphical data displays can minimize the extent of false associations that need to be followed up in replication or fine-mapping experiments. SUMMARY Recently published genome-wide studies are beginning to yield valuable insights into the importance of well designed methodological and statistical techniques for sensible interpretation of the plethora of genetic data generated.
Collapse
Affiliation(s)
- Yik Y Teo
- Wellcome Trust Centre for Human Genetics, University of Oxford, UK.
| |
Collapse
|
26
|
Su SY, Balding DJ, Coin LJM. Disease association tests by inferring ancestral haplotypes using a hidden markov model. ACTA ACUST UNITED AC 2008; 24:972-8. [PMID: 18296746 DOI: 10.1093/bioinformatics/btn071] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
MOTIVATION Most genome-wide association studies rely on single nucleotide polymorphism (SNP) analyses to identify causal loci. The increased stringency required for genome-wide analyses (with per-SNP significance threshold typically approximately 10(-7)) means that many real signals will be missed. Thus it is still highly relevant to develop methods with improved power at low type I error. Haplotype-based methods provide a promising approach; however, they suffer from statistical problems such as abundance of rare haplotypes and ambiguity in defining haplotype block boundaries. RESULTS We have developed an ancestral haplotype clustering (AncesHC) association method which addresses many of these problems. It can be applied to biallelic or multiallelic markers typed in haploid, diploid or multiploid organisms, and also handles missing genotypes. Our model is free from the assumption of a rigid block structure but recognizes a block-like structure if it exists in the data. We employ a Hidden Markov Model (HMM) to cluster the haplotypes into groups of predicted common ancestral origin. We then test each cluster for association with disease by comparing the numbers of cases and controls with 0, 1 and 2 chromosomes in the cluster. We demonstrate the power of this approach by simulation of case-control status under a range of disease models for 1500 outcrossed mice originating from eight inbred lines. Our results suggest that AncesHC has substantially more power than single-SNP analyses to detect disease association, and is also more powerful than the cladistic haplotype clustering method CLADHC. AVAILABILITY The software can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin.
Collapse
Affiliation(s)
- Shu-Yi Su
- Department of Epidemiology and Public Health, Imperial College, London W2 1PG, UK
| | | | | |
Collapse
|
27
|
Abstract
Association methods based on linkage disequilibrium (LD) offer a promising approach for detecting genetic variations that are responsible for complex human diseases. Although methods based on individual single nucleotide polymorphisms (SNPs) may lead to significant findings, methods based on haplotypes comprising multiple SNPs on the same inherited chromosome may provide additional power for mapping disease genes and also provide insight on factors influencing the dependency among genetic markers. Such insights may provide information essential for understanding human evolution and also for identifying cis-interactions between two or more causal variants. Because obtaining haplotype information directly from experiments can be cost prohibitive in most studies, especially in large scale studies, haplotype analysis presents many unique challenges. In this chapter, we focus on two main issues: haplotype inference and haplotype-association analysis. We first provide a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees. We then cover a number of statistical methods that employ haplotype information in association analysis. In addition, we discuss the advantages and limitations of different methods.
Collapse
Affiliation(s)
- Nianjun Liu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | |
Collapse
|
28
|
Won S, Sinha R, Luo Y. Fine-scale linkage disequilibrium mapping: a comparison of coalescent-based and haplotype-clustering-based methods. BMC Proc 2007; 1 Suppl 1:S133. [PMID: 18466476 PMCID: PMC2367559 DOI: 10.1186/1753-6561-1-s1-s133] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Among the various linkage-disequilibrium (LD) fine-mapping methods, two broad classes have received considerable development recently: those based on coalescent theory and those based on haplotype clustering. Using Genetic Analysis Workshop 15 Problem 3 simulated data, the ability of these two classes to localize the causal variation were compared. Our results suggest that a haplotype-clustering-based approach performs favorably, while at the same time requires much less computing than coalescent-based approaches. Further, we observe that 1) when marker density is low, a set of equally spaced single-nucleotide polymorphisms (SNPs) provides better localization than a set of tagging SNPs of equal number; 2) denser sets of SNPs generally lead to better localization, but the benefit diminishes beyond a certain density; 3) larger sample size may do more harm than good when poor selection of markers results in biased LD patterns around the disease locus. These results are explained by how well the set of selected markers jointly approximates the expected LD pattern around a disease locus.
Collapse
Affiliation(s)
- Sungho Won
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Wolstein Research Building, 10900 Euclid Avenue, Cleveland, Ohio 44106, USA.
| | | | | |
Collapse
|
29
|
Abstract
Multi-locus association analyses, including haplotype-based analyses, can sometimes provide greater power than single-locus analyses for detecting disease susceptibility loci. This potential gain, however, can be compromised by the large number of degrees of freedom caused by irrelevant markers. Exhaustive search for the optimal set of markers might be possible for a small number of markers, yet it is computationally inefficient. In this paper, we present a sequential haplotype scan method to search for combinations of adjacent markers that are jointly associated with disease status. When evaluating each marker, we add markers close to it in a sequential manner: a marker is added if its contribution to the haplotype association with disease is warranted, conditional on current haplotypes. This conditional evaluation is based on the well-known Mantel-Haenszel statistic. We propose two permutation based methods to evaluate the growing haplotypes: a haplotype method for the combined markers, and a summary method that sums conditional statistics. We compared our proposed methods, the single-locus method, and a sliding window method using simulated data. We also applied our sequential haplotype scan algorithm to experimental data for CYP2D6. The results indicate that the sequential scan procedure can identify a set of adjacent markers whose haplotypes might have strong genetic effects or be in linkage disequilibrium with disease predisposing variants. As a result, our methods can achieve greater power than the single-locus method, yet is much more computationally efficient than sliding window methods.
Collapse
Affiliation(s)
- Zhaoxia Yu
- Division of Biostatistics, Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota 55905, USA
| | | |
Collapse
|
30
|
Maniatis N, Collins A, Morton NE. Effects of single SNPs, haplotypes, and whole-genome LD maps on accuracy of association mapping. Genet Epidemiol 2007; 31:179-88. [PMID: 17285621 DOI: 10.1002/gepi.20199] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We describe an association mapping approach that utilizes linkage disequilibrium (LD) maps in LD units (LDU). This method uses composite likelihood to combine information from all single marker tests, and applies a model with a parameter for the location of the causal polymorphism. Previous analyses of the poor drug metabolizer phenotype provided evidence of the substantial utility of LDU maps for disease gene association mapping. Using LDU locations for the 27 single nucleotide polymorphisms (SNPs) flanking the CYP2D6 gene on chromosome 22, the most common functional polymorphism within the gene was located at 15 kb from its true location. Here, we examine the performance of this mapping approach by exploiting the high-density LDU map constructed from the HapMap data. Expressing the locations of the 27 SNPs in LDU from the HapMap LDU map, analysis yielded an estimated location that is only 0.3 kb away from the CYP2D6 gene. This supports the use of the high marker density HapMap-derived LDU map for association mapping even though it is derived from a much smaller number of individuals compared to the CYP2D6 sample. We also examine the performance of 2-SNP haplotypes. Using the same modelling procedures and composite likelihood as for single SNPs, the haplotype data provided much poorer localization compared to single SNP analysis. Haplotypes generate more autocorrelation through multiple inclusions of the same SNPs, which could inflate significance in association studies. The results of the present study demonstrate the great potential of the genome HapMap LDU maps for high-resolution mapping of complex phenotypes.
Collapse
Affiliation(s)
- Nikolas Maniatis
- Human Genetics Division, University of Southampton, Southampton General Hospital, Southampton, UK.
| | | | | |
Collapse
|
31
|
de Andrade M, Allen AS. Summary of contributions to GAW15 Group 13: candidate gene association studies. Genet Epidemiol 2007; 31 Suppl 1:S110-7. [DOI: 10.1002/gepi.20287] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
32
|
Abstract
Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy-Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.
Collapse
Affiliation(s)
- David J Balding
- Department of Epidemiology and Public Health, Imperial College, St Marys Campus, Norfolk Place, London W2 1PG, UK.
| |
Collapse
|
33
|
Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet 2006; 79:910-22. [PMID: 17033967 PMCID: PMC1698562 DOI: 10.1086/508901] [Citation(s) in RCA: 79] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2006] [Accepted: 09/01/2006] [Indexed: 12/26/2022] Open
Abstract
Large-scale association studies are being undertaken with the hope of uncovering the genetic determinants of complex disease. We describe a computationally efficient method for inferring genealogies from population genotype data and show how these genealogies can be used to fine map disease loci and interpret association signals. These genealogies take the form of the ancestral recombination graph (ARG). The ARG defines a genealogical tree for each locus, and, as one moves along the chromosome, the topologies of consecutive trees shift according to the impact of historical recombination events. There are two stages to our analysis. First, we infer plausible ARGs, using a heuristic algorithm, which can handle unphased and missing data and is fast enough to be applied to large-scale studies. Second, we test the genealogical tree at each locus for a clustering of the disease cases beneath a branch, suggesting that a causative mutation occurred on that branch. Since the true ARG is unknown, we average this analysis over an ensemble of inferred ARGs. We have characterized the performance of our method across a wide range of simulated disease models. Compared with simpler tests, our method gives increased accuracy in positioning untyped causative loci and can also be used to estimate the frequencies of untyped causative alleles. We have applied our method to Ueda et al.'s association study of CTLA4 and Graves disease, showing how it can be used to dissect the association signal, giving potentially interesting results of allelic heterogeneity and interaction. Similar approaches analyzing an ensemble of ARGs inferred using our method may be applicable to many other problems of inference from population genotype data.
Collapse
Affiliation(s)
- Mark J Minichiello
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, United Kingdom
| | | |
Collapse
|
34
|
Mailund T, Besenbacher S, Schierup MH. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics 2006; 7:454. [PMID: 17042942 PMCID: PMC1624851 DOI: 10.1186/1471-2105-7-454] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2006] [Accepted: 10/16/2006] [Indexed: 11/21/2022] Open
Abstract
Background With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. Results We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. Conclusion Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours.
Collapse
Affiliation(s)
- Thomas Mailund
- Department of Statistics, University of Oxford, UK
- Bioinformatics Research Center, University of Aarhus, Denmark
| | | | | |
Collapse
|
35
|
Morris AP. A flexible Bayesian framework for modeling haplotype association with disease, allowing for dominance effects of the underlying causative variants. Am J Hum Genet 2006; 79:679-94. [PMID: 16960804 PMCID: PMC1592560 DOI: 10.1086/508264] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2006] [Accepted: 08/02/2006] [Indexed: 11/03/2022] Open
Abstract
Multilocus analysis of single-nucleotide-polymorphism (SNP) haplotypes may provide evidence of association with disease, even when the individual loci themselves do not. Haplotype-based methods are expected to outperform single-SNP analyses because (i) common genetic variation can be structured into haplotypes within blocks of strong linkage disequilibrium and (ii) the functional properties of a protein are determined by the linear sequence of amino acids corresponding to DNA variation on a haplotype. Here, I propose a flexible Bayesian framework for modeling haplotype association with disease in population-based studies of candidate genes or small candidate regions. I employ a Bayesian partition model to describe the correlation between marker-SNP haplotypes and causal variants at the underlying functional polymorphism(s). Under this model, haplotypes are clustered according to their similarity, in terms of marker-SNP allele matches, which is used as a proxy for recent shared ancestry. Haplotypes within a cluster are then assigned the same probability of carrying a causal variant at the functional polymorphism(s). In this way, I can account for the dominance effect of causal variants, here corresponding to any deviation from a multiplicative contribution to disease risk. The results of a detailed simulation study demonstrate that there is minimal cost associated with modeling these dominance effects, with substantial gains in power over haplotype-based methods that do not incorporate clustering and that assume a multiplicative model of disease risks.
Collapse
Affiliation(s)
- Andrew P Morris
- Wellcome Trust Centre for Human Genetics, Oxford, OX3 7BN, United Kingdom.
| |
Collapse
|
36
|
Johnson T. Bayesian method for gene detection and mapping, using a case and control design and DNA pooling. Biostatistics 2006; 8:546-65. [PMID: 16984977 DOI: 10.1093/biostatistics/kxl028] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Association mapping studies aim to determine the genetic basis of a trait. A common experimental design uses a sample of unrelated individuals classified into 2 groups, for example cases and controls. If the trait has a complex genetic basis, consisting of many quantitative trait loci (QTLs), each group needs to be large. Each group must be genotyped at marker loci covering the region of interest; for dense coverage of a large candidate region, or a whole-genome scan, the number of markers will be very large. The total amount of genotyping required for such a study is formidable. A laboratory effort efficient technique called DNA pooling could reduce the amount of genotyping required, but the data generated are less informative and require novel methods for efficient analysis. In this paper, a Bayesian statistical analysis of the classic model of McPeek and Strahs is proposed. In contrast to previous work on this model, I assume that data are collected using DNA pooling, so individual genotypes are not directly observed, and also account for experimental errors. A complete analysis can be performed using analytical integration, a propagation algorithm for a hidden Markov model, and quadrature. The method developed here is both statistically and computationally efficient. It allows simultaneous detection and mapping of a QTL, in a large-scale association mapping study, using data from pooled DNA. The method is shown to perform well on data sets simulated under a realistic coalescent-with-recombination model, and is shown to outperform classical single-point methods. The method is illustrated on data consisting of 27 markers in an 880-kb region around the CYP2D6 gene.
Collapse
Affiliation(s)
- Toby Johnson
- School of Biological Sciences, The University of Edinburgh, Edinburgh EH9 3JT, UK.
| |
Collapse
|
37
|
Nicolae DL. Testing Untyped Alleles (TUNA)—applications to genome-wide association studies. Genet Epidemiol 2006; 30:718-27. [PMID: 16986160 DOI: 10.1002/gepi.20182] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The large number of tests performed in analyzing data from genome-wide association studies has a large impact on the power of detecting risk variants, and analytic strategies specifying the optimal set of hypotheses to be tested are necessary. We propose a genome-wide strategy that is based on one degree of freedom tests for all the genotyped variants, and for all the untyped variants for which there is sufficient information in the observed data. The set of untyped variants to be tested is found using multi-locus measures of linkage disequilibrium and haplotype frequencies from a reference database such as HapMap (The International HapMap Consortium [2003] Nature 426:789-796). We introduce a novel statistic for testing differences in allele frequencies for untyped variation that is based on linear combinations of estimable haplotype frequencies. Algorithms for finding the sets of genotyped markers to be used in testing an untyped allele, and ways of incorporating haplotypes observed in the study data but not in the reference database are also described. The proposed testing strategy can be used as the first step in the analysis of genome-wide association data, and, because every performed test is directed to a marker, it can be used to specify the set of polymorphisms to genotype in follow-up studies. The described methodology provides also a tool for joint analysis of data from studies done on different platforms.
Collapse
Affiliation(s)
- Dan L Nicolae
- Departments of Medicine and Statistics, The University of Chicago, Illinois 60637, USA
| |
Collapse
|